excel - VBA script to remove duplicates when the gap values between duplicate pair meets certain criteria - Stack Overflow

时间: 2025-01-06 admin 业界

I'm still a novice in VBA script. I'd like to automatically remove duplicate(s) from their corresponding pairs with the conditions as follows:

  1. If the gap value of "Score" variable >= 0.02, remove the remaining duplicates by keeping the highest "Score" value
  2. If the gap value of "Score" variable <= 0.01, remove the remaining duplicates by keeping the higher value in "Peak" variable.

For example,

ID Score Peak
CompoundA 0.85 300
CompoundA 0.84 310
Compound_F 0.75 270
Compound_F 0.71 250
Compound_F 0.70 244
Compound_1z 0.93 370
Compound_1z 0.90 353
Compound_1z 0.88 349

I'm still a novice in VBA script. I'd like to automatically remove duplicate(s) from their corresponding pairs with the conditions as follows:

  1. If the gap value of "Score" variable >= 0.02, remove the remaining duplicates by keeping the highest "Score" value
  2. If the gap value of "Score" variable <= 0.01, remove the remaining duplicates by keeping the higher value in "Peak" variable.

For example,

ID Score Peak
CompoundA 0.85 300
CompoundA 0.84 310
Compound_F 0.75 270
Compound_F 0.71 250
Compound_F 0.70 244
Compound_1z 0.93 370
Compound_1z 0.90 353
Compound_1z 0.88 349

Expected outcome:

ID Score Peak
CompoundA 0.84 310
Compound_F 0.75 270
Compound_1z 0.93 370

I tried this:

Sub RemoveDuplicatesByScoreAndPeak() 
Dim ws As Worksheet 
Dim lastRow As Long 
Dim i As Long 
Dim gap As Double

' Define the worksheet
Set ws = ActiveSheet

' Find the last used row
lastRow = ws.Cells(ws.Rows.Count, 1).End(xlUp).Row

' Sort the data by ID (ascending) and Score (descending)
ws.Sort.SortFields.Clear
ws.Sort.SortFields.Add Key:=ws.Columns(1), Order:=xlAscending
ws.Sort.SortFields.Add Key:=ws.Columns(2), Order:=xlDescending
With ws.Sort
    .SetRange ws.Range("A2:C" & lastRow)
    .Header = xlYes
    .MatchCase = False
    .Orientation = xlTopToBottom
    .SortMethod = xlPinYin
    .Apply
End With

' Loop through rows to remove duplicates
For i = lastRow To 2 Step -1
    If ws.Cells(i, 1).Value = ws.Cells(i - 1, 1).Value Then
        ' Calculate the gap in "Score"
        gap = Abs(ws.Cells(i, 2).Value - ws.Cells(i - 1, 2).Value)
        
        If gap >= 0.02 Then
            ' Keep the row with the highest "Score"
            If ws.Cells(i, 2).Value < ws.Cells(i - 1, 2).Value Then
                ws.Rows(i).Delete
            Else
                ws.Rows(i - 1).Delete
            End If
        ElseIf gap <= 0.01 Then
            ' Keep the row with the higher "Peak" value
            If ws.Cells(i, 3).Value < ws.Cells(i - 1, 3).Value Then
                ws.Rows(i).Delete
            Else
                ws.Rows(i - 1).Delete
            End If
        End If
    End If
Next i

' Notify the user
MsgBox "Duplicates processed successfully!", vbInformation
End Sub

I just want to use the script without loop cause I keep receiving error pop-up message. Thanks in advance.

Share Improve this question edited yesterday Tim Williams 166k8 gold badges100 silver badges137 bronze badges asked yesterday FransFrans 11 New contributor Frans is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 4
  • Describing the logic step by step to explain why there are three rows in the output would be helpful for revising your code effectively. – taller Commented yesterday
  • These data are representation of GCMS or LCMS metabolomics information in which more than one peaks for the same metabolite can be identified with different time points (referred as retention time) and signal intensity (peak). The duplicate number is often not uniformly identified in such tests. – Frans Commented yesterday
  • * keep receiving error pop-up message* - This is absolutely useless without the complete, exact error message you're getting, as well as which line is executing when it happens. It's difficult to help you when you provide such vague information - please edit your post to be specific. – Ken White Commented yesterday
  • For ID Compound_F, the gap between the 1st and 2nd rows meets Condition 1, and the gap between the 2nd and 3rd rows meets Condition 2. Why is the first row still included in the output table? – taller Commented yesterday
Add a comment  | 

2 Answers 2

Reset to default 0

I feel like you have to loop and you know how to build a loop. Your error probably comes from deleting the row above where you are at when it moves up to the next iteration. Excel really only likes deleting the row you are on or previous ones you have visited. And, of course, as you designed it, when deleting rows, you should process from the bottom to the top.

Instead of deleting the row above when conditions are met, don't. Your "If" should only delete the row you are on or the one you came from. In other words, check down, not up.

This seems to work (at least, it matches your sample output)

Sub RemoveDuplicatesByScoreAndPeak()
    Dim ws As Worksheet, rngData As Range, data, rwA As Range, rwB As Range
    Dim lastRow As Long, rngDel As Range, n As Long, rStart As Long, rComp As Long
    Dim compCol As Long, indx As Long, i As Long
    Dim gap As Double, groups As Object, k, rng As Range, lr As Long
    
    
    Set ws = ActiveSheet ' Define the worksheet
    lastRow = ws.Cells(ws.Rows.Count, 1).End(xlUp).Row ' Find the last used row
    
    Set rngData = ws.Range("A2:C" & lastRow)
    rngData.Font.Color = vbBlack 'when testing using font color to indicate deletes
    
    ' Sort the data by ID (ascending) and Score (ascending)
    With ws.Sort
        .SortFields.Clear
        .SortFields.Add Key:=rngData.Columns(1), Order:=xlAscending
        .SortFields.Add Key:=rngData.Columns(2), Order:=xlAscending
        .SetRange rngData
        .Header = xlNo
        .MatchCase = False
        .Orientation = xlTopToBottom
        .SortMethod = xlPinYin
        .Apply
    End With
    
    Set groups = RowGroups(rngData) 'rows grouped by id
    For Each k In groups            'loop over compound id's
        Set rng = groups(k)
        Debug.Print "Id: " & k, rng.Address
        n = rng.Rows.Count
        If n > 1 Then              'id has replicates ?
            lr = rng.Rows(n).Row   'last row for this id...
            Set rwA = rng.Rows(1)  'first pair of rows to compare
            Set rwB = rng.Rows(2)
            Do
                gap = Abs(rwA.Cells(2).Value - rwB.Cells(2).Value)
                compCol = IIf(gap >= 0.02, 2, 3) 'set comparison column
                If rwA.Cells(compCol).Value < rwB.Cells(compCol).Value Then
                    BuildRange rngDel, rwA
                    Set rwA = rwB           'reset the comparison rows...
                    Set rwB = rwB.Offset(1)
                Else
                    BuildRange rngDel, rwB
                    Set rwB = rwB.Offset(1) 'next row down
                End If
                If rwB.Row > lr Then Exit Do 'ran out of rows
            Loop
        End If
    Next k
    If Not rngDel Is Nothing Then rngDel.Font.Color = vbRed 'or rngDel.Entirerow.delete
   
    ' Notify the user
    MsgBox "Duplicates processed successfully!", vbInformation
End Sub

'return a dictionary keyed on column 1 of rngData, with corresponding ranges as values
'  assumes data is sorted on at least column 1
Function RowGroups(rngData As Range) As Object 'dictionary of ranges
    Dim dict As Object, id, rw As Range
    Set dict = CreateObject("scripting.dictionary")
    For Each rw In rngData.Rows
        id = rw.Cells(1).Value
        If Not dict.exists(id) Then
            Set dict(id) = rw
        Else
            Set dict(id) = Application.Union(dict(id), rw)
        End If
    Next rw
    Set RowGroups = dict
End Function


Sub BuildRange(ByRef rngTot As Range, rngAdd As Range)
    If rngTot Is Nothing Then
        Set rngTot = rngAdd
    Else
        Set rngTot = Application.Union(rngTot, rngAdd)
    End If
End Sub