1

我正在创建一个倒排索引来获取一个单词字典,其中包含该单词出现的行号的相关列表(开始行号和出现在该行内给定单元格中的单词列表)。

我已经设法让一些代码为此工作,但我发现处理添加到数组(字典中的值)有点麻烦,我想知道是否有更有效或更优雅的方法来处理这个问题。

我愿意使用数组、集合或任何其他可以轻松搜索的数据类型来将行号列表存储在字典的值中。我已经粘贴了我的代码的缩减版本来演示下面的核心问题,问题实际上只是关于BuildInvertedIndex过程,但包含其余部分是为了让重新创建场景变得更容易:

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F20585")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ArrayToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next


End Sub


Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)

    Dim cell As Range
    Dim words As Variant, word As Variant, val As Variant
    Dim tmpArr() As Long
    Dim newLen As Long, i As Long

    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells

        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words

            If Not pDict.exists(word) Then
                ' start line array with first row number
                pDict.Add word, Array(cell.Row())
            Else
                i = 0
                If Not InArray(cell.Row(), pDict.Item(word)) Then
                    newLen = UBound(pDict.Item(word)) + 1
                    ReDim tmpArr(newLen)
                    For Each val In tmpArr
                        If i < newLen Then
                            tmpArr(i) = pDict.Item(word)(i)
                        Else
                            tmpArr(i) = cell.Row()
                        End If
                        i = i + 1
                    Next val
                    pDict.Item(word) = tmpArr
                End If
            End If
        Next word
    Next cell

End Sub


Function ArrayToString(vArray As Variant, _
                       Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = LBound(vArray) To UBound(vArray)
        vDelimString = vDelimString & CStr(vArray(i)) & _
                       IIf(vCounter < UBound(vArray), vDelim, "")
    Next

    ArrayToString = vDelimString
End Function

要运行它,您需要活动工作表(句子)的 F 列中的值,如果您还没有它,您还需要在 VBA 环境中添加对 Microsoft Scripting Runtime 的引用,以便字典数据类型可用(工具-> 参考-> Microsoft 脚本运行时)。

正如您将从代码中看到的那样,这有点混乱,我必须将新行号插入现有数组(存储为字典中的值)。由于我不知道扩展此数组的方法(不清除现有值),因此我使用变量 tmpArr 创建了一个适当大小的数组,然后从字典中的现有数组中一一复制这些值然后将当前行号添加到末尾。然后使用临时数组替换该键(当前单词)的现有值。

对此的任何建议将不胜感激。

4

1 回答 1

1

我愿意使用数组、集合或任何其他数据类型

如我所见,使用集合代替数组会简单得多:

Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)
    Dim cell As Range
    Dim words, word
    Dim i As Long    
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells    
        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words    
            If Not pDict.Exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If
            'try to add to collection. If row is already in collecton, nothing happend. Storing key makes you sure there're only unique rows
            On Error Resume Next
            pDict.Item(word).Add Item:=cell.Row, Key:=CStr(cell.Row)
            On Error GoTo 0                
        Next word
    Next cell
End Sub

下一步,稍微修改ArrayToStringColToString

Function ColToString(vCol As Collection, _
                   Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = 1 To vCol.Count
        vDelimString = vDelimString & CStr(vCol.Item(i)) & _
                       IIf(i < vCol.Count, vDelim, "")
    Next

    ColToString = vDelimString
End Function

和测试子例程(仅更改一行 -Debug.Print k & ": " & ColToString(vDict.Item(k))目标范围为"F2:F5"):

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F5")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

结果:

在此处输入图像描述


更新:

为了提高代码的速度,您可以将范围存储在数组中(下一种方法仅适用于单列范围,但您可以轻松修改它):

测试子:

Sub TestWirhArray()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary
    Dim myArr As Variant

    Set vDict = New Dictionary
    Set vRange = ActiveSheet.Range("F2:F20585")
    myArr = vRange.Value
    BuildInvertedIndexWithArr vDict, myArr, vRange.Row

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

新版本BuildInvertedIndexWithArr

Sub BuildInvertedIndexWithArr(pDict As Dictionary, pArr, firstRow As Long)
    Dim cell, words, word
    Dim i As Long, j As Long

    j = firstRow
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pArr

        ' loop through words in line
        words = Split(cell)
        For Each word In words

            If Not pDict.exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If

            On Error Resume Next
            pDict.Item(word).Add Item:=j, Key:=CStr(j)
            On Error GoTo 0

        Next word
        j = j + 1
    Next cell
End Sub
于 2014-03-03T08:58:34.657 回答