我正在创建一个倒排索引来获取一个单词字典,其中包含该单词出现的行号的相关列表(开始行号和出现在该行内给定单元格中的单词列表)。
我已经设法让一些代码为此工作,但我发现处理添加到数组(字典中的值)有点麻烦,我想知道是否有更有效或更优雅的方法来处理这个问题。
我愿意使用数组、集合或任何其他可以轻松搜索的数据类型来将行号列表存储在字典的值中。我已经粘贴了我的代码的缩减版本来演示下面的核心问题,问题实际上只是关于BuildInvertedIndex
过程,但包含其余部分是为了让重新创建场景变得更容易:
Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure
Dim vRange As Range
Dim vDict As Dictionary
Set vRange = ActiveSheet.Range("F2:F20585")
Set vDict = New Dictionary
BuildInvertedIndex vDict, vRange
' test values returned in dictionary (word: [line 1, ..., line n])
Dim k As Variant, vCounter As Long
vCounter = 0
For Each k In vDict.Keys
Debug.Print k & ": " & ArrayToString(vDict.Item(k))
vCounter = vCounter + 1
If vCounter >= 10 Then
Exit For
End If
Next
End Sub
Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)
Dim cell As Range
Dim words As Variant, word As Variant, val As Variant
Dim tmpArr() As Long
Dim newLen As Long, i As Long
' loop through cells (one col wide so same as looping through lines)
For Each cell In pRange.Cells
' loop through words in line
words = Split(cell.Value)
For Each word In words
If Not pDict.exists(word) Then
' start line array with first row number
pDict.Add word, Array(cell.Row())
Else
i = 0
If Not InArray(cell.Row(), pDict.Item(word)) Then
newLen = UBound(pDict.Item(word)) + 1
ReDim tmpArr(newLen)
For Each val In tmpArr
If i < newLen Then
tmpArr(i) = pDict.Item(word)(i)
Else
tmpArr(i) = cell.Row()
End If
i = i + 1
Next val
pDict.Item(word) = tmpArr
End If
End If
Next word
Next cell
End Sub
Function ArrayToString(vArray As Variant, _
Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)
Dim vDelimString As String
Dim i As Long
For i = LBound(vArray) To UBound(vArray)
vDelimString = vDelimString & CStr(vArray(i)) & _
IIf(vCounter < UBound(vArray), vDelim, "")
Next
ArrayToString = vDelimString
End Function
要运行它,您需要活动工作表(句子)的 F 列中的值,如果您还没有它,您还需要在 VBA 环境中添加对 Microsoft Scripting Runtime 的引用,以便字典数据类型可用(工具-> 参考-> Microsoft 脚本运行时)。
正如您将从代码中看到的那样,这有点混乱,我必须将新行号插入现有数组(存储为字典中的值)。由于我不知道扩展此数组的方法(不清除现有值),因此我使用变量 tmpArr 创建了一个适当大小的数组,然后从字典中的现有数组中一一复制这些值然后将当前行号添加到末尾。然后使用临时数组替换该键(当前单词)的现有值。
对此的任何建议将不胜感激。