我是一名初学者数据科学家,尝试使用 datasketch 中的 LSH 实现来编写快速重复搜索。当我使用大尺寸的输入文本(文档数 > 250000)运行我的程序时,第 1 步很好,但随后程序在第 2 步挂起。当我用小输入运行程序时,一切正常。有没有决定如何解决这个问题?
def LSH(data, num_perm = 128, threshold = 0.5, check_const = 0.9):
vec_unig = CountVectorizer(min_df=50, analyzer = 'word', stop_words = ['_dot_', '_comma_''_voskl_'], ngram_range=(1,2))
X = vec_unig.fit_transform([" ".join(i) for i in data])
length = X.shape[0]
array1 = []
print("Collection:" ,length)
print("Step 1:")
print("Form Minhash")
start = datetime.now()
for i in range(len(data)):
print(i)
m = MinHash(num_perm = num_perm)
for d in data[i]:
m.update(d.encode('utf8'))
array1.append(m)
print(datetime.now()- start)
print("Step 2")
print("Form potential clusters")
start = datetime.now()
lsh = MinHashLSH(threshold = threshold, num_perm = num_perm)
for i in range(len(array1)):
if ((i % 100) == 0):
print(i)
lsh.insert(i, array1[i])
print(datetime.now()- start)