1

我是一名初学者数据科学家,尝试使用 datasketch 中的 LSH 实现来编写快速重复搜索。当我使用大尺寸的输入文本(文档数 > 250000)运行我的程序时,第 1 步很好,但随后程序在第 2 步挂起。当我用小输入运行程序时,一切正常。有没有决定如何解决这个问题?

def LSH(data, num_perm = 128, threshold = 0.5, check_const = 0.9):
    vec_unig = CountVectorizer(min_df=50, analyzer = 'word', stop_words = ['_dot_', '_comma_''_voskl_'], ngram_range=(1,2))
    X = vec_unig.fit_transform([" ".join(i) for i in data])
    length = X.shape[0]
    array1 = []
    print("Collection:" ,length)
    print("Step 1:")
    print("Form Minhash")
    start = datetime.now()
    for i in range(len(data)):
        print(i)
        m = MinHash(num_perm = num_perm)
        for d in data[i]:
            m.update(d.encode('utf8'))
        array1.append(m)
    print(datetime.now()- start)
    print("Step 2")
    print("Form potential clusters")
    start = datetime.now()
    lsh = MinHashLSH(threshold = threshold, num_perm = num_perm)
    for i in range(len(array1)):
        if ((i % 100) == 0):
            print(i)
        lsh.insert(i, array1[i])
    print(datetime.now()- start)
4

0 回答 0