python - 稀疏numpy数组的局部敏感散列

Question

我有一个大的稀疏 numpy/scipy 矩阵，其中每一行对应于高维空间中的一个点。我想进行以下类型的查询：

给定一个点P（矩阵中的一行）和一个距离epsilon ，找到与P的距离最大为epsilon的所有点。

我使用的距离度量是 Jaccard-similarity，所以应该可以使用 Locality Sensitive Hashing 技巧，例如 MinHash。

在某处是否有用于稀疏 numpy 数组的 MinHash 实现（我似乎找不到），还是有一种简单的方法可以做到这一点？

我不只是从 Github 中提取为非稀疏数组构建的东西的原因是 scipy 中的稀疏数据结构可能会导致时间复杂度爆炸。

score 7 · Accepted Answer

如果你有非常大的稀疏数据集，太大而无法以非稀疏格式保存在内存中，我会尝试这个基于 Scipy 的 CSR 稀疏矩阵假设构建的 LSH 实现：

https://github.com/brandonrobertz/SparseLSH

如果您无法将表放入内存，它还支持基于磁盘的键值存储，例如 LevelDB。从文档：

from sparselsh import LSH
from scipy.sparse import csr_matrix

X = csr_matrix( [
    [ 3, 0, 0, 0, 0, 0, -1],
    [ 0, 1, 0, 0, 0, 0,  1],
    [ 1, 1, 1, 1, 1, 1,  1] ])

# One class number for each input point
y = [ 0, 3, 10]

X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])

lsh = LSH( 4,
           X.shape[1],
           num_hashtables=1,
           storage_config={"dict":None})

for ix in xrange(X.shape[0]):
    x = X.getrow(ix)
    c = y[ix]
    lsh.index( x, extra_data=c)

# find points similar to X_sim
lsh.query(X_sim, num_results=1)

如果你肯定只想使用 MinHash，你可以试试https://github.com/go2starr/lshhdc，但我还没有亲自测试过它与稀疏矩阵的兼容性。

python - 稀疏numpy数组的局部敏感散列

1 回答 1

Related

Reference