vocabulary - 我可以在 spaCy 中修剪解析器的词汇表吗？

Question

以下代码使用spaCy 词向量来查找与给定词最相似的 20 个词，方法是首先计算词汇表中所有词（超过一百万）的余弦相似度，然后对该最相似词的列表进行排序。

parser = English()

# access known words from the parser's vocabulary
current_word = parser.vocab[word]

# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))

# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and w.orth_.islower() and w.lower_ != word})

# sort by similarity
allWords.sort(key=lambda w: cosine(w.vector, current_word.vector))
allWords.reverse()

print("Top 20 most similar words to %s:") % word
for word in allWords[:20]:   
    print(word.orth_)

我想知道的是是否有一种方法可以将 spaCy 的词汇表限制为仅出现在给定列表中的单词，我希望这会大大降低排序操作的成本。

为了清楚起见，我想传入一个仅包含几个单词的列表，或者只是给定文本中的单词，并且能够快速查找这些单词中的哪些单词在 spaCy 的向量空间中彼此最近。

在这方面的任何帮助表示赞赏。

score 1 · Accepted Answer

SpaCy 文档说：

默认英语模型使用 GloVe 算法在 Common Crawl 语料库上训练的 300 维向量为一百万个词汇条目安装向量。GloVe 通用爬网向量已成为实际 NLP 的事实标准。

因此，您可以使用 Gensim 加载GloVe向量。我不确定您是否可以直接加载它们，或者您是否必须使用此脚本。

如果你已经在 Gensim 中加载了词向量 as model，你可以简单地使用word_vectors.similarity('woman', 'man')来获取两个词之间的相似度。如果您有单词列表，则可以执行以下操作：

def most_similar(word, candidates, model, n=20):
    "Get N most similar words from a list of candidates"
    similarities = [(model.similarity(word,candidate), candidate) 
                    for candidate in candidates]
    most_similar_words = sorted(similarities, reverse=True)[:n]
    only_words = [w for sim,w in most_similar_words]
    return only_words

score 0 · Accepted Answer

Spacy 有一个Vectors有most_similar方法的类。然后，您可以定义一个包装函数，以避免编写自己的实现：

import spacy
import numpy as np

def most_similar(word, model, n=20):
    nlp = spacy.load(model)
    doc = nlp(word)
    vecs = [token.vector for token in doc]
    queries = np.array(vecs)
    keys_arr, best_rows_arr, scores_arr = nlp.vocab.vectors.most_similar(queries, n=n)
    keys = keys_arr[0] # The array of keys is nested in another array from the previous step.
    similar_words_list = [nlp.vocab[key].text for key in keys]
    return similar_words_list

并这样称呼它：most_similar('apple', 'en_core_web_md', n=20)这将使用基于 Spacy 模型包“en_core_web_md”的单词“apple”的余弦相似度找到 20 个最相似的单词。

这是结果：['BLACKBERRY', 'APPLE', 'apples', 'PRUNES', 'iPHone', '3g/3gs', 'fruit', 'FIG', 'CREAMSICLE', 'iPad', 'ipad4', 'LONGAN', 'CALVADOS', 'iPOD', 'iPod', 'SORBET', 'PERSICA', 'peach', 'juice', 'JUICE']

vocabulary - 我可以在 spaCy 中修剪解析器的词汇表吗？

2 回答 2

Related

Reference