python - 将 CountVectorizer 应用于 Python 中包含行中单词列表的列

Question

我为文本分析做了一个预处理部分，在删除了停用词和词干之后，如下所示：

test[col] = test[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

train[col] = train[col].apply(
    lambda x: [ps.stem(item) for item in re.findall(r"[\w']+", x) if ps.stem(item) not in stop_words])

我有一列包含“清理过的单词”列表。这是一列中的 3 行：

['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']

我现在想将 CountVectorizer 应用于此列：

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=1500, analyzer='word', lowercase=False) # will leave only 1500 words
X_train = cv.fit_transform(train[col])

但我得到一个错误：

TypeError: expected string or bytes-like object

从列表创建字符串并再次由 CountVectorizer 分隔会有点奇怪。

score 5 · Accepted Answer

要将 CountVectorizer 应用于单词列表，您应该禁用分析器。

x=[['ab','cd'], ['ab','de']]
vectorizer = CountVectorizer(analyzer=lambda x: x)
vectorizer.fit_transform(x).toarray()

Out:
array([[1, 1, 0],
       [1, 0, 1]], dtype=int64)

score 4 · Accepted Answer

由于没有其他方法可以避免错误，因此我加入了列中的列表

train[col]=train[col].apply(lambda x: " ".join(x) )
test[col]=test[col].apply(lambda x: " ".join(x) )

只有在那之后我才开始得到结果

X_train = cv.fit_transform(train[col])
X_train=pd.DataFrame(X_train.toarray(), columns=cv.get_feature_names())

score 0 · Accepted Answer

使用时fit_transform，传入的参数必须是字符串或类似字节的对象的可迭代对象。看起来您应该将其应用于您的列。

X_train = train[col].apply(lambda x: cv.fit_transform(x))

您可以阅读fit_transform 此处的文档。

score 0 · Accepted Answer

您的输入应该是字符串或字节列表，在这种情况下，您似乎提供了列表列表。

看起来您已经将字符串标记为标记，在单独的列表中。你可以做的是一个黑客如下：

inp = [['size']
['pcs', 'new', 'x', 'kraft', 'bubble', 'mailers', 'lined', 'bubble', 'wrap', 
'protection', 'self', 'sealing', 'peelandseal', 'adhesive', 'keeps', 
'contents', 'secure', 'tamper', 'proof', 'durable', 'lightweight', 'kraft', 
'material', 'helps', 'save', 'postage', 'approved', 'ups', 'fedex', 'usps']]
['brand', 'new', 'coach', 'bag', 'bought', 'rm', 'coach', 'outlet']


inp = ["<some_space>".join(x) for x in inp]

vectorizer = CountVectorizer(tokenizer = lambda x: x.split("<some_space>"), analyzer="word")

vectorizer.fit_transform(inp)

python - 将 CountVectorizer 应用于 Python 中包含行中单词列表的列

4 回答 4

Related

Reference