python - 从元素列表中提取文本计数

Question

我有一个包含文本元素的列表。

text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']

我需要计算“=”之前存在的文本。我使用了 CountVectorizer 和一个令牌模式，但它没有给出预期的结果

print(text)
vectorizer = CountVectorizer()
vectorizer = CountVectorizer(token_pattern="^[^=]+")
vectorizer.fit(text)
print(vectorizer.vocabulary_)

输出如下

{'a for': 2, 'b for': 3, 'd for': 4, 'e for': 5, '1.': 0, '2.': 1}

但预期的输出应该是

{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1.': 1, '2.': 1}

我还需要删除“。” 从“1”。这样我的输出将是

 {'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}

有什么办法可以做到吗？

score 0 · Accepted Answer

import re
dictionary = {}

def remove_special_characters(value):
    if '.' in value:
        return re.sub(r'\.=\w+','',value)
    return value.split('=')[0]
for value in text:
    new_value = remove_special_characters(value)
    if new_value in dictionary:
        dictionary[new_value] += 1
    else:
        dictionary[new_value] = 1
print(dictionary)
>>>{'a for': 2, 'b for': 1, 'd for': 2, 'e for': 1, '1': 1, '2': 1}

score 0 · Accepted Answer

from collections import Counter

text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']

text = [i.split('=')[0] for i in text]      #consider only the first part of the split
text = [i.split('.')[0] for i in text]
frequency = {}
for each in text:
    if each in frequency:
        frequency[each] += 1
    else:
        frequency[each] = 1
print(frequency)                        #if you want to use dict

counts =list(Counter(text).items())     #if you want to use collections module
print(counts)

请注意，这仅适用于您的text列表所说的内容，即恰好包含一个=，除此之外，您需要对其进行一些调整。

score 0 · Accepted Answer

您可以在没有 CountVectorizer 的情况下执行此操作：

text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two'] 
left_sides = [pair.split('=')[0].replace('.','') for pair in text]
uniques = set(left_sides)
counts = {i:left_sides.count(i) for i in uniques}
print(counts)

产生：

{'d for': 2, 'b for': 1, '1': 1, 'a for': 2, '2': 1, 'e for': 1}

score 0 · Accepted Answer

一个简单的方法是使用collections.Counter()：

>>> from collections import Counter
>>> text = ['a for=apple','b for=ball', 'd for=dog', 'e for=elephant', 'a for=apple', 'd for=dog', '1.=one', '2.=two']
>>> Counter(x.split('=')[0].replace('.', '') for x in text)
Counter({'a for': 2, 'd for': 2, 'b for': 1, 'e for': 1, '1': 1, '2': 1})

首先将文本中的每个字符串拆分"="为一个列表，然后从中获取第一个元素。然后replace()调用以替换"."with的任何实例""。最后，它返回一个Counter()计数对象。

注意：如果你想在最后返回一个纯字典，你可以换dict()行到最后一行。

python - 从元素列表中提取文本计数

4 回答 4

Related

Reference