在字节对编码算法中,有一个替换步骤,它将由空格分隔的字符串更改为二元组。
即,给定str
这样的元组列表:
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
还有一个字符串元组:('i', 's')
如何处理列表,使其遍历所有元组键并替换('i', 's')
为('is')
?,即输出Counter
将如下所示:
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
我试过这个:
>>> cin
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
>>> [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
但是有没有比遍历每个单词更有效的方法,然后将它们更改为字符串以进行替换并再次拆分它们,然后将它们重新转换为元组?
正则表达式替换会更快吗?有没有办法在不处理字符串的情况下处理元组列表?
我已经尝试过了,似乎用替换字符串str.replace
不是问题。它真的在计算二元组并提取它们:
import io
from collections import Counter
import time
infile = 'big.txt' # comes from norvig.com/big.txt
n = 2
with io.open(infile, 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
for j in range(1,6400):
unused_char = unichr(ord(u'\uE001') + j)
start = time.time()
char_bigrams = zip(*[text[i:] for i in range(n)])
bigram_time = time.time() - start
start = time.time()
most_freq_bigram = Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]
max_time = time.time() - start
start = time.time()
text = text.replace(''.join(most_freq_bigram), unused_char)
replace_time = time.time() - start
print j, ''.join(most_freq_bigram), most_freq_bigram, bigram_time, max_time, replace_time
print text
[出去]:
1 th (u't', u'h') 0.896255016327 3.28389787674 0.0253069400787
2 e (u'\ue002', u'e') 1.47053217888 3.16544914246 0.0280749797821
3 in (u'i', u'n') 1.13404297829 3.10529899597 0.0245559215546
4 an (u'a', u'n') 1.20013689995 3.63801002502 0.0242891311646
5 er (u'e', u'r') 1.41387891769 3.13376092911 0.0237591266632
6 on (u'o', u'n') 1.22826981544 3.06997895241 0.0227301120758
7 re (u'r', u'e') 1.21916294098 2.97599196434 0.0238041877747
8 at (u'a', u't') 1.14608097076 2.97988891602 0.0226521492004
9 en (u'e', u'n') 1.20747494698 2.88649988174 0.019054889679
10 ed (u'e', u'd') 1.16296696663 2.8995718956 0.0198271274567
11 is (u'i', u's') 1.17692494392 3.02292394638 0.0228500366211
12 d (u'\ue005', u'd') 1.13779211044 2.85169506073 0.0229239463806
我已经尝试过scikit-learn
CountVectorizer 并且我似乎没有使用的那么快zip
,请参阅Python 中的 Fast/Optimize N-gram implementations
此外,如果没有他们filter
在Counter
步骤中的操作,它会花费更长的时间。Counter 操作每次迭代需要 3 秒 =(
此操作还能如何优化?
Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]