python - 如何在 Python 中有效地进行字节对编码二元组计数和替换？

Question

在字节对编码算法中，有一个替换步骤，它将由空格分隔的字符串更改为二元组。

即，给定str这样的元组列表：

[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]

还有一个字符串元组：('i', 's')

如何处理列表，使其遍历所有元组键并替换('i', 's')为('is')？，即输出Counter将如下所示：

[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]

我试过这个：

>>> cin
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
>>> [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]

但是有没有比遍历每个单词更有效的方法，然后将它们更改为字符串以进行替换并再次拆分它们，然后将它们重新转换为元组？

正则表达式替换会更快吗？有没有办法在不处理字符串的情况下处理元组列表？

我已经尝试过了，似乎用替换字符串str.replace不是问题。它真的在计算二元组并提取它们：

import io
from collections import Counter

import time

infile = 'big.txt' # comes from norvig.com/big.txt

n = 2
with io.open(infile, 'r', encoding='utf8') as fin:
    text = fin.read().lower().replace(u' ', u"\uE000")
    for j in range(1,6400):
        unused_char = unichr(ord(u'\uE001') + j)

        start = time.time()
        char_bigrams = zip(*[text[i:] for i in range(n)])
        bigram_time = time.time() - start

        start = time.time()
        most_freq_bigram = Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]
        max_time = time.time() - start

        start = time.time()
        text = text.replace(''.join(most_freq_bigram), unused_char)
        replace_time = time.time() - start
        print j, ''.join(most_freq_bigram), most_freq_bigram, bigram_time, max_time, replace_time
    print text

这是在norvig.com/big.txt上测试的

[出去]：

1 th (u't', u'h') 0.896255016327 3.28389787674 0.0253069400787
2 e (u'\ue002', u'e') 1.47053217888 3.16544914246 0.0280749797821
3 in (u'i', u'n') 1.13404297829 3.10529899597 0.0245559215546
4 an (u'a', u'n') 1.20013689995 3.63801002502 0.0242891311646
5 er (u'e', u'r') 1.41387891769 3.13376092911 0.0237591266632
6 on (u'o', u'n') 1.22826981544 3.06997895241 0.0227301120758
7 re (u'r', u'e') 1.21916294098 2.97599196434 0.0238041877747
8 at (u'a', u't') 1.14608097076 2.97988891602 0.0226521492004
9 en (u'e', u'n') 1.20747494698 2.88649988174 0.019054889679
10 ed (u'e', u'd') 1.16296696663 2.8995718956 0.0198271274567
11 is (u'i', u's') 1.17692494392 3.02292394638 0.0228500366211
12 d (u'\ue005', u'd') 1.13779211044 2.85169506073 0.0229239463806

我已经尝试过scikit-learnCountVectorizer 并且我似乎没有使用的那么快zip，请参阅Python 中的 Fast/Optimize N-gram implementations

此外，如果没有他们filter在Counter步骤中的操作，它会花费更长的时间。Counter 操作每次迭代需要 3 秒 =(

此操作还能如何优化？

Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]

score 5 · Accepted Answer

您的原始代码：

[tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]

我会扩展它，以便更容易看到发生了什么

result = []
qtuple = ("i", "s")
for i in cin:
    f = " ".join(qtuple)
    r = "".join(qtuple)
    word = ' '.join(i)
    word = word.replace(f, r)
    word = word.split()
    word = tuple(word)
    result.append(word)
print(result)

寻找可以移出循环的东西。我们可以预先计算替换而不是为每个单词计算它们

find = " ".join(qtuple)
replacement = "".join(qtuple)
result = []
# this will join and split each word once
for i in cin:
    word = " ".join(i)
    # if you had multiple replacements to do, they should be in an inner loop here
    word = word.replace(find, replacement)
    result.append(tuple(word.split(" ")))
print(result)

也许其他人可以谈论 str.replace 与 re.replace 的相对效率。就我个人而言，如果一个简单的替换可以做到这一点，我倾向于避免使用正则表达式，只是为了便于阅读。

通过更改输入的数据结构可以进一步提高效率。如果替换符号是单个字符，那么我们可以使用字符串而不是元组列表，并避免循环内的任何连接。

result = []
replacements = [("\ue000", "X"), ("is", "Z")]
s = "".join(["".join(t) for t in cin])
for f, r in replacements:
    s = s.replace(f,r)
print(s)

# output: thZXcorpusXinXtxtfileXtheXsentenceXbarXandXZXfooXfirstXaX.X

我认为这个问题需要添加一些要求来解释为什么选择的数据结构是有利的。从效率的角度来看，在字节对编码算法的上下文中，字符串对我来说更有意义。

score 4 · Accepted Answer

如果你保持你的字符串元组长度为 2，你可以像这样使用 reduce：

def cons_2(word_list, t):
    j = ''.join(t)
    f = lambda acc, e: acc[:-1] + (j,) if (acc[-1] == t[0] and e == t[1]) else acc + (e,)
    return [reduce(f, i[1:], (i[0],)) for i in word_list]

print cons_2(cin, ('i', 's'))

不涉及替换，f适用于每个元素i，cin不改变的值而是创建并返回一个新数组。

细节：

reduce适用f于每个数组元素i并向累加器返回一个值acc。
减少参数：
- f: 要应用的功能。
- i[1:]：要迭代除第一个元素之外的所有元素的数组。
- (i[0],)：累加器的初始值，它是一个带有输入元组第一个值的元组i。
f: 是一个lambda以累加器acc和当前元素e作为输入的函数：
- 如果累加器的最后一个元素等于字符串元组的第一个元素并且当前元素e等于字符串元组的第二个元素，则返回元组：acc[-1] + (j,)否则继续正常连接：acc + (e,)。

对于字符串元组 > 2 的想法是相同的，但我们必须管理元组的长度l。

def cons_n(word_list, t):
    l = len(t)
    j = ''.join(t)
    f = lambda acc, e: acc[:-l] + (j, e,) if acc[-l:] == t or acc[:l] == t else acc + (e,)
    return [reduce(f, i[l:], (i[:l])) for i in word_list]

print cons_n(cin, ('i', 's'))

这应该适用于 n 长度的字符串元组。

细节：

与上面相同的过程，但使用l: reduce 适用f于其余元素i[l:]，并且累加器的初始值是具有第一个l元素的元组：(i[:l])。
向后和向前检查l等于字符串 tuple 的元素t，如果为 true，则添加元组：acc[:-l] + (j, e,)否则继续正常连接：acc + (e,)。

这是一种功能性方法，没有数据被修改而是生成，因此同时拥有多个进程应该是安全的（理论上，我不是 Python 解释器的专家）。

如果上面的代码对于不熟悉函数式编程的人来说太奇怪了，这是另一种方法：

def cons_n_iter(tuple_list, tuple_seq):
     jnt = ''.join(tuple_seq)
     lnt = len(tuple_seq)
     res = []
     for word in tuple_list:
         acc = (word[:lnt])
         for letter in word[lnt:]:
             if acc[-lnt:] == tuple_seq or acc[:lnt] == tuple_seq:
                 acc = acc[:-lnt] + (jnt, letter,)
             else:
                 acc += (letter,)
         res += (acc,)
     return res

print cons_n_iter(cin, ('i', 's'))

逻辑与功能方法相同，累加器的使用相同。在这种情况下，res累加器是显式的，因为在上面的示例reduce中正在处理它。

score 3 · Accepted Answer

这是你需要的吗？使用重新。

import re,ast
cin = [('t','h',"i",'s', '\ue000'), ('c', 'i', 's', 'p')]
cin = re.sub(r"i'[,\s]+'s", r"is",str(cin))
cin = ast.literal_eval(cin)

python - 如何在 Python 中有效地进行字节对编码二元组计数和替换？

3 回答 3

您的原始代码：

Related

Reference