2

So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'

I want to replace all country instances except when we have an 'of ' or 'of the '

Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.

If these do exist then we do not remove the country, else we do remove country. The examples will become:

'Bank of China', or 'Embassy of China', and 'International'

However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.

Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?

My function is along these lines:

def removeCountry(name):
    for country in countries:
        if country in name:
            if 'of ' + country in name:
                return name
            if 'of the ' + country in name:
                return name
            else:
                name =  re.sub(country + '$', '', name).strip()
                return name
    return name

EDIT: I did find some info here. This does describe how to do an if, but I really want a if not 'of ' if not 'of the ' then replace...

4

4 回答 4

1

您可以编译几组正则表达式,然后通过它们传递您的输入列表。像:import re

countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]

def remove_country(s):
    for regex in takes:
        if regex.search(s):
            return s
    for regex in subs:
        s = regex.sub('', s)
    return s

print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')

''' Output:
    the bank of foo
    the bank of the baz
    the nation
'''

看起来没有什么比这里可能的线性时间复杂度更快的了。至少可以避免一百万次重新编译正则表达式并提高常数因子。

编辑:我有一些错别字,但基本想法是合理的并且有效。我添加了一个例子。

于 2014-02-13T23:01:36.070 回答
1

我认为您可以使用Python 中的方法:如何确定字符串中是否存在单词列表以查找提到的任何国家,然后从那里进行进一步处理。

就像是

countries = [
    "Afghanistan",
    "Albania",
    "Algeria",
    "Andorra",
    "Angola",
    "Anguilla",
    "Antigua",
    "Arabia",
    "Argentina",
    "Armenia",
    "Aruba",
    "Australia",
    "Austria",
    "Azerbaijan",
    "Bahamas",
    "Bahrain",
    "China",
    "Russia"
    # etc
]

def find_words_from_set_in_string(set_):
    set_ = set(set_)
    def words_in_string(s):
        return set_.intersection(s.split())
    return words_in_string

get_countries = find_words_from_set_in_string(countries)

然后

get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")

返回

set(['Argentina', 'China', 'Russia'])

...这显然需要更多的后期处理,但很快就会告诉你你需要寻找什么。

正如链接文章中指出的那样,您必须警惕以标点符号结尾的单词 - 这可以通过类似s.split(" \t\r\n,.!?;:'\""). 您可能还想查找形容词形式,即“俄罗斯”、“中国”等。

于 2014-02-13T22:53:20.640 回答
0

re.sub函数接受一个函数作为替换文本,调用该函数是为了获取应该在给定匹配中替换的文本。所以你可以这样做:

import re

def make_regex(countries):
    escaped = (re.escape(country) for country in countries)
    states = '|'.join(escaped)
    return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))

def remove_name(match):
    name = match.group()
    if name.lstrip().startswith('of'):
        return name
    else:
        return name.replace(match.group('state'), '').strip()

regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'

结果可能包含一些虚假空间(在上述情况下需要最后一个strip())。您可以将正则表达式修改为:

\s*(of(\sthe)?\s)?(?P<state>({}))

捕捉of国家名称之前或之前的空格,并避免输出中的错误空格。

请注意,此解决方案可以处理整个文本,而不仅仅是表单Something of CountrySomething Country. 例如:

In [38]: regex = make_regex(['China'])
    ...: text = '''This is more complex than just "Embassy of China" and "International China"'''

In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'

另一个示例用法:

In [33]: countries = [
    ...:     'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
    ...:     'France', 'Italy', 'Australia', 'New Zealand', 'Brazil', 
    ...:     'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
    ...:     'Spain', 'Portugal', 'Argentina', 'San Marino'
    ...: ]

In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'

In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)

In [36]: regex = make_regex(countries)
    ...: result = regex.sub(remove_name, text)

In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'
于 2014-02-13T23:03:22.247 回答
0

未测试:

def removeCountry(name):
    for country in countries:
          name =  re.sub('(?<!of (the )?)' + country + '$', '', name).strip()

使用负后瞻 re.sub 仅在 country 前面没有 of 或 of the 时匹配和替换

于 2014-02-14T01:39:41.500 回答