python - wikitext 模板上的 Python 正则表达式

Question

我正在尝试使用 Python 从以下形式的 wikitext 模板中删除换行符：

{{cite web
|title=Testing
|url=Testing
|editor=Testing
}}

应使用 re.sub 获得以下内容：

{{cite web|title=Testing|url=Testing|editor=Testing}}

我一直在尝试使用 Python 正则表达式几个小时，但还没有成功。例如我试过：

while(re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}')):
     textmodif=re.sub(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', r'{cite web\1\3}}', textmodif,re.DOTALL)

但它没有按预期工作（即使没有 while 循环，它也不适用于第一个换行符）。

我发现了这个类似的问题，但没有帮助：Regex for MediaWiki wikitext templates。我对 Python 很陌生，所以请不要对我太苛刻:-)

先感谢您。

score 1 · Accepted Answer

.您需要为;打开换行符匹配。否则它不匹配换行符：

re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)

您在要匹配的文本中分布有多个换行符，因此仅匹配一组连续的换行符是不够的。

从re.DOTALL文档中：

使'.'特殊字符完全匹配任何字符，包括换行符；没有这个标志，'.'将匹配除换行符以外的任何内容。

您可以使用一个re.sub()调用一次性删除该cite节中的所有换行符，而无需循环：

re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)

这使用嵌套的正则表达式从匹配的文本中删除所有包含至少一个换行符的空格。

演示：

>>> import re
>>> inputtext = '''\
... {{cite web
... |title=Testing
... |url=Testing
... |editor=Testing
... }}
... '''
>>> re.search(r'\{cite web(.*?)([\r\n]+)(.*?)\}\}', inputtext, flags=re.DOTALL)
<_sre.SRE_Match object at 0x10f335458>
>>> re.sub(r'\{cite web.*?[\r\n]+.*?\}\}', lambda m: re.sub('\s*[\r\n]\s*', '', m.group(0)), inputtext, flags=re.DOTALL)
'{{cite web|title=Testing|url=Testing|editor=Testing}}\n'

python - wikitext 模板上的 Python 正则表达式

1 回答 1

Related

Reference