python - ply lexmatch 正则表达式与通常的 re 具有不同的组

Question

我正在使用 ply 并注意到存储在 t.lex.lexmatch 中的令牌重新匹配与使用 re 模块以通常方式定义的 sre_pattern 之间存在奇怪的差异。group(x) 似乎偏离了 1。

我定义了一个简单的词法分析器来说明我所看到的行为：

import ply.lex as lex

tokens = ('CHAR',)

def t_CHAR(t):
    r'.'
    t.value = t.lexer.lexmatch
    return t

l = lex.lex()

（我收到关于 t_error 的警告，但现在忽略它。）现在我将一些输入输入到词法分析器中并获得一个令牌：

l.input('hello')
l.token()

我得到一个LexToken(CHAR,<_sre.SRE_Match object at 0x100fb1eb8>,1,0). 我想看一个匹配对象：

m = _.value

所以现在我看一下这些组：

m.group()=>'h'正如我所料。

m.group(0)=>'h'正如我所料。

m.group(1)=> 'h'，但我希望它没有这样的组。

将此与手动创建这样的正则表达式进行比较：

import re
p = re.compile(r'.')
m2 = p.match('hello')

这给出了不同的组：

m2.group()='h'正如我所料。

m2.group(0)='h'正如我所料。

m2.group(1)正如IndexError: no such group我所料。

有谁知道为什么存在这种差异？

score 5 · Accepted Answer

在 PLY 3.4 版本中，出现这种情况的原因与表达式如何从文档字符串转换为模式有关。

查看源代码确实有帮助 - lex.py 的第 746 行：

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

我不建议在版本之间依赖这样的东西——这只是 PLY 工作原理的一部分。

score 1 · Accepted Answer

在我看来，匹配组取决于文件中标记函数的位置，就像组实际上是通过所有声明的标记正则表达式累积的：

   t_MYTOKEN1(t):
      r'matchit(\w+)'
      t.value = lexer.lexmatch.group(1)
      return t

   t_MYTOKEN2(t):
      r'matchit(\w+)'
      t.value = lexer.lexmatch.group(2)
      return t

python - ply lexmatch 正则表达式与通常的 re 具有不同的组

2 回答 2

Related

Reference