python - Python Lex-Yacc（PLY）：无法识别行开头或字符串开头

Question

我对PLY很陌生，而且比 Python 的初学者还要多。我正在尝试使用PLY-3.4和 python 2.7 来学习它。请看下面的代码。我正在尝试创建一个令牌 QTAG，它是一个由零个更多空格组成的字符串，后跟“Q”或“q”，然后是“。” 和一个正整数和一个或多个空格。例如 VALID QTAG 是

"Q.11 "
"  Q.12 "
"q.13     "
'''
   Q.14 
'''

无效的是

"asdf Q.15 "
"Q.  15 "

这是我的代码：

import ply.lex as lex

class LqbLexer:
     # List of token names.   This is always required
     tokens =  [
        'QTAG',
        'INT'
        ]


     # Regular expression rules for simple tokens

    def t_QTAG(self,t):
        r'^[ \t]*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t

    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
    r'\d+'
    t.value = int(t.value)   
    return t


    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)

    # A string containing ignored characters (spaces and tabs)
    t_ignore  = ' \t'

    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)

    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)

    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok

# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test('''
   Q.14 
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q.  15 ")

我得到的输出如下：

LexToken(QTAG,11,1,0)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,12,1,4)
LexToken(QTAG,13,1,0)
Newline found
Illegal character 'Q'
Illegal character '.'
LexToken(INT,14,2,6)
Newline found
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,7)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,4)

请注意，只有第一个和第三个有效输入被正确标记。我无法弄清楚为什么我的其他有效输入没有被正确标记。在 t_QTAG 的文档字符串中：

替换'^'为'\A'无效。
我尝试删除'^'. 然后所有有效输入都被标记化，但是第二个无效输入也被标记化。

提前感谢任何帮助！

谢谢

PS：我加入了 google-group ply-hack 并尝试在那里发帖，但我无法直接在论坛或通过电子邮件发帖。我不确定该组是否已处于活动状态。Beazley 教授也没有回应。有任何想法吗？

score 3 · Accepted Answer

最后我自己找到了答案。发布它，以便其他人可能会发现它有用。

正如@Tadgh 正确指出的那样，它t_ignore = ' \t'消耗了空格和制表符，因此我将无法按照上面的正则表达式进行匹配，t_QTAG结果是第二个有效输入没有被标记化。通过仔细阅读 PLY 文档，我了解到，如果要维护令牌的正则表达式的顺序，那么它们必须在函数中定义，而不是像t_ignore. 如果使用字符串，则 PLY 会自动按最长到最短长度对它们进行排序，并将它们附加到函数之后。我猜这里t_ignore很特别，它以某种方式在其他任何事情之前执行。这部分没有明确记录。解决此问题的方法是使用新标记定义函数，例如t_SPACETAB，之后 t_QTAG只是不返回任何东西。有了这个，所有有效的输入现在都被正确标记了，除了带有三引号的输入（包含的多行字符串"Q.14"）。此外，根据规范，无效的未标记化。

多行字符串问题：原来 PLY 内部使用了re模块。在该模块中，默认情况下^仅在字符串的开头而不是每行的开头进行解释。要改变这种行为，我需要打开多行标志，这可以在正则表达式中使用(?m). 因此，要正确处理我的测试中的所有有效和无效字符串，正确的正则表达式是：

r'(?m)^\s*[Qq]\.[0-9]+\s+'

这是添加了更多测试的更正代码：

import ply.lex as lex

class LqbLexer:
    # List of token names.   This is always required

    tokens = [
        'QTAG',
        'INT',
        'SPACETAB'
        ]


    # Regular expression rules for simple tokens

    def t_QTAG(self,t):
        # corrected regex
        r'(?m)^\s*[Qq]\.[0-9]+\s+'
        t.value = int(t.value.strip()[2:])
        return t

    # A regular expression rule with some action code
    # Note addition of self parameter since we're in a class
    def t_INT(self,t):
        r'\d+'
        t.value = int(t.value)    
        return t

    # Define a rule so we can track line numbers
    def t_newline(self,t):
        r'\n+'
        print "Newline found"
        t.lexer.lineno += len(t.value)

    # A string containing ignored characters (spaces and tabs)
    # Instead of t_ignore  = ' \t'
    def t_SPACETAB(self,t):
        r'[ \t]+'
        print "Space(s) and/or tab(s)"

    # Error handling rule
    def t_error(self,t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)

    # Build the lexer
    def build(self,**kwargs):
        self.lexer = lex.lex(debug=1,module=self, **kwargs)

    # Test its output
    def test(self,data):
        self.lexer.input(data)
        while True:
             tok = self.lexer.token()
             if not tok: break
             print tok

# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test("  Q.12 ")
q.test("q.13     ")
q.test("""


   Q.14
""")
q.test("""

qewr
dhdhg
dfhg
   Q.15 asda

""")

# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q.  17 ")

这是输出：

-============Testing some VALID inputs===========-
LexToken(QTAG,11,1,0)
LexToken(QTAG,12,1,0)
LexToken(QTAG,13,1,0)
LexToken(QTAG,14,1,0)
Newline found
Illegal character 'q'
Illegal character 'e'
Illegal character 'w'
Illegal character 'r'
Newline found
Illegal character 'd'
Illegal character 'h'
Illegal character 'd'
Illegal character 'h'
Illegal character 'g'
Newline found
Illegal character 'd'
Illegal character 'f'
Illegal character 'h'
Illegal character 'g'
Newline found
LexToken(QTAG,15,6,18)
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'a'
Newline found
-============Testing some INVALID inputs===========-
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,16,8,7)
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
Space(s) and/or tab(s)
LexToken(INT,17,8,4)
Space(s) and/or tab(s)

python - Python Lex-Yacc（PLY）：无法识别行开头或字符串开头

1 回答 1

Related

Reference