parsing - NLTK 正则表达式解析器中的非条件

Question

我需要在 NLTK 的正则表达式解析器中创建一个非条件作为我的语法的一部分。我想将那些具有结构的单词分块，'Coffee & Tea'但如果<IN>在序列之前有一个类型的单词，它不应该分块。例如'in London and Paris'不应该被解析器分块。

我的代码如下：

grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''

我尝试了上述语法来解决问题，但它不起作用，有人可以告诉我我做错了什么。

例子：

def parse_sentence(sentence):
    pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
    grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
    parser = nltk.RegexpParser(grammar)
    result = parser.parse(pos_sentence)
    print result

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

Result for sentence 1 is:
(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)

Result for sentence2 is:
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (NP London/NNP and/CC Paris/NNP)
  ?/.)

从句子 1 和句子 2 中都可以看出，短语Coffee & Tea和London and Paris被分块为一个组，尽管我不想分块London and Paris。一种方法是忽略那些前面带有<IN>POS 标签的模式。

简而言之，我需要知道如何在正则表达式解析器的语法中为 POS 标签添加 NOT（否定）条件。使用 '^' 后跟标签定义的标准语法似乎不起作用

score 3 · Accepted Answer

你需要的是一个“消极的后视”表达。不幸的是，它在块解析器中不起作用，所以我怀疑你想要的不能指定为块正则表达式。

这是一个普通的否定后视：匹配“Paris”，但如果前面有“and”则不匹配。

>>> re.findall(r"(?<!and) Paris", "Search in London and Paris etc.")
[]

不幸的是，相应的lookbehind 分块规则不起作用。nltk 的正则表达式引擎会调整您传递给它的正则表达式以解释 POS 类型，它会被后向操作弄糊涂。（我猜测<后向语法中的字符被误解为标签分隔符。）

>>> parser = nltk.RegexpParser(r"NP: {(?<!<IN>)<NNP>+<CC><NN.*>+}")
...
ValueError: Illegal chunk pattern: {(?<!<IN>)<NNP>+<CC><NN.*>+}

score 1 · Accepted Answer

NLTK 的 Tag chunking Documentation 有点混乱，而且不容易访问，所以为了完成类似的事情，我做了很多努力。

检查以下链接：

按照@Luda的回答，我找到了一个简单的解决方案：

分块你想要的：<IN>*<other tags> 标签。这将创建以具有 0 个或多个 <IN> 标记的任何单词开头的块。
从前一个块表达式中插入 <IN><other tags> 标签。这将删除以一个 <IN> 标记词开头的所有块。（我们删除了星号）。

示例（以@Ram G Athreya 的问题为例）：

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    NP: {<IN>*<NNP>+<CC><NN.*>+}
        }<IN><NNP>+<CC><NN.*>+{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)


 (S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  London/NNP
  and/CC
  Paris/NNP
  ?/.)

现在它分块“咖啡和电视”，但不分块“伦敦和巴黎”

此外，这对于构建后向断言很有用，在 RegExp 中通常是?<=，但这会与 chunk_tag 语法正则表达式中使用的<和>符号产生冲突。

因此，为了构建一个lookbehind，我们可以尝试以下操作：

将您想要的内容分块，包括开头的<IN>标签，然后是您想要的其他标签。这将创建以具有 0 个或多个<IN>标记的任何单词开头的块。
从前一个块表达式中插入<IN>标记。这将从块中删除所有<IN>标记的单词。

示例 2 - 对前面带有 <IN> 标记词的所有词进行分块：

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    CHUNK: {<IN>+<.*>}
        }<IN>{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  (CHUNK the/DT)
  band/NN
  that/WDT
  wrote/VBD
  Coffee/NNP
  &/CC
  TV/NN
  ?/.)
(S
  Who/WP
  of/IN
  (CHUNK those/DT)
  resting/VBG
  in/IN
  (CHUNK Westminster/NNP)
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (CHUNK London/NNP)
  and/CC
  Paris/NNP
  ?/.)

正如我们所见，它从句子1 中分块了“the” ；“那些”、“威斯敏斯特”和“伦敦”来自句子 2

score 0 · Accepted Answer

cp.2.5 “叮当声”

“我们可以将一个缝隙定义为一个不包含在一个块中的标记序列”

http://www.nltk.org/book/ch07.html

见反花括号排除

grammar = 
        r"""
          NP:
            {<.*>+}          # Chunk everything
            }<VBD|IN>+{      # Chink sequences of VBD and IN

         """

parsing - NLTK 正则表达式解析器中的非条件

3 回答 3

cp.2.5 “叮当声”

Related

Reference