0

我正在尝试为 IntelliJ 语言插件编写词法分析器。在 JFLex手册中有一个可以使用 lex 字符串文字的示例。然而在这个例子中,他们使用一个 StringBuffer 来插入 lexed 字符的每个部分,并不断地构建一个字符串。这种方法的问题是它创建了正在读取的字符的副本,我不知道如何将该示例与 IntelliJ 集成。在 IntelliJ 中,总是返回一个 IElementType,然后使用函数 yytext() 从 yytext() 中获取相关文本getTokenStart()getTokenEnd()这样整个标记的开始和结束就直接映射到输入字符串。

所以我希望能够返回一个令牌,并且关联的yytext()应该跨越整个文本,因为上次返回另一个令牌。例如,在字符串文字示例中,我将读取\"标记文字开始的标记,然后我更改为状态STRING,当我\"再次读取时,我更改回另一个状态并返回字符串文字标记。那时我希望 yytext() 包含整个字符串文字。

JFlex 可以做到这一点吗?如果不是,那么建议在匹配跨越多个操作的令牌后将内容从 StringBuffer 传递到 IntelliJ API 的原因是什么。

4

1 回答 1

0

You could write a regular expression that matches the entire String literal so that you get it in one yytext() call, but this match would contain escape sequences unprocessed.

From the JFlex java example:

<STRING> {
  \"                             { yybegin(YYINITIAL); return symbol(STRING_LITERAL, string.toString()); }

  {StringCharacter}+             { string.append( yytext() ); }

  /* escape sequences */
  "\\b"                          { string.append( '\b' ); }
  "\\t"                          { string.append( '\t' ); }
  "\\n"                          { string.append( '\n' ); }
  "\\f"                          { string.append( '\f' ); }
  "\\r"                          { string.append( '\r' ); }
  "\\\""                         { string.append( '\"' ); }
  "\\'"                          { string.append( '\'' ); }
  "\\\\"                         { string.append( '\\' ); }
  \\[0-3]?{OctDigit}?{OctDigit}  { char val = (char) Integer.parseInt(yytext().substring(1),8);
                                           string.append( val ); }

  /* error cases */
  \\.                            { throw new RuntimeException("Illegal escape sequence \""+yytext()+"\""); }
  {LineTerminator}               { throw new RuntimeException("Unterminated string at end of line"); }
}

This code doesn't just match escape sequences like "\\t", but turns them into the single character '\t'. You could match the whole string in one expression in an expression like this

\" ({StringCharacter} | \\[0-3]?{OctDigit}?{OctDigit} | "\\b" | "\\t" | .. | "\\\\") * \"

but yytext will then contain the unprocessed sequence \\t instead of the character '\t'.

If that is acceptable, then that's the easy solution. If the token is supposed to be an actual substring of the input, then it sounds like this is what you want.

If it's not, you'll need something more complicated, for instance an intermediate interface function that is not yytext(), but that returns the StringBuffer content when the last match was a string match (a flag you could set in the string action), and otherwise returns yytext().

于 2015-04-16T20:26:02.440 回答