当我第一次写答案时,我什至没有意识到现有答案之一是提问者本身。另一方面,我很少在相当小的 SO lex 社区中找到赏金。因此,在我看来,这似乎值得学习足够的 Java 和 jflex 来编写示例:
/* JFlex scanner: to recognize nested comments in Mathematica style
*/
%%
%{
/* counter for open (nested) comments */
int open = 0;
%}
%state IN_COMMENT
%%
/* any state */
"(*" { if (!open++) yybegin(IN_COMMENT); }
"*)" {
if (open) {
if (!--open) {
yybegin(YYINITIAL);
return MathematicaElementTypes.COMMENT;
}
} else {
/* or return MathematicaElementTypes.BAD_CHARACTER;
/* or: throw new Error("'*)' without '(*'!"); */
}
}
<IN_COMMENT> {
. |
\n { }
}
<<EOF>> {
if (open) {
/* This is obsolete if the scanner is instanced new for
* each invocation.
*/
open = 0; yybegin(IN_COMMENT);
/* Notify about syntax error, e.g. */
throw new Error("Premature end of file! ("
+ open + " open comments not closed.)");
}
return MathematicaElementTypes.EOF; /* just a guess */
}
尽管我尽量小心并尽力而为,但可能会出现拼写错误和愚蠢的错误。
作为“概念证明”,我将使用 flex 和 C/C++ 完成的原始实现留在这里。
这个扫描仪
break
我的解决方案本质上是基于 flex 规则可能以or结尾的事实return
。因此,只有在模式的规则与关闭最外层的注释匹配后,才会返回令牌。评论中的内容只是简单地“记录”在缓冲区中——在我的例子中是一个std::string
. (AFAIKstring
甚至是 Java 中的内置类型。因此,我决定将 C 和 C++ 混合使用,而我通常不会。)
我的来源scan-nested-comments.l
:
%{
#include <cstdio>
#include <string>
// counter for open (nested) comments
static int open = 0;
// buffer for collected comments
static std::string comment;
%}
/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap
%s IN_COMMENT
%%
"(*" {
if (!open++) BEGIN(IN_COMMENT);
comment += "(*";
}
"*)" {
if (open) {
comment += "*)";
if (!--open) {
BEGIN(INITIAL);
printf("EMIT TOKEN COMMENT(lexem: '%s')\n", comment.c_str());
comment.clear();
}
} else {
printf("ERROR: '*)' without '(*'!\n");
}
}
<IN_COMMENT>{
. |
"\n" { comment += *yytext; }
}
<<EOF>> {
if (open) {
printf("ERROR: Premature end of file!\n"
"(%d open comments not closed.)\n", open);
return 1;
}
return 0;
}
%%
int main(int argc, char **argv)
{
if (argc > 1) {
yyin = fopen(argv[1], "r");
if (!yyin) {
printf("Cannot open file '%s'!\n", argv[1]);
return 1;
}
} else yyin = stdin;
return yylex();
}
我在 Windows 10(64 位)的 cygwin 中使用 flex 和 g++ 编译它:
$ flex -oscan-nested-comments.cc scan-nested-comments.l ; g++ -o scan-nested-comments scan-nested-comments.cc
scan-nested-comments.cc:398:0: warning: "yywrap" redefined
^
scan-nested-comments.cc:74:0: note: this is the location of the previous definition
^
$
出现警告的原因是%option noyywrap
。我想这并不意味着任何伤害,可以忽略不计。
现在,我做了一些测试:
$ cat >good-text.txt <<EOF
> Test for nested comments.
> (* a comment *)
> (* a (* nested *) comment *)
> No comment.
> (* a
> (* nested
> (* multiline *)
> *)
> comment *)
> End of file.
> EOF
$ cat good-text | ./scan-nested-comments
Test for nested comments.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')
EMIT TOKEN COMMENT(lexem: '(* a (* nested *) comment *)')
No comment.
EMIT TOKEN COMMENT(lexem: '(* a
(* nested
(* multiline *)
*)
comment *)')
End of file.
$ cat >bad-text-1.txt <<EOF
> Test for wrong comment.
> (* a comment *)
> with wrong nesting *)
> End of file.
> EOF
$ cat >bad-text-1.txt | ./scan-nested-comments
Test for wrong comment.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')
with wrong nesting ERROR: '*)' without '(*'!
End of file.
$ cat >bad-text-2.txt <<EOF
> Test for wrong comment.
> (* a comment
> which is not closed.
> End of file.
> EOF
$ cat >bad-text-2.txt | ./scan-nested-comments
Test for wrong comment.
ERROR: Premature end of file!
(1 open comments not closed.)
$