algorithm - LR、SLR 和 LALR 解析器有什么区别？

Question

LR、SLR 和 LALR 解析器之间的实际区别是什么？我知道 SLR 和 LALR 是 LR 解析器的类型，但就它们的解析表而言，它们的实际区别是什么？

以及如何显示语法是 LR、SLR 还是 LALR？对于 LL 语法，我们只需要证明解析表的任何单元格都不应包含多个产生式规则。LALR、SLR 和 LR 是否有类似的规则？

例如，我们如何证明语法

S --> Aa | bAc | dc | bda
A --> d

是 LALR(1) 但不是 SLR(1)？

编辑（ybungalobill）：对于 LALR 和 LR 之间的区别，我没有得到令人满意的答案。所以 LALR 的表尺寸更小，但它只能识别 LR 语法的一个子集。有人可以详细说明 LALR 和 LR 之间的区别吗？LALR(1) 和 LR(1) 足以回答。它们都使用 1 个令牌前瞻，并且都是表驱动的！它们有何不同？

score 70 · Accepted Answer

SLR、LALR 和 LR 解析器都可以使用完全相同的表驱动机制来实现。

从根本上说，解析算法收集下一个输入标记 T，并参考当前状态 S（以及相关的前瞻、GOTO 和归约表）来决定要做什么：

SHIFT：如果当前表对令牌 T 说 SHIFT，则对 (S,T) 被推入解析堆栈，状态根据 GOTO 表对当前令牌所说的内容进行更改（例如，GOTO(T) )，获取另一个输入令牌 T'，并重复该过程
REDUCE：每个状态都有 0、1 或许多可能发生在状态中的归约。如果解析器是 LR 或 LALR，则针对该状态的所有有效缩减的前瞻集检查令牌。如果标记与语法规则 G = R1 R2 .. Rn 的归约的前瞻集相匹配，则会发生堆栈归约和移位：调用 G 的语义动作，堆栈被弹出 n（从 Rn）次，对 ( S,G) 被压入堆栈，新状态 S' 设置为 GOTO(G)，循环以相同的标记 T 重复。如果解析器是 SLR 解析器，则最多有一个归约规则状态等归约动作可以盲目地完成，而无需搜索以查看适用的归约。SLR 解析器知道是否存在减少与否；这很容易判断每个状态是否明确记录与其关联的减少次数，并且无论如何在实践中 L(AL)R 版本都需要该计数。
错误：如果 SHIFT 和 REDUCE 都不可能，则声明语法错误。

那么，如果他们都使用相同的机器，那有什么意义呢？

SLR 的价值在于其实现的简单性。您不必扫描检查前瞻集的可能减少，因为最多有一个，如果状态没有 SHIFT 退出，这是唯一可行的操作。哪个减少适用可以专门附加到状态，因此 SLR 解析机器不必寻找它。在实践中，L(AL)R 解析器处理一组有用的更大的语言，并且实现的额外工作非常少，以至于除了作为学术练习之外没有人实现 SLR。

LALR 和 LR 的区别在于表生成器. LR 解析器生成器跟踪特定状态的所有可能减少及其精确的前瞻集；您最终会得到这样的状态，在这种状态下，每个归约都与其左侧上下文中的确切前瞻集相关联。这往往会建立相当大的状态集。如果 GOTO 表和减少的查找头集兼容且不冲突，则 LALR 解析器生成器愿意组合状态；这会产生相当少的状态，代价是无法区分 LR 可以区分的某些符号序列。因此，LR 解析器可以解析比 LALR 解析器更大的语言集，但解析器表要大得多。在实践中，可以找到与目标语言足够接近的 LALR 语法，以使状态机的大小值得优化；

所以：所有三个都使用相同的机器。SLR 是“简单”的，因为您可以忽略一点点机器，但不值得这么麻烦。LR 解析更广泛的语言集，但状态表往往很大。这使得 LALR 成为实际的选择。

说了这么多，值得知道GLR 解析器可以解析任何上下文无关语言，使用更复杂的机器但完全相同的表（包括 LALR 使用的较小版本）。这意味着 GLR 严格来说比 LR、LALR 和 SLR 更强大；如果你能写出标准的 BNF 语法，GLR 就会根据它进行解析。机制的不同之处在于，当 GOTO 表和/或前瞻集之间存在冲突时，GLR 愿意尝试多次解析。（GLR 如何有效地做到这一点纯粹是天才 [不是我的]，但不适合这篇 SO 帖子）。

这对我来说是一个非常有用的事实。我构建程序分析器和代码转换器和解析器是必要的，但“无趣”；有趣的工作是你对解析结果所做的事情，所以重点是做解析后的工作。与破解语法以进入 LALR 可用形式相比，使用 GLR 意味着我可以相对轻松地构建工作语法。在尝试处理诸如 C++ 或 Fortran 之类的非学术语言时，这很重要，在这些语言中，您实际上需要数千条规则才能很好地处理整个语言，而且您不想花费一生来尝试破解语法规则满足LALR（甚至LR）的限制。

作为一个著名的例子，C++ 被认为是极其难以解析的......被那些进行 LALR 解析的人所认为。使用 GLR 机器解析 C++ 非常简单，几乎使用 C++ 参考手册背面提供的规则。（我正好有这样一个解析器，它不仅处理普通 C++，还处理各种供应商方言。这只有在实践中才有可能，因为我们使用的是 GLR 解析器，恕我直言）。

[编辑 2011 年 11 月：我们扩展了解析器以处理所有 C++11。GLR 让这件事变得容易多了。编辑 2014 年 8 月：现在处理所有 C++17。没有任何问题或变得更糟，GLR 仍然是猫的喵喵声。]

score 19 · Accepted Answer

LALR 解析器合并 LR 文法中的相似状态，以生成与等效 SLR 文法完全相同大小的解析器状态表，通常比纯 LR 解析表小一个数量级。但是，对于过于复杂而不能成为 LALR 的 LR 文法，这些合并状态会导致解析器冲突，或者产生无法完全识别原始 LR 文法的解析器。

顺便说一句，我在这里的 MLR(k) 解析表算法中提到了一些关于此的事情。

附录

简短的回答是 LALR 解析表更小，但解析器机制是相同的。如果生成了所有 LR 状态，那么给定的 LALR 语法将产生更大的解析表，其中包含许多冗余（几乎相同）状态。

LALR 表更小，因为相似（冗余）状态合并在一起，有效地丢弃了单独状态编码的上下文/前瞻信息。优点是您可以获得相同语法的更小的解析表。

缺点是并非所有的 LR 文法都可以编码为 LALR 表，因为更复杂的文法具有更复杂的前瞻，导致两个或多个状态而不是单个合并状态。

主要区别在于生成 LR 表的算法在从状态到状态的转换之间携带更多信息，而 LALR 算法则没有。因此，LALR 算法无法判断给定的合并状态是否真的应该保留为两个或多个单独的状态。

score 12 · Accepted Answer

另一个答案（YAA）。

SLR(1)、LALR(1) 和 LR(1) 的解析算法与 Ira Baxter 所说的相同，
但是由于解析器生成算法，解析器表可能不同。

SLR 解析器生成器创建一个 LR(0) 状态机并根据语法（FIRST 和 FOLLOW 集）计算前瞻。这是一种简化的方法，可能会报告在 LR(0) 状态机中并不真正存在的冲突。

LALR 解析器生成器创建一个 LR(0) 状态机并计算来自 LR(0) 状态机的前瞻（通过终端转换）。这是一种正确的方法，但偶尔会报告 LR(1) 状态机中不存在的冲突。

一个规范的 LR 解析器生成器计算一个 LR(1) 状态机，并且前瞻已经是 LR(1) 状态机的一部分。这些解析器表可能非常大。

最小 LR 解析器生成器计算 LR(1) 状态机，但在此过程中合并兼容状态，然后从最小 LR(1) 状态机计算前瞻。这些解析器表与 LALR 解析器表大小相同或略大，提供了最佳解决方案。

LRSTAR 10.0可以在 C++ 中生成 LALR(1)、LR(1)、CLR(1) 或 LR(*) 解析器，无论您的语法需要什么。请参阅此图，该图显示了 LR 解析器之间的差异。

[全面披露：LRSTAR是我的产品]

score 5 · Accepted Answer

使用 SLR 与 LR 生成的解析器表之间的基本区别在于，reduce 操作基于 SLR 表的 Follows 集。这可能会过于严格，最终导致轮班减少冲突。

另一方面，LR 解析器仅基于可以实际跟随非终结符被归约的终结符集做出归约决策。这组终结符通常是此类非终结符的 Follows 集的真子集，因此与换档动作发生冲突的可能性较小。

由于这个原因，LR 解析器更强大。然而，LR 解析表可能非常大。

LALR 解析器从构建 LR 解析表的想法开始，但以一种显着减小表大小的方式组合生成的状态。不利的一面是，对于一些文法，LR 表本来可以避免的冲突的可能性很小。

LALR 解析器的功能略逊于 LR 解析器，但仍然比 SLR 解析器强大。由于这个原因，YACC 和其他此类解析器生成器倾向于使用 LALR。

PS 为简洁起见，上面的 SLR、LALR 和 LR 实际上是指 SLR(1)、LALR(1) 和 LR(1)，因此暗示了一个标记前瞻。

score 5 · Accepted Answer

5

于 2014-06-05T05:05:12.963 回答

score 4 · Accepted Answer

Suppose a parser without a lookahead is happily parsing strings for your grammar.

Using your given example it comes across a string dc, what does it do? Does it reduce it to S, because dc is a valid string produced by this grammar? OR maybe we were trying to parse bdc because even that is an acceptable string?

As humans we know the answer is simple, we just need to remember if we had just parsed b or not. But computers are stupid :)

Since an SLR(1) parser had the additional power over LR(0) to perform a lookahead, we know that any amounts of lookahead cannot tell us what to do in this case; instead, we need to look back in our past. Thus comes the canonical LR parser to the rescue. It remembers the past context.

The way it remembers this context is that it disciplines itself, that whenever it will encounter a b, it will start walking on a path towards reading bdc, as one possibility. So when it sees a d it knows whether it is already walking a path. Thus a CLR(1) parser can do things an SLR(1) parser cannot!

But now, since we had to define so many paths, the states of the machine gets very large!

So we merge same looking paths, but as expected it could give rise to problems of confusion. However, we are willing to take the risk at the cost of reducing the size.

This is your LALR(1) parser.

Now how to do it algorithmically.

When you draw the configuring sets for the above language, you will see a shift-reduce conflict in two states. To remove them you might want to consider an SLR(1), which takes decisions looking at a follow, but you would observe that it still won't be able to. Thus you would, draw the configuring sets again but this time with a restriction that whenever you calculate the closure, the additional productions being added must have strict follow(s). Refer any textbook on what should these follow be.

score 3 · Accepted Answer

3

In addition to the answers above, this diagram demonstrates how different parsers relate:

于 2019-06-19T06:48:07.153 回答

score 0 · Accepted Answer

Adding on top of the above answers, the difference in between the individual parsers in the class of bottom-up LR parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables. The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).

For example, consider the following expression grammar:

E → E + T

E → T

T → F

T → T * F

F → ( E )

F → id

It's not LR(0) but SLR(1). Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):

from copy import deepcopy
import pandas as pd

def update_items(I, C):
    if len(I) == 0:
         return C
    for nt in C:
         Int = I.get(nt, [])
         for r in C.get(nt, []):
              if not r in Int:
                  Int.append(r)
          I[nt] = Int
     return I

def compute_action_goto(I, I0, sym, NTs): 
    #I0 = deepcopy(I0)
    I1 = {}
    for NT in I:
        C = {}
        for r in I[NT]:
            r = r.copy()
            ix = r.index('.')
            #if ix == len(r)-1: # reduce step
            if ix >= len(r)-1 or r[ix+1] != sym:
                continue
            r[ix:ix+2] = r[ix:ix+2][::-1]    # read the next symbol sym
            C = compute_closure(r, I0, NTs)
            cnt = C.get(NT, [])
            if not r in cnt:
                cnt.append(r)
            C[NT] = cnt
        I1 = update_items(I1, C)
    return I1

def construct_LR0_automaton(G, NTs, Ts):
    I0 = get_start_state(G, NTs, Ts)
    I = deepcopy(I0)
    queue = [0]
    states2items = {0: I}
    items2states = {str(to_str(I)):0}
    parse_table = {}
    cur = 0
    while len(queue) > 0:
        id = queue.pop(0)
        I = states[id]
        # compute goto set for non-terminals
        for NT in NTs:
            I1 = compute_action_goto(I, I0, NT, NTs) 
            if len(I1) > 0:
                state = str(to_str(I1))
                if not state in statess:
                    cur += 1
                    queue.append(cur)
                    states2items[cur] = I1
                    items2states[state] = cur
                    parse_table[id, NT] = cur
                else:
                    parse_table[id, NT] = items2states[state]
        # compute actions for terminals similarly
        # ... ... ...
                    
    return states2items, items2states, parse_table
        
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)

where the grammar G, non-terminal and terminal symbols are defined as below

G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]

Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:

def augment(G, S): # start symbol S
    G[S + '1'] = [[S, '$']]
    NTs.append(S + '1')
    return G, NTs

def compute_closure(r, G, NTs):
    S = {}
    queue = [r]
    seen = []
    while len(queue) > 0:
        r = queue.pop(0)
        seen.append(r)
        ix = r.index('.') + 1
        if ix < len(r) and r[ix] in NTs:
            S[r[ix]] = G[r[ix]]
            for rr in G[r[ix]]:
                if not rr in seen:
                    queue.append(rr)
    return S

The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:

The following table shows the LR(0) parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).

enter image description here

SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. The following parse table is generated by SLR:

The following animation shows how an input expression is parsed by the above SLR(1) grammar:

The grammar from the question is not LR(0) as well:

#S --> Aa | bAc | dc | bda
#A --> d    
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c', 'd'}
G['S'] = [['A', 'a'], ['b', 'A', 'c'], ['d', 'c'], ['b', 'd', 'a']]
G['A'] = [['d']]

as can be seen from the next LR0 DFA and the parsing table:

there is a shift / reduce conflict again:

But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):

A → a A b

A → c

S → A

# S --> A 
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]

As can be seen from the following figure, there is no conflict in the parsing table generated.

Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:

def parse(input, parse_table, rules):
    input = 'aaacbbb$'
    stack = [0]
    df = pd.DataFrame(columns=['stack', 'input', 'action'])
    i, accepted = 0, False
    while i < len(input):
        state = stack[-1]
        char = input[i]
        action = parse_table.loc[parse_table.states == state, char].values[0]
        if action[0] == 's':   # shift
            stack.append(char)
            stack.append(int(action[-1]))
            i += 1
        elif action[0] == 'r': # reduce
            r = rules[int(action[-1])]
            l, r = r['l'], r['r']
            char = ''
            for j in range(2*len(r)):
                s = stack.pop()
                if type(s) != int:
                    char = s + char
            if char == r:
                goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
                stack.append(l)
                stack.append(int(goto[-1]))
        elif action == 'acc':  # accept
            accepted = True
        df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
        df = df.append(df2, ignore_index = True)
        if accepted:
            break
        
    return df

parse(input, parse_table, rules)

The next animation shows how the input string a^2cb^2 is parsed with LR(0) parser using the above code:

score -3 · Accepted Answer

One simple answer is that all LR(1) grammars are LALR(1) grammars. Compared to LALR(1), LR(1) has more states in the associated finite-state machine (more than double the states). And that is the main reason LALR(1) grammars require more code to detect syntax errors than LR(1) grammars. And one more important thing to know regarding these two grammars is that in LR(1) grammars we might have less reduce/reduce conflicts. But in LALR(1) there is more possibility of reduce/reduce conflicts.

algorithm - LR、SLR 和 LALR 解析器有什么区别？

9 回答 9

Related

Reference