machine-learning - AlphaZero：在自我游戏过程中访问了哪些节点？

Question

阅读这篇文章有助于更好地理解 AlphaZero 背后的原理。不过，有些事情我并不完全确定。

下面是作者的UCT_search方法，可以在他在 Github 上的代码中查阅：https
://github.com/plkmo/AlphaZero_Connect4/tree/master/src 在这里，UCTNode.backup()将网络添加value_estimate到所有遍历的节点（另见此“备忘单”）。

def UCT_search(game_state, num_reads,net,temp):
    root = UCTNode(game_state, move=None, parent=DummyNode())
    for i in range(num_reads):
        leaf = root.select_leaf()
        encoded_s = ed.encode_board(leaf.game); encoded_s = encoded_s.transpose(2,0,1)
        encoded_s = torch.from_numpy(encoded_s).float().cuda()
        child_priors, value_estimate = net(encoded_s)
        child_priors = child_priors.detach().cpu().numpy().reshape(-1); value_estimate = value_estimate.item()
        if leaf.game.check_winner() == True or leaf.game.actions() == []: # if somebody won or draw
            leaf.backup(value_estimate); continue
        leaf.expand(child_priors) # need to make sure valid moves
        leaf.backup(value_estimate)
    return root

这种方法似乎只访问与根节点直接相连的节点。
然而，最初的 DeepMind 论文（关于 AlphaGo Zero）说：

每个模拟从根状态开始，并迭代地选择最大化置信上限 Q(s, a) + U(s, a) 的移动，其中 U(s, a) ∝ P(s, a)/(1 + N (s, a))，直到遇到叶节点 s'。

所以相反，我会期待类似的东西：

def UCT_search():
    for i in range(num_reads):
        current_node = root
        while current_node.is_expanded:
            …
            current_node = current_node.select_leaf()
        current_node.backup(value_estimate)

^{(UCTNode.is_expanded是False如果节点还没有被访问过（或者是结束状态，即游戏结束）}

你能解释一下为什么会这样吗？还是我忽略了什么？
提前致谢

score 0 · Accepted Answer

您提到的逻辑在select_leaf()方法内部，它选择最好的叶子，而不仅仅是直接连接的节点

machine-learning - AlphaZero：在自我游戏过程中访问了哪些节点？

1 回答 1

Related

Reference