0

我正在尝试解析一些wikitext. 这是我需要解析的文本示例:

== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...

这里的结构并不复杂:
标题我相信title整个文档中至少有一个
子主题是可选
元素每个主题/子主题必须至少有一个
子元素是可选的并且可以重复

如果sub-elements重复,我打算使用\ln.

我想要做的是把它解析成字典,结构如下:

{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}

你知道任何 pythonic 的方式或想法来把它解析成我想要的吗?我会非常感谢你的时间。

PS。这是我试图解析和提取引号的完整文件: Woody Allen

4

1 回答 1

0

您说的是“引号”,但您链接了维基百科。你是说维基语录吗?

无论如何,你不能自己解析 wikitext。您可以通过Python 客户端访问的parseAPI实现您的目标。

例如,他的 Wikiquote 文章https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections上的章节列表(即引用的作品) :

{
    "parse": {
        "title": "Woody Allen",
        "pageid": 80,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes",
                "number": "1",
                "index": "1",
                "fromtitle": "Woody_Allen",
                "byteoffset": 657,
                "anchor": "Quotes"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Getting Even</i> (1971)",
                "number": "1.1",
                "index": "2",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11322,
                "anchor": "Getting_Even_.281971.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "<i>My Philosophy</i>",
                "number": "1.1.1",
                "index": "3",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11471,
                "anchor": "My_Philosophy"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Everything You Always Wanted to Know About Sex* (*But Were Afraid to Ask)</i> (1972)",
                "number": "1.2",
                "index": "4",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11814,
                "anchor": "Everything_You_Always_Wanted_to_Know_About_Sex.2A_.28.2ABut_Were_Afraid_to_Ask.29_.281972.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Sleeper</i> (1973)",
                "number": "1.3",
                "index": "5",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12364,
                "anchor": "Sleeper_.281973.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Love and Death</i> (1975)",
                "number": "1.4",
                "index": "6",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12858,
                "anchor": "Love_and_Death_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Without Feathers</i> (1975)",
                "number": "1.5",
                "index": "7",
                "fromtitle": "Woody_Allen",
                "byteoffset": 14090,
                "anchor": "Without_Feathers_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Annie Hall</i> (1977)",
                "number": "1.6",
                "index": "8",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16485,
                "anchor": "Annie_Hall_.281977.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Side Effects</i> (1980)",
                "number": "1.7",
                "index": "9",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16899,
                "anchor": "Side_Effects_.281980.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "My Apology",
                "number": "1.7.1",
                "index": "10",
                "fromtitle": "Woody_Allen",
                "byteoffset": 17529,
                "anchor": "My_Apology"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Manhattan Murder Mystery</i> (1993)",
                "number": "1.8",
                "index": "11",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18579,
                "anchor": "Manhattan_Murder_Mystery_.281993.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Don't Drink the Water</i> (1994)",
                "number": "1.9",
                "index": "12",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18960,
                "anchor": "Don.27t_Drink_the_Water_.281994.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Deconstructing Harry</i> (1997)",
                "number": "1.10",
                "index": "13",
                "fromtitle": "Woody_Allen",
                "byteoffset": 19228,
                "anchor": "Deconstructing_Harry_.281997.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Standup Comic</i> (1999)",
                "number": "1.11",
                "index": "14",
                "fromtitle": "Woody_Allen",
                "byteoffset": 21289,
                "anchor": "Standup_Comic_.281999.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Mere Anarchy</i> (2007)",
                "number": "1.12",
                "index": "15",
                "fromtitle": "Woody_Allen",
                "byteoffset": 22463,
                "anchor": "Mere_Anarchy_.282007.29"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Attributed",
                "number": "2",
                "index": "16",
                "fromtitle": "Woody_Allen",
                "byteoffset": 24181,
                "anchor": "Attributed"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Others",
                "number": "3",
                "index": "17",
                "fromtitle": "Woody_Allen",
                "byteoffset": 25045,
                "anchor": "Others"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes about Allen",
                "number": "4",
                "index": "18",
                "fromtitle": "Woody_Allen",
                "byteoffset": 27525,
                "anchor": "Quotes_about_Allen"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "External links",
                "number": "5",
                "index": "19",
                "fromtitle": "Woody_Allen",
                "byteoffset": 29106,
                "anchor": "External_links"
            }
        ]
    }
}
于 2015-10-11T11:13:55.957 回答