1

我有一个大的 CSV 文件,其中一行如下所示:

id_85,
{
    "link": "some link",
    "icon": "hello.gif",
    "name": "Wall Photos",
    "comments": {
        "count": 0
    },
    "updated_time": "2012-03-12",
    "object_id": "400",
    "is_published": true,
    "properties": [
        {
            "text": "University",
            "name": "By",
            "href": "some link"
        }
    ],
    "from": {
        "id": "7778",
        "name": "Let"
    },
    "message": "Hello World! :D",
    "id": "id_85",
    "created_time": "2012-03-12",
    "to": {
        "data": [
            {
                "id": "100",
                "name": "March"
            }
        ]
    },
    "message_tags": {
        "0": [
            {
                "id": "100",
                "type": "user",
                "name": "Marcelo",
                "length": 7,
                "offset": 0
            }
        ]
    },
    "type": "photo",
    "caption": "Hello world!"
}

我试图在第一个和最后一个大括号之间获取它的 json 部分。

下面是我到目前为止的python regex 代码

import re 
str = "id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} "
m = re.match(r'.*,({.*}$)', str)
if m:
     print m.group(1)

在某些情况下,它不使用第一个和最后一个大括号,例如 { ... } 。如何确保仅包含第一个和最后一个大括号之间的文本而不包含其他任何文本?

所需的输出如下所示:

{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03- 12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], “来自”:{“id”:“777”,“名称”:“Let”},“消息”:“Hello World!:D”,“id”:“id_85”,“created_time”:“2012-03 -12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": " 100",“type”:“user”,“name”:“March”,“length”:7,“offset”:0}]},“type”:“photo”,“caption”:“Hello world!”}

谢谢!

4

3 回答 3

0

我相信这是有效的,因为.*在这种情况下是“贪婪的”:

import re
str = 'id_85,{"link": "some link", "icon": "hello.gif", "name": "Wall Photos", "comments": {"count": 0}, "updated_time": "2012-03-12", "object_id": "400", "is_published": true, "properties": [{"text": "University", "name": "By", "href": "some link"}], "from": {"id": "777", "name": "Let"}, "message": "Hello World! :D", "id": "id_85", "created_time": "2012-03-12", "to": {"data": [{"id": "100", "name": "March"}]}, "message_tags": {"0": [{"id": "100", "type": "user", "name": "March", "length": 7, "offset": 0}]}, "type": "photo", "caption": "Hello world!"} '
m = re.search('({.*})', str)
if m:
    print m.group(0)

如果您的 CSV 中有其他 JSON 字符串,这可能会占用太多},即它会太贪婪,因为 final将与最后一次出现的}in匹配str

请注意,符号re.search(r'somregex', string)- 即r在您的正则表达式之前添加一个 - 称为“原始字符串符号” - 当您希望将反斜杠按字面意思处理而不是作为正则表达式特殊字符时,通常会使用这种符号。看这里。egr'\n'匹配两个字符\n'\n'while 匹配换行符

于 2014-07-21T00:13:53.323 回答
0

假设(如最初发布的那样)CSV 中的每一行都有 1 个 JSON 元素,那么

re.match(r'^[^{]*({.*})[^}]*$',str).group(1)

应该做的伎俩。那就是:丢弃所有不是 a 的东西,{直到找到第一个,然后将后面的所有东西放入一个组中,直到找到之后}没有其他的 a 为止。}

于 2014-07-21T01:53:21.647 回答
0

这将匹配第一个逗号后的整个 json 部分。不确定这是否是您想要的。所需输出的示例会有所帮助。

re.match(r'[^,]*,(.*)', s).group(1)
于 2014-07-20T23:58:45.243 回答