我正在使用 re.search 将字符串中的 URL 替换为占位符“{url_object}”。这是我的代码:
def url_detector(text):
urls = re.findall(r"(https?\:\/\/[\w\d:#@%\/;$()~_?\+-=\\\.&]*)", text)
if len(urls)>0:
for url in urls:
span = re.search(url, text).span()
text = text[:(span[0])] + '{url_object}' + text[span[1]:]
return text
我使用的带有 URL 的文本如下:
text_list = ["google url is https://www.google.com/. Everyone frequently uses it",
"https://www.google.com/search?q=simple+search&oq=simple+search&aqs=chrome..69i57j0l9.2908j0j7&sourceid=chrome&ie=UTF-8 is the url for simplesearch",
"url for today's news : https://www.google.com/search?q=news+today&sxsrf=ALeKk00r1fVK6JeIaO1bhigZSu8IEGjgQw%3A1617353154494&ei=wtlmYIrXHdXez7sP6v-XwAw&oq=news+today&gs_lcp=Cgdnd3Mtd2l6EAMyCggAELEDEIMBEEMyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATIICAAQsQMQgwEyCAgAELEDEIMBMggIABCxAxCDATICCAA6BwgAEEcQsAM6BwgAELADEEM6CgguELADEMgDEEM6BAgAEEM6BwgAELEDEEM6BQgAELEDSgUIOBIBMVDUBliRFGCzGGgBcAJ4AIABnAGIAacHkgEDMC43mAEAoAEBqgEHZ3dzLXdpesgBC8ABAQ&sclient=gws-wiz&ved=0ahUKEwiKwP-Blt_vAhVV73MBHer_BcgQ4dUDCA0&uact=5, date = 02/04/2021",
"sample url = https://www.google.com/search?q=sample&sxsrf=ALeKk02uixAiZMyqhMtSZZwbeYefHRutGQ%3A1617353222151&ei=BtpmYKfTCJKD4-EPlLWVeA&oq=sample&gs_lcp=Cgdnd3Mtd2l6EAMyBAgjECcyBQgAELEDMgUIABCxAzIECAAQQzIFCAAQsQMyBAgAEEMyAggAMgUIABCxAzIFCAAQsQMyBQgAELEDOgcIABBHELADOgcIABCwAxBDOgcIABCHAhAUUKERWN4UYMIYaAFwAngAgAGWAYgB9ASSAQMwLjWYAQCgAQGqAQdnd3Mtd2l6yAEKwAEB&sclient=gws-wiz&ved=0ahUKEwin7qCilt_vAhWSwTgGHZRaBQ8Q4dUDCA0&uact=5"]
我在上面的列表中尝试了 url_detector
for text in text_list:
print(url_detector(text))
虽然我期望输出看起来像这样:
google url is {url_object}. Everyone frequently uses it
{url_object} is the url for simple search
url for today's news : {url_object}, date = 02/04/2021
sample url = {url_object}
我懂了:
google url is {url_object}. Everyone frequently uses it
'NoneType' object has no attribute 'span'
似乎这是由于“?”的存在而发生的。在从 re.findall 获得的 URL 中。
这可能是因为 re 正在治疗 '?' 因为它的特殊意义。所以,我试着替换'?和 '\?' 让它工作。但 '?' 正在被替换为“\\?”。当这种模式与 re.search() 一起使用时,它会产生错误:
error: bad escape (end of pattern) at position 29.
关于如何解决这个问题的任何想法?提前致谢。