python - Python网页源码读取特殊字符

Question

我正在从网页读取页面源，然后从该源解析一个值。在那里我遇到了特殊字符的问题。

在我的 python 控制器文件中，我使用# -*- coding: utf-8 -*-. 但我正在阅读正在使用的网页源charset=iso-8859-1

因此，当我在未指定任何编码的情况下阅读页面内容时，它会引发错误UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte

当我使用string.decode("iso-8859-1").encode("utf-8")then 它正在解析数据而没有任何错误。但它将值显示为“F\u00fcnke”而不是“Fünke”。

请让我知道如何解决这个问题。我将不胜感激任何建议。

score 0 · Accepted Answer

编码肯定是 Python3 中的 PITA（在某些情况下也是 2）。尝试检查这些链接，它们可能会对您有所帮助：

Python - 编码字符串 - 瑞典字母
 Python3 - ascii/utf-8/iso-8859-1 无法解码字节 0xe5（瑞典字符）

http://docs.python.org/2/library/codecs.html

我最好的"So when I read the page content without specifying any encoding"猜测是你的控制台不使用utf-8（例如，windows ..你# -*- coding: utf-8 -*-只告诉Python在源代码中找到什么类型的字符，而不是代码的实际数据将解析或分析自己。例如我写：

# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

python - Python网页源码读取特殊字符

1 回答 1

Related

Reference