python - 使用 Python 从 HTML 文件中提取文本

Question

我想使用 Python 从 HTML 文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我希望得到的输出基本相同。

我想要比使用可能在格式不佳的 HTML 上失败的正则表达式更强大的东西。我见过很多人推荐 Beautiful Soup，但我在使用它时遇到了一些问题。一方面，它拾取了不需要的文本，例如 JavaScript 源代码。此外，它不解释 HTML 实体。例如，我希望 ' 在 HTML 源代码中转换为文本中的撇号，就像我将浏览器内容粘贴到记事本中一样。

更新 html2text看起来很有希望。它正确处理 HTML 实体并忽略 JavaScript。但是，它并不完全生成纯文本。它会产生降价，然后必须将其转换为纯文本。它没有示例或文档，但代码看起来很干净。

更新

根据弗雷泽的评论，这里是更优雅的解决方案：

from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

score 14 · Accepted Answer

这是 xperroni 的答案的一个版本，它更完整一些。它跳过脚本和样式部分并翻译字符引用（例如，'）和 HTML 实体（例如，&）。

它还包括一个简单的纯文本到 html 逆转换器。

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

score 8 · Accepted Answer

您也可以在 stripogram 库中使用 html2text 方法。

from stripogram import html2text
text = html2text(your_html_string)

要安装条形图，请运行 sudo easy_install stripogram

score 7 · Accepted Answer

有用于数据挖掘的模式库。

http://www.clips.ua.ac.be/pages/pattern-web

您甚至可以决定保留哪些标签：

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

score 7 · Accepted Answer

我知道这里已经有很多答案，但我认为报纸 3k也值得一提。我最近需要完成一项类似的任务，即从网络上的文章中提取文本，到目前为止，这个库在我的测试中做得非常出色。它忽略菜单项和侧边栏中的文本以及作为 OP 请求出现在页面上的任何 JavaScript。

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

如果您已经下载了 HTML 文件，则可以执行以下操作：

article = Article('')
article.set_html(html)
article.parse()
article.text

它甚至还有一些用于总结文章主题的 NLP 功能：

article.nlp()
article.summary

score 6 · Accepted Answer

PyParsing 做得很好。PyParsing wiki 已被杀死，因此这里是另一个位置，其中有使用 PyParsing 的示例（示例链接）。在 pyparsing 上投入一点时间的一个原因是，他还编写了一本非常简短且组织良好的 O'Reilly Short Cut 手册，而且价格也不贵。

话虽如此，我经常使用 BeautifulSoup，处理实体问题并不难，您可以在运行 BeautifulSoup 之前将它们转换。

祝你好运

score 5 · Accepted Answer

如果您需要更快的速度和更低的准确性，那么您可以使用原始 lxml。

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

score 4 · Accepted Answer

而不是 HTMLParser 模块，请查看 htmllib。它具有类似的界面，但可以为您完成更多工作。（它非常古老，因此在摆脱 javascript 和 css 方面没有太大帮助。您可以创建一个派生类，但添加名称为 start_script 和 end_style 的方法（有关详细信息，请参阅 python 文档），但这很难为格式错误的 html 可靠地执行此操作。）无论如何，这里有一些简单的东西，可以将纯文本打印到控制台

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

score 4 · Accepted Answer

这不完全是 Python 解决方案，但它会将 Javascript 生成的文本转换为文本，我认为这很重要（例如 google.com）。浏览器 Links（不是 Lynx）有一个 Javascript 引擎，并且会使用 -dump 选项将源代码转换为文本。

因此，您可以执行以下操作：

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

score 4 · Accepted Answer

我推荐一个名为 goose-extractor 的 Python 包 Goose 会尝试提取以下信息：

文章的正文文章的主图文章中嵌入的任何 Youtube/Vimeo 电影元描述元标签

更多：https ://pypi.python.org/pypi/goose-extractor/

score 4 · Accepted Answer

使用安装html2text

点安装 html2text

然后，

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

score 3 · Accepted Answer

美丽的汤确实可以转换 html 实体。考虑到 HTML 经常出错并且充满了 unicode 和 html 编码问题，这可能是您最好的选择。这是我用来将 html 转换为原始文本的代码：

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

score 3 · Accepted Answer

另一个非 python 解决方案：Libre Office：

soffice --headless --invisible --convert-to txt input1.html

与其他替代方案相比，我更喜欢这个的原因是每个 HTML 段落都被转换为单个文本行（没有换行符），这正是我所寻找的。其他方法需要后处理。Lynx 确实产生了不错的输出，但不是我想要的。此外，Libre Office 可用于从各种格式转换...

score 3 · Accepted Answer

3

有人试过漂白剂bleach.clean(html,tags=[],strip=True)吗？它对我有用。

于 2017-01-16T14:10:24.890 回答

score 3 · Accepted Answer

最适合我的是 inscripts 。

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

结果真的很好

score 3 · Accepted Answer

我有一个类似的问题，实际上使用了 BeautifulSoup 的答案之一。问题是它真的很慢。我最终使用了名为 selectolax 的库。它非常有限，但它适用于这项任务。唯一的问题是我手动删除了不必要的空格。但是，BeautifulSoup 解决方案的工作速度似乎要快得多。

from selectolax.parser import HTMLParser

def get_text_selectolax(html):
    tree = HTMLParser(html)

    if tree.body is None:
        return None

    for tag in tree.css('script'):
        tag.decompose()
    for tag in tree.css('style'):
        tag.decompose()

    text = tree.body.text(separator='')
    text = " ".join(text.split()) # this will remove all the whitespaces
    return text

score 2 · Accepted Answer

另一种选择是通过基于文本的 Web 浏览器运行 html 并将其转储。例如（使用 Lynx）：

lynx -dump html_to_convert.html > converted_html.txt

这可以在 python 脚本中完成，如下所示：

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

它不会只为您提供 HTML 文件中的文本，但根据您的用例，它可能比 html2text 的输出更可取。

score 2 · Accepted Answer

@PeYoTIL 使用 BeautifulSoup 并消除样式和脚本内容的答案对我不起作用。我尝试使用decompose而不是，extract但它仍然没有工作。所以我创建了自己的，它还使用标签格式化文本并用href 链接<p>替换标签。<a>还处理文本内的链接。可在此 gist中找到并嵌入测试文档。

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

score 2 · Accepted Answer

我使用Apache Tika取得了不错的成绩。它的目的是从内容中提取元数据和文本，因此底层解析器进行了相应的调整，开箱即用。

Tika 可以作为服务器运行，在 Docker 容器中运行/部署很简单，并且可以从那里通过Python 绑定进行访问。

score 1 · Accepted Answer

在 Python 3.x 中，您可以通过导入 'imaplib' 和 'email' 包以非常简单的方式做到这一点。虽然这是一篇较旧的帖子，但也许我的回答可以帮助这篇文章的新人。

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

现在您可以打印正文变量，它将采用纯文本格式:) 如果它对您来说足够好，那么最好选择它作为接受的答案。

score 1 · Accepted Answer

以简单的方式

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

此代码查找以“<”开头并以“>”结尾的 html_text 的所有部分，并将找到的所有部分替换为空字符串

score 1 · Accepted Answer

这是我经常使用的代码。

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

我希望这会有所帮助。

score 1 · Accepted Answer

您只能使用 BeautifulSoup 从 HTML 中提取文本

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

score 1 · Accepted Answer

虽然很多人提到使用正则表达式来去除 html 标签，但也有很多缺点。

例如：

<p>hello&nbsp;world</p>I love you

应解析为：

Hello world
I love you

这是我想出的一个片段，您可以根据您的特定需求对其进行自定义，它就像一个魅力

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

score 1 · Accepted Answer

在 Python 2.7.9+ 中使用 BeautifulSoup4 的另一个例子

包括：

import urllib2
from bs4 import BeautifulSoup

代码：

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

解释：

以 html 格式读取 url 数据（使用 BeautifulSoup），删除所有脚本和样式元素，并使用 .get_text() 仅获取文本。分成几行并删除每行的前导和尾随空格，然后将多个标题分成一行，每个块 = (phrase.strip() for line in lines for phrase in line.split(" "))。然后使用 text = '\n'.join，删除空白行，最后以认可的 utf-8 形式返回。

笔记：

由于 SSL 问题，某些运行此功能的系统会因 https:// 连接而失败，您可以关闭验证以解决该问题。修复示例：http: //blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python < 2.7.9 运行时可能会出现一些问题
text.encode('utf-8') 可能会留下奇怪的编码，可能只想返回 str(text) 。

score 0 · Accepted Answer

LibreOffice writer 注释具有优点，因为应用程序可以使用 python 宏。它似乎为回答这个问题和促进 LibreOffice 的宏观基础提供了多种好处。如果此解决方案是一次性实现，而不是用作更大生产程序的一部分，则在编写器中打开 HTML 并将页面保存为文本似乎可以解决此处讨论的问题。

score 0 · Accepted Answer

Perl 方式（对不起妈妈，我永远不会在生产中这样做）。

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

score 0 · Accepted Answer

这里的所有方法都不适用于某些网站。由 JS 代码生成的段落可以抵抗上述所有内容。这是最终受此答案和此启发而对我有用的方法。

这个想法是在 webdriver 中加载页面并滚动到页面的末尾，让 JS 完成它的工作来生成/加载页面的其余部分。然后插入击键命令以选择所有复制/粘贴整个页面：

import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyperclip
import time

driver = webdriver.Chrome()
driver.get("https://www.lazada.com.ph/products/nike-womens-revolution-5-running-shoes-black-i1262506154-s4552606107.html?spm=a2o4l.seller.list.3.6f5d7b6cHO8G2Y&mp=1&freeshipping=1")

# Scroll down to end of the page to let all javascript code load its content
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
        lastCount = lenOfPage
        time.sleep(1)
        lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        if lastCount==lenOfPage:
            match=True

# copy from the webpage
element = driver.find_element_by_tag_name('body')
element.send_keys(Keys.CONTROL,'a')
element.send_keys(Keys.CONTROL,'c')
alltext = pyperclip.paste()
alltext = alltext.replace("\n", " ").replace("\r", " ")  # cleaning the copied text
print(alltext )

它很慢。但没有其他任何工作。

更新：更好的方法是使用 inscriptis 库滚动到页面末尾后加载页面的源代码：

from inscriptis import get_text
text = get_text(driver.page_source)

仍然无法使用无头驱动程序（页面以某种方式检测到它没有被真实显示并且滚动到末尾不会使 JS 代码加载它的东西），但至少我们不需要阻碍我们的疯狂复制/粘贴在具有共享剪贴板的机器上运行多个脚本。

score 0 · Accepted Answer

回答使用 Pandas 从 HTML 获取表格数据。

如果您想从 HTML 中快速提取表格数据。您可以使用 read_HTML 函数文档在这里。在使用此功能之前，您应该阅读有关BeautifulSoup4/ html5lib /lxml 解析器 HTML 解析库的问题/问题。

import pandas as pd

http = r'https://www.ibm.com/docs/en/cmofz/10.1.0?topic=SSQHWE_10.1.0/com.ibm.ondemand.mp.doc/arsa0257.htm'
table = pd.read_html(http)
df = table[0]
df

输出

有许多选项可以使用，请参见此处和此处。

score -1 · Accepted Answer

我正在实现它。

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

python - 使用 Python 从 HTML 文件中提取文本

34 回答 34

更新

Related

Reference