python -
使用 BeautifulSoup提取文本

Question

不幸的是，我有一系列网页我想从中抓取文本，它们都遵循不同的模式。我正在尝试编写一个在<br>标签之后提取文本的刮板，因为该结构对所有页面都是通用的。

据我所知，这些页面遵循三种基本模式：

正如我现在所拥有的，我正在使用以下循环：

  for br in soup.find_all('br'):
        text = br.next_sibling

        try:         
            print text.strip().replace("\t", " ").replace("\r", " ").replace('\n', ' ')
        except AttributeError:
            print('...')

虽然此脚本适用于某些页面，但只能抓取部分或不抓取其他页面的文本。在过去的几天里，我一直在为此烦恼，所以任何帮助将不胜感激。

此外，我已经尝试过这种技术，但无法使其适用于所有页面。

score 1 · Accepted Answer

我仍然会继续依赖underlinespan 元素的样式。这是一个可以帮助您入门的示例代码（使用.next_siblings）：

for span in soup.select('p > span[style*=underline]'):
    texts = []
    for sibling in span.next_siblings:
        # break upon reaching the next span 
        if sibling.name == "span":
            break

        text = sibling.get_text(strip=True) if isinstance(sibling, Tag) else sibling.strip()
        if text:
            texts.append(text.replace("\n", " "))

    if texts:
        text = " ".join(texts)
        print(span.text.strip(), text.strip())

python - 使用 BeautifulSoup提取文本

1 回答 1

Related

Reference

python -
使用 BeautifulSoup提取文本