beautifulsoup - BS4 遍历表元素 nextsibling 或 parent？

Question

我正在尝试按日期提取表格中的所有文本和链接，到目前为止只能获得一个条目（但由于链接名称不正确，因此不正确）。我认为nextsibling可能在这里工作，但也许这不是正确的解决方案。

这是html：

<ul class="indented">
  <br>
  <strong>May 15, 2019</strong>
  <ul>
    Sign up for more insight into FERC with our monthly news email, The FERC insight
    <a href="/media/insight.asp">Read More</a>
  </ul>
  <br><br>
  <strong>May 15, 2019</strong>
  <ul>
    FERC To Convene a Technical Conference regarding Columbia Gas Transmission, LLC on July 10, 2019
    <a href="/CalendarFiles/20190515104556-RP19-763-000%20TC.pdf">Notice</a> <img src="/images/icon_pdf.gif" alt="PDF"> | <a href="/EventCalendar/EventDetails.aspx?ID=13414&amp;CalType=%20&amp;CalendarID=116&amp;Date=07/10/2019&amp;View=Listview">Event Details</a>
  </ul>
  <br><br>

这是我的代码：

import requests
from bs4 import BeautifulSoup


url1 = ('https://www.ferc.gov/media/headlines.asp')
r = requests.get(url1)
# Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'lxml')


# Pull headline  text from the ul class indented

headlines = soup.find_all("ul", class_="indented")
headline = headlines[0]

date  = headline.select_one('strong').text.strip()

print(date)

headline_text = headline.select_one('ul').text.strip()
print(headline_text)

headline_link = headline.select_one('ul a')["href"]
headline_link = 'https://www.ferc.gov' + headline_link 
print(headline_link)

我得到了第一个日期、文本和链接，因为我使用的是select_one. 我需要获取所有链接并为每个日期正确命名它们。会findnext在这里工作还是findnextsibling？

score 0 · Accepted Answer

我相信这就是您正在寻找的；它获取日期、公告和相关链接：

[start same as your code; thru soup declaration]

dates = soup.find_all("strong")
for date in dates:
    if '2019' in date.text:        
        print(date.text)
        print(date.nextSibling.nextSibling.text)
        for ref in date.nextSibling.nextSibling.find_all('a'):
            new_link = "https://www.ferc.gov" + ref['href']
            print(new_link)
    print('=============================')

输出的随机部分：

May 15, 2019

            FERC To Convene a Technical Conference regarding Columbia Gas Transmission, LLC on July 10, 2019
Notice   

| Event Details

https://www.ferc.gov/CalendarFiles/20190515104556-RP19-763-000%20TC.pdf
https://www.ferc.gov/EventCalendar/EventDetails.aspx?ID=13414&CalType=%20&CalendarID=116&Date=07/10/2019&View=Listview
=============================

beautifulsoup - BS4 遍历表元素 nextsibling 或 parent？

1 回答 1

Related

Reference