python - 使用 Beautifulsoup 和 Mechanize 从元素中解析 href 属性值

Question

谁能帮我用漂亮的汤遍历一棵 html 树？

我正在尝试通过 html 输出进行解析，并在收集每个值之后插入到一个名为Tldpython/django的表中

<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>

并且只解析的href属性值<a>，所以只有这部分：

https://billing.anapp.com/

的：

<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>

我目前有：

for url in urls:
    mb.open(url)
    beautifulSoupObj = BeautifulSoup(mb.response().read())
    beautifulSoupObj.find_all('h3',attrs={'class': 'r'})

问题就在上面，离元素find_all还不够远。<a>

任何帮助深表感谢。谢谢你。

score 14 · Accepted Answer

from bs4 import BeautifulSoup

html = """
<div class="rc" data-hveid="53">
<h3 class="r">
<a href="https://billing.anapp.com/" onmousedown="return rwt(this,'','','','2','AFQjCNGqpb38ftdxRdYvKwOsUv5EOJAlpQ','m3fly0i1VLOK9NJkV55hAQ','0CDYQFjAB','','',event)">Billing: Portal Home</a>
</h3>
"""

bs = BeautifulSoup(html)
elms = bs.select("h3.r a")
for i in elms:
    print(i.attrs["href"])

印刷：

https://billing.anapp.com/

h3.r a是一个CSS 选择器

您可以使用 css 选择器（我更喜欢它们）、xpath 或在元素中查找。选择器h3.r a将查找所有h3类r并从它们内部获取a元素。它可能是一个更复杂的示例，例如#an_id table tr.the_tr_class td.the_td_class它会在给定 td 的内部找到一个 id，该 id 属于具有给定类的 tr 并且当然在表中。

这也会给你同样的结果。find_all返回一个列表bs4.element.Tag，find_all有一个递归字段，不确定是否可以在一行中完成，我个人更喜欢 css 选择器，因为它简单而干净。

for elm in  bs.find_all('h3',attrs={'class': 'r'}):
    for a_elm in elm.find_all("a"):
        print(a_elm.attrs["href"])

python - 使用 Beautifulsoup 和 Mechanize 从元素中解析 href 属性值

1 回答 1

Related

Reference