python - 抓取：如何获取标签中的属性

Question

我正在使用lxml 和 python来抓取页面。该页面的链接是这里。我现在面临的问题是如何获取标签中的属性。例如页面顶部的 3 颗金星，它们有一个 html

<abbr title="3" class="average rating large star3">★★★☆☆&lt;/abbr>

在这里，我想获取标题，以便知道该位置获得了多少颗星。

我尝试过做几件事，包括：

response = urllib.urlopen('http://www.insiderpages.com/b/3721895833/central-kia-of-irving-irving').read()
mo = re.search(r'<div class="rating_box">.*?</div>', response)
div = html.fromstring(mo.group(0))
title = div.find("abbr").attrib["title"]
print title

但对我不起作用。帮助将不胜感激。

score 4 · Accepted Answer

不要使用正则表达式从 html 中提取数据。你有 lxml，使用它的力量（XPath）。

>>> import lxml.html as html
>>> page = html.parse("http://www.insiderpages.com/b/3721895833/central-kia-of-irving-irving")
>>> print page.xpath("//div[@class='rating_box']/abbr/@title")
['3']

score 2 · Accepted Answer

Have you tried xpath?

In [38]: from lxml import etree

In [39]: import urllib2

In [40]: html = etree.fromstring(urllib2.urlopen('http://www.insiderpages.com/b/3721895833/central-kia-of-irving-irving').read(), parser)

In [41]: html.xpath('//abbr')[0].xpath('./@title')
Out[41]: ['3']

python - 抓取：如何获取标签中的属性

2 回答 2

Related

Reference