python - WebCrawler，只有少数商品有折扣价 - 索引错误

Question

我是编程新手，正在尝试用 python 构建我的第一个小型网络爬虫。

目标：爬取产品列表页面 - 抓取品牌名称、商品名称、原价和新价格 - 保存在 CSV 文件中

状态：我已经设法获得了品牌名称、商品名称以及原价，并将它们以正确的顺序排列到一个列表中（例如 10 个产品）。由于所有商品都有品牌名称、描述和价格，因此我的代码将它们以正确的顺序放入 csv 中。

代码：

    import bs4 
    from urllib.request import urlopen as uReq
    from bs4 import BeautifulSoup as soup

    myUrl = 'https://www.zalando.de/rucksaecke-herren/'

    #open connection, grabbing page, saving in page_html and closing connection 
    uClient = uReq(myUrl)
    page_html = uClient.read()
    uClient.close()

    #Datatype, html paser
    page_soup = soup(page_html, "html.parser")

    #grabbing information
    brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"})
    articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"})
    original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"})
    new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"})

    #opening a csv file and printing its header
    filename = "XXX.csv"
    file = open(filename, "w")
    headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n"
    file.write(headers)

    #How many brands on page?
    products_on_page = len(brand_Names)

    #Looping through all brands, atricles, prices and writing the text into the CSV 
    for i in range(products_on_page): 
            brand = brand_Names[i].text
            articale_Name = articale_Names[i].text
            price = original_Prices[i].text
            new_Price = new_Prices[i].text
            file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n")

    #closing CSV
    file.close()

问题：我正在努力将折扣价放入我的 csv 中的正确位置。并非每件商品都有折扣，我目前看到我的代码有两个问题：

我使用 .findAll 来查找网站上的信息 - 因为打折的产品比总产品少，所以我的 new_Prices 包含的价格更少（例如 10 种产品的 3 个价格）。如果我能够将它们添加到列表中，我假设它们会出现在前 3 行中。如何确保将 new_Price 添加到正确的产品中？
我收到“索引错误：列表索引超出范围”错误，我认为这是由于我正在循环 10 个产品这一事实引起的，但是对于 new_Prices，我比其他列表更快地到达终点？这有意义吗，我的假设是否正确？

我非常感谢任何帮助。

感谢，

托尔斯滕

score 0 · Accepted Answer

由于某些项目没有'div.z-nvg-cognac_promotionalPrice-3GRE7'标签，因此您无法可靠地使用列表索引。
但是，您可以选择所有容器标签 ( 'div.z-nvg-cognac_infoContainer-MvytX') 并使用find来选择每个项目上的标签。

from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import csv

my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
client = urlopen(my_url)
page_html = client.read().decode(errors='ignore')
page_soup = soup(page_html, "html.parser")

headers = ["BRAND", "ARTICALE NAME", "OLD PRICE", "NEW PRICE"]
filename = "test.csv"
with open(filename, 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(headers)

    items = page_soup.find_all(class_='z-nvg-cognac_infoContainer-MvytX')
    for item in items:
        brand_names = item.find(class_="z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn").text
        articale_names = item.find(class_="z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn").text
        original_prices = item.find(class_="z-nvg-cognac_originalPrice-2Oy4G").text
        new_prices = item.find(class_="z-nvg-cognac_promotionalPrice-3GRE7")
        if new_prices is not None: 
            new_prices = new_prices.text 
        writer.writerow([brand_names, articale_names, original_prices, new_prices])

如果您希望每页获得超过 24 个项目，则必须使用运行 js 的客户端，例如selenium.

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv

my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
driver = webdriver.Firefox()
driver.get(my_url)
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser")
...

脚注：函数和变量
的命名约定是带下划线的小写字母。
读取或写入 csv 文件时，最好使用csvlib。
处理文件时，您可以使用该with语句。

python - WebCrawler，只有少数商品有折扣价 - 索引错误

1 回答 1

Related

Reference