r - Yelp 餐厅信息网页抓取（运行循环以获取多个餐厅信息时遇到错误）

Question

试图从 Yelp 上抓取餐厅信息，例如价格范围 ($$$$)、价格描述、酒精、电话、网站、健康评分。该代码适用于 2 家餐厅 - Dirty French 和 Uncle Boons，但在餐厅 Legacy Records 使用相同代码时，它开始显示错误。这是因为我在酒精代码（以及代码中未显示的网站）中使用的 XPath 对于 Dirty French 和 Uncle Boons 以及 Legacy Records 是不同的。此外，Legacy Records 没有价格范围，但仍显示在输出中。

有没有什么方法可以让我循环遍历不同的餐厅并获取所需的信息，即使 XPath 保持不变，或者无论如何 XPath 会为每家餐厅自行更改？我正在收集超过 1000 家餐厅的数据，因此无法想到每次都手动更改代码。

我是否朝着正确的方向前进？有没有更好的办法？

此代码可以很好地在您的系统中重现。

actual_name <- data.frame(actual_name = c("dirty-french-new-york", "uncle- 
boons-new-york", 
                                      "legacy-records-new-york"))


titles <- c()
urls <- c()

urls <- paste(initial, actual_name$actual_name, sep = "")

map_df(urls, function(i){
  url <- read_html(i)

  data.frame(Title = url %>% html_node("title") %>% html_text(),
         HealthScore = url %>% html_node(".health-score-description") %>% 
html_text(), 
         Rating = url %>%
           html_node(xpath = "//*   [@id='wrap']/div[2]/div/div[1]/div/div[3]/div[1]/div[2]/div[1]/div[1]/div") 
%>%
           html_attr("title"),
         Phone = url %>% html_node(".biz-phone") %>% html_text(),
         Price = url %>% html_node(".price-range") %>% html_text(),
         PriceDescription = url %>% html_node(".price-description") %>% 
html_text(),
         Alcohol = url %>%
           html_nodes(xpath = "//* [@id='wrap']/div[2]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[3]/span[2]/a") %>%
           html_text())
}) -> titles

r - Yelp 餐厅信息网页抓取（运行循环以获取多个餐厅信息时遇到错误）

0 回答 0

Related

Reference