html - 使用 RSelenium 进行网页抓取 Google Scholar

Question

我正在尝试使用 Google Scholar 提供的信息开发一个学术网络。其中一部分涉及从单击单个学者的文章标题产生的弹出窗口（实际上不确定它是哪种窗口 - 它似乎不是常规窗口或 iframe）中抓取数据页。

我一直在使用 RSelenium 来执行此任务。下面是迄今为止我为与 Google Scholar 交互而开发的代码。

#Libraries----    
library(RSelenium)


#Functions----
#Convenience function for simplifying data generated from .$findElements()
unPack <- function(x, opt = "text"){
  unlist(sapply(x, function(x){x$getElementAttribute(opt)}))  
}


#Analysis----
#Start up the server for Chrome.
rD <- rsDriver(browser = "chrome")
#Start Chrome.
remDr <- rD[["client"]]
#Add a test URL.
siteAdd <- "http://scholar.google.com/citations?user=sc3TX6oAAAAJ&hl=en&oi=ao"
#Open the site.
remDr$navigate(siteAdd)

#Create a list of all the article titles
cite100Elem <- remDr$findElements(using = "css selector", value = "a.gsc_a_at")
cite100 <- unPack(cite100Elem)

#Start scraping the first article. I will create some kind of loop for all
# articles later.
#This opens the pop-up window with additional data I'm interested in.
citeTitle <- cite100[1]
citeElem <- remDr$findElement(using = 'link text', value = citeTitle)
citeElem$clickElement()

这就是我卡住的地方。使用 Chrome 的开发者工具查看底层网页，我可以看到我感兴趣的第一个信息，文章的作者，它与以下 HTML 相关联：

<div class="gsc_vcd_value">TR Moore, NT Roulet, JM Waddington</div>

这表明我应该能够执行以下操作：

#Extract all the information about the article.
articleElem <- remDr$findElements(value = '//*[@class="gsc_vcd_title"]')
articleInfo <- unPack(articleElem)

但是，此解决方案似乎不起作用；它返回值“NULL”。

我希望有人有一个基于 R 的解决方案，因为我对 Java Script 知之甚少。

最后，如果从以下代码中搜索结果文本（解析我当前所在的页面）：

htmlOut <- XML::htmlParse(remDr$getPageSource()[[1]])
htmlOut

我找不到与“gsc_vcd_title”关联的 CSS 类，这表明我感兴趣的页面有一个我还没有完全弄清楚的更复杂的结构。

您的任何见解都将受到欢迎。谢谢！

html - 使用 RSelenium 进行网页抓取 Google Scholar

0 回答 0

Related

Reference