- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我编写了一些代码来解析来自 yell.com 的不同商店的名称、地址和电话号码。如果为我的爬虫提供了任何链接,它就会解析整个内容,而不管它分布在多少页面上。然而,我能发现的唯一问题是它总是跳过第一页的内容,就像如果有 10 页,我的爬虫抓取最后 9 页。一点点抽搐可能会让我找到解决方法。这是完整的代码。提前致谢。
Sub YellUK()
Const mlink = "https://www.yell.com"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim post As HTMLHtmlElement, page As Object, newlink As String
With http
.Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001", False
.send
html.body.innerHTML = .responseText
End With
Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")
For i = 0 To page.Length - 2
newlink = mlink & Replace(page(i).href, "about:", "")
With http
.Open "GET", newlink, False
.send
htm.body.innerHTML = .responseText
End With
For Each post In htm.getElementsByClassName("js-LocalBusiness")
x = x + 1
With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
If .Length Then Cells(x + 1, 1) = .Item(0).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 1 Then Cells(x + 1, 2) = .Item(1).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 2 Then Cells(x + 1, 3) = .Item(2).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 3 Then Cells(x + 1, 4) = .Item(3).innerText
End With
With post.getElementsByClassName("businessCapsule--tel")
If .Length > 1 Then Cells(x + 1, 5) = .Item(1).innerText
End With
Next post
Next i
End Sub
这是存储下一页页码的元素:
<div class="row pagination">
<div class="col-sm-24">
<span class="pagination--page is-selected">1</span>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=2" data-tracking="DISPLAY:PAGINATION:NUMBER">2</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=3" data-tracking="DISPLAY:PAGINATION:NUMBER">3</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=4" data-tracking="DISPLAY:PAGINATION:NUMBER">4</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=5" data-tracking="DISPLAY:PAGINATION:NUMBER">5</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=6" data-tracking="DISPLAY:PAGINATION:NUMBER">6</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=7" data-tracking="DISPLAY:PAGINATION:NUMBER">7</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=8" data-tracking="DISPLAY:PAGINATION:NUMBER">8</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=9" data-tracking="DISPLAY:PAGINATION:NUMBER">9</a>
<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=10" data-tracking="DISPLAY:PAGINATION:NUMBER">10</a>
<a rel="nofollow" class="pagination--next" href="/ucs/UcsSearchAction.do?location=United+Kingdom&keywords=pizza&scrambleSeed=721890588&pageNum=2" data-tracking="DISPLAY:PAGINATION:NEXT">Next</a>
</div>
</div>
最佳答案
这里的问题是第一页已经被选中,因此它在分页中没有 anchor 。解决方案是先处理第一页,然后使用分页处理其余页面。
Option Explicit
Sub YellUK()
Const mlink = "https://www.yell.com"
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim page As Object, newlink As String
With http
.Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001", False
.send
html.body.innerHTML = .responseText
End With
Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")
Dim i, x
' First page first, is selected already, 'row pagination' doesn't have 'a' for it
GetPageData x, html
' Next pages then
Dim html2 As New HTMLDocument
For i = 0 To page.Length - 2
newlink = mlink & Replace(page(i).href, "about:", "")
With http
.Open "GET", newlink, False
.send
html2.body.innerHTML = .responseText
End With
GetPageData x, html2
Next i
End Sub
Private Sub GetPageData(ByRef x, ByRef html As HTMLDocument)
Dim post As HTMLHtmlElement
For Each post In html.getElementsByClassName("js-LocalBusiness")
x = x + 1
With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
If .Length Then Cells(x + 1, 1) = .Item(0).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 1 Then Cells(x + 1, 2) = .Item(1).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 2 Then Cells(x + 1, 3) = .Item(2).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 3 Then Cells(x + 1, 4) = .Item(3).innerText
End With
With post.getElementsByClassName("businessCapsule--tel")
If .Length > 1 Then Cells(x + 1, 5) = .Item(1).innerText
End With
Next post
End Sub
编辑:可能是这样的。为 i=-1
创建第一页链接,然后像往常一样创建下一页。
For i = -1 To page.Length - 2
If i = -1 Then
newlink = mlink & Replace(page(i + 1).href, "about:", "")
newlink = Left(newlink, Len(newlink) - 1) & "1"
Else
newlink = mlink & Replace(page(i).href, "about:", "")
End If
Debug.Print i & ", " & newlink ' Prints the links for all the pages
With http
.Open "GET", newlink, False
.send
htm.body.innerHTML = .responseText
End With
' Get page data here ...
Next i
关于vba - Scraper 无法解析第一页的内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44247730/
SCENARIO 有两页,第一页是HomePage,它在flutter_bloc软件包的帮助下自动获取api数据。在首页(第一页)中,还有一个按钮,可在此代码Navigator.push(contex
我检查过类似的问题,但由其他人发布,但我仍然看不到我的代码有什么问题。 我刚刚从文档中复制了它 - https://symfony.com/doc/3.4/page_creation.html Luc
我已经编写了一段代码,使用Python中的Scrapy来抓取页面。下面我粘贴了 main.py 代码。但是,每当我运行我的蜘蛛时,它仅从第一页抓取(DEBUG:从抓取),这也是请求中的Referer标
我创建了一个 ios 图书阅读器应用程序。在这个应用程序中,我集成了 google drive 和 skydrive 。现在我可以从 google drive 和 skydrive 登录和检索数据了。
效果图: 功能简介:可使用上下键选中行,选中后点击修改,textbox获得gridview中的代码的数据。对你有帮助的话,请记得要点击“好文要顶”哦!!!不懂的,请留言。废话不多说了,贴码如下
我是一名优秀的程序员,十分优秀!