vba - Scraper 无法解析第一页的内容-6ren

vba - Scraper 无法解析第一页的内容

转载作者：行者123 更新时间：2023-12-01 14:24:41

我编写了一些代码来解析来自 yell.com 的不同商店的名称、地址和电话号码。如果为我的爬虫提供了任何链接，它就会解析整个内容，而不管它分布在多少页面上。然而，我能发现的唯一问题是它总是跳过第一页的内容，就像如果有 10 页，我的爬虫抓取最后 9 页。一点点抽搐可能会让我找到解决方法。这是完整的代码。提前致谢。

Sub YellUK()
Const mlink = "https://www.yell.com"
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim post As HTMLHtmlElement, page As Object, newlink As String

With http
    .Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001", False
    .send
    html.body.innerHTML = .responseText
End With
Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")
For i = 0 To page.Length - 2
    newlink = mlink & Replace(page(i).href, "about:", "")
    With http
        .Open "GET", newlink, False
        .send
        htm.body.innerHTML = .responseText
    End With

    For Each post In htm.getElementsByClassName("js-LocalBusiness")
        x = x + 1
        With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
            If .Length Then Cells(x + 1, 1) = .Item(0).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 1 Then Cells(x + 1, 2) = .Item(1).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 2 Then Cells(x + 1, 3) = .Item(2).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 3 Then Cells(x + 1, 4) = .Item(3).innerText
        End With
        With post.getElementsByClassName("businessCapsule--tel")
            If .Length > 1 Then Cells(x + 1, 5) = .Item(1).innerText
        End With
    Next post
Next i
End Sub

这是存储下一页页码的元素:

<div class="row pagination">
<div class="col-sm-24">
&nbsp;<span class="pagination--page is-selected">1</span>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=2" data-tracking="DISPLAY:PAGINATION:NUMBER">2</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=3" data-tracking="DISPLAY:PAGINATION:NUMBER">3</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=4" data-tracking="DISPLAY:PAGINATION:NUMBER">4</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=5" data-tracking="DISPLAY:PAGINATION:NUMBER">5</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=6" data-tracking="DISPLAY:PAGINATION:NUMBER">6</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=7" data-tracking="DISPLAY:PAGINATION:NUMBER">7</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=8" data-tracking="DISPLAY:PAGINATION:NUMBER">8</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=9" data-tracking="DISPLAY:PAGINATION:NUMBER">9</a>
&nbsp;<a class="pagination--page" rel="nofollow" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=10" data-tracking="DISPLAY:PAGINATION:NUMBER">10</a>
&nbsp;<a rel="nofollow" class="pagination--next" href="/ucs/UcsSearchAction.do?location=United+Kingdom&amp;keywords=pizza&amp;scrambleSeed=721890588&amp;pageNum=2" data-tracking="DISPLAY:PAGINATION:NEXT">Next</a>
</div>
</div>

最佳答案

这里的问题是第一页已经被选中，因此它在分页中没有 anchor 。解决方案是先处理第一页，然后使用分页处理其余页面。

Option Explicit

Sub YellUK()
Const mlink = "https://www.yell.com"
Dim http As New MSXML2.XMLHTTP60
Dim html As New HTMLDocument
Dim page As Object, newlink As String

With http
    .Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=pizza&location=United+Kingdom&scrambleSeed=1426936001", False
    .send
    html.body.innerHTML = .responseText
End With

Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")

Dim i, x
' First page first, is selected already, 'row pagination' doesn't have 'a' for it
GetPageData x, html

' Next pages then
Dim html2 As New HTMLDocument
For i = 0 To page.Length - 2
    newlink = mlink & Replace(page(i).href, "about:", "")
    With http
        .Open "GET", newlink, False
        .send
        html2.body.innerHTML = .responseText
    End With
    GetPageData x, html2
Next i
End Sub

Private Sub GetPageData(ByRef x, ByRef html As HTMLDocument)
    Dim post As HTMLHtmlElement
    For Each post In html.getElementsByClassName("js-LocalBusiness")
        x = x + 1
        With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
            If .Length Then Cells(x + 1, 1) = .Item(0).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 1 Then Cells(x + 1, 2) = .Item(1).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 2 Then Cells(x + 1, 3) = .Item(2).innerText
        End With
        With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
            If .Length > 3 Then Cells(x + 1, 4) = .Item(3).innerText
        End With
        With post.getElementsByClassName("businessCapsule--tel")
            If .Length > 1 Then Cells(x + 1, 5) = .Item(1).innerText
        End With
    Next post
End Sub

编辑:可能是这样的。为 i=-1 创建第一页链接，然后像往常一样创建下一页。

For i = -1 To page.Length - 2
    If i = -1 Then
        newlink = mlink & Replace(page(i + 1).href, "about:", "")
        newlink = Left(newlink, Len(newlink) - 1) & "1"
    Else
        newlink = mlink & Replace(page(i).href, "about:", "")
    End If
    Debug.Print i & ", " & newlink ' Prints the links for all the pages
    With http
        .Open "GET", newlink, False
        .send
        htm.body.innerHTML = .responseText
    End With
    ' Get page data here ...
Next i

关于vba - Scraper 无法解析第一页的内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44247730/

文章推荐： java - 使用 gradle : howto? 构建 ServiceLoader 文件

文章推荐： java - Android:仅在尝试编译时出现 R 错误

文章推荐： java - MongoDB 树查找后代 Java

文章推荐： java - 判别发现程序

flutter - 如何在Flutter中使用flutter_bloc自动刷新上一页(第一页)
SCENARIO 有两页，第一页是HomePage，它在flutter_bloc软件包的帮助下自动获取api数据。在首页(第一页)中，还有一个按钮，可在此代码Navigator.push(contex
php - Symfony 第一页 - 自动加载器预期的类 […] 将在文件中定义
我检查过类似的问题，但由其他人发布，但我仍然看不到我的代码有什么问题。我刚刚从文档中复制了它 - https://symfony.com/doc/3.4/page_creation.html Luc
python - SCRAPY:每次我的蜘蛛爬行时，它都会抓取同一页面(第一页)
我已经编写了一段代码，使用Python中的Scrapy来抓取页面。下面我粘贴了 main.py 代码。但是，每当我运行我的蜘蛛时，它仅从第一页抓取(DEBUG:从抓取)，这也是请求中的Referer标
ios - 使用 SkyDrive api ios 获取文件的缩略图(第一页)
我创建了一个 ios 图书阅读器应用程序。在这个应用程序中，我集成了 google drive 和 skydrive 。现在我可以从 google drive 和 skydrive 登录和检索数据了。
asp.net gridview分页：第一页下一页 1 2 3 4 上一页最末页
效果图：功能简介：可使用上下键选中行，选中后点击修改，textbox获得gridview中的代码的数据。对你有帮助的话，请记得要点击“好文要顶”哦!!!不懂的，请留言。废话不多说了，贴码如下

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

vba - Scraper 无法解析第一页的内容