gpt4 book ai didi

html - 按类和标签名称的网页抓取元素

转载 作者:搜寻专家 更新时间:2023-10-31 08:22:41 24 4
gpt4 key购买 nike

我正在尝试从下面提到的网站复制数据,我需要各种尺寸、价格、设施、特价、预订。我在代码下方构图,但我能够正确复制元素。第一件事只有三个元素在处理重复,我也没有得到 Amenities 和 Reserve 的结果。有人可以看看这个吗?

Sub text()


Dim ie As New InternetExplorer, ws As Worksheet
Set ws = ThisWorkbook.Worksheets("Unit Data")
With ie
.Visible = True
.Navigate2 "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"

While .Busy Or .readyState < 4: DoEvents: Wend

Sheets("Unit Data").Select


Dim listings As Object, listing As Object, headers(), results()
Dim r As Long, list As Object, item As Object
headers = Array("size", "features", "Specials", "Price", "Reserve")
Set list = .document.getElementsByClassName("units_table")
'.unit_size medium, .features, .Specials, .price, .Reserve
Dim rowCount As Long
rowCount = .document.querySelectorAll(".tab_container li").Length
ReDim results(1 To rowCount, 1 To UBound(headers) + 1)
For Each listing In list
For Each item In listing.getElementsByClassName("unitinfo even")
r = r + 1

results(r, 1) = listing.getElementsByClassName("size secondary-color-text")(0).innerText
results(r, 2) = listing.getElementsByClassName("amenities")(0).innerText
results(r, 3) = listing.getElementsByClassName("offer1")(0).innerText
results(r, 4) = listing.getElementsByClassName("rate_text primary-color-text rate_text--clear")(0).innerText
results(r, 5) = listing.getElementsByClassName("reserve")(0).innerText





Next
Next
ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
.Quit
End With

Worksheets("Unit Data").Range("A:G").Columns.AutoFit
End Sub

最佳答案

tl;博士;
为答案的长度提前(向某些人)道歉,但我想我会接受这个
详细说明正在发生的事情的教学时刻。
我使用的整体方法与您的代码相同:找到一个 css 选择器来隔离行(尽管在不同的选项卡中,小、中、大实际上仍然存在于页面上):

Set listings = html.querySelectorAll(".unitinfo")
以上生成行。和以前一样,我们将其转储到新的 HTMLDocument 中所以我们可以利用 querySelector/querySelectorAll方法。

行:
让我们来看看我们正在检索的第一行 html。后续部分将以此行作为案例研究来讨论如何检索信息:

5x5</TD> <TD class=features>
<DIV id=a5x5-1 class="icon a5x5">
<DIV class=img><IMG src="about:/core/resources/images/units/5x5_icon.png"></DIV>
<DIV class=display>
<P>More Information</P></DIV></DIV>
<SCRIPT type=text/javascript>
// Refine Search
//
$(function() {
$("#a5x5-1").tooltip({
track: false,
delay: 0,
showURL: false,
left: 5,
top: 5,
bodyHandler: function () {
return " <div class=\"tooltip\"> <div class=\"tooltop\"></div> <div class=\"toolmid clearfix\"> <div class=\"toolcontent\"> <div style=\"text-align:center;width:100%\"> <img alt=\"5 x 5 storage unit\" src=\"/core/resources/images/units/5x5.png\" /> </div> <div class=\"display\">5 x 5</div> <div class=\"description\">Think of it like a standard closet. Approximately 25 square feet, this space is perfect for about a dozen boxes, a desk and chair, and a bicycle.</div> </div> <div class=\"clearfix\"></div> </div> <div class=\"toolfoot\"></div> <div class=\"clearfix\"></div> </div> "}
});
});
</SCRIPT>
</TD><TD class=rates>
<DIV class="discount_price secondary-color-text standard_price--left">
<DIV class=price_text>Web Rate: </DIV>
<DIV class="rate_text primary-color-text rate_text--clear">$39.00 </DIV></DIV>
<SCRIPT>
$( document ).ready(function() {
$('.units_table tr.unitinfo').each(function(index, el) {
if ($(this).find('.standard_price').length != '' && $(this).find('.discount_price').length != '') {
$(this).parents('.units_table').addClass('both');
$(this).addClass('also-both');
$(this).find('.rates').addClass('rates_two_column');
}
});
});
</SCRIPT>
</TD><TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD><TD class=offers>
<DIV class=offer1>Call for Specials </DIV>
<DIV class=offer2></DIV></TD><TD class=reserve><A id=5x5:39:00000000 class="facility_call_to_reserve cta_call primary-color primary-hover" href="about:blank#" rel=nofollow>Call </A></TD>

我们将要处理的每一行在 html2 中都会有类似的 html。多变的。如果您有疑问,请查看上面显示的函数中的 javascript:
$('.units_table tr.unitinfo').each(function(index, el) 
它使用相同的选择器(但也指定了父表类和元素类型( tr ))。基本上,正在为表中的每一行调用该函数。

尺寸:
由于某种原因开放 td标签正在被删除(我已经看到这个缺少父 <table> 我认为标签)所以对于大小,而不是按类抓取,我正在寻找结束标签的开始并将字符串提取到那里。我通过传递由 Instr 给出的返回值来做到这一点。 (其中 < 在字符串中找到)-1 到 Left$ (打字)功能。
enter image description here
results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
这将返回 5x5 .

说明:
描述列由我们上面看到的函数填充(它应用于记住的每一行)
这个位 - $("#a5x5-1").tooltip - 告诉它目标在哪里,然后函数的返回语句提供了带有 div 的 html , 与类(class) description ,包含我们想要的文本。由于我们没有使用浏览器,而且我使用的是 64 位窗口,因此我无法评估此脚本,但我可以使用 split提取 "description\"> 之间的字符串(描述)和收盘相关的开始 div标签:
results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
这将返回:
“把它想象成一个标准的壁橱。大约 25 平方英尺,这个空间非常适合大约十几个盒子、一张 table 和椅子,还有一辆自行车。”

费率类型和价格:
这些很简单,并使用类名来定位:
results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)
返回(分别)
网络费率,
39.00 英镑

便利设施:
这是事情有点棘手的地方。
让我们重新检查上面显示的 html,对于这一行,与便利设施有关:

<TD class=amenities>
<DIV title="Temperature Controlled" class="amenity_icon icon_climate"></DIV>
<DIV title="Interior Storage" class="amenity_icon icon_interior"></DIV>
<DIV title="Ground Floor" class="amenity_icon icon_ground_floor"></DIV></TD>

我们可以看到父 td有一个类 amenities , 有 child div具有复合类名的元素;后者在每种情况下都用作便利设施类型的标识符,例如 icon_climate .
当您将鼠标悬停在这些上时,页面上会显示工具提示信息:
enter image description here
我们可以在实际页面的 html 中跟踪此工具提示的位置:
enter image description here
当您将鼠标悬停在不同的便利设施上时,此内容会更新。
长话短说(他在页面一半的时候说!),此内容正在从服务器上的 php 文件更新。我们可以请求该文件并构建一个字典来映射每个便利设施的类名,例如 amenity_icon icon_climate (当转换为 .amenity_icon.icon_climate 的适当 css 选择器时,复合类需要将“”替换为“.”)到相关的描述。您可以浏览 php 文件 here .
php文件:
让我们只看文件的开头,以便剖析什么是重复模式的基本单位:

function LoadTooltips() {
$(".units_table .amenity_icon.icon_climate").tooltip({
track: false,
delay: 0,
showURL: false,
left: -126,
top: -100,
bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"
}
});

负责更新tooltip的函数是 LoadTooltips . CSS 类选择器用于定位每个图标:
$(".units_table .amenity_icon.icon_climate").tooltip
我们有指定返回文本的 bodyhandler:

bodyHandler: function () {
return "<div class=\"sidebar_tooltip\"><h4>Temperature Controlled</h4><p>Units are heated and/or cooled. See manager for details.</p></div>"

我们有 3 位有用的信息出现在重复组中。元素的类名选择器、简短描述和详细描述,例如
  • .amenity_icon.icon_climate :我们使用它来将 php 文件描述映射到我们行中的便利设施图标的类名。 CSS 选择器
  • Temperature Controlled ;内h4工具提示函数返回文本的标签。 简短说明
  • Units are heated and/or cooled. See manager for details. ;内p工具提示函数返回文本的标签。 详细说明

  • 我写了2个函数, GetMatchesGetAmenitiesDescriptions ,使用正则表达式为每个图标提取所有重复项,并返回一个以 css 选择器为键的字典,以及短 description : long description作为值(value)。
    当我收集每一行中的所有图标时:
    Set icons = html2.querySelectorAll(".amenity_icon")

    我使用字典根据图标的类名返回工具提示说明
    For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
    amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
    Next
    然后我将描述加入 vbNewLine以确保输出在输出单元格内的不同行上。
    您可以探索正则表达式 here .
    正则表达式使用 | (或)语法,所以我在一个列表中返回所有匹配的模式。
    arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
    因为我需要不同的子匹配(0,1 或 2 又名 css 类选择器,短描述,长描述)我使用 Select Case i mod 3 , 带计数器变量 i , 提取适当的子匹配。
    php 文件中映射的匹配示例:
    enter image description here

    特价:
    我们回到类选择器。 Offer2未填充,因此您可以删除。
    results(r, 6) = html2.querySelector(".offer1").innerText
    results(r, 7) = html2.querySelector(".offer2").innerText
    返回(分别):
    调用特价,空字符串

    结束语:
    因此,以上内容将引导您完成一行。它只是冲洗并在所有行的循环中重复。为了提高效率,数据被添加到一个数组中, results ;然后写入 Sheet1一气呵成。我可以看到一些小的改进,但这很快。

    VBA:
    Option Explicit
    Public Sub GetInfo()
    Dim ws As Worksheet, html As HTMLDocument, s As String, amenitiesDescriptions As Object
    Const URL As String = "https://www.safeandsecureselfstorage.com/self-storage-lake-villa-il-86955"

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    Set html = New HTMLDocument
    Set amenitiesDescriptions = GetAmenitiesDescriptions

    With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", URL, False
    .setRequestHeader "User-Agent", "Mozilla/5.0"
    .send
    s = .responseText

    html.body.innerHTML = s

    Dim headers(), results(), listings As Object, amenities As String

    headers = Array("Size", "Description", "RateType", "Price", "Amenities", "Offer1", "Offer2")
    Set listings = html.querySelectorAll(".unitinfo")

    Dim rowCount As Long, numColumns As Long, r As Long, c As Long
    Dim icons As Object, icon As Long, amenitiesInfo(), i As Long, item As Long

    rowCount = listings.Length
    numColumns = UBound(headers) + 1

    ReDim results(1 To rowCount, 1 To numColumns)
    Dim html2 As HTMLDocument
    Set html2 = New HTMLDocument
    For item = 0 To listings.Length - 1
    r = r + 1
    html2.body.innerHTML = listings.item(item).innerHTML
    results(r, 1) = Left$(html2.body.innerHTML, InStr(html2.body.innerHTML, "<") - 1)
    results(r, 2) = Split(Split(html2.querySelector("SCRIPT").innerHTML, """description\"">")(1), "</div>")(0)
    results(r, 3) = Replace$(html2.querySelector(".price_text").innerText, ":", vbNullString)
    results(r, 4) = Trim$(html2.querySelector(".rate_text").innerText)

    Set icons = html2.querySelectorAll(".amenity_icon")
    ReDim amenitiesInfo(0 To icons.Length - 1)

    For icon = 0 To icons.Length - 1 'use class name of amenity to look up description
    amenitiesInfo(icon) = amenitiesDescriptions("." & Replace$(icons.item(icon).className, Chr$(32), "."))
    Next

    amenities = Join$(amenitiesInfo, vbNewLine) 'place each amenity description on a new line within cell when written out

    results(r, 5) = amenities
    results(r, 6) = html2.querySelector(".offer1").innerText
    results(r, 7) = html2.querySelector(".offer2").innerText
    Next

    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
    End Sub

    Public Function GetAmenitiesDescriptions() As Object 'retrieve amenities descriptions from php file on server
    Dim s As String, dict As Object, re As Object, i As Long, arr() 'keys based on classname, short desc, full desc
    ' view regex here: https://regex101.com/r/bII5AL/1
    Set dict = CreateObject("Scripting.Dictionary")
    Set re = CreateObject("vbscript.regexp")

    With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", "https://www.safeandsecureselfstorage.com/core/resources/js/src/common.tooltip.php", False
    .setRequestHeader "User-Agent", "Mozilla/5.0"
    .send
    s = .responseText

    arr = GetMatches(re, s, "(\.amenity_icon\..*)""|<h4>(.*)<\/h4>|<p>(.*)<\/p>")
    For i = LBound(arr) To UBound(arr) Step 3 'build up lookup dictionary for amenities descriptions
    dict(arr(i)) = arr(i + 1) & ": " & arr(i + 2)
    Next
    End With
    Set GetAmenitiesDescriptions = dict
    End Function

    Public Function GetMatches(ByVal re As Object, inputString As String, ByVal sPattern As String) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With re
    .Global = True
    .MultiLine = True
    .IgnoreCase = False
    .Pattern = sPattern
    If .test(inputString) Then
    Set matches = .Execute(inputString)
    ReDim arrMatches(0 To matches.Count - 1)
    For Each iMatch In matches
    Select Case i Mod 3
    Case 0
    arrMatches(i) = iMatch.SubMatches.item(0)
    Case 1
    arrMatches(i) = iMatch.SubMatches.item(1)
    Case 2
    arrMatches(i) = iMatch.SubMatches.item(2)
    End Select
    i = i + 1
    Next iMatch
    Else
    ReDim arrMatches(0)
    arrMatches(0) = vbNullString
    End If
    End With
    GetMatches = arrMatches
    End Function

    输出:
    enter image description here

    引用资料(VBE > 工具 > 引用资料):
  • Microsoft HTML 对象库
  • 关于html - 按类和标签名称的网页抓取元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55761018/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com