gpt4 book ai didi

excel - 使用 unicode 字符从本地 HTML 中抓取表格

转载 作者:行者123 更新时间:2023-12-02 19:01:52 26 4
gpt4 key购买 nike

我尝试使用以下代码从存储在我的 PC 上的本地 HTML 文件中抓取表格

Sub Test()
Dim mtbl As Object
Dim tableData As Object
Dim tRow As Object
Dim tcell As Object
Dim trowNum As Integer
Dim tcellNum As Integer
Dim webpage As New HTMLDocument
Dim fPath As String
Dim strCnt As String
Dim f As Integer

fPath = Environ("USERPROFILE") & "\Desktop\LocalHTML.txt"
f = FreeFile()
Open fPath For Input As #f
strCnt = Input(LOF(f), f)
Close #f

webpage.body.innerHTML = strCnt

Set mtbl = webpage.getElementsByTagName("Table")(0)
Set tableData = mtbl.getElementsByTagName("tr")
Debug.Print tableData.Item(0).innerText

On Error GoTo TryAgain:
trowNum = 1

For Each tRow In tableData
For Each tcell In tRow.Children
tcellNum = tcellNum + 1
Sheet1.Cells(trowNum, tcellNum) = tcell.innerText
Next tcell
trowNum = trowNum + 1
tcellNum = 0
Next tRow
Exit Sub

TryAgain:
Application.Wait Now + TimeValue("00:00:02")
Err.Clear
Resume
End Sub

代码运行没有错误,但结果有两点不正确首先,阿拉伯语字符在工作表上显示为问号。我的意思是unicode字符没有被正确读取第二点数据以无组织的结构分散在工作表上

这是本地 HTML 文件的链接 http://www.mediafire.com/file/oxpyzv4gc53kuwg/LocalHTML.txt

感谢高级帮助

最佳答案

所以,也许这会有所帮助。这不是我想给出的完整答案。基本上,HTML 是一团糟(在我看来)。您没有将数据排列在行 (tr) 中,并在其中包含表格单元格 (td),从而可以轻松地隔离各个文本元素。

我提供以下内容实际上只是为了演示尝试隔离各个文本组件并在保留阿拉伯字符的情况下读/写的奇怪之处。我借用了@whom的adodb Stream方法确保UTF-8。

这种方法,用硬编码编号循环 table 标签等,很丑陋,确实属于罪孽。我利用后面的表格将各个组件单独存储的事实来重建包含行和列的整体表格外观。

但你可能会从中得到一些东西:

Option Explicit

Public Sub test()
Dim fStream As ADODB.Stream, html As HTMLDocument
Set html = New HTMLDocument
Set fStream = New ADODB.Stream
With fStream
.Charset = "UTF-8"
.Open
.LoadFromFile "C:\Users\User\Downloads\LocalHTML.html"
html.body.innerHTML = .ReadText
.Close
End With

Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long
Dim counter As Long, endTableNumber, numColumns As Long

startTableNumber = 43
endTableNumber = 330
numColumns = 9

Set hTables = html.getElementsByTagName("table")
r = 2: c = 1

For i = startTableNumber To endTableNumber Step 2
counter = counter + 1
If counter = 10 Then
c = 1: r = r + 1: counter = 1
End If
Cells(r, c) = hTables(i).innerText
c = c + 1
Next

End Sub

关于excel - 使用 unicode 字符从本地 HTML 中抓取表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53033150/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com