gpt4 book ai didi

python - 如何下载html表格内容?

转载 作者:行者123 更新时间:2023-12-01 00:39:42 25 4
gpt4 key购买 nike

我想从以下网站下载财务数据(“konsernregnskap”而不是“morregnskap”),但我不确定如何下载所有内容:https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/

尝试使用 xpath 查找表,但没有成功。

我想将所有内容下载到一张 Excel 工作表中。

最佳答案

@rusu_ro1给出的答案是正确的。然而,我认为Pandas是适合这里工作的工具。

您可以使用pandas.read_html获取页面中的所有表格。然后使用 pandas.DataFrame.to_excel 仅将最后 4 个表写入 Excel 工作簿。

以下脚本抓取数据并将每个表写入不同的工作表。

import pandas as pd
all_tables = pd.read_html(
"https://www.proff.no/regnskap/yara-international-asa/oslo/hovedkontortjenester/IGB6AV410NZ/"
)
with pd.ExcelWriter('output.xlsx') as writer:
# Last 4 tables has the 'konsernregnskap' data
for idx, df in enumerate(all_tables[4:8]):
# Remove last column (empty)
df = df.drop(df.columns[-1], axis=1)
df.to_excel(writer, "Table {}".format(idx))

注释:

flavor : str or None, container of strings

The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

来自HTML Table Parsing Gotchas

html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.

在您的具体情况下,它会删除第 5 个表(它仅返回 7)。也许因为第一个和第五个表都有相同的数据。

关于python - 如何下载html表格内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57449268/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com