gpt4 book ai didi

python - 如何使用 Pandas 从文件中提取 html 表格?

转载 作者:行者123 更新时间:2023-12-04 15:17:09 25 4
gpt4 key购买 nike

我是 pandas 的新手,我正在尝试从一些 HTML 文件中提取一些数据。

如何转换如下所示的多个 HTML 表格:

       PS4
Game Name | Price
GoW | 49.99
FF VII R | 59.99

XBX
Game Name | Price
Gears 5 | 49.99
Forza 5 | 59.99
<table>
<tr colspan="2">
<td>PS4</td>
</tr>
<tr>
<td>Game Name</td>
<td>Price</td>
</tr>
<tr>
<td>GoW</td>
<td>49.99</td>
</tr>
<tr>
<td>FF VII R</td>
<td>59.99</td>
</tr>
</table>

<table>
<tr colspan="2">
<td>XBX</td>
</tr>
<tr>
<td>Game Name</td>
<td>Price</td>
</tr>
<tr>
<td>Gears 5</td>
<td>49.99</td>
</tr>
<tr>
<td>Forza 5</td>
<td>59.99</td>
</tr>
</table>

像这样变成一个 json 对象:

[
{ "Game Name": "Gow", "Price": "49.99", "platform": "PS4"},
{ "Game Name": "FF VII R", "Price": "59.99", "platform": "PS4"},
{ "Game Name": "Gears 5", "Price": "49.99", "platform": "XBX"},
{ "Game Name": "Forza 5", "Price": "59.99", "platform": "XBX"}
]

我尝试使用 pandas.read_html(path/to/file) 加载包含表格的 html 文件,它确实返回了 DataFrame 列表,但之后我不知道如何提取数据,尤其是平台名称位于标题中,而不是作为单独的列。

我正在使用 pandas,因为我正在从包含其他形式的表格和 HTML 代码的本地 htm 文件中提取这些表格,所以我使用:

tables = pandas.read_html(file_path, match="Game Name")

使用基于该列名的匹配参数快速隔离我需要的表。

最佳答案

import pandas as pd

# list to save all dataframe from all tables in all files
df_list = list()

# list of files to load
list_of_files = ['test.html']

# iterate through your files
for file in list_of_files:

# create a list of dataframes from the tables in the file
dfl = pd.read_html(file, match='Game Name')

# fix the headers and columns
for d in dfl:

# select row 1 as the headers
d.columns = d.iloc[1]

# select row 0, column 0 as the platform
d['platform'] = d.iloc[0, 0]

# selection row 2 and below as the data, row 0 and 1 were the headers
d = d.iloc[2:]

# append the cleaned dataframe to df_list
df_list.append(d.copy())

# create a single dataframe
df = pd.concat(df_list).reset_index(drop=True)

# create a list of dicts from df
records = df.to_dict('records')

print(records)
[out]:
[{'Game Name': 'GoW', 'Price': '49.99', 'platform': 'PS4'},
{'Game Name': 'FF VII R', 'Price': '59.99', 'platform': 'PS4'},
{'Game Name': 'Gears 5', 'Price': '49.99', 'platform': 'XBX'},
{'Game Name': 'Forza 5', 'Price': '59.99', 'platform': 'XBX'}]

关于python - 如何使用 Pandas 从文件中提取 html 表格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64126294/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com