gpt4 book ai didi

python - 修复在 python 中用 BS4 提取的损坏的 html 表

转载 作者:行者123 更新时间:2023-12-01 07:26:24 25 4
gpt4 key购买 nike

我正在解析行政文件中的 html 表格。这很棘手,因为 html 经常被破坏,这会导致表格结构不佳。以下是我加载到 pandas 数据框中的表示例:

                0   1    2     3   4         5  \
0 NaN NaN NaN NaN NaN NaN
1 Name NaN Age NaN NaN Position
2 Aylwin Lewis NaN NaN 59.0 NaN NaN
3 John Morlock NaN NaN 58.0 NaN NaN
4 Matthew Revord NaN NaN 50.0 NaN NaN
5 Charles Talbot NaN NaN 48.0 NaN NaN
6 Nancy Turk NaN NaN 49.0 NaN NaN
7 Anne Ewing NaN NaN 49.0 NaN NaN

6
0 NaN
1 NaN
2 Chairman, Chief Executive Officer and President
3 Senior Vice President, Chief Operations Officer
4 Senior Vice President, Chief Legal Officer, Ge...
5 Senior Vice President and Chief Financial Officer
6 Senior Vice President, Chief People Officer an...
7 Senior Vice President, New Shop Development

我编写了以下 python 代码来尝试修复该表:

#dropping empty rows
df = df.dropna(how='all',axis=0)

#dropping columns with more than 70% empty values
df = df.dropna(thresh =2, axis=1)

#resetting dataframe index
df = df.reset_index(drop = True)

#set found_name variable to stop the loop once it finds the name column
found_name = 0

#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():

#only loop if we have not found a name column yet
if found_name == 0:

#convert the row to string
text_row = str(row)

#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")

#changing column names
df.columns = df.iloc[row.Index]

#dropping first rows
df = df.iloc[row.Index + 1 :]

#changing found_name to 1
found_name = 1

#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df)

这是我得到的表格:

0            Name   NaN                                                NaN
0 Aylwin Lewis 59.0 Chairman, Chief Executive Officer and President
1 John Morlock 58.0 Senior Vice President, Chief Operations Officer
2 Matthew Revord 50.0 Senior Vice President, Chief Legal Officer, Ge...
3 Charles Talbot 48.0 Senior Vice President and Chief Financial Officer
4 Nancy Turk 49.0 Senior Vice President, Chief People Officer an...
5 Anne Ewing 49.0 Senior Vice President, New Shop Development

我的主要问题是标题“年龄”和“职位”消失了,因为它们与列不一致。我正在使用这个脚本来解析许多表,所以我无法手动修复它们。此时我可以做什么来修复数据?

最佳答案

不要在开始时删除几乎空的列,我们稍后需要它们:一旦找到包含“Name”的标题行,我们会收集其所有非空元素,在删除空列后将它们设置为列标题剩余数据。

#dropping empty rows
df = df.dropna(how='all',axis=0)

#resetting dataframe index
df = df.reset_index(drop = True)

#set found_name variable to stop the loop once it finds the name column
found_name = 0

#looping through rows to find the first one that has the word "Name" in it
for row in df.itertuples():

#only loop if we have not found a name column yet
if found_name == 0:

#convert the row to string
text_row = str(row)

#search if there is the word "Name" in that row
if "Name" in text_row:
print("Name found in text of rows. Investigating row",row.Index," as header.")

#collect column names
headers = [c for c in row if not pd.isnull(c)][1:]

#dropping first rows
df = df.iloc[row.Index + 1 :]

#dropping empty columns
df = df.dropna(axis=1)

#setting column names
df.columns = (headers + ['col'] * (len(df.columns) - len(headers)))[:len(df.columns)]

#changing found_name to 1
found_name = 1

#reindex
df = df.reset_index(drop = True)
print("Attempted to clean dataframe:")
print(df)

结果:

             Name   Age                                           Position
0 Aylwin Lewis 59.0 Chairman, Chief Executive Officer and President
1 John Morlock 58.0 Senior Vice President, Chief Operations Officer
2 Matthew Revord 50.0 Senior Vice President, Chief Legal Officer, Ge...
3 Charles Talbot 48.0 Senior Vice President and Chief Financial Officer
4 Nancy Turk 49.0 Senior Vice President, Chief People Officer an...
5 Anne Ewing 49.0 Senior Vice President, New Shop Development

关于python - 修复在 python 中用 BS4 提取的损坏的 html 表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57430821/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com