gpt4 book ai didi

python - 使用 Regex + BeautifulSoup 抓取 XML 并存储到 Pandas

转载 作者:行者123 更新时间:2023-12-04 10:03:31 25 4
gpt4 key购买 nike

我正在使用 beautifulSoup 抓取一些 xml 站点,然后将抓取的数据存储到数据帧中。 XML 通常格式一致,因此抓取工作正常。但可能有 15% 的情况下,数据不会保存到数据框中,因为其中一个前缀略有不同。

例如,在抓取这三个 URL 时,第 2 个和第 3 个 URL 会毫无问题地存储到数据帧中,而第一个则没有。

from bs4 import BeautifulSoup
import requests
import pandas as pd

session = requests.Session()

# urls to loop through
form_urls = ['https://www.sec.gov/Archives/edgar/data/1418814/000141881220000017/vac13f021420.xml',
'https://www.sec.gov/Archives/edgar/data/820124/000095012320003895/408.xml',
'https://www.sec.gov/Archives/edgar/data/1067983/000095012320002466/form13fInfoTable.xml']

# Create dataframe and set columns to match XML doc
cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt',
'sshPrnamtType', 'putCall', 'investmentDiscretion',
'otherManager', 'Sole', 'Shared', 'None']

res_df = pd.DataFrame(columns=cols)


# Iterate over URLs
for form_url in form_urls:
data = []
soup = BeautifulSoup(session.get(form_url).content, 'lxml')
print(soup)

for info_table in soup.find_all(['ns1:infotable', 'infotable']):
row = []
for col in cols:
d = info_table.find([col.lower(), 'ns1:' + col.lower()])
row.append(d.text.strip() if d else 'NaN')
data.append(row)
url_df = pd.DataFrame(data, columns=cols)
res_df = res_df.append(url_df, ignore_index=True)

print(res_df)

那么,如果前缀采用意外格式(例如,它可能是一个空字符串或其他一些大写和小写字母和数字的组合),我该如何使刮板更加灵活?

最佳答案

您提供的第一个链接的第二行是 n1:infoTable 而不是 ns1:infoTable,因此要使您的代码正常工作,您需要考虑到这一点。

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re


session = requests.Session()

# urls to loop through
form_urls = ['https://www.sec.gov/Archives/edgar/data/1418814/000141881220000017/vac13f021420.xml',
'https://www.sec.gov/Archives/edgar/data/820124/000095012320003895/408.xml',
'https://www.sec.gov/Archives/edgar/data/1067983/000095012320002466/form13fInfoTable.xml']

# Create dataframe and set columns to match XML doc
cols = ['nameOfIssuer', 'titleOfClass', 'cusip', 'value', 'sshPrnamt',
'sshPrnamtType', 'putCall', 'investmentDiscretion',
'otherManager', 'Sole', 'Shared', 'None']

res_df = pd.DataFrame(columns=cols)


# Iterate over URLs
for form_url in form_urls:
data = []
soup = BeautifulSoup(session.get(form_url).content, 'lxml')

for info_table in soup.find_all(re.compile("([A-Za-z0-9]+:|)infotable")):
row = []
for col in cols:
pattern = re.compile("([A-Za-z0-9]+:|)" + col.lower())
d = info_table.find(pattern)
row.append(d.text.strip() if d else 'NaN')
data.append(row)
url_df = pd.DataFrame(data, columns=cols)
res_df = res_df.append(url_df, ignore_index=True)

编辑:现在前缀可以不存在(空字符串'')或者它可以是小写、大写字母和数字的组合

关于python - 使用 Regex + BeautifulSoup 抓取 XML 并存储到 Pandas,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61711393/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com