gpt4 book ai didi

python - 使用 beautiful soup 从 标签中提取正确格式的文本(中间有空格)

转载 作者:行者123 更新时间:2023-11-27 22:54:46 24 4
gpt4 key购买 nike

我正在尝试从 ABBV 10-k 的其中一个表格中提取列标题sec 归档(第 25 页 上的“发行人购买股本证券”表 - 图表下方。)

内部<td>列标题中的标记 <tr>标签,文字分开<div>标签如下例所示

<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>

当尝试从标签中提取所有文本时,文本之间没有空格分隔(例如,对于上述 html 输出将是 string1string3string3 预期 string1 string3 string3 )。

使用下面的代码从表格中提取列标题

url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.text.encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])

print(table_data)

output:[['(a) TotalNumberof Shares(or Units)Purchased', '', '(b) AveragePricePaid per Share(or Unit)']]

Expected output:[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']] (each word separated bay a space)

最佳答案

.get_text() 中使用 separator 参数:

html = '''<tr>
<td>
<div>string1</div>
<div>string2</div>
<div>string3</div>
</td>
</tr>'''

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

td = soup.find('td')
td.get_text(separator=' ')

这是你的代码的样子:

from bs4 import BeautifulSoup
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
row_data=[]
cells = tr.find_all(['td', 'th'], recursive=False)
for cell in cells[1:4]:
row_data.append(cell.get_text(separator=' ').encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])

print(table_data)

输出:

print(table_data)
[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']]

关于python - 使用 beautiful soup 从 <td> 标签中提取正确格式的文本(中间有空格),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56848931/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com