gpt4 book ai didi

html - 用父元素的 beautifulsoup4 : does it affect the . 字符串解包元素?

转载 作者:行者123 更新时间:2023-11-28 03:15:06 27 4
gpt4 key购买 nike

我正在网络抓取如下表中的文本数据,我想获得结果:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

    html = '''
<table>
<tr class="title last ">
<td>
Lorem ipsum
</td>
<td>
</td>
</tr>
<tr>
<td>
<span class="caps">dolor
</span>
sit amet
</td>
<td>
</td>
</tr>
<tr>
<td>
consectetur adipiscing elit,
</td>
<td>
</td>
</tr>
<tr>
<td>
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</td>
<td>
</td>
</tr>
<tr>
<td>
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</td>
<td>
</td>
</tr>
</table>
'''

我打开了 <span> beautifulsoup4 元素:

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
tag.unwrap()

但是,我想出了所有空的空行 <td>元素,或者 'dolor sit amet' 行不打印,即使我在使用 prettify 打印 html 时可以看到它。

# text with empty lines
for line in soup.find_all('td'):
print(line.get_text().strip())
print(line.string) # line with <span> prints None

# missing line <span>
for line in soup.find_all('td', text=re.compile(r'\w')):
print(line.get_text().strip())

print(soup.prettify())

我做错了什么吗?我如何使用 unwrap() 并仍然访问所有没有空行的文本内容?

感谢您的帮助!

最佳答案

据我测试,您就在附近。应用 strip() 然后使用 re 模块将多个空格替换为一个空格,例如:

from bs4 import BeautifulSoup
import re

html = '''
<table>
<tr class="title last ">
<td>
Lorem ipsum
</td>
<td>
</td>
</tr>
<tr>
<td>
<span class="caps">dolor
</span>
sit amet
</td>
<td>
</td>
</tr>
<tr>
<td>
consectetur adipiscing elit,
</td>
<td>
</td>
</tr>
<tr>
<td>
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
</td>
<td>
</td>
</tr>
<tr>
<td>
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
</td>
<td>
</td>
</tr>
</table>
'''

soup = BeautifulSoup(html)

# remove <span> tag but keep content
spans = soup.find_all('span')
for tag in spans:
tag.unwrap()

print('\n'.join(
re.sub(r'\s+', ' ', td.text.strip())
for td in soup.find_all('td') if td.text.strip()))

它产生:

Lorem ipsum
dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

关于html - 用父元素的 beautifulsoup4 : does it affect the . 字符串解包元素?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28528594/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com