gpt4 book ai didi

python - 删除
之间的内容 Beautiful Soup

转载 作者:太空宇宙 更新时间:2023-11-03 18:50:33 25 4
gpt4 key购买 nike

我有一段代码来解析网页。我想删除 div、ahref、h1 之间的所有内容。

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
url = "http://en.wikipedia.org/wiki/Viscosity"
try:
ourUrl = opener.open(url).read()
except Exception,err:
pass
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p')

for i in dem:
print i.text

我想打印 h1、ahref 之间没有任何内容的文本,就像我上面提到的那样。

最佳答案

编辑:来自评论“我想返回不在任何<div></div>标签之间的文本。”。这应该删除父级具有 div 标签的所有 block :

raw = '''
<html>
Text <div> Avoid this </div>
<p> Nested <div> Don't get me either </div> </p>
</html>
'''

def check_for_div_parent(mark):
mark = mark.parent
if 'div' == mark.name:
return True
if 'html' == mark.name:
return False
return check_for_div_parent(mark)

soup = bs4.BeautifulSoup(raw)

for text in soup.findAll(text=True):
if not check_for_div_parent(text):
print text.strip()

这只会产生两个标签,忽略 div 标签:

Text
Nested

原始回复

目前尚不清楚您到底想做什么。首先,您应该尝试发布一个完整的工作示例,因为您似乎缺少标题。其次,维基百科似乎对“机器人”或自动下载程序持反对态度

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

可以通过以下代码行来避免这种情况

import urllib2, bs4

url = r"http://en.wikipedia.org/wiki/Viscosity"

req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )

现在我们有了页面,我认为您只想使用 bs4 提取正文。我会做这样的事情

soup = bs4.BeautifulSoup(con.read())
start_pos = soup.find('h1').parent

for p in start_pos.findAll('p'):
para = ''.join([text for text in p.findAll(text=True)])
print para

这给我的文本看起来像:

The viscosity of a fluid is a measure of its resistance to gradual deformation by shear stress or tensile stress. For liquids, it corresponds to the informal notion of "thickness". For example, honey has a higher viscosity than water.[1] Viscosity is due to friction between neighboring parcels of the fluid that are moving at different velocities. When fluid is forced through a tube, the fluid generally moves faster near the axis and very slowly near the walls, therefore some stress (such as a pressure difference between the two ends of the tube) is needed to overcome the friction between layers and keep the fluid moving. For the same velocity pattern, the stress required is proportional to the fluid's viscosity. A liquid's viscosity depends on the size and shape of its particles and the attractions between the particles.[citation needed]

关于python - 删除 <div> 和 <ahref> 之间的内容 Beautiful Soup,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18445389/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com