gpt4 book ai didi

Python文本两个单词之间的解析

转载 作者:行者123 更新时间:2023-12-01 05:53:10 26 4
gpt4 key购买 nike

我正在使用 beautifulsoup,想要提取网页上两个单词之间的所有文本。

例如,想象以下网站文本:

This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.

我想提取页面上以 text 开头并以 bunch 结尾的所有内容。

在这种情况下我只想:

text of the webpage. It is just a string of a bunch 

但是,一个页面上可能存在多个此类实例。

最好的方法是什么?

这是我当前的设置:

#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
urls = [
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html
]



for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)

def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it

return False

elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it

return False

else:
# Otherwise, return True as these are the elements we need

return True

visible_texts = filter(visible, texts)
# Filter only returns those items in the sequence, texts, that return True.
# We use those to build our final list.

for line in visible_texts:
print line

最佳答案

因为您只是解析文本,所以只需要正则表达式:

import re
result = re.findall("text.*?bunch", text_from_web_page)

关于Python文本两个单词之间的解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13505322/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com