gpt4 book ai didi

python - beautifulsoup 如何重组单词

转载 作者:太空宇宙 更新时间:2023-11-04 08:32:11 25 4
gpt4 key购买 nike

运行这段代码时,输​​出的一些单词被拆分了。就像“tolerances”这个词被拆分成“tole rances”一样。我查看了 html 源代码,页面似乎就是这样创建的。

还有很多其他的单词拆分的地方。在写入文本之前如何重新组合它们?

import requests, codecs
from bs4 import BeautifulSoup
from bs4.element import Comment

path='C:\\Users\\jason\\Google Drive\\python\\'

def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True

ticker = 'TSLA'
quarter = '18Q2'
mark1= 'ITEM 1A'
mark2= 'UNREGISTERED SALES'
url_new='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'

def get_text(url,mark1,mark2):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

for hr in soup.select('hr'):
hr.find_previous('p').extract()

texts = soup.findAll(text=True)

visible_texts = filter(tag_visible, texts)
text=u" ".join(t.strip() for t in visible_texts)

return text[text.find(mark1): text.find(mark2)]

text = get_text(url_new,mark1,mark2)

file=codecs.open(path + "test.txt", 'w', encoding='utf8')
file.write (text)
file.close()

最佳答案

您正在处理 用 Microsoft Word 格式化的 HTML。不要在没有上下文的情况下提取文本并尝试处理它。

您要处理的部分已用 <a name="..."> 清楚地标明。标签,让我们从使用 <a name="ITEM_1A_RISK_FACTORS"> 选择所有元素开始标记,一直到但不包括 <a name="ITEM2_UNREGISTERED_SALES">标记:

def sec_section(soup, item_name):
"""iterate over SEC document paragraphs for the section named item_name

Item name must be a link target, starting with ITEM
"""

# ask BS4 to find the section
elem = soup.select_one('a[name={}]'.format(item_name))
# scan up to the parent text element
# html.parser does not support <text> but lxml does
while elem.parent is not soup and elem.parent.name != 'text':
elem = elem.parent

yield elem
# now we can yield all next siblings until we find one that
# is also contains a[name^=ITEM] element:
for elem in elem.next_siblings:
if not isinstance(elem, str) and elem.select_one('a[name^=ITEM]'):
return
yield elem

此函数为我们提供了 <text> 中的所有子节点HTML 文档中的节点,从包含特定链接目标的段落开始,一直到命名为 ITEM 的下一个链接目标。 .

接下来,通常的 Word 清理任务是删除 <font>标签,style属性:

def clean_word(elem):
if isinstance(elem, str):
return elem
# remove last-rendered break markers, non-rendering but messy
for lastbrk in elem.select('a[name^=_AEIOULastRenderedPageBreakAEIOU]'):
lastbrk.decompose()
# remove font tags and styling from the document, leaving only the contents
if 'style' in elem.attrs:
del elem.attrs['style']
for e in elem: # recursively do the same for all child nodes
clean_word(e)
if elem.name == 'font':
elem = elem.unwrap()
return elem

Tag.unwrap() method什么对你的情况最有帮助,因为文本几乎被 <font> 任意分割了。标签。

现在干净地提取文本突然变得微不足道:

for elem in sec_section(soup, 'ITEM_1A_RISK_FACTORS'):
clean_word(elem)
if not isinstance(elem, str):
elem = elem.get_text(strip=True)
print(elem)

在其余文本中输出:

•that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;

文本现在已正确连接,不再需要重新组合。

整个部分仍在表格中,但 clean_word()现在清理这个更合理:

<div align="left">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td valign="top">
<p> </p></td>
<td valign="top">
<p>•</p></td>
<td valign="top">
<p>that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;</p></td></tr></table></div>

因此您可以使用更智能的文本提取技术来进一步确保此处的文本转换干净;您可以将此类项目符号表转换为 *前缀,例如:

def convert_word_bullets(soup, text_bullet="*"):
for table in soup.select('div[align=left] > table'):
div = table.parent
bullet = div.find(string='\u2022')
if bullet is None:
# not a bullet table, skip
continue
text_cell = bullet.find_next('td')
div.clear()
div.append(text_bullet + ' ')
for i, elem in enumerate(text_cell.contents[:]):
if i == 0 and elem == '\n':
continue # no need to include the first linebreak
div.append(elem.extract())

此外,您可能还想跳过分页符(<p>[page number]</p><hr/> 元素的组合),如果您运行

for pagebrk in soup.select('p ~ hr[style^=page-break-after]'): 
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()

这比您自己的版本更明确,您删除了所有 <hr/>元素和前面的 <p>元素,不管他们是否真的是 sibling 。

在清理您的 Word HTML 之前执行这两个操作。结合你的功能一起变成:

def get_text(url, item_name):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()

convert_word_bullets(soup)
cleaned_section = map(clean_word, sec_section(soup, item_name))

return ''.join([
elem.get_text(strip=True) if elem.name else elem
for elem in cleaned_section])


text = get_text(url, 'ITEM_1A_RISK_FACTORS')
with open(os.path.join(path, 'test.txt'), 'w', encoding='utf8') as f:
f.write(text)

关于python - beautifulsoup 如何重组单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52023692/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com