gpt4 book ai didi

python - 如何将 html 文件转换为人类可读的 txt 文件?

转载 作者:太空宇宙 更新时间:2023-11-03 13:51:44 24 4
gpt4 key购买 nike

我有很多 html 文件看起来像这样:

<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
SIC Code:
</b>
0000
<br />
<b>
Sector:
</b>
N/A
<br />
<b>
Industry:
</b>
N/A
<br />
</font>

我想做的是取出文件中间的文本并将其转换为人类可读的格式。在这个例子中,它是:

 According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.

On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.

我知道我必须做三件事,它们是:

  1. 取出文件中间的文字
  2. 替换"<br />""\n"
  3. 替换"&nbsp;"" " (一个空格)

我知道后两件事很简单,只需使用 Python 中的 replace 方法即可,但我不知道如何实现第一个目标。

我知道一点正则表达式和BeautifulSoup,但我不知道如何将它们应用到这道题中。

有人可以帮助我吗?

谢谢,我很抱歉我的英语不好。

@Paul:我只想要一个摘要部分。我的老师(对计算机了解不多)给了我很多 html 文件,并要求我将它们转换为适合数据挖掘的格式(我的老师尝试使用 SAS 来做到这一点)。我不知道 SAS,但我认为它可能用于处理大量的 txt 文件,所以我想将这些 html 文件转换为普通的 txt 文件。

@Owen:我需要处理很多html文件,我觉得这个问题不太难处理,所以我想直接用Python解决。

最佳答案

你可以使用 Scrapely。

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

http://github.com/scrapy/scrapely

关于python - 如何将 html 文件转换为人类可读的 txt 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7155881/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com