python - 从 smg 文件 Beautiful Soup 和 Python 中提取正文标签-6ren

python - 从 smg 文件 Beautiful Soup 和 Python 中提取正文标签

转载作者：太空宇宙更新时间：2023-11-04 01:28:24

我有一个 sgm 文件，格式如下:

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><BODY>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</BODY></TEXT>
</REUTERS>

同一个文件中有1000条根节点为RETURNS的记录。我想从每条记录中提取 body 标签并对其进行一些处理，但是我做不到。以下是我的代码

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
topics= soup.findAll('body') # find all body tags
print len(topics)  # print number of body tags in sgm file
i=0
for link in topics:         #loop through each body tag and print its content 
    children = link.findChildren()
    for child in children:
        if i==0:
            print child
        else:
            print "none"
            i=i+1

print i

问题是 for 循环不打印 body 标签的内容 - 而是打印记录本身。

最佳答案

正如我在评论中所说，出于未知(对我而言)的原因，您不应该将标签命名为 body。

因此，第一步:用例如 content 替换 body 标签名称:

<REUTERS TOPICS="NO" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="16321" NEWID="1001">
<DATE> 3-MAR-1987 09:18:21.26</DATE>
<TOPICS></TOPICS>
<PLACES><D>usa</D><D>ussr</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN> 
&#5;&#5;&#5;G T
&#22;&#22;&#1;f0288&#31;reute
d f BC-SANDOZ-PLANS-WEEDKILL   03-03 0095</UNKNOWN>
<TEXT>&#2;
<TITLE>SANDOZ PLANS WEEDKILLER JOINT VENTURE IN USSR</TITLE>
<DATELINE>    BASLE, March 3 - </DATELINE><CONTENT>Sandoz AG said it planned a joint venture
to produce herbicides in the Soviet Union.
    The company said it had signed a letter of intent with the
Soviet Ministry of Fertiliser Production to form the first
foreign joint venture the ministry had undertaken since the
Soviet Union allowed Western firms to enter into joint ventures
two months ago.
    The ministry and Sandoz will each have a 50 pct stake, but
a company spokeswoman was unable to give details of the size of
investment or planned output.
 Reuter
&#3;</CONTENT></TEXT>
</REUTERS>

这是代码:

from bs4 import BeautifulSoup,SoupStrainer
f = open('dataset/reut2-001.sgm', 'r')
data= f.read()
soup = BeautifulSoup(data)
contents = soup.findAll('content')
for content in contents:
    print content.text

关于python - 从 smg 文件 Beautiful Soup 和 Python 中提取正文标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15863751/

文章推荐： html - CSS 背景不透明度变为灰色并仅覆盖移动设备上的图像

python - 从 smg 文件 Beautiful Soup 和 Python 中提取正文标签
我有一个 sgm 文件，格式如下: 3-MAR-1987 09:18:21.26 usaussr G T f0288reu

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 从 smg 文件 Beautiful Soup 和 Python 中提取正文标签