python - Beautiful Soup 的 XML 数据不完整-6ren

python - Beautiful Soup 的 XML 数据不完整

转载作者：太空宇宙更新时间：2023-11-03 16:53:41

25

4

我正在使用 Python3.4 和 Beautiful Soup 4 来获取 RSS XML 提要的一些数据。一切似乎都工作正常，但有时它的行为不符合预期，因为没有获取 <description> 内的所有数据。列表中至少一项的标记。
例如，这是给我带来问题的项目:

<item>
    <title>Google&#8217;s first DeepMind AI health project is missing something</title>
    <link>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/</link>
    <comments>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/#respond</comments>
    <pubDate>Thu, 25 Feb 2016 11:36:56 +0000</pubDate>
    <dc:creator><![CDATA[Kirsty Styles]]></dc:creator>
            <category><![CDATA[Google]]></category>
    <category><![CDATA[Insider]]></category>
    <category><![CDATA[Deepmind]]></category>
    <category><![CDATA[doctor]]></category>
    <category><![CDATA[healthcare]]></category>
    <category><![CDATA[NHS]]></category>
    <category><![CDATA[UK]]></category>

    <guid isPermaLink="false">http://thenextweb.com/?p=957096</guid>
    <description><![CDATA[<img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" alt="Doctors Seek Higher Fees From Health Insurers" title="Google&#039;s first DeepMind AI health project is missing something" data-id="750745" /><br />Having been down at Google’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London&#8230; <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&#038;utm_medium=feed&#038;utm_campaign=profeed">This story continues</a> at The Next Web]]></description>
    <wfw:commentRss>http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/feed/</wfw:commentRss>
    <slash:comments>0</slash:comments>
<enclosure url="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2014/04/doctor-crop-520x245.jpg" type="image/jpeg" length="0" />
</item>

我使用此代码来解析数据:

from bs4 import BeautifulSoup
import urllib.request

req = urllib.request.urlopen('http://thenextweb.com/feed/')

xml = BeautifulSoup(req, 'xml')

for item in xml.findAll('item'):
    string = item.description.string
    #new_string = string.split('/>', 1)
    #print(new_string[0]+'/><p>')
    print(string)

当我运行脚本时一切正常，但该特定项目失败。代码中的注释行用于分割 img并添加 <p>标签来订购内容。

我从该项目中得到的结果是:

’s DeepMind office earlier this week its man vs AI machine gaming competition preview, I was tipped off that a potentially-more-serious healthcare announcement would follow soon. That it has, but contrary to what the company’s remit might suggest, this project doesn’t actually contain any artificial intelligence at launch. “To date, no machine learning has been involved in these projects,” the company said. “While there is obvious potential in applying machine learning to these kinds of complex challenges, any decision to do so will led by clinicians.” DeepMind has announced an acquisition in the shape of an Imperial College London&#8230; <br><br><a href="http://thenextweb.com/google/2016/02/25/googles-first-deepmind-ai-health-project-is-missing-something/?utm_source=social&#038;utm_medium=feed&#038;utm_campaign=profeed">This story continues</a> at The Next Web

我不知道发生了什么。如果有人可以帮助我或指导我通过一种方式提取准确的<img>标记我将非常感激。

最佳答案

为什么不在 for 循环中搜索 description 标签，如下所示:

for item in xml.findAll('item'):
    s = item.find('description')
    print (s)
    >>> <description>&lt;img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2016/02/shutterstock_366588536-520x245.jpg" alt="Fintech" title="5 British companies for FinTech Week" data-id="956789" /&gt;&lt;br /&gt;FinTech, financial technology, is about disrupting the stale financial sector with technology and innovation. Have you accepted the status quo of a bank-led dominance? The people in the flourishing FinTech field have rejected it. Last year, Eileen Burbidge, the UK government’s special envoy for FinTech stated: “London and the UK will lead the FinTech sector.” That’s not hard to believe. With a well-established financial sector, a cultivated tech scene and wide access to capital and talent, London is primed for FinTech. The industry generated over $9 billion in revenue last year. As the UK celebrates #FinTechWeek, we look at five British&amp;#8230; &lt;br&gt;&lt;br&gt;&lt;a href="http://thenextweb.com/insider/2016/02/25/5-british-companies-for-fintech-week/?utm_source=social&amp;#038;utm_medium=feed&amp;#038;utm_campaign=profeed"&gt;This story continues&lt;/a&gt; at The Next Web</description>

关于python - Beautiful Soup 的 XML 数据不完整，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35629775/

25

4

0

文章推荐： ruby-on-rails - 程序 'gem' 可以在以下包中找到

文章推荐： ubuntu - Yeoman 出现 "Cannot start Chrome"错误

文章推荐： c# - LINQ to SQL 关联抛出 null 异常

python - Beautiful Soup 4 find_all 找不到 Beautiful Soup 3 找到的链接
我注意到一个非常烦人的错误:BeautifulSoup4(包:bs4)经常发现比以前版本(包:BeautifulSoup)更少的标签。这是该问题的一个可重现的实例: import requests
Python Beautiful Soup解析具有特定ID的表
我正在尝试从具有我所知道的特定ID的表中获取数据。由于某种原因，该代码不断给我“无”结果。我正在尝试从HTML代码中解析: שווי שוק (אלפי ש"ח)
python beautiful soup元内容标签
我正在尝试从包含以下 HTML 的网站中提取价格: $ 29.99 我正在使用以下 Beautiful Soup 代码: book_prices = soup_pack
python - beautiful Soup中python响应报错如何继续
我做了一个网络爬虫，它从一个文本文件中获取数千个 Urls，然后爬取该网页上的数据。现在它有很多网址；一些网址也被破坏了。所以它给了我错误: Traceback (most recent call
网站的Python正确编码(Beautiful Soup)
我正在尝试加载 html 页面并输出文本，即使我正确获取网页，BeautifulSoup 以某种方式破坏了编码。来源: # -*- coding: utf-8 -*- import requests
python beautiful soup库入门安装教程
目录 beautiful soup库的安装 beautiful soup库的理解 beautiful soup库的引用 BeautifulSoup类
面向新手解析python Beautiful Soup基本用法
Beautiful Soup就是Python的一个HTML或XML的解析库，可以用它来方便地从网页中提取数据。它有如下三个特点： Beautiful Soup提供一些简单的、Python式的
526. Beautiful Arrangement 优美的排列
题目地址：https://leetcode.com/problems/beautiful-arrangement/description/ 题目描述 Suppose you have N inte
932. Beautiful Array 漂亮数组
题目地址：https://leetcode.com/problems/beautiful-array/description/ 题目描述 Forsome fixed N, an array A i
Python Beautiful Soup find_all
您好，我正在尝试从网站获取一些信息。请原谅我，如果我的格式有任何错误，这是我第一次发布到 SO。 soup.find('div', {"class":"stars"}) 从这里我收到我需要 “
python - Beautiful Soup 选择谷歌图像返回空列表
我想从 Google Arts & Culture 检索信息使用 BeautifulSoup。我检查了许多 stackoverflow 帖子( [1] ， [2] , [3] , [4] , [5]
Python -- Beautiful Soup -- 如果标签为空或有值则返回信息
我决定学习 Python，因为我现在有更多时间(由于大流行)并且一直在自学 Python。我试图从一个网站上刮取税率，几乎可以获得我需要的一切。下面是来自我的 Soup 变量以及相关 Python
python - 从页面中获取所有链接 Beautiful Soup
我正在使用 beautifulsoup 从页面中获取所有链接。我的代码是: import requests from bs4 import BeautifulSoup url = 'http://ww
reactjs - 使用react-beautiful-dnd获取DragHandle错误
我正在使用react-beautiful-dnd版本8.0.5(最新)并尝试渲染可重组列表，但我不断收到此错误: Warning: React.createElement: type is inval
javascript - 组件不会掉落在相邻列中react-beautiful-dnd
我在将组件放入应用程序的下一个列表区域时遇到困难。我可以在父列中完美地拖放和排序，但无法将组件放在其他地方。这是我的 onDragEnd 函数中的代码: onDragEnd = result =>
javascript - 无法在同一个可放置的react-beautiful-dnd中拖放组件
发生的情况是，当我在一列中有多个项目并尝试拖动其中一个时，只显示一个项目，并且根据发现的经验教训 here我应该处于可以移动同一列内的项目但不能移动的位置。在 React 开发工具中，state 和
python - Beautiful Soup 根据部分属性值查找标签
我正在尝试根据部分属性值来识别 html 文档中的标签。例如，如果我有一个 Beautifulsoup 对象: import bs4 as BeautifulSoup r = requests.ge
python - Beautiful Soup 查找具有多个类的元素
Показать телефон 如何在 Beautiful Soup 中找到上述元素？我尝试了以下方法，但没有奏效: show = soup.find('div', {'class': 'acti
python - beautiful soup 通过指定两件事在表中查找链接
我如何获得结果网址:https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-in
Python Beautiful Soup 使用类解析表
我是 python 新手，尝试从页面中提取表格，但无法使用 BS4 找到该表格。你能告诉我我哪里出错了吗？ import requests from bs4 import BeautifulSoup

首页

博学

6Ren·AI

商城

python - Beautiful Soup 的 XML 数据不完整