Common issues with real-world data
Of course, using a small example of hard-coded text makes it clear why certain calls to the find
etc. methods fail - the content simply isn't there, and it's immediately obvious just by reading a few lines of data. Any attempt to debug code should start by carefully checking for typos:
当然,使用一个硬编码文本的小例子可以清楚地说明为什么对find等方法的某些调用会失败-内容根本不存在,并且只需读取几行数据就可以立即显而易见。任何调试代码的尝试都应该从仔细检查拼写错误开始:
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
...
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
...
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, 'html.parser')
>>> print(soup.find('a', class_='sistre')) # note the typo
None
>>> print(soup.find('a', class_='sister')) # corrected
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
In the real world, however, web pages can easily span many kilobytes or even megabytes of text, so that kind of visual inspection isn't practical. In general, for more complex tasks, it's worth taking the time first to check if a given webpage provides an API to access data, rather than scraping it out of page content. Many websites are happy to provide the data directly, in a format that's easier to work with (because it's specifically designed to be worked with as data, rather than to fill in the blanks of a "template" web page).
然而,在现实世界中,网页可以很容易地跨越许多千字节甚至兆字节的文本,因此这种视觉检查是不现实的。一般来说,对于更复杂的任务,值得首先花时间检查给定的网页是否提供了访问数据的API,而不是从页面内容中提取数据。许多网站都乐于以一种更容易使用的格式直接提供数据(因为它是专门为作为数据使用而设计的,而不是用来填补“模板”网页的空白)。
As a rough overview: an API consists of endpoints - URIs that can be directly accessed in the same way as web page URLs, but the response is something other than a web page. The most common format by far is JSON, although it's possible to use any data format depending on the exact use case - for example, a table of data might be returned as CSV. To use a standard JSON endpoint, write code that figures out the exact URI to use, load it normally, read and parse the JSON response, and proceed with that data. (In some cases, an "API key" will be necessary; a few companies use these to bill for premium data access, but it's usually just so that the information requests can be tied to a specific user.)
作为一个粗略的概述:API由端点-URI组成,可以以与网页URL相同的方式直接访问,但响应不是网页。到目前为止,最常见的格式是JSON,但也可以根据具体用例使用任何数据格式--例如,数据表可能以CSV形式返回。要使用标准的JSON端点,编写代码来确定要使用的确切URI,正常加载它,读取和解析JSON响应,并继续处理这些数据。(在某些情况下,“API密钥”将是必需的;一些公司使用这些密钥来为高级数据访问收费,但这通常只是为了将信息请求绑定到特定用户。)
Normally this is much easier than anything that could be done with BeautifulSoup, and will save on bandwidth as well. Companies that offer publicly documented APIs for their web pages want you to use them; it's generally better for everyone involved.
通常,这比使用BeautifulSoup做任何事情都容易得多,而且还会节省带宽。为其网页提供公开文档的API的公司希望您使用它们;这通常对参与其中的每个人都更好。
All of that said, here are some common reasons why the web response being parsed by BeautifulSoup either doesn't contain what it's expected to, or is otherwise not straightforward to process.
尽管如此,这里有一些常见的原因,为什么BeautifulSoup正在解析的Web响应要么没有包含它期望的内容,要么不能直接处理。
Dynamically (client-side) generated content
Keep in mind that BeautifulSoup processes static HTML, not JavaScript. It can only use data that would be seen when visiting the webpage with JavaScript disabled.
请记住,BeautifulSoup处理的是静态HTML,而不是JavaScript。它只能使用在禁用了JavaScript的情况下访问网页时会看到的数据。
Modern webpages commonly generate a lot of the page data by running JavaScript in the client's web browser. In typical cases, this JavaScript code will make more HTTP requests to get data, format it, and effectively edit the page (alter the DOM) on the fly. BeautifulSoup cannot handle any of this. It sees the JavaScript code in the web page as just more text.
现代网页通常通过在客户端的Web浏览器中运行JavaScript来生成大量页面数据。在典型情况下,此JavaScript代码将发出更多的HTTP请求以获取数据、格式化数据并有效地动态编辑页面(更改DOM)。BeautifulSoup无法处理这些问题。它将网页中的JavaScript代码视为更多的文本。
To scrape a dynamic website, consider using Selenium to emulate interacting with the web page.
要创建一个动态的网站,可以考虑使用Selify来模拟与网页的交互。
Alternately, investigate what happens when using the site normally. Typically, the JavaScript code on the page will make calls to API endpoints, which can be seen on the "Network" (or similarly-named) tab of a web browser's developer console. This can be a great hint for understanding the site's API, even if it isn't easy to find good documentation.
或者,调查正常使用该站点时会发生什么。通常,页面上的JavaScript代码将调用API端点,这可以在Web浏览器的开发人员控制台的“Network”(或类似名称)选项卡上看到。即使很难找到好的文档,这对于理解站点的API也是一个很好的提示。
User-agent checks
Every HTTP request includes headers that provide information to the server to help the server handle the request. These include information about caches (so the server can decide whether it can use a cached version of the data), acceptable data formats (so the server can e.g. apply compression to the response to save on bandwidth), and about the client (so the server can tweak the output to look right in every web browser).
每个HTTP请求都包括向服务器提供信息以帮助服务器处理请求的标头。这些信息包括关于缓存的信息(这样服务器就可以决定它是否可以使用数据的缓存版本)、可接受的数据格式(这样服务器就可以对响应进行压缩以节省带宽),以及关于客户端的信息(这样服务器就可以调整输出,使其在每个Web浏览器中都显示正确)。
The last part is done using the "user-agent" part of the header. However, by default, HTML libraries (like urllib
and requests
) will generally not claim any web browser at all - which, on the server end, is a big red flag for "this user is running a program to scrape web pages, and not actually using a web browser".
最后一部分使用头的“用户-代理”部分完成。然而,默认情况下,HTML库(如urllib和请求)通常根本不会声明任何Web浏览器--在服务器端,这是一个很大的危险信号,表明“该用户正在运行一个程序来抓取网页,而不是实际使用Web浏览器”。
Most companies don't like that very much. They would rather have you see the actual web page (including ads). So, the server may simply generate some kind of dummy page (or an HTTP error) instead. (Note: this might include a "too many requests" error, that would otherwise point at a rate limit as described in the next section.)
大多数公司都不太喜欢这样。他们更愿意让你看到实际的网页(包括广告)。因此,服务器可能只生成某种类型的伪页面(或HTTP错误)。(注意:这可能包括“请求太多”错误,否则将指向下一节所述的速率限制。)
To work around this, set the header in the appropriate way for the HTTP library:
要解决此问题,请以适当的方式为HTTP库设置标头:
Rate limits
Another telltale sign of a "bot" is that the same user is requesting multiple web pages as fast as the internet connection will allow, or not even waiting for one page to finish loading before asking for another one. The server tracks who is making requests by IP (and possibly by other "fingerprinting" information) even when logins are not required, and may simply deny page content to someone who is requesting pages too quickly.
“机器人”的另一个迹象是,同一用户正在以互联网连接所允许的速度请求多个网页,或者甚至不等待一个页面完成加载就请求另一个页面。即使在不需要登录的情况下,服务器也会通过IP(可能还会通过其他“指纹”信息)跟踪谁在发出请求,并且可能会简单地将页面内容拒绝给请求页面太快的人。
Limits like this will usually apply equally to an API (if available) - the server is protecting itself against denial of service attacks. So generally the only work-around will be to fix the code to make requests less frequently, for example by pausing the program between requests.
这样的限制通常同样适用于API(如果可用)-服务器正在保护自己免受拒绝服务攻击。因此,通常唯一的解决办法是修复代码以降低发出请求的频率,例如,通过在请求之间暂停程序。
See for example How to avoid HTTP error 429 (Too Many Requests) python.
例如,请参阅如何避免HTTP错误429(请求太多)。
Login required
This is pretty straightforward: if the content is normally only available to logged-in users, then the scraping script will have to emulate whatever login procedure the site uses.
这非常简单:如果内容通常只对登录的用户可用,那么抓取脚本将必须模拟站点使用的任何登录过程。
Server-side dynamic/randomized names
Keep in mind that the server decides what to send for every request. It doesn't have to be the same thing every time, and it doesn't have to correspond to any actual files in the server's permanent storage.
请记住,服务器决定为每个请求发送什么内容。它不必每次都是相同的,也不必对应于服务器永久存储中的任何实际文件。
For example, it could include randomized class names or IDs generated on the fly, that could potentially be different every time the page is accessed. Trickier yet: because of caching, the name could appear to be consistent... until the cache expires.
例如,它可以包括动态生成的随机类名或ID,这些名称或ID可能在每次访问页面时都不同。更棘手的是:由于缓存,名称可能看起来是一致的……直到缓存过期。
If a class name or ID in the HTML source seems to have a bunch of meaningless junk characters in it, consider not relying on that name staying consistent - think of another way to identify the necessary data. Alternatively, it might be possible to figure out a tag ID dynamically, by seeing how some other tag in the HTML refers to it.
如果HTML源文件中的类名或ID似乎包含一堆毫无意义的垃圾字符,请考虑不要依赖该名称保持一致--想另一种方法来标识必要的数据。或者,也可以通过查看HTML中的其他标记如何引用它来动态计算出标记ID。
Irregularly structured data
Suppose for example that a company web site's "About" page displays contact information for several key staff members, with a <div class="staff">
tag wrapping each person's info. Some of them list an email address, and others do not; when the address isn't listed, the corresponding tag is completely absent, rather than just not having any text:
例如,假设一家公司网站的“About”页面显示了几个关键员工的联系信息,并用
标记包装了每个人的信息。其中一些列出了电子邮件地址,而另一些则没有;当地址没有列出时,对应的标记完全没有,而不仅仅是没有任何文本:
soup = BeautifulSoup("""<html>
<head><title>Company staff</title></head><body>
<div class="staff">Name: <span class="name">Alice A.</span> Email: <span class="email">[email protected]</span></div>
<div class="staff">Name: <span class="name">Bob B.</span> Email: <span class="email">[email protected]</span></div>
<div class="staff">Name: <span class="name">Cameron C.</span></div>
</body>
</html>""", 'html.parser')
Trying to iterate and print each name and email will fail, because of the missing email:
尝试迭代并打印每个名称和电子邮件将失败,因为缺少电子邮件:
>>> for staff in soup.select('div.staff'):
... print('Name:', staff.find('span', class_='name').text)
... print('Email:', staff.find('span', class_='email').text)
...
Name: Alice A.
Email: [email protected]
Name: Bob B.
Email: [email protected]
Name: Cameron C.
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
AttributeError: 'NoneType' object has no attribute 'text'
This is simply an irregularity that has to be expected and handled.
这只是一种必须预料到和处理的不规则性。
However, depending on the exact requirements, there may be more elegant approaches. If the goal is simply to collect all email addresses (without worrying about names), for example, we might first try code that processes the child tags with a list comprehension:
然而,根据确切的要求,可能会有更优雅的方法。例如,如果目标只是收集所有电子邮件地址(而不担心名称),我们可能首先尝试使用列表理解来处理子标记的代码:
>>> [staff.find('span', class_='email').text for staff in soup.select('div.staff')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
AttributeError: 'NoneType' object has no attribute 'text'
We could work around the problem by instead getting a list of emails for each name (which will have either 0 or 1 element), and using a nested list comprehension designed for a flat result:
我们可以通过获得每个名字的电子邮件列表(将有0个或1个元素)来解决这个问题,并使用为平面结果设计的嵌套列表理解:
>>> [email.text for staff in soup.select('div.staff') for email in staff.find_all('span', class_='email')]
['[email protected]', '[email protected]']
Or we could simply use a better query:
或者,我们可以简单地使用更好的查询:
>>> # maybe we don't need to check for the div tags at all?
>>> [email.text for email in soup.select('span.email')]
['[email protected]', '[email protected]']
>>> # Or if we do, use a fancy CSS selector:
>>> # look for the span anywhere inside the div
>>> [email.text for email in soup.select('div.staff span.email')]
['[email protected]', '[email protected]']
>>> # require the div as an immediate parent of the span
>>> [email.text for email in soup.select('div.staff > span.email')]
['[email protected]', '[email protected]']
Invalid HTML "corrected" by the browser
HTML is complicated, and real-world HTML is often riddled with typos and minor errors that browsers gloss over. Nobody would use a pedantic browser that just popped up an error message if the page source wasn't 100% perfectly standards-compliant (both to begin with, and after each JavaScript operation) - because such a huge fraction of the web would just disappear from view.
HTML是复杂的,现实世界的HTML经常充斥着错别字和浏览器掩盖的小错误。没有人会使用一个迂腐的浏览器,如果页面源不是100%完全符合标准的话,它会弹出一个错误消息(无论是在开始时,还是在每次JavaScript操作之后)-因为这样一个巨大的网页部分会从视图中消失。
BeautifulSoup allows for this by letting the HTML parser handle it, and letting the user choose an HTML parser if there are others installed besides the standard library one. Web browsers, on the other hand, have their own HTML parsers built in, which might be far more lenient, and also take much more heavy-weight approaches to "correcting" errors.
BeautifulSoup允许让HTML解析器处理它,如果除了标准库之外还安装了其他解析器,则允许用户选择一个HTML解析器。另一方面,Web浏览器内置了自己的HTML解析器,这可能要宽松得多,也会采取更重的方法来“纠正”错误。
In this example, the OP's browser showed a <tbody>
tag inside a <table>
in its "Inspect Element" view, even though that was not present in the actual page source. The HTML parser used by BeautifulSoup, on the other hand, did not; it simply accepted having <tr>
tags nested directly within a <table>
. Thus, the corresponding Tag
element created by BeautifulSoup to represent the table, reported None
for its tbody
attribute.
在本例中,OP的浏览器在其“Inspect Element”视图中的
中显示了标记,尽管该标记并不存在于实际的页面源代码中。另一方面,BeautifulSoup使用的HTML解析器没有这样做;它只是接受将标记直接嵌套在
中。因此,BeautifulSoup为表示表而创建的相应的tag元素在其tbody属性中没有报告任何内容。
Typically, problems like this can be worked around by searching within a subsection of the soup (e.g. by using a CSS selector), rather than trying to "step into" each nested tag. This is analogous to the problem of irregularly structured data.
通常,这样的问题可以通过在SOUP的一个子部分中进行搜索来解决(例如,使用一个css选择器),而不是尝试“单步执行”每个嵌套标记。这类似于不规则结构数据的问题。
Not HTML at all
Since it comes up sometimes, and is also relevant to the caveat at the top: not every web request will produce a web page. An image, for example, can't be processed with BeautifulSoup; it doesn't even represent text, let alone HTML. Less obviously, a URL that has something like /api/v1/
in the middle is most likely intended as an API endpoint, not a web page; the response will most likely be JSON formatted data, not HTML. BeautifulSoup is not an appropriate tool for parsing this data.
因为它有时会出现,而且也与顶部的警告相关:并不是每个Web请求都会生成Web页面。例如,一个图像不能用BeautifulSoup处理;它甚至不能表示文本,更不用说HTML了。不太明显的是,中间带有/api/v1/之类的URL很可能是一个API端点,而不是网页;响应很可能是JSON格式的数据,而不是HTML。BeautifulSoup不是解析此数据的合适工具。
Modern web browsers will commonly generate a "wrapper" HTML document for such data. For example, if I view an image on Imgur, with the direct image URL (not one of Imgur's own "gallery" pages), and open my browser's web-inspector view, I'll see something like (with some placeholders substituted in):
现代的Web浏览器通常会为这些数据生成一个“包装器”的HTML文档。例如,如果我在Imgur上查看带有直接图像URL的图像(而不是Imgur自己的“图库”页面),并打开浏览器的Web检查器视图,我将看到如下所示(其中替换了一些占位符):
<html>
<head>
<meta name="viewport" content="width=device-width; height=device-height;">
<link rel="stylesheet" href="resource://content-accessible/ImageDocument.css">
<link rel="stylesheet" href="resource://content-accessible/TopLevelImageDocument.css">
<title>[image name] ([format] Image, [width]×[height] pixels) — Scaled ([scale factor])</title>
</head>
<body>
<img src="[url]" alt="[url]" class="transparent shrinkToFit" width="[width]" height="[height]">
</body>
</html>
For JSON, a much more complex wrapper is generated - which is actually part of how the browser's JSON viewer is implemented.
对于JSON,会生成一个复杂得多的包装器--这实际上是浏览器的JSON查看器实现方式的一部分。
The important thing to note here is that BeautifulSoup will not see any such HTML when the Python code makes a web request - the request was never filtered through a web browser, and it's the local browser that creates this HTML, not the remote server.
这里需要注意的重要一点是,当Python代码发出Web请求时,BeautifulSoup将不会看到任何这样的HTML--该请求从未通过Web浏览器进行过过滤,并且是本地浏览器创建的,而不是远程服务器。
Overview
In general, there are two kinds of queries offered by BeautifulSoup: ones that look for a single specific element (tag, attribute, text etc.), and those which look for each element that meets the requirements.
一般来说,BeautifulSoup提供了两种查询:一种是查找单个特定元素(标签、属性、文本等),以及那些寻找满足要求的每个元素的方法。
For the latter group - the ones like .find_all
that can give multiple results - the return value will be a list. If there weren't any results, then the list is simply empty. Nice and simple.
对于后一组--像.findall这样可以提供多个结果的组--返回值将是一个列表。如果没有任何结果,那么列表就是空的。又好又简单。
However, for methods like .find
and .select_one
that can only give a single result, if nothing is found in the HTML, the result will be None
. BeautifulSoup will not directly raise an exception to explain the problem. Instead, an AttributeError
will commonly occur in the following code, which tries to use the None
inappropriately (because it expected to receive something else - typically, an instance of the Tag
class that BeautifulSoup defines). This happens because None
simply doesn't support the operation; it's called an AttributeError
because the .
syntax means to access an attribute of whatever is on the left-hand side.
[TODO: once a proper canonical exists, link to an explanation of what attributes are and what AttributeError
is.]
但是,对于像.find和.select_one这样只能给出单一结果的方法,如果在HTML中什么都没有找到,结果将为NONE。BeautifulSoup不会直接引发异常来解释该问题。相反,AttributeError通常会出现在以下代码中,该代码试图不适当地使用None(因为它预计会收到其他内容-通常是BeautifulSoup定义的Tag类的实例)。之所以会发生这种情况,是因为没有人只是不支持该操作;它之所以称为AttributeError,是因为。语法意味着访问位于左侧的任何内容的属性。[TODO:一旦存在适当的规范,请链接到什么是属性以及什么是AttributeError的解释。]
Examples
Let's consider the non-working code examples in the question one by one:
让我们逐一考虑问题中不起作用的代码示例:
>>> print(soup.sister)
None
This tries to look for a <sister>
tag in the HTML (not a different tag that has a class
, id
or other such attribute equal to sister
). There isn't one, so the result is `None.
这会尝试在HTML中查找一个<姐妹>标记(而不是具有与姐妹相同的类、id或其他类似属性的不同标记)。没有一个,所以结果是一个也没有。
>>> print(soup.find('a', class_='brother'))
None
This tries to find an <a>
tag that has a class
attribute equal to brother
, like <a href="https://example.com/bobby" class="brother">Bobby</a>
. The document doesn't contain anything like that; none of the a
tags have that class (they all have the sister
class instead).
它试图找到一个标记,它的类属性等于Brother,比如Bobby。该文档不包含任何类似的内容;a标记中没有一个包含该类(它们都具有姐妹类)。
>>> print(soup.select_one('a.brother'))
None
This is another way to do the same thing as the previous example, with a different method. (Instead of passing a tag name and some attribute values, we pass a CSS query selector.) The result is the same.
这是用不同的方法完成与上一个示例相同的操作的另一种方式。(我们传递的不是标记名和一些属性值,而是一个CSS查询选择器。)结果是一样的。
>>> soup.select_one('a.brother').text
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'text'
Since soup.select_one('a.brother')
returned None
, this is the same as trying to do None.text
. The error means exactly what it says: None
doesn't have a text
to access. In fact, it doesn't have any "ordinary" attributes; the NoneType
class only defines special methods like __str__
(which converts None
to the string 'None'
, so that it can look like the actual text None
when it is printed).
由于Soup.select_one(‘a.brother’)返回NONE,因此这与尝试执行NON.Text相同。这个错误的意思正如它所说的:没有一个没有文本可供访问。事实上,它没有任何“普通”属性;NoneType类只定义了__str__这样的特殊方法(它将None转换为字符串‘None’,以便在打印时看起来像实际的文本None)。
更多回答
我是一名优秀的程序员,十分优秀!