gpt4 book ai didi

python - 如何使用 python 和 Beautifulsoup 抓取网页的第二个

转载 作者:行者123 更新时间:2023-12-01 08:28:05 26 4
gpt4 key购买 nike

我一直在尝试使用 BeautifulSoup,因为我想尝试抓取网页 ( https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1 )。到目前为止,我成功地抓取了一些元素,但现在我想抓取电影描述,但我一直在努力。描述在 html 中的位置如下:

<div class="lister-item mode-advanced"> 
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>

我想删除第二段,这似乎很容易做到,但我尝试的所有内容都给了我“无”作为输出。我一直在四处寻找答案。在另一篇 stackoverflow 帖子中我发现

find('p:nth-of-type(1)')  

find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')

可以解决这个问题,但它仍然给了我

none #as output

下面你可以找到我的一段代码,它是一个有点低级的代码,因为我只是尝试学习一些东西

 import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title?
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-
advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description

上面的代码给了我这个输出:

$ python scrape.py
Logan
(2017)
8.1
None

我想学习选择 html 标签的正确方法,因为这对将来的项目很有用。

最佳答案

find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

然后您可以使用列表的索引来获取所需的元素。索引从 0 开始,因此 1 将给出第二项。

将first_description更改为此。

first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()

完整代码

import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description

输出

Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.

阅读Documentation学习选择html标签的正确方法。

还可以考虑迁移到 python 3。

关于python - 如何使用 python 和 Beautifulsoup 抓取网页的第二个 <p>,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54093253/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com