gpt4 book ai didi

python - 如何让我的爬虫从起始页解析数据

转载 作者:太空宇宙 更新时间:2023-11-03 14:58:32 25 4
gpt4 key购买 nike

我用 python 编写了一些代码来从 torrent 站点获取详细信息。然而,当我运行代码时,我发现结果符合我的预期。这个爬虫的唯一问题是它会跳过第一页的内容 [因为分页 URL 从 2 开始],这是我无法修复的。对此的任何帮助都将非常感激。

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"
b_link = "https://yts.ag"

def get_links(main_link):
response = requests.get(main_link).text
tree = html.fromstring(response)
for item in tree.cssselect('ul.tsc_pagination a'):
if "page" in item.attrib["href"]:
movie_details(b_link + item.attrib["href"])

def movie_details(link):
response = requests.get(link).text
tree = html.fromstring(response)
for titles in tree.cssselect("div.browse-movie-wrap"):
title = titles.cssselect('div.browse-movie-bottom a.browse-movie-title')[0].text
link = titles.cssselect('div.browse-movie-year')[0].text
rating= titles.cssselect('figcaption.hidden-xs h4.rating')[0].text
genre = titles.cssselect('figcaption.hidden-xs h4')[0].text
genre1 = titles.cssselect('figcaption.hidden-xs h4')[1].text
print(title, link, rating, genre, genre1)

get_links(page_link)

最佳答案

为什么不在循环之前调用 main_link 上的 movie_details() 函数?

def get_links(main_link):
response = requests.get(main_link).text
tree = html.fromstring(response)
movie_details(main_link)
for item in tree.cssselect('ul.tsc_pagination a'):
if "page" in item.attrib["href"]:
movie_details(b_link + item.attrib["href"])

关于python - 如何让我的爬虫从起始页解析数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45313617/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com