gpt4 book ai didi

python - 遍历列表并仅从页面中提取最后一个链接的尴尬问题 [BS4]

转载 作者:行者123 更新时间:2023-12-04 07:35:33 24 4
gpt4 key购买 nike

我正在尝试抓取网站,有 12 个页面带有 X 链接——我只想提取所有链接,并将它们存储起来以备后用。
但是从页面中提取链接存在一个尴尬的问题。准确地说,我的输出只包含每个页面的最后一个链接。
我知道这个描述可能听起来令人困惑,所以让我向您展示代码和图像:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time

#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"

archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)

#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"

paper_urls =[]

for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
print(paper_urls)
time.sleep(5)
但问题是,我的输出看起来像 [已编辑]。
我收到了这个,而不是大约 80 个链接!我想知道发生了什么,看起来我的脚本从每个生成的 URL(来自代码中名为“问题”的列表)只获取最后列出的链接?!如何解决?我不知道这里应该是什么问题。

最佳答案

在附加到 paper_urls 时,您是否可能遗漏了缩进?

paper_urls =[]

for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags) # added missing indentation
print(paper_urls)
time.sleep(5)
将打印移出循环后,整个代码如下所示:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time

#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"

archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)

#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"

paper_urls =[]

for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
#print(ahrefTags) #uncomment if you wish to print each and every link by itself
#time.sleep(5) #uncomment if you wish to add a delay between each request
print(paper_urls)

关于python - 遍历列表并仅从页面中提取最后一个链接的尴尬问题 [BS4],我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67761311/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com