gpt4 book ai didi

python - 在Python3中,如何使用.append函数将字符串添加到抓取的链接中?

转载 作者:太空宇宙 更新时间:2023-11-03 14:11:40 25 4
gpt4 key购买 nike

感谢 stackoverflow.com,我能够编写一个程序,从任何给定的网页中抓取网络链接。但是,我需要它将主页 URL 连接到它遇到的任何相对链接。 (例如:“http://www.google.com/sitemap ”可以。但仅“/sitemap”本身是不行的。)

在下面的代码中,

from bs4 import BeautifulSoup as mySoup
from urllib.parse import urljoin as myJoin
from urllib.request import urlopen as myRequest

base_url = "https://www.census.gov/programs-surveys/popest.html"

html_page = myRequest(base_url)
raw_html = html_page.read()
page_soup = mySoup(raw_html, "html.parser")
html_page.close()

f = open("census4-3.csv", "w")

all_links = page_soup.find_all('a', href=True)

def clean_links(tags, base_url):
cleaned_links = set()
for tag in tags:
link = tag.get('href')
if link is None:
continue
full_url = myJoin(base_url, link)
cleaned_links.add(full_url)
return cleaned_links

cleaned_links = clean_links(all_links, base_url)

for link in cleaned_links:
f.write(str(link) + '\n')

f.close()
print("The CSV file is saved to your computer.")

如何以及在哪里添加这样的内容:

.append("http://www.google.com")

最佳答案

您应将基本网址保存为 base_url = 'https://www.census.gov'

像这样调用请求

html_page = myRequest(base_url + '/programs-surveys/popest.html')

当你想获取任何full_url时,只需这样做

full_url = base_url + link

关于python - 在Python3中,如何使用.append函数将字符串添加到抓取的链接中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48454799/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com