gpt4 book ai didi

python - 创建特定网站的 URL 列表

转载 作者:行者123 更新时间:2023-12-01 02:31:36 24 4
gpt4 key购买 nike

这是我第一次尝试使用编程来做一些有用的事情,所以请耐心等待。非常感谢建设性的反馈:)

我正在努力建立一个包含欧洲议会所有新闻稿的数据库。到目前为止,我已经构建了一个抓取工具,可以从一个特定的 URL 检索我想要的数据。然而,在阅读和查看几个教程后,我仍然不知道如何创建包含此特定站点的所有新闻稿的 URL 列表。

也许这与网站的构建方式有关,或者我(可能)只是错过了一些经验丰富的程序会立即意识到的明显的事情,但我真的不知道如何从这里继续。

这是起始 URL:http://www.europarl.europa.eu/news/en/press-room

这是我的代码:

links = [] # Until now I have just manually pasted a few links 
# into this list, but I need it to contain all the URLs to scrape

# Function for removing html tags from text
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(text):
return TAG_RE.sub('', text)

# Regex to match dates with pattern DD-MM-YYYY
date_match = re.compile(r'\d\d-\d\d-\d\d\d\d')

# For-loop to scrape variables from site
for link in links:

# Opening up connection and grabbing page
uClient = uReq(link)

# Saves content of page in new variable (still in HTML!!)
page_html = uClient.read()

# Close connection
uClient.close()

# Parsing page with soup
page_soup = soup(page_html, "html.parser")

# Grabs page
pr_container = page_soup.findAll("div",{"id":"website"})

# Scrape date
date_container = pr_container[0].time
date = date_container.text
date = date_match.search(date)
date = date.group()

# Scrape title
title = page_soup.h1.text
title_clean = title.replace("\n", " ")
title_clean = title_clean.replace("\xa0", "")
title_clean = ' '.join(title_clean.split())
title = title_clean

# Scrape institutions involved
type_of_question_container = pr_container[0].findAll("div", {"class":"ep_subtitle"})
text = type_of_question_container[0].text
question_clean = text.replace("\n", " ")
question_clean = text.replace("\xa0", " ")
question_clean = re.sub("\d+", "", question_clean) # Redundant?
question_clean = question_clean.replace("-", "")
question_clean = question_clean.replace(":", "")
question_clean = question_clean.replace("Press Releases"," ")
question_clean = ' '.join(question_clean.split())
institutions_mentioned = question_clean

# Scrape text
text_container = pr_container[0].findAll("div", {"class":"ep-a_text"})
text_with_tags = str(text_container)
text_clean = remove_tags(text_with_tags)
text_clean = text_clean.replace("\n", " ")
text_clean = text_clean.replace(",", " ") # Removing commas to avoid trouble with .csv-format later on
text_clean = text_clean.replace("\xa0", " ")
text_clean = ' '.join(text_clean.split())

# Calculate word count
word_count = len(text_clean.split())
word_count = str(word_count)

print("Finished scraping: " + link)

time.sleep(randint(1,5))

f.write(date + "," + title + ","+ institutions_mentioned + "," + word_count + "," + text_clean + "\n")

f.close()

最佳答案

下面是获取 python-requests 所需链接列表的简单方法和 lxml :

from lxml import html
import requests
url = "http://www.europarl.europa.eu/news/en/press-room/page/"
list_of_links = []
for page in range(10):
r = requests.get(url + str(page))
source = r.content
page_source = html.fromstring(source)
list_of_links.extend(page_source.xpath('//a[@title="Read more"]/@href'))
print(list_of_links)

关于python - 创建特定网站的 URL 列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46768629/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com