gpt4 book ai didi

python - BeautifulSoup - 抓取后无法创建 csv 和文本文件

转载 作者:太空宇宙 更新时间:2023-11-03 14:56:52 25 4
gpt4 key购买 nike

我正在尝试从网站的所有页面中提取文章的 URL。仅重复抓取第一页中的 URL 并将其存储在 csv 文件中。来自这些链接的信息再次以相同的方式抓取并存储在文本文件中。

在这个问题上需要一些帮助。

import requests
from bs4 import BeautifulSoup
import csv
import lxml
import urllib2

base_url = 'https://www.marketingweek.com/?s=big+data'
response = requests.get(base_url)
soup = BeautifulSoup(response.content, "lxml")

res = []

while 1:
search_results = soup.find('div', class_='archive-constraint') #localizing search window with article links
article_link_tags = search_results.findAll('a') #ordinary scheme goes further
res.append([url['href'] for url in article_link_tags])
#Automatically clicks next button to load other articles
next_button = soup.find('a', text='>>')
#Searches for articles till Next button is not found
if not next_button:
break
res.append([url['href'] for url in article_link_tags])
soup = BeautifulSoup(response.text, "lxml")
for i in res:
for j in i:
print(j)
####Storing scraped links in csv file###

with open('StoreUrl1.csv', 'w+') as f:
f.seek(0)
for i in res:
for j in i:
f.write('\n'.join(i))


#######Extracting info from URLs########

with open('StoreUrl1.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)

for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url), "lxml")

with open('InfoOutput1.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')

最佳答案

使用lxml的html解析器的解决方案。

共有 361 个页面,每个页面有 12 个链接。我们可以迭代每个页面并使用 xpath 提取链接。

xpath 有助于获取:

  • 特定标签下的文本
  • 特定标记的值(此处:“a”标记的“href”属性的值)

    import csv
    from lxml import html
    from time import sleep
    import requests
    from random import randint

    outputFile = open("All_links.csv", r'wb')
    fileWriter = csv.writer(outputFile)

    fileWriter.writerow(["Sl. No.", "Page Number", "Link"])

    url1 = 'https://www.marketingweek.com/page/'
    url2 = '/?s=big+data'

    sl_no = 1

    #iterating from 1st page through 361th page
    for i in xrange(1, 362):

    #generating final url to be scraped using page number
    url = url1 + str(i) + url2

    #Fetching page
    response = requests.get(url)
    sleep(randint(10, 20))
    #using html parser
    htmlContent = html.fromstring(response.content)

    #Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title'
    page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href')
    for page_link in page_links:
    fileWriter.writerow([sl_no, i, page_link])
    sl_no += 1

关于python - BeautifulSoup - 抓取后无法创建 csv 和文本文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45477874/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com