gpt4 book ai didi

python - 使用 python 抓取黄页 href

转载 作者:行者123 更新时间:2023-12-01 03:25:34 25 4
gpt4 key购买 nike

我最近发帖要求从黄页中抓取数据,@alecxe 向我展示了一些提取数据的新方法,帮助了我很多,但我再次陷入困境,想抓取黄页中每个链接的数据,这样我就可以获取黄页页面,其中包含更多数据。我想添加一个名为“url”的变量并获取企业的 href,不是实际的企业网站,而是企业的黄页页面。我尝试了各种方法,但似乎没有任何效果。 href 位于“class=business-name”下。

import csv
import requests
from bs4 import BeautifulSoup


with open('cities_louisiana.csv','r') as cities:
lines = cities.read().splitlines()
cities.close()

for city in lines:
print(city)
url = "http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms="baton%rouge+LA&page="+str(count)

for city in lines:
for x in range (0, 50):
print("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page="+str(x))
page = requests.get("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page="+str(x))
soup = BeautifulSoup(page.text, "html.parser")
for result in soup.select(".search-results .result"):
try:
name = result.select_one(".business-name").get_text(strip=True, separator=" ")
except:
pass
try:
streetAddress = result.select_one(".street-address").get_text(strip=True, separator=" ")
except:
pass
try:
city = result.select_one(".locality").get_text(strip=True, separator=" ")
city = city.replace(",", "")
state = "LA"
zip = result.select_one('span[itemprop$="postalCode"]').get_text(strip=True, separator=" ")
except:
pass

try:
telephone = result.select_one(".phones").get_text(strip=True, separator=" ")
except:
telephone = "No Telephone"
try:
categories = result.select_one(".categories").get_text(strip=True, separator=" ")
except:
categories = "No Categories"
completeData = name, streetAddress, city, state, zip, telephone, categories
print(completeData)
with open("yellowpages_businesses_louisiana.csv", "a", newline="") as write:
wrt = csv.writer(write)
wrt.writerow(completeData)
write.close()

最佳答案

您应该实现的多项内容:

  • 从具有 business-name 类的元素的 href 属性中提取业务链接 - 在 BeautifulSoup 中,这可以通过“处理”来完成像字典一样的元素
  • 使用 urljoin() 使链接绝对化
  • 在维持网络抓取 session 的同时向业务页面发出请求
  • 同时使用 BeautifulSoup 解析业务页面并提取所需信息
  • 添加时间延迟以避免过于频繁地访问网站

从搜索结果页面打印出企业名称并从企业资料页面打印出企业描述的完整工作示例:

from urllib.parse import urljoin  

import requests
import time
from bs4 import BeautifulSoup


url = "http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=baton%rouge+LA&page=1"


with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'}

page = session.get(url)
soup = BeautifulSoup(page.text, "html.parser")
for result in soup.select(".search-results .result"):
business_name_element = result.select_one(".business-name")
name = business_name_element.get_text(strip=True, separator=" ")

link = urljoin(page.url, business_name_element["href"])

# extract additional business information
business_page = session.get(link)
business_soup = BeautifulSoup(business_page.text, "html.parser")
description = business_soup.select_one("dd.description").text

print(name, description)

time.sleep(1) # time delay to not hit the site too often

关于python - 使用 python 抓取黄页 href,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41417485/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com