gpt4 book ai didi

python - 想要通过 python 中的 beautifulsoup 使用 post 方法(不想使用 selenium)来抓取具有 loadmore 按钮的整个网页

转载 作者:行者123 更新时间:2023-12-05 05:37:23 25 4
gpt4 key购买 nike

我想从 main_url 中抓取一些详细信息,其中每个公司都有另一个 url,并获取每个公司的凭据,其中有公司名称、电话、传真、网站等元素。我已经使用 beautifulsoup 和 requests 编写了代码,还获得了凭据(但只获得了 52 家公司)。之后它给出了 javascript 错误,因为有 loadmore 按钮。我也想通过点击那个加载更多按钮来获取所有公司的详细信息。只想处理请求和 beautifulsoup。不想为此使用 Selenium 。我会很高兴并感谢您获得帮助。这是我的代码,直到我得到好的结果,但只有加载更多按钮有问题。

from time import time
from bs4 import BeautifulSoup
import urllib.request
import requests
import json
import time

parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("https://www.arabiantalks.com/category/1/advertising-gift-articles")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
cnt=0

for links in soup.find_all('a', href=True):
url=(links['href'])

# print(url)
page=requests.get(url)
# print(page).

soup=BeautifulSoup(page.content,'html.parser')

# print(soup)
results=soup.find('div',class_='rdetails')
# print(results)



if results is not None:
# print(f"company_results: {results.text}")
Name=results.find('h1')
if Name is not None:
print(f"Comapny_name: {Name.text}")
else:
print(f"Comapny_name: Notfound")

address=results.find(attrs={'itemprop':'address'})
if address is not None:
# address.text.replace(" Address : ", "")
# print(type(f"Comapny_address: {address.text}"),"this type")
print((f"Comapny_address: {address.text[12:]}"))
else:
print(f"Comapny_address: Notfound")
phone=results.find(attrs={'itemprop':'telephone'})
if phone is not None:
print(f"Comapny_phone: {phone.text[16:]}")
else:
print(f"Comapny_phone: Notfound")

fax=results.find(attrs={'itemprop':'faxNumber'})
if fax is not None:
print(f"Comapny_fax: {fax.text[7:]}")
else:
print(f"Comapny_fax: Notfound")

email=results.find(attrs={'itemprop':'email'})
if email is not None:
print(f"Comapny_email: {email.text[9:]}")
else:
print(f"Comapny_email: Notfound")

website=results.find(attrs={'itemprop':'url'})
if website is not None:
print(f"Comapny_website: {website.text}")

else:
print(f"Comapny_website: Notfound")

cnt += 1

print(cnt)

print("="*100)

如果有人知道意识形态,请复制这些代码并进行必要的修改,然后再次作为回复发布。这样我就知道要更改的地方了。我尝试了很多文章,但仍在挣扎。这是first answer我试过的最后一个链接。请帮我解决这个问题我在网络抓取方面还很新。提前致谢...

最佳答案

这个问题(几乎)是这个问题的重复:Scraping a website that has a "Load more" button doesn't return info of newly loaded items with Beautiful Soup and Selenium

不同之处在于 - 在这个实例中,ajax 响应不是 JSON,而是 HTML。您需要检查浏览器开发工具中的“网络”选项卡,以查看正在进行的网络调用。

以下代码将访问 ajax 端点,提取所有可用数据,获取所有公司简介 url、抓取名称、地址、电话、传真、电子邮件、网站并将所有内容保存到 csv 文件中:

from bs4 import BeautifulSoup
import requests
import pandas as pd

item_list = []
counter = 20

s = requests.Session()

r = s.get('https://www.arabiantalks.com/category/1/advertising-gift-articles')
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a', {'itemprop': 'item'})

for item in items:
item_list.append(item.get('href'))

while True:
payload = {
'start': counter,
'cat': 1
}
r = s.post('https://www.arabiantalks.com/ajax/load_content', data=payload)
if len(r.text) < 25:
break
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a')
for item in items:
item_list.append(item.get('href'))
counter = counter + 12
print('Total items:', len(set(item_list)))
full_comp_list = []
for x in item_list:
r = s.get(x)
soup = BeautifulSoup(r.text, 'html.parser')
c_details_card = soup.select_one('div.rdetails')
try:
c_name = c_details_card.select_one('#hcname').text.strip()
except Exception as e:
c_name = 'Name unknown'
try:
c_address = c_details_card.find('h3', {'itemprop': 'address'}).text.strip()
except Exception as e:
c_address = 'Address unknown'
try:
c_phone = c_details_card.find('h3', {'itemprop': 'telephone'}).text.strip()
except Exception as e:
c_phone = 'Phone unknown'
try:
c_fax = c_details_card.find('h3', {'itemprop': 'faxNumber'}).text.strip()
except Exception as e:
c_fax = 'Fax unknown'
try:
c_email = c_details_card.find('h3', {'itemprop': 'email'}).text.strip()
except Exception as e:
c_email = 'Email unknown'
try:
c_website = c_details_card.find('a').get('href')
except Exception as e:
c_website = 'Website unknown'
full_comp_list.append((c_name, c_address, c_phone, c_fax, c_email, c_website))
print('Done', c_name)

full_df = pd.DataFrame(list(set(full_comp_list)), columns = ['Name', 'Address', 'Phone', 'Fax', 'Email', 'Website'])
full_df.to_csv('full_arabian_advertising_companies.csv')
full_df


它还会在终端中打印出来,让您了解它在做什么:

Total items: 122
Done Ash & Sims Advertising LLC
Done Strings International Advertising LLC
Done Zaabeel Advertising LLC
Done Crystal Arc Factory LLC
Done Zone Group
Done Business Link General Trading
[....]

Name Address Phone Fax Email Website
0 Ash & Sims Advertising LLC Address : P.O.Box 50391,\nDubai - United Arab Emirates Phone Number : +971-4-8851366 , +9714 8851366 Fax : +971-4-8852499 E-mail : sales@ashandsims.com http://www.ashandsims.com
1 Strings International Advertising LLC Address : P O BOX 117617\n57, Al Kawakeb Property, Al Quoz\nDubai, U.A.E Phone Number : +971-4-3386567 , +971502503591 Fax : +971-4-3386569 E-mail : vinod@stringsinternational.org http://www.stringsinternational.org
2 Zaabeel Advertising LLC Address : Al Khabaisi, Phone Number : +971-4-2598444 Fax : +971-4-2598448 E-mail : info@zaabeeladv.com http://www.zaabeeladv.com
3 Crystal Arc Factory LLC Address : Dubai - P.O. Box 72282\nAl Quoz, Interchange 3, Al Manara, Phone Number : +971-4-3479191 , +971 4 3479191, Fax : +971-4-3475535 E-mail : info@crystalarc.net http://www.crystalarc.net
4 Zone Group Address : Al Khalidiya opp to Rak Bank, Kamala Tower,\nOffice no.1401, PO Box 129297, Abu Dhabi, UAE Phone Number : +97126339004 Fax : +97126339005 E-mail : info@zonegroupuae.ae http://www.zonegroupuae.ae
[....]

关于python - 想要通过 python 中的 beautifulsoup 使用 post 方法(不想使用 selenium)来抓取具有 loadmore 按钮的整个网页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73122219/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com