For some reason this clutch.co scraper isn't clicking the "next" button and navigating to the next page. So when I run this code it'll only get information from the first page and then close itself.
由于某种原因,这个Clutch.co Screper没有点击“下一步”按钮并导航到下一页。因此,当我运行这段代码时,它只会从第一页获取信息,然后自动关闭。
I added in waits to allow the page to load but it hasn't helped. When watching the browser you can see it scrolls to the bottom of the page but then closes itself.
我添加了等待,以允许页面加载,但它并没有帮助。在观看浏览器时,您可以看到它滚动到页面底部,但随后自动关闭。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
website = "https://clutch.co/us/web-developers"
options = webdriver.ChromeOptions()
options.add_experimental_option("detach", False)
driver = webdriver.Chrome(options=options)
driver.get(website)
wait = WebDriverWait(driver, 10)
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
#pagination
pagination = driver.find_element(By.XPATH,'//ul[@class="pagination justify-content-center"]')
pages = pagination.find_elements(By.TAG_NAME,'li')
last_page = int(250)
company_names = []
taglines = []
locations = []
costs = []
ratings = []
current_page = 1
while current_page <= last_page:
company_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'provider-info')))
for company_element in company_elements:
company_name = company_element.find_element(By.CLASS_NAME, "company_info").text
company_names.append(company_name)
tagline = company_element.find_element(By.XPATH,'.//p[@class="company_info__wrap tagline"]').text
taglines.append(tagline)
rating = company_element.find_element(By.XPATH,'.//span[@class="rating sg-rating__number"]').text
ratings.append(rating)
location = company_element.find_element(By.XPATH, './/span[@class="locality"]').text
locations.append(location)
cost = company_element.find_element(By.XPATH, './/div[@class="list-item block_tag custom_popover"]').text
costs.append(cost)
current_page = current_page + 1
try:
next_page = driver.find_element(By.XPATH,'//li[@class="page-item next"]/a[@class="page-link"]")')
next_page.click()
time.sleep(10)
except:
break
driver.close()
data = {'Company_Name': company_names, 'Tagline': taglines, 'location': locations, 'Ticket_Price': costs, 'Rating': ratings}
df = pd.DataFrame(data)
df.to_csv('companies_test1.csv', index=False)
print(df)
更多回答
优秀答案推荐
Your XPath is wrong, use:
您的XPath错误,请使用:
next_page = driver.find_element(By.XPATH,'//li[@class="page-item next"]/a[@class="page-link"]')
But the website block it. If you remove the try/catch, you can read error:
但网站屏蔽了它。如果删除TRY/CATCH,则会显示错误:
selenium.common.exceptions.ElementClickInterceptedException:
Message: element click intercepted: Element
<a class="page-link" data-page="1" href="/us/web-developers?pag e=1" data-link="?page=1">...</a>
is not clickable at point (622, 888).
Other element would receive the click:
<div id="CybotCookiebotDialogBodyButtons" style="padding-left: 0px;">...</div>
A better code, but my IP/settings require Cloudfare captcha:
一个更好的代码,但我的IP/设置需要云费用验证码:
next_page = driver.find_element(By.XPATH,'//li[@class="page-item next"]/a[@class="page-link"]')
np = next_page.get_attribute('href')
driver.get(np)
time.sleep(6)
更多回答
Yeah it works thanks! I also get the Cloudfare captcha I'll find a work around
是啊,很管用,谢谢!我也买了云彩验证码,我会在附近找工作的
The way to thanks here is to vote up/accept the answer. Can you tell me what is the workaround? If you don't want it to be public, you can find my email on the page linked in my profile.
在这里表达感谢的方式是投票支持/接受答案。您能告诉我解决办法是什么吗?如果你不想公开,你可以在我个人资料中链接的页面上找到我的电子邮件。
我是一名优秀的程序员,十分优秀!