python - 如何从推特上抓取所有主题-6ren

python - 如何从推特上抓取所有主题

转载作者：行者123 更新时间：2023-12-04 04:29:39

Twitter 中的所有主题都可以在此 link 中找到
我想用里面的每个子类别抓取所有这些。
BeautifulSoup 在这里似乎没有用。我尝试使用 selenium，但我不知道如何匹配单击主类别后出现的 Xpath。

from selenium import webdriver
from selenium.common import exceptions

url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()

main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')

topics = {}
for main_topic in main_topics[2:]:
    print(main_topic.text.strip())
    topics[main_topic.text.strip()] = {}

我知道我可以使用 main_topics[3].click() 单击主要类别，但我不知道如何才能递归地点击它们，直到我找到只有 Follow 的那些。在右边。

最佳答案

抓取所有主要主题，例如艺术文化 , 商业与金融 等使用 Selenium和 python你必须诱导WebDriverWait为 visibility_of_all_elements_located()您可以使用以下任一 Locator Strategies :

使用 XPATH和文本属性:

driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])

使用 XPATH和 get_attribute() :

driver.get("https://twitter.com/i/flow/topics_selector")
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])

控制台输出:

['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']

刮所有主和 子主题使用 Selenium 和 WebDriver您可以使用以下定位器策略:

使用 XPATH和 get_attribute("textContent") :

driver.get("https://twitter.com/i/flow/topics_selector")
elements =  WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))
for element in elements:
    element.click()
print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@role='button']/div/span[text()]")))])
driver.quit()

控制台输出:

['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']

备注 :您必须添加以下导入:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

关于python - 如何从推特上抓取所有主题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63035551/

文章推荐： angular - tsconfig.spec.json 有什么作用？

文章推荐： c - 检测执行字符集字母是否连续

文章推荐： objection.js - 为 objection.js 进行的所有查询打印完整的 SQL

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何从推特上抓取所有主题