gpt4 book ai didi

python - 使用 selenium 和 google colab 抓取 youtube 评论很慢

转载 作者:行者123 更新时间:2023-12-03 06:05:59 30 4
gpt4 key购买 nike

我正在使用 selenium 和 google Colab 从 YouTube 上抓取视频评论。无论是 1000 条评论还是 38 条评论,整个抓取过程大约需要一个小时。我可以做些什么来改进我的代码以提高处理速度?谢谢!
感谢以下有助于构建代码的资源。
1:https://colab.research.google.com/drive/1GFJKhpOju_WLAgiVPCzCGTBVGMkyAjtk#scrollTo=4Ylzd_l6fXGv
2:https://www.tfzx.net/article/2719742.html
3:https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab
输出#1:

Completed scraping 1000 comments in 3089.1585 seconds from YouTube Entertainment Tonight channel.
输出#2:
Completed scraping 38 comments in 3011.5525 seconds from YouTube Anne Schmidt channel.
输入:
!apt-get update
!apt install chromium-chromedriver
%pip install selenium
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=chrome_options)
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

def scrapecomments(url):
tic = time.perf_counter()
wait = WebDriverWait(wd,15)
wd.get(url)
data1=[]
data2=[]
data3=[]
for item in range(200):
wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
time.sleep(15)
for author in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#author-text"))):
if len(data1) == 1000:
break
else:
data1.append(author.text)
for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content-text"))):
data2.append(comment.text)
for likes in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#vote-count-middle"))):
data3.append(likes.text)

def merge(list1, list2, list3):
merged_list = [(list1[i], list2[i], list3[i]) for i in range(0, len(list1))]
return merged_list

alldata = merge(data1,data2,data3)
comments = pd.DataFrame(alldata,columns=['user_id','comment','likes'])
comments['rank'] = comments.reset_index().index +1
channel_name = wd.find_element_by_id('channel-name').text
comments['source'] = channel_name
toc = time.perf_counter()
print(f"Completed scraping {len(data1)} comments in {toc - tic:0.4f} seconds from YouTube {channel_name} channel.")
return comments

最佳答案

也可能是您每次运行代码时都在安装 chromedriver 和 selenium

关于python - 使用 selenium 和 google colab 抓取 youtube 评论很慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63608189/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com