gpt4 book ai didi

python-3.x - 用 Selenium 刮活聊天直到流结束

转载 作者:行者123 更新时间:2023-12-03 06:06:12 24 4
gpt4 key购买 nike

我正在尝试抓取youtube livechat。我需要保存所有旧的和传入的消息。为此,我使用css选择器和无限循环来完成此操作,但是这导致重复的条目和以前的消息被忽略。正确的方法是什么?目标url是第一个命令行参数。

from selenium import webdriver
import requests
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import pandas as pd
import os,re,sys

def parseyt():
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--executable_path="chromedriver.exe"')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-extensions')
chrome_bin = os.getenv('GOOGLE_CHROME_SHIM', None)
is_local = os.getenv('IS_LOCAL', None)
chromedriver_path = r'chromedriver.exe'
service_log_path = "{}/chromedriver.log".format('\.')
service_args = ['"--verbose", "--log-path=scrape.log"']
chromedriver_path = 'chromedriver.exe'
chrome_options.binary_location = r'C:\Program Files (x86)\Chromium\Application\chrome.exe'
browser = webdriver.Chrome(executable_path=chromedriver_path,chrome_options=chrome_options,service_args=service_args)
url = sys.argv[1]
url = url.replace(r'watch?',r'live_chat?')
print(url)
browser.get(url)
browser.implicitly_wait(1)
while True:
innerHTML = browser.execute_script("return document.body.innerHTML")
chats = []
for chat in browser.find_elements_by_css_selector('yt-live-chat-text-message-renderer'):
author_name = chat.find_element_by_css_selector("#author-name").get_attribute('innerHTML')
message = chat.find_element_by_css_selector("#message").get_attribute('innerHTML')
author_name_encoded = author_name.encode('utf-8').strip()
message_encoded = message.encode('utf-8').strip()
print(message+" "+author_name+"\n")
browser.quit()
return chats

最佳答案

最好改用YouTube API。

关于python-3.x - 用 Selenium 刮活聊天直到流结束,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63537047/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com