gpt4 book ai didi

python - 如何解决 RuntimeError : can't start new thread using Selenium for webscraping?

转载 作者:太空宇宙 更新时间:2023-11-04 02:08:42 40 4
gpt4 key购买 nike

我构建了一个脚本来从许多网站 (~120) 收集产品及其详细信息。它实现了我想要实现的目标,但过了一段时间(主要是大约 70 页),它给了我一个“MemoryError”和一个“RuntimeError:无法启动新线程”。我试图寻找解决方案,例如:.clear() 我的列表,或尝试使用 sys.getsizeof() 来发现内存泄漏,但没有成功。您知道可能是什么问题吗?

详细错误信息:

Traceback (most recent call last):

File "C:\EGYÉB\PYTHON\PyCharm\helpers\pydev\pydevd.py", line 1741, in <module>
main()

File "C:\EGYÉB\PYTHON\PyCharm\helpers\pydev\pydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)

File "C:\EGYÉB\PYTHON\PyCharm\helpers\pydev\pydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script

File "C:\EGYÉB\PYTHON\PyCharm\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)

File "C:/EGYÉB/PYTHON/Projects/WebScraping/Selenium_scraping.py", line 63, in <module>
soup1 = BeautifulSoup(driver.page_source, 'html.parser')

File "C:\EGYÉB\PYTHON\Projects\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 679, in page_source
return self.execute(Command.GET_PAGE_SOURCE)['value']

File "C:\EGYÉB\PYTHON\Projects\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)

File "C:\EGYÉB\PYTHON\Projects\venv\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)

File "C:\EGYÉB\PYTHON\Projects\venv\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 423, in _request
data = utils.load_json(data.strip())

File "C:\EGYÉB\PYTHON\Projects\venv\lib\site-packages\selenium\webdriver\remote\utils.py", line 37, in load_json
return json.loads(s)

File "C:\EGYÉB\PYTHON\Python Core\lib\json\__init__.py", line 348, in loads
return _default_decoder.decode(s)

File "C:\EGYÉB\PYTHON\Python Core\lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())

File "C:\EGYÉB\PYTHON\Python Core\lib\json\decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
MemoryError

Traceback (most recent call last):
File "C:\EGYÉB\PYTHON\PyCharm\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 1505, in do_it
t.start()

File "C:\EGYÉB\PYTHON\Python Core\lib\threading.py", line 847, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

代码:

from selenium import webdriver
from bs4 import BeautifulSoup
from itertools import count
import pandas as pd
import os
import csv
import time
import re

os.chdir('C:\...')
price = []
prod_name = []
href_link = []
specs = []
item_specs1 = []
item_specs2 = []
url1 = 'https://login.aliexpress.com/'

driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url1)
time.sleep(3)
driver.switch_to.frame('alibaba-login-box')
driver.find_element_by_id('fm-login-id').send_keys('..........')
driver.find_element_by_id('fm-login-password').send_keys('.........')
driver.find_element_by_id('fm-login-submit').click()
time.sleep(3)
driver.switch_to.default_content()

df = pd.read_csv('........csv', header=0)

for index, row in df.iterrows():
page_nr = 1
url = 'https://www.aliexpress.com/store/{}'.format(row['Link']) + '/search/{}'.format(page_nr) + '.html'
driver.get(url)
time.sleep(2)
for page_number in count(start=1):
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for div_b in soup.find_all('div', {'class': 'cost'}):
price.append(div_b.text + 'Ł')
for pr_name in soup.find_all('div', {'class': 'detail'}):
for pr_h in pr_name.find_all('h3'):
for pr_title in pr_h.find_all('a'):
prod_name_t = (pr_title.get('title').strip())
prod_name_l = (pr_title.get('href').strip())
href_link.append(prod_name_l + 'Ł')
prod_name.append(prod_name_t + 'Ł')
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath("//div[@id='node-gallery']/div[5]/div/div/ul/li/div[2]/h3/a")]
for link in links:
driver.get(link)
time.sleep(2)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
for item1 in soup1.find_all('span', {'class': 'propery-title'}):
item_specs1.append(item1.text)
for item2 in soup1.find_all('span', {'class': 'propery-des'}):
item_specs2.append(item2.text + 'Ł')
item_specs = list(zip(item_specs1, item_specs2)))
item_specs_join = ''.join(str(item_specs))
item_specs_replace = [re.sub('[^a-zA-Z0-9 \n.:Ł]', '', item_specs_join)]
specs.append(item_specs_replace)
item_specs1.clear()
item_specs2.clear()
soup1.clear()
driver.back()
links.clear()
if len(prod_name) > 500:
data_csv = list(zip(prod_name, price, href_link, specs))
with open('........csv'), 'a', newline='') as f:
writer = csv.writer(f)
for row0 in data_csv:
writer.writerow(row0)
f.close()
price.clear()
prod_name.clear()
href_link.clear()
specs.clear()
data_csv.clear()
try:
if soup.find_all('span', {'class': 'ui-pagination-next ui-pagination-disabled'}):
print("Last page reached!")
break
else:
driver.find_element_by_class_name('ui-pagination-next').click()
time.sleep(1)
except Exception:
break
driver.quit()
data_csv = list(zip(prod_name, price, href_link, specs))
print(len(data_csv))
with open('.......csv', 'a', newline='') as f:
writer = csv.writer(f)
for row1 in data_csv:
writer.writerow(row1)
f.close()

最佳答案

这个错误信息...

RuntimeError: can't start new thread

...暗示系统“无法启动新线程”,因为您的 python 进程中已经运行了太多线程,并且由于资源限制,创建新线程的请求被拒绝。

您的主要问题源于以下行:

item_specs_join = ''.join(str(item_specs))

您需要查看您的程序创建的线程数与您的系统根据您的环境能够创建的最大线程数。可能您的程序启动的线程多于系统可以处理的线程。一个进程可以激活的线程数是有限制的。

另一个因素可能是,您的程序启动线程的速度快于线程运行完成的速度。如果您需要启动许多线程,您需要以更可控的方式进行,您可以使用线程池。

考虑到线程是异步运行的,重新设计程序流程会是一个更好的方法。也许在为每个请求启动线程时使用线程池来获取资源。

您可以在 error: can't start new thread 上找到详细的讨论

在这里您还可以找到关于 Is there any way to kill a Thread? 的详细讨论。

关于python - 如何解决 RuntimeError : can't start new thread using Selenium for webscraping?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54108786/

40 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com