gpt4 book ai didi

python - 如何在要抓取的 url 列表中安全地执行多线程?

转载 作者:行者123 更新时间:2023-12-01 00:59:16 24 4
gpt4 key购买 nike

我正在从列表中抓取多个网址。

它似乎有效,但输出都是混合的并且彼此不对应。

这是带有线程的代码:

import requests
import pandas
import json
import concurrent.futures

# our list with multiple profiles
profile=['kaid_329989584305166460858587','kaid_896965538702696832878421','kaid_1016087245179855929335360','kaid_107978685698667673890057','kaid_797178279095652336786972','kaid_1071597544417993409487377','kaid_635504323514339937071278','kaid_415838303653268882671828','kaid_176050803424226087137783']

# two lists of the data that we are going to fill up with each profile
link=[]
projects=[]

############### SCRAPING PART ###############

# my scraping function that we are going to use for each item in profile
def scraper (kaid):
link.append('https://www.khanacademy.org/profile/{}'.format(kaid))
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects.append(str(len(data['scratchpads'])))
except json.decoder.JSONDecodeError:
projects.append('NA')

# the threading part
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
for future in concurrent.futures.as_completed(future_kaid):
kaid = future_kaid[future]

############### WRITING PART ##############

# Now we write everything into a a dataframe object
d = {'link':link,'projects':projects}
dataframe = pandas.DataFrame(data=d)
print(dataframe)

我期待这个(我在没有线程的情况下得到的输出):

                                                link projects
0 https://www.khanacademy.org/profile/kaid_32998... 0
1 https://www.khanacademy.org/profile/kaid_89696... 219
2 https://www.khanacademy.org/profile/kaid_10160... 22
3 https://www.khanacademy.org/profile/kaid_10797... 0
4 https://www.khanacademy.org/profile/kaid_79717... 0
5 https://www.khanacademy.org/profile/kaid_10715... 12
6 https://www.khanacademy.org/profile/kaid_63550... 365
7 https://www.khanacademy.org/profile/kaid_41583... NA
8 https://www.khanacademy.org/profile/kaid_17605... 2

但是,我得到了这个:

                                                link projects
0 https://www.khanacademy.org/profile/kaid_32998... 0
1 https://www.khanacademy.org/profile/kaid_89696... 0
2 https://www.khanacademy.org/profile/kaid_10160... 0
3 https://www.khanacademy.org/profile/kaid_10797... 22
4 https://www.khanacademy.org/profile/kaid_79717... NA
5 https://www.khanacademy.org/profile/kaid_10715... 12
6 https://www.khanacademy.org/profile/kaid_63550... 2
7 https://www.khanacademy.org/profile/kaid_41583... 219
8 https://www.khanacademy.org/profile/kaid_17605... 365

看起来是一样的,但实际上我们可以看到我们的链接与我们的项目不正确对应。搞混了。

除了SCRAPING PART之外,没有线程的代码是相同的

# first part of the scraping
for kaid in profile:
link.append('https://www.khanacademy.org/profile/{}'.format(kaid))

# second part of the scraping
for kaid in profile:
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects.append(str(len(data['scratchpads'])))
except json.decoder.JSONDecodeError:
projects.append('NA')

我的线程代码有什么问题?为什么一切都变得困惑了?

最佳答案

尝试这样的事情吗?不是附加到链接,然后在执行一些耗时的代码后附加到项目,而是按顺序附加它们,应该可以解决问题。但我正在考虑更好的方法 atm...

d = {'link' : [], 'projects' : []}

############### SCRAPING PART ###############

# my scraping function that we are going to use for each item in profile
def scraper (kaid):
link = 'https://www.khanacademy.org/profile/{}'.format(kaid)
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects = str(len(data['scratchpads']))
except json.decoder.JSONDecodeError:
projects ='NA'
d['link'].append(link)
d['projects'].append(projects)

不同的解决方案(大概,不是真的)

或者更好的是,在线程执行结束时返回链接和项目,然后添加它们...(我不确定它是否会起作用)

def scraper (kaid):
link = 'https://www.khanacademy.org/profile/{}'.format(kaid)
data = requests.get('https://www.khanacademy.org/api/internal/user/scratchpads?casing=camel&kaid={}&sort=1&page=0&limit=40000&subject=all&lang=en&_=190425-1456-9243a2c09af3_1556290764747'.format(kaid))
try:
data=data.json()
projects = str(len(data['scratchpads']))
except json.decoder.JSONDecodeError:
projects = 'NA'
return link, projects

# the threading part
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_kaid = {executor.submit(scraper, kaid): kaid for kaid in profile}
for future in concurrent.futures.as_completed(future_kaid):
kaid = future_kaid[future]
data = future.result()
link.append(data[0])
projects.append(data[1])

我想说第二个是更好的解决方案,因为它会等待所有线程执行后再将所有数据处理到 DataFrame 中。对于第一个,仍然有可能发生时序不一致(但是,这些偏差非常小,因为我们谈论的只是千兆赫时钟速度的滴答声,但为了完全消除这种可能性,第二个选项更好)。

关于python - 如何在要抓取的 url 列表中安全地执行多线程?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55941173/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com