gpt4 book ai didi

python - 在并行Python中使用回调更新数据库

转载 作者:行者123 更新时间:2023-11-30 23:42:21 25 4
gpt4 key购买 nike

我正在尝试对使用 SQLAlchemy 访问的 SQlite 数据库中的大约 200,000 个条目进行一些文本处理。我想对其进行并行化(我正在查看 Parallel Python),但我不确定具体该怎么做。

我想在每次处理条目时提交 session ,这样如果我需要停止脚本,我就不会丢失已经完成的工作。但是,当我尝试将 session.commit() 命令传递给回调函数时,它似乎不起作用。

from assignDB import *
from sqlalchemy.orm import sessionmaker
import pp, sys, fuzzy_substring

def matchIng(rawIng, ingreds):
maxScore = 0
choice = ""
for (ingred, parentIng) in ingreds.iteritems():
score = len(ingred)/(fuzzy_substring(ingred,rawIng)+1)
if score > maxScore:
maxScore = score
choice = ingred
refIng = parentIng
return (refIng, choice, maxScore)

def callbackFunc(match, session, inputTuple):
print inputTuple
match.refIng_id = inputTuple[0]
match.refIng_name = inputTuple[1]
match.matchScore = inputTuple[2]
session.commit()

# tuple of all parallel python servers to connect with
ppservers = ()
#ppservers = ("10.0.0.1",)

if len(sys.argv) > 1:
ncpus = int(sys.argv[1])
# Creates jobserver with ncpus workers
job_server = pp.Server(ncpus, ppservers=ppservers)
else:
# Creates jobserver with automatically detected number of workers
job_server = pp.Server(ppservers=ppservers)

print "Starting pp with", job_server.get_ncpus(), "workers"

ingreds = {}
for synonym, parentIng in session.query(IngSyn.synonym, IngSyn.parentIng):
ingreds[synonym] = parentIng

jobs = []
for match in session.query(Ingredient).filter(Ingredient.refIng_id == None):
rawIng = match.ingredient
jobs.append((match, job_server.submit(matchIng,(rawIng,ingreds), (fuzzy_substring,),callback=callbackFunc,callbackargs=(match,session))))

session 是从assignDB导入的。我没有收到任何错误,只是没有更新数据库。

感谢您的帮助。

更新这是fuzzy_substring的代码

def fuzzy_substring(needle, haystack):
"""Calculates the fuzzy match of needle in haystack,
using a modified version of the Levenshtein distance
algorithm.
The function is modified from the levenshtein function
in the bktree module by Adam Hupp"""
m, n = len(needle), len(haystack)

# base cases
if m == 1:
return not needle in haystack
if not n:
return m

row1 = [0] * (n+1)
for i in range(0,m):
row2 = [i+1]
for j in range(0,n):
cost = ( needle[i] != haystack[j] )

row2.append( min(row1[j+1]+1, # deletion
row2[j]+1, #insertion
row1[j]+cost) #substitution
)
row1 = row2
return min(row1)

我从这里得到的:Fuzzy Substring 。就我而言,“needle”是大约 8000 个可能的选择之一,而 haystack 是我试图匹配的原始字符串。我循环遍历所有可能的“针”并选择得分最高的一根。

最佳答案

无需查看您的具体代码,就可以公平地说:

  1. 使用无服务器 SQLite 和
  2. 通过并行性寻求提高写入性能

是相互不相容的欲望。引用SQLite FAQ :

… However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine. …

即使没有 SQLAlchemy 使用的任何门控和排序也是如此。也根本不清楚并行 Python 作业何时完成(如果有的话)。

我的建议:首先让它正常工作,然后寻找优化。尤其是当 pp secret 武器即使工作完美,也可能根本不会让你买太多东西时。

添加以回复评论:

如果fuzzy_substring匹配是瓶颈,那么它似乎与数据库访问完全解耦,您应该记住这一点。在不了解 fuzzy_substring 正在做什么的情况下,一个好的开始假设是您可以进行算法改进,这可能使单线程编程在计算上可行。 Approximate string matching这是一个经过充分研究的问题,选择正确的算法通常比“投入更多处理器”要好得多。

从这个意义上说,更好的是您拥有更清晰的代码,不要浪费分段和重新组装问题的开销,最后有一个更具可扩展性和可调试性的程序。

关于python - 在并行Python中使用回调更新数据库,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11488328/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com