gpt4 book ai didi

python - 如何加快此文件创建过程?

转载 作者:行者123 更新时间:2023-12-01 07:57:16 27 4
gpt4 key购买 nike

我正在尝试创建一个具有固定宽度列的大型平面文件,其中包含多个层,但处理似乎非常慢,很可能是因为我正在迭代每一行。就上下文而言,这是为了传输保单信息。

层次结构如下:

-Policy row
--Property on policy
---Coverage on property
--Property on policy
---Coverage on property
--Owner on policy
--Owner on policy
--Owner on policy

目前,我正在将四种记录类型加载到单独的数据帧中,然后根据父记录的 ID 拉取它们,对每种类型执行 for 循环,然后将它们写入文件。我希望有某种分层数据帧合并,这样不会强制我每次想要记录时都扫描文件。

import re
import pandas as pd
import math


def MakeNumeric(instring):
output = re.sub('[^0-9]', '', str(instring))
return str(output)

def Pad(instring, padchar, length, align):
if instring is None: # Takes care of NULL values
instring = ''
instring = str(instring).upper()
instring = instring.replace(',', '').replace('\n', '').replace('\r', '')
instring = instring[:length]
if align == 'L':
output = instring + (padchar * (length - len(instring)))
elif align == 'R':
output = (padchar * (length - len(instring))) + instring
else:
output = instring
return output

def FileCreation():
POLR = pd.read_parquet(r'POLR.parquet')
PRP1 = pd.read_parquet(r'PRP1.parquet')
PROP = pd.read_parquet(r'PROP.parquet')
SUBJ = pd.read_parquet(r'SUBJ.parquet')
rownum = 1
totalrownum = 1
POLRCt = 0
size = 900000
POLR = [POLR.loc[i:i + size - 1, :] for i in range(0, len(POLR), size)]
FileCt = 0
print('Predicted File Count: ' + str(math.ceil(len(POLR[0])/ size)) )
for df in POLR:
FileCt += 1
filename = r'OutputFile.' + Pad(FileCt, '0', 2, 'R')
with open(filename, 'a+') as outfile:
for i, row in df.iterrows():
row[0] = Pad(rownum, '0', 9, 'R')
row[1] = Pad(row[1], ' ', 4, 'L')
row[2] = Pad(row[2], '0', 5, 'R')
# I do this for all 50 columns
outfile.write((','.join(row[:51])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
for i2, row2 in PROP[PROP.ID == row[51]].iterrows():
row2[0] = Pad(rownum, '0', 9, 'R')
row2[1] = Pad(row2[1], ' ', 4, 'L')
row2[2] = Pad(row2[2], '0', 5, 'R')
# I do this for all 105 columns
outfile.write((','.join(row2[:106])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
for i3, row3 in PRP1[(PRP1['id'] == row2['ID']) & (PRP1['VNum'] == row2['vnum'])].iterrows():
row3[0] = Pad(rownum, '0', 9, 'R')
row3[1] = Pad(row3[1], ' ', 4, 'L')
row3[2] = Pad(row3[2], '0', 5, 'R')
# I do this for all 72 columns
outfile.write((','.join(row3[:73])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
for i2, row2 in SUBJ[SUBJ['id'] == row['id']].iterrows():
row2[0] = Pad(rownum, '0', 9, 'R')
row2[1] = Pad(row2[1], ' ', 4, 'L')
row2[2] = Pad(row2[2], '0', 5, 'R')
# I do this for all 24 columns
outfile.write((','.join(row2[:25])).replace(',', '') + '\n')
rownum += 1
totalrownum += 1
POLRCt += 1
print('File {} of {} '.format(str(FileCt),str(len(POLR)) ) + str((POLRCt - 1) / len(df.index) * 100) + '% Finished\r')
rownum += 1
rownum = 1
POLRCt = 1

我本质上是在寻找一个不需要花费多天时间来创建 27M 记录文件的脚本。

最佳答案

我最终为每个记录级别填充临时表,并创建键,然后将它们插入到永久临时表中,并为键分配聚集索引。然后,我在使用 OFFSETFETCH NEXT %d ROWS ONLY 来减少内存大小时查询结果。然后,我使用多处理库来分解 CPU 上每个线程的工作负载。最终,这些因素的结合将运行时间减少到了最初发布此问题时的 20% 左右。

关于python - 如何加快此文件创建过程?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55907611/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com