gpt4 book ai didi

Python.exe 在运行具有 pandas 和列表实现的脚本时被挂起

转载 作者:太空宇宙 更新时间:2023-11-04 05:09:10 25 4
gpt4 key购买 nike

我开发了一个脚本来处理 CSV 文件并生成另一个结果文件。脚本在有限的测试数据下成功运行,但是当我使用 15 列 2500 万行的实际数据文件执行它时,相同的脚本被挂起并突然关闭。请参阅随附的错误屏幕截图。

那么,我可以使用 Pandas 从 CSV 文件读取最大限制,或者在列表中存储记录的最大限制..?

请分享您的想法以优化以下脚本。

[ Error Screen Shot ]

下面是脚本。

import csv
import operator
import pandas as pd
import time

print time.strftime('Script Start Time : ' + "%Y-%m-%d %H:%M:%S")
sourceFile = raw_input('Enter file name along with path : ')
searchParam1 = raw_input('Enter first column name containing MSISDN : ').lower()
searchParam2 = raw_input('Enter second column name containing DATE-TIME : ').lower()
searchParam3 = raw_input('Enter file seperator (,/#/|/:/;) : ')

df = pd.read_csv(sourceFile, sep=searchParam3)
df.columns = df.columns.str.lower()
df = df.rename(columns={searchParam1 : 'msisdn', searchParam2 : 'datetime'})

destFileWritter = csv.writer(open(sourceFile + ' - ProcessedFile.csv','wb'))
destFileWritter.writerow(df.keys().tolist())
sortedcsvList = df.sort_values(['msisdn','datetime']).values.tolist()

rows = [row for row in sortedcsvList]
col_1 = [row[df.columns.get_loc('msisdn')] for row in rows]
col_2 = [row[df.columns.get_loc('datetime')] for row in rows]

for i in range(0,len(col_1)-1):
if col_1[i] == col_1[i+1]:
#print('Inside If...')
continue
else:
for row in rows:
if col_1[i] in row:
if col_2[i] in row:
#print('Inside else...')
destFileWritter.writerow(row)
destFileWritter.writerow(rows[len(rows)-1])
print('Processing Completed, Kindly Check Response File On Same Location.')
print time.strftime('Script End Time : ' + "%Y-%m-%d %H:%M:%S")
raw_input('Press Enter to Exit...')[![enter image description here][1]][1]

更新的脚本:

import csv
import operator
import pandas as pd
import time
import sys

print time.strftime('Script Start Time : ' + "%Y-%m-%d %H:%M:%S")
sourceFile = raw_input('Enter file name along with path : ')
searchParam1 = raw_input('Enter first column name containing MSISDN : ').lower()
searchParam2 = raw_input('Enter second column name containing DATE-TIME : ').lower()
searchParam3 = raw_input('Enter file seperator (,/#/|/:/;) : ')

def csvSortingFunc(sourceFile, searchParam1, searchParam2, searchParam3):
CHUNKSIZE = 10000
for chunk in pd.read_csv(sourceFile, chunksize=CHUNKSIZE, sep=searchParam3):
df = chunk
#df = pd.read_csv(sourceFile, sep=searchParam3)
df.columns = df.columns.str.lower()
df = df.rename(columns={searchParam1 : 'msisdn', searchParam2 : 'datetime'})
"""destFileWritter = csv.writer(open(sourceFile + ' - ProcessedFile.csv','wb'))
destFileWritter.writerow(df.keys().tolist()) """
resultList = []
resultList.append(df.keys().tolist())
sortedcsvList = df.sort_values(['msisdn','datetime']).values.tolist()
rows = [row for row in sortedcsvList]
col_1 = [row[df.columns.get_loc('msisdn')] for row in rows]
col_2 = [row[df.columns.get_loc('datetime')] for row in rows]
for i in range(0,len(col_1)-1):
if col_1[i] == col_1[i+1]:
#print('Inside If...')
continue
else:
for row in rows:
if col_1[i] in row:
if col_2[i] in row:
#print('Inside else...')
#destFileWritter.writerow(row)
resultList.append(row)
#destFileWritter.writerow(rows[len(rows)-1])
resultList.append(rows[len(rows)-1])
writedf = pd.DataFrame(resultList)
writedf.to_csv(sourceFile + ' - ProcessedFile.csv', header=False, index=False)
#print('Processing Completed, Kindly Check Response File On Same Location.')


csvSortingFunc(sourceFile, searchParam1, searchParam2, searchParam3)
print('Processing Completed, Kindly Check Response File On Same Location.')
print time.strftime('Script End Time : ' + "%Y-%m-%d %H:%M:%S")
raw_input('Press Enter to Exit...')

最佳答案

如果您可以轻松地汇总结果,您绝对应该考虑在 pd.read_csv 中使用参数 chunksize。它允许您读取大型 .csv 文件,比如 100000 条记录。

chunksize = 10000
for chunk in pd.read_csv(filename, chunksize=chunk_size):
df = chunk
#your code

之后,您应该将每次计算的结果附加到最终计算中。希望它有所帮助,我在处理超过几百万行的文件时使用了这种方法。

继续:

    i = 0
for chunk in pd.read_csv(sourceFile, chunksize=10):
print('chunk_no', i)
i+=1

你可以运行这几行吗?它会打印出一些数字吗?

关于Python.exe 在运行具有 pandas 和列表实现的脚本时被挂起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43541101/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com