gpt4 book ai didi

python - 用逗号解析 Pandas 中的 CSV 文件

转载 作者:太空宇宙 更新时间:2023-11-04 02:50:10 25 4
gpt4 key购买 nike

我需要从 csv 文件创建一个 pandas.DataFrame。为此,我使用方法 pandas.csv_reader(...)。这个文件的问题是一列或多列在值中包含逗号(我不控制文件格式)。我一直在尝试从此 question 实现解决方案,但出现以下错误:

pandas.errors.EmptyDataError: No columns to parse from file 

由于某种原因,在实现此解决方案后,我尝试修复的 csv 文件是空白的。

这是我使用的代码:

# fix csv file
with open ("/Users/username/works/test.csv",'rb') as f,\
open("/Users/username/works/test.csv",'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 4)
writer.writerow(row)
# Manipulate csv file
data = pd.read_csv(os.path.expanduser\
("/Users/username/works/test.csv"),error_bad_lines=False)

有什么想法吗?

数据概览:

 Id0    Id 1    Id 2 Country Company Title       Email                  
23 123 456 AR name cargador email@email.com

24 123 456 AR name Executive assistant email@email.com

25 123 456 AR name Asistente Administrativo email@email.com

26 123 456 AR name Atención al cliente vía telefónica vía online email@email.com
39 123 456 AR name Asesor de ventas email@email.com

40 123 456 AR name inc. International company representative email@email.com
41 123 456 AR name Vendedor de campo email@email.com

42 123 456 AR name PUBLICIDAD ATENCIÓN AL CLIENTE email@email.com
43 123 456 AR name Asistente de Marketing email@email.com

44 123 456 AR name SOLDADOR email@email.com
217 123 456 AR name Se requiere vendedores Loja Quevedo Guayas) email@email.com
218 123 456 AR name Ing. Civil recién graduado Yaruquí email@email.com
219 123 456 AR name ayudantes enfermeria email@email.com

220 123 456 AR name Trip Leader for International Youth Exchange email@email.com
221 123 456 AR name COUNTRY MANAGER / DIRECTOR COMERCIAL email@email.com
250 123 456 AR name Ayudante de Pasteleria email@email.com Asesor email@email.com email@email.com

预解析 CSV:

#,Id 1,Id 2,Country,Company,Title,Email,,,,
23,123,456,AR,name,cargador,email@email.com,,,,
24,123,456,AR,name,Executive assistant,email@email.com,,,,
25,123,456,AR,name,Asistente Administrativo,email@email.com,,,,
26,123,456,AR,name,Atención al cliente vía telefónica , vía online,email@email.com,,,
39,123,456,AR,name,Asesor de ventas,email@email.com,,,,
40,123,456,AR,name, inc.,International company representative,email@email.com,,,
41,123,456,AR,name,Vendedor de campo,email@email.com,,,,
42,123,456,AR,name,PUBLICIDAD, ATENCIÓN AL CLIENTE,email@email.com,,,
43,123,456,AR,name,Asistente de Marketing,email@email.com,,,,
44,123,456,AR,name,SOLDADOR,email@email.com,,,,
217,123,456,AR,name,Se requiere vendedores,, Loja , Quevedo, Guayas),email@email.com
218,123,456,AR,name,Ing. Civil recién graduado, Yaruquí,email@email.com,,,
219,123,456,AR,name,ayudantes enfermeria,email@email.com,,,,
220,123,456,AR,name,Trip Leader for International Youth Exchange,email@email.com,,,,
221,123,456,AR,name,COUNTRY MANAGER / DIRECTOR COMERCIAL,email@email.com,,,,
250,123,456,AR,name,Ayudante de Pasteleria,email@email.com, Asesor,email@email.com,email@email.com,
251,123,456,AR,name,Ejecutiva de Ventas,email@email.com,,,,

最佳答案

如果您可以假设对于 Comapny,任何逗号后跟空格,并且所有剩余的错误逗号都在电子邮件地址之前的列中,那么可以编写一个小型解析器来处理它。

代码:

import csv
import re

VALID_EMAIL = re.compile(r'[^@]+@[^@]+\.[^@]+')

def read_my_csv(file_handle):
# build csv reader
reader = csv.reader(file_handle)

# get the header, and find the e-mail and title columns
header = next(reader)
email_column = header.index('Email')
title_column = header.index('Title')

# yield the header up to the e-mail column
yield header[:email_column+1]

# for each row, go through rebuild columns
for row in reader:

# for each row, put the Company column back together
while row[title_column].startswith(' '):
row[title_column-1] += ',' + row[title_column]
del row[title_column]

# for each row, put the Title column back together
while not VALID_EMAIL.match(row[email_column]):
row[email_column-1] += ',' + row[email_column]
del row[email_column]
yield row[:email_column+1]

测试代码:

with open ("test.csv", 'rU') as f:
generator = read_my_csv(f)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)

print(df)

结果:

      # Id 1 Id 2 Country     Company  \
0 23 123 456 AR name
1 24 123 456 AR name
2 25 123 456 AR name
3 26 123 456 AR name
4 39 123 456 AR name
5 40 123 456 AR name, inc.
6 41 123 456 AR name
7 42 123 456 AR name
8 43 123 456 AR name
9 44 123 456 AR name
10 217 123 456 AR name
11 218 123 456 AR name
12 219 123 456 AR name
13 220 123 456 AR name
14 221 123 456 AR name
15 250 123 456 AR name
16 251 123 456 AR name

Title Email
0 cargador email@email.com
1 Executive assistant email@email.com
2 Asistente Administrativo email@email.com
3 Atención al cliente vía telefónica , vía online email@email.com
4 Asesor de ventas email@email.com
5 International company representative email@email.com
6 Vendedor de campo email@email.com
7 PUBLICIDAD, ATENCIÓN AL CLIENTE email@email.com
8 Asistente de Marketing email@email.com
9 SOLDADOR email@email.com
10 Se requiere vendedores,, Loja , Quevedo, Guayas) email@email.com
11 Ing. Civil recién graduado, Yaruquí email@email.com
12 ayudantes enfermeria email@email.com
13 Trip Leader for International Youth Exchange email@email.com
14 COUNTRY MANAGER / DIRECTOR COMERCIAL email@email.com
15 Ayudante de Pasteleria email@email.com
16 Ejecutiva de Ventas email@email.com

关于python - 用逗号解析 Pandas 中的 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44122091/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com