gpt4 book ai didi

python - 尝试在Python中读取csv文件并创建单独的表

转载 作者:行者123 更新时间:2023-12-01 07:55:50 25 4
gpt4 key购买 nike

import numpy as np
import pandas as pd

尝试使用 pandas 读取 csv 文件这是我爬取的数据。请注意,有括号开始和结束[](也许它是一个列表)。我应该怎么写才能将整个数据写成表格形式?我不知道如何将括号与数据分开。

[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of

我应该如何读取 CSV 文件?我想要表格形式的所有数据。请帮忙

最佳答案

您的问题正是错误消息所告诉您的问题。错误在于解析这一行:

['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']

代码忽略引号字符并将行分成字段,在找到分隔符“,”的地方进行中断。您希望这是一个单一字段:

The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics

但是这个“字段”中有一个分隔符“,”的实例,CSV 解析器会尊重它,因为它忽略了这个值在引号中的事实。所以这条数据被分成两个字段:

['The University of Alabama (Master of Science in Marketing

Specialization in Marketing Analytics)'

这会导致该行被分成 7 个字段,而您的代码预计只有 6 个字段。

请注意,此外,您的项目将包含引号,这可能也不是您所期望的,并且这些方括号不属于那里。简而言之,这不是一个格式良好的 CSV 文件。

更新:我是一个正则表达式爱好者。我用正则表达式做所有事情,不能忽视这样的挑战。这是一个基于正则表达式的解决方案,它将准确地从这些数据中读取您想要的内容。如果您希望它识别数据的最后一行,则应在该行末尾添加“']”。

import regex
from pprint import pprint

def parse_file(file):
linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
with open(file) as f:
r = []
while True:
line = f.readline()
if not line:
break
line = line.strip()
if len(line) == 0:
continue
m = linepat.match(line)
if m and m.captures(4):
fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
r.append(fields)
return r

def main():
r = parse_file("/tmp/blah.csv")
pprint(r)

main()

结果:

[['Auburn University (Online Master of Business Administration with '
'concentration in Business Analytics)',
'Masters',
'US',
'AL',
'/Campus',
'Raymond J. Harbert College of Business'],
...
['University of Arkansas (Professional Master of Information Systems)',
'Masters',
'US',
'AR',
'/Campus',
'Sam M. Walton College of']]

请注意,这不使用内置的“re”模块。该模块不处理重复组,而这是解决此类问题所必需的。另请注意,这不涉及 Pandas。我对这个模块一无所知,我认为将这段代码中干净的、解析过的数据提供给 Pandas 是微不足道的,如果那是你真正想要的地方。

关于python - 尝试在Python中读取csv文件并创建单独的表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55986684/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com