- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我有大型 csv 文件,每个文件的大小都超过 10 mb,大约有 50 多个这样的文件。这些输入有超过 25 列和超过 50K 行。
所有这些都有相同的标题,我试图将它们合并到一个 csv 中,标题只被提及一次。
选项:一个代码:适用于小型 csv——超过 25 列,但文件大小以 kbs 为单位。
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
但上面的代码不适用于较大的文件并给出错误。
错误:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
all_files = glob.glob("*.csv", encoding='utf8', engine='python')
TypeError: glob() got an unexpected keyword argument 'encoding'
lakshmi@lakshmi-HP-15-Notebook-PC:~/Desktop/Twitter_Lat_lon/nasik_rain/rain_2$ python merge_large.py
Traceback (most recent call last):
File "merge_large.py", line 10, in <module>
df = pd.read_csv(file_,index_col=None, header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
代码:第 25 列以上但文件大小超过 10mb
选项:四个
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
错误:
Traceback (most recent call last):
File "merge_large.py", line 6, in <module>
allFiles = glob.glob("*.csv", sep=None)
TypeError: glob() got an unexpected keyword argument 'sep'
我进行了广泛的搜索,但找不到将具有相同 header 的大型 csv 文件连接到一个文件中的解决方案。
编辑:
代码:
import dask.dataframe as dd
ddf = dd.read_csv('*.csv')
ddf.to_csv('master.csv',index=False)
错误:
Traceback (most recent call last):
File "merge_csv_dask.py", line 5, in <module>
ddf.to_csv('master.csv',index=False)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 792, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io.py", line 762, in to_csv
compute(*values)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 179, in compute
results = get(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 58, in get
**kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 481, in get_async
raise(remote_exception(res, tb))
dask.async.ValueError: could not convert string to float: {u'type': u'Point', u'coordinates': [4.34279, 50.8443]}
Traceback
---------
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 49, in bytes_read_csv
coerce_dtypes(df, dtypes)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/csv.py", line 73, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2950, in astype
raise_on_error=raise_on_error, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2938, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 2890, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 434, in astype
values=values, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/internals.py", line 477, in _astype
values = com._astype_nansafe(values.ravel(), dtype, copy=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/common.py", line 1920, in _astype_nansafe
return arr.astype(dtype
)
最佳答案
如果我理解你的问题,你有一些结构相同的大 csv 文件,你想合并到一个大的 CSV 文件中。
我的建议是使用dask
来自 Continuum Analytics 来处理这项工作。您可以合并文件,也可以像 pandas 一样执行核外计算和数据分析。
### make sure you include the [complete] tag
pip install dask[complete]
首先,检查 dask 的版本。对我来说,dask = 0.11.0 和 pandas = 0.18.1
import dask
import pandas as pd
print (dask.__version__)
print (pd.__version__)
这是读取所有 csvs 的代码。我在使用您的 DropBox 示例数据时没有出错。
import dask.dataframe as dd
from dask.delayed import delayed
import dask.bag as db
import glob
filenames = glob.glob('/Users/linwood/Downloads/stack_bundle/rio*.csv')
'''
The key to getting around the CParse error was using sep=None
Came from this post
http://stackoverflow.com/questions/37505577/cparsererror-error-tokenizing-data
'''
# custom saver function for dataframes using newfilenames
def reader(filename):
return pd.read_csv(filename,sep=None)
# build list of delayed pandas csv reads; then read in as dask dataframe
dfs = [delayed(reader)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
'''
This is the final step. The .compute() code below turns the
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in
dask. Get a subset of the data you want and save that.
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')
# print the count of values in each column; perfect data would have the same count
# you have dirty data as the counts will show
print (df.count().compute())
下一步是做一些类似 Pandas 的分析。这是我首先“清理”“tweetFavoriteCt”列数据的一些代码。所有数据都不是整数,因此我将字符串替换为“0”并将其他所有数据转换为整数。获得整数转换后,我将展示一个简单的分析,在其中过滤整个数据帧以仅包含 favoriteCt 大于 3 的行
# function to convert numbers to integer and replace string with 0; sample analytics in dask dataframe
# you can come up with your own..this is just for an example
def conversion(value):
try:
return int(value)
except:
return int(0)
# apply the function to the column, create a new column of cleaned data
clean = df['tweetFavoriteCt'].apply(lambda x: (conversion(x)),meta=('stuff',str))
# set new column equal to our cleaning code above; your data is dirty :-(
df['cleanedFavoriteCt'] = clean
最后一段代码显示了 dask 分析以及如何将合并后的文件加载到 pandas 中,并将合并后的文件写入磁盘。请注意,如果您有大量 CSV,当您使用下面的 .compute()
代码时,它会将合并后的 csv 加载到内存中。
# retreive the 50 tweets with the highest favorite count
print(df.nlargest(50,['cleanedFavoriteCt']).compute())
# only show me the tweets that have been favorited at least 3 times
# TweetID 763525237166268416, is VERRRRY popular....7000+ favorites
print((df[df.cleanedFavoriteCt.apply(lambda x: x>3,meta=('stuff',str))]).compute())
'''
This is the final step. The .compute() code below turns the
dask dataframe into a single pandas dataframe with all your
files merged. If you don't need to write the merged file to
disk, I'd skip this step and do all the analysis in
dask. Get a subset of the data you want and save that.
'''
df = df.reset_index().compute()
df.to_csv('./test.csv')
现在,如果您想为合并的 csv 文件切换到 pandas:
import pandas as pd
dff = pd.read_csv('./test.csv')
让我知道这是否有效。
到此为止
第一步是确保安装了 dask
。有install instructions for dask
in the documentation page但这应该有效:
安装 dask 后,可以轻松读取文件。
先做些内务处理。假设我们有一个包含 csvs 的目录,其中文件名为 my18.csv
、my19.csv
、my20.csv
等。名称标准化和单一目录位置是关键。如果您将 csv 文件放在一个目录中并以某种方式序列化名称,则此方法有效。
步骤:
dask.dataframe
对象中。如果你愿意,你可以在这一步之后立即进行类似 pandas 的操作。import dask.dataframe as dd
ddf = dd.read_csv('./daskTest/my*.csv')
ddf.describe().compute()
master.csv
ddf.to_csv('./daskTest/master.csv',index=False)
master.csv
读入dask.dataframe对象进行计算。这也可以在上面的第一步之后完成; dask 可以对暂存文件执行类似于 pandas 的操作......这是一种在 Python 中处理“大数据”的方法# reads in the merged file as one BIG out-of-core dataframe; can perform functions like pangas
newddf = dd.read_csv('./daskTest/master.csv')
#check the length; this is now length of all merged files. in this example, 50,000 rows times 11 = 550000 rows.
len(newddf)
# perform pandas-like summary stats on entire dataframe
newddf.describe().compute()
希望这有助于回答您的问题。在三个步骤中,您读入所有文件,合并到单个数据帧,然后将这个庞大的数据帧写入磁盘,只有一个标题和所有行。
关于python - pandas.io.common.CParserError : Error tokenizing data. C 错误:缓冲区溢出被捕获 - 可能是格式错误的输入文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38757713/
我想将 CSV 列表读入数据框中。但是,当文件具有与数据本身不匹配的标题行(即元数据或其他空白行)时,我无法捕捉到错误。此错误是“CParserError”(请参阅底部的错误消息)。 我目前的解决
我想阅读book-crossing dataset表:BX-Books。使用 Pandas 。当我写下: #load book informations dataset books = pd.re
当我运行这个脚本时,它不起作用,我不知道为什么。你能帮我吗? import pandas as pd data1 = pd.read_csv(url) print(data1) 错误: Traceba
我有大型 csv 文件,每个文件的大小都超过 10 mb,大约有 50 多个这样的文件。这些输入有超过 25 列和超过 50K 行。 所有这些都有相同的标题,我试图将它们合并到一个 csv 中,标题只
所以我尝试从一个文件夹中读取所有 csv 文件,然后将它们连接起来创建一个大 csv(所有文件的结构相同),保存并再次读取。所有这些都是使用 Pandas 完成的。读取时发生错误。我在下面附上代码和错
我检查了这个答案,因为我遇到了类似的问题。 Python Pandas Error tokenizing data 但是,由于某种原因,我的所有行都被跳过了。 我的代码很简单: import pand
我是一名优秀的程序员,十分优秀!