gpt4 book ai didi

相当于 pd.to_numeric 的 Dask

转载 作者:行者123 更新时间:2023-12-02 11:20:49 27 4
gpt4 key购买 nike

我正在尝试使用 dask read_csv 读取多个 CSV 文件,每个文件大约 15 GB。在执行此任务时,dask 将特定列解释为浮点数,但是它有一些字符串类型的值,后来当我尝试执行某些操作时失败,说明它无法将字符串转换为浮点数。因此我使用 dtype=str 参数将所有列作为字符串读取。现在我想将特定列转换为带有 errors='coerce' 的数字,以便我将那些包含字符串的记录转换为 NaN 值,其余部分正确转换为浮点数。您能否建议如何使用 dask 来实现这一点?

已经尝试过:astype 转换

import dask.dataframe as dd
df = dd.read_csv("./*.csv", encoding='utf8',
assume_missing = True,
usecols =col_names.values.tolist(),
dtype=str)
df["mycol"] = df["mycol"].astype(float)
search_df = df.query('mycol >0').compute()
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+-----------------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------------+--------+----------+
| mycol | object | float64 |
+-----------------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- mycol
ValueError("could not convert string to float: 'cliqz.com/tracking'")
#Reproducible example
import dask.dataframe as dd
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True)
df.dtypes #count column will appear as float but it has a couple of dirty values as string
search_df = df.query('count >0').compute() #This line will give the type conversion error
#Edit with one possible solution, but is this optimal while using dask?
import dask.dataframe as dd
import pandas as pd
to_n = lambda x: pd.to_numeric(x, errors="coerce")
df = dd.read_csv("mydata.csv", encoding='utf8',
assume_missing = True,
converters={"count":to_n}
)
df.dtypes
search_df = df.query('count >0').compute()

最佳答案

我有一个类似的问题,我使用 .where 解决了它.

p = ddf.from_pandas(pandas.Series(["1", "2", np.nan, "3", "4"]), 1)
p.where(~p.isna(), 999).astype("u4")

或者也许用以下内容替换第二行:

p.where(p.str.isnumeric(), 999).astype("u4")

就我而言,我的 DataFrame (或 Series )是其他操作的结果,所以我不能将它直接应用于 read_csv .

关于相当于 pd.to_numeric 的 Dask,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56771265/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com