gpt4 book ai didi

python - 从行到列构建数据集 pandas python

转载 作者:行者123 更新时间:2023-12-01 02:40:13 28 4
gpt4 key购买 nike

我有一个如下所示的数据框,其中包含许多特征列,但下面只提到了 3 个:

productid   |feature1   |value1 |feature2    |value2     | feature3    |value3
100001 |weight | 130g | |price |$140.50
100002 |weight | 200g |pieces |12 pcs | dimensions |150X75cm
100003 |dimensions |70X30cm |price |$22.90
100004 |price |$12.90 |manufacturer| ABC |calories |556Kcal
100005 |calories |1320Kcal|dimensions |20X20cm |manufacturer | XYZ

我想使用 pandas 按以下方式构建它:

productid   weight  dimensions  price   calories    no. of pieces   manufacturer
100001 130g $140.50
100002 200g 150X75cm 12 pcs
100003 70X30cm $22.90
100004 $12.90 556Kcal ABC
100005 20X20cm 1320Kcal XYZ

我研究了各种pandas方法,如reset_index、stack等,但没有得到它以所需的方式转换。

最佳答案

您正在寻找解压数据帧的代码。最简单的方法是(具有许多功能并且可能有重复的产品 ID):

import pandas as pd
import numpy as np

def expand(frame):
df = pd.DataFrame()
for row in frame.iterrows():
data = row[1]
for feature_name, feature_value in zip(data[1::2], data[2::2]):
if feature_name:
df.loc[data.productid, feature_name] = feature_value
return df.replace(np.nan, '')


df = pd.DataFrame([("100001", "weight", "130g", None, None, "price", "$140.50"),
("100002", "weight", "200g", "pieces", "12 pcs", "dimensions", "150X75cm"),
("100003", "dimensions", "70X30cm", "price", "$22.90"),
("100004", "price", "$12.90", "manufacturer", "ABC", "calories", "556Kcal"),
("100005", "calories", "1320Kcal", "dimensions", "20X20cm", "manufacturer", "XYZ")],
columns=["productid", "feature1", "value1", "feature2", "value2", "feature3", "value3"])

xdf = expand(df)
print(xdf)

输出:

       weight    price  pieces dimensions manufacturer  calories
100001 130g $140.50
100002 200g 12 pcs 150X75cm
100003 $22.90 70X30cm
100004 $12.90 ABC 556Kcal
100005 20X20cm XYZ 1320Kcal

EDIT1:稍微压缩的形式:(慢!)

def expand2(frame):
return pd.DataFrame.from_dict(
{data.productid: {f: v for f, v in zip(data[1::2], data[2::2]) if f} for _, data in frame.iterrows()},
orient='index')

EDIT2:使用生成器表达式:

def expand3(frame):
return pd.DataFrame.from_records(
({f: v for f, v in itertools.chain((('productid', data.productid),), zip(data[1::2], data[2::2])) if f}
for _, data
in frame.iterrows()), index='productid').replace(np.nan, '')

一些测试(用@timeit装饰函数):

def timeit(f):
@functools.wraps(f)
def timed(*args, **kwargs):
try:
start_time = time.time()
return f(*args, **kwargs)
finally:
end_time = time.time()
function_invocation = "x"
sys.stdout.flush()
print(f'Function {f.__name__}({function_invocation}), took: {end_time - start_time:2.4f} seconds.',
flush=True, file=sys.stderr)

return timed

def generate_wide_df(n_rows, n_features):
possible_labels = [f'label_{i}' for i in range(n_features)]
columns = ['productid']
for i in range(1, n_features):
columns.append(f'feature_{i}')
columns.append(f'value_{i}')

df = pd.DataFrame(columns=columns)
for row_n in range(n_rows):
df.loc[row_n, 'productid'] = int(1000000 + row_n)
for _ in range(n_features):
feature_num = random.randint(1, n_features)
df.loc[row_n, f'feature_{feature_num}'] = random.choice(possible_labels)
df.loc[row_n, f'value_{feature_num}'] = random.randint(1, 10000)
return df.where(df.notnull(), None)


df = generate_wide_df(4000, 30)


expand(df)
expand3(df)
expand2(df)

结果:

Function expand(x), took: 1.1576 seconds.
Function expand3(x), took: 1.1185 seconds.
Function expand2(x), took: 16.3055 seconds.

关于python - 从行到列构建数据集 pandas python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45793412/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com