gpt4 book ai didi

python - 执行合并时防止行重复

转载 作者:行者123 更新时间:2023-12-02 17:32:58 25 4
gpt4 key购买 nike

我正在从事的数据分析项目遇到了困难。

本质上,如果我有示例 CSV 'A':

id   | item_num
A123 | 1
A123 | 2
B456 | 1

我有示例 CSV 'B':

id   | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...

如果我使用 Pandas 执行合并,结果如下:

id   | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...

我怎样才能让它变成:

id   | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...

这是我的代码:

import pandas as pd

# Import CSVs
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))


# Create a resultant, but empty, DF, and then append the merge.
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))

# Lets do a "dedupe" to deal with an issue on how Pandas handles datetime merges
# I read about an issue where if datetime is involved, duplicate entires will be created.
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))

# Save to another CSV
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")

我真的很感激任何帮助 - 我很困难!我正在处理 20,000 多行。

谢谢。

编辑:我的帖子被标记为潜在重复。事实并非如此,因为我不一定要添加一列 - 我只是想防止 description 乘以归因于的 item_num 数量特定的id

<小时/>

更新,6 月 21 日:

如果 2 个 DF 看起来像这样,我该如何进行合并?

id   | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum

我有示例 CSV 'B':

id   | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...

所以我最终得到:

id   | item_num |  other_col  | description
A123 | 1 | lorem ipsum | Mary Had a...
A123 | 2 | dolor sit | ...little lamb.
B456 | 1 | lorem ipsum | ...Its fleece...

意思是,“other_col”中包含“amet, consectetur”的 3 行将被忽略。

最佳答案

我会这样做:

In [135]: result = A.merge(B.assign(item_num=B.groupby('id').cumcount()+1))

In [136]: result
Out[136]:
id item_num description
0 A123 1 Mary had a...
1 A123 2 ...little lamb.
2 B456 1 ...Its fleece...

说明:我们可以在 B DF 中创建“虚拟”item_num 列进行连接:

In [137]: B.assign(item_num=B.groupby('id').cumcount()+1)
Out[137]:
id description item_num
0 A123 Mary had a... 1
1 A123 ...little lamb. 2
2 B456 ...Its fleece... 1

关于python - 执行合并时防止行重复,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43768700/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com