gpt4 book ai didi

python - Reconciling Records (Date and Number Value) : Given two datasets with multiple features, 如何获得最有可能的匹配?

转载 作者:行者123 更新时间:2023-11-28 17:00:35 26 4
gpt4 key购买 nike

假设我有两个数据集 basepayment

base 是:

[ id, timestamp, value]

付款是:

 [ payment_id, timestamp, value, gateway ]

我想协调 basepayment。期望的结果是:

[id, timestamp, value, payment_id, gateway, probability]

基本上它应该告诉我对于给定的基本条目最可能的 payment_id 是什么。匹配应同时考虑日期时间和值。如果它只给出概率最高的那个,我会很满意,但是我也不会打扰第二个/第三个建议。

到目前为止,我已经阅读了一些关于模糊匹配和相似性学习、余弦相似度等内容,但似乎无法将它们应用到我的问题中。我想手动做一些事情,比如:

for each_entry in base:
value_difference = base['value'] - payment['value']
time_difference = base['timestamp'] - payment['timestamp']

if value_difference <= 0.1 and time_difference <= 0.1:
#if the difference is small, then tell me the payment_id.

问题在于,这看起来像是一种真正的“转储”方法,如果有多个 payment_entry 符合条件,则可能会发生冲突,我将不得不手动微调参数以取得良好的效果。

我希望找到一种更智能、更自动化的方法来帮助协调这两个数据集。

有人对如何解决这个问题有什么建议吗?


编辑:我目前的状态:

import pandas as pd
import time
from itertools import islice
from pandas import ExcelWriter
import datetime
from random import uniform

orders = pd.read_excel("Orders.xlsx")
pmts = pd.read_excel("Payments.xlsx")

pmts['date'] = pd.to_datetime(pmts.date)
orders['data'] = pd.to_datetime(orders.data)

payment_list = []
for index, row in pmts.iterrows():
new_entry = {}
ts = row['date']
new_entry['id'] = row['id']
new_entry['date'] = ts.to_pydatetime()
new_entry['value'] = row['value']
new_entry['types'] = row['pmt']
new_entry['results'] = []
payment_list.append(new_entry)

order_list = []
for index, row in orders.iterrows():
new_entry = {}
ts = row['data']
new_entry['id'] = row['Id1']
new_entry['date'] = ts.to_pydatetime()
new_entry['value'] = row['valor']
new_entry['types'] = row['nome']
new_entry['results'] = []
order_list.append(new_entry)

for each_entry in order_list:
for each_payment in payment_list:
delta_value = (each_entry['value'] - each_payment['value'])
try:
delta_time = abs(each_entry['date'] - each_payment['date'])
except:
TypeError
pass
results = []
delta_ref = datetime.timedelta(minutes=60)

if delta_value == 0 and delta_time < delta_ref:
result_type = each_payment['types']
result_id = each_payment['id']
results.append(result_type)
results.append(delta_time)
results.append(result_id)
each_entry['results'].append(results)

result_id = each_entry['id']
each_payment['results'].append(result_id)



orders2 = pd.DataFrame(order_list)
writer = ExcelWriter('OrdersList.xlsx')
orders2.to_excel(writer)
writer.save()

pmts2 = pd.DataFrame(payment_list)
writer = ExcelWriter('PaymentList.xlsx')
pmts2.to_excel(writer)
writer.save()

好的,现在我得到了一些东西。它返回所有具有相同值和小于 x 的时间增量(在本例中为 60 分钟)的条目。只给我最有可能的结果,匹配正确的概率(相同数量,小时间窗口),这再好不过了。会继续努力。

最佳答案

最简单的方法可能是选择具有最小差异的基础/支付对。例如:

base_data = [...]  # all base data
payment_data = [...] # all payment data

def prop_diff(a,b,props):
# this iterates through all specified properties and
# sums the result of the differences
return sum([a[prop]-b[prop] for prop in props])


def join_data(base, payment):
# you need to implement your merging strategy here
return joined_base_and_payment


results = [] # where we will store our merged results
working_payment = payment_data.copy()
for base in base_data:
# check the difference between the lists
diffs = []
for payment in working_payment:
diffs.append(prop_diff(base, payment, ['value', 'timestamp']))

# find the index of the payment with the minimum difference
min_idx = 0
for i, d in enumerate(diffs):
if d < diffs[min_idx]:
min_idx = i

# append the result of the joined lists
results.append(join_data(base, working_payment[min_idx]))
del working_payment[min_idx] # remove the selected payment

print(results)

基本思想是找出列表之间的总差异并选择差异最小的对。在本例中,我复制了 payment_data,这样我们就不会破坏它,并且在我们将它与一个基数匹配并附加结果后,我们实际上删除了付款条目。

关于python - Reconciling Records (Date and Number Value) : Given two datasets with multiple features, 如何获得最有可能的匹配?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54849245/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com