gpt4 book ai didi

sql - 查找每个客户组的最新帐户

转载 作者:行者123 更新时间:2023-12-04 11:19:16 25 4
gpt4 key购买 nike

我有一个包含客户信息的表。每个客户都被分配了一个客户 ID(他们的 SSN),他们在开设更多账户时会保留该 ID。两个客户可能在同一个帐户上,每个客户都有自己的 ID。帐号不按日期排序。

我想找到每个客户或客户组的最新帐户。如果两个客户曾经一起使用过一个帐户,我想返回任一客户最近使用过的帐户。

这是一个包含一些可能情况的示例表。

示例表 ACCT:

acctnumber  date            Cust1ID     Cust2ID 
10000 '2016-02-01' 1110 NULL --Case0-customer has only ever had
--one account

10001 '2016-02-01' 1111 NULL --Case1-one customer has multiple
10050 '2017-02-01' 1111 NULL --accounts
400050 '2017-06-01' 1111 NULL
10089 '2017-12-08' 1111 NULL

10008 '2016-02-01' 1120 NULL --Case2-customer has account(s) and later
10038 '2016-04-01' 1120 NULL
10058 '2017-02-03' 1120 1121 --gets account(s) with another customer

10002 '2016-02-01' 1112 NULL --Case3-customer has account(s) and later
10052 '2017-02-02' 1113 1112 --becomes the second customer on another
10152 '2017-05-02' 1113 1112 --account(s)

10003 '2016-02-02' 1114 1115 --Case4-customer and second customer
7060 '2017-02-04' 1115 1114 --switch which is first and second

10004 '2016-02-02' 1116 1117 --Case5-second customer later gets
10067 '2017-02-05' 1117 NULL --separate account(s)
10167 '2018-02-05' 1117 NULL

50013 '2016-01-01' 2008 NULL --Case5b -customer has account(s) & later
50014 '2017-02-02' 2008 2009 --gets account(s) with second customer &
50015 '2017-04-04' 2008 NULL --later still first customer gets
100015 '2018-05-05' 2008 NULL --separate account(s)

30005 '2015-02-01' 1118 NULL --Case6-customer has account(s)
10005 '2016-02-01' 1118 NULL
10054 '2017-02-02' 1118 1119 --gets account(s) with another
40055 '2017-03-03' 1118 1119
10101 '2017-04-04' 1119 NULL --who later gets separate account(s)
10201 '2017-05-05' 1119 NULL
30301 '2017-06-06' 1119 NULL
10322 '2018-01-01' 1119 NULL

10007 '2016-02-01' 1122 1123 --Case7-customers play musical chairs
10057 '2017-02-03' 1123 1124
10107 '2017-06-02' 1124 1125

50001 '2016-01-01' 2001 NULL --Case8a-customers with account(s)
50002 '2017-02-02' 2001 2002 --together each later get separate
50003 '2017-03-03' 2001 NULL --account(s)
50004 '2017-04-04' 2002 NULL

50005 '2016-01-01' 2003 NULL --Case8b-customers with account(s)
50006 '2017-02-02' 2003 2004 --together each later get separate
50007 '2017-03-03' 2004 NULL --account(s)
50008 '2017-04-04' 2003 NULL
50017 '2018-03-03' 2004 NULL
50018 '2018-04-04' 2003 NULL

50009 '2016-01-01' 2005 NULL --Case9a-customer has account(s) & later
50010 '2017-02-02' 2005 2006 --gets account(s) with a second customer
50011 '2017-03-03' 2005 2007 --& later still gets account(s) with a
--third customer

50109 '2016-01-01' 2015 NULL --Case9b starts the same as Case9a, but
50110 '2017-02-02' 2015 2016
50111 '2017-03-03' 2015 2017
50112 '2017-04-04' 2015 NULL --after all accounts with other customers
50122 '2017-05-05' 2015 NULL --are complete, the original primary
--customer begins opening individual
--accounts again

期望的结果:
acctnumber  date            Cust1ID     Cust2ID 
10000 '2016-02-01' 1110 NULL --Case0
10089 '2017-12-08' 1111 NULL --Case1
10058 '2017-02-03' 1120 1121 --Case2
10152 '2017-05-02' 1113 1112 --Case3
7060 '2017-02-04' 1115 1114 --Case4
10167 '2018-02-05' 1117 NULL --Case5
100015 '2018-05-05' 2008 NULL --Case5b
10322 '2018-01-01' 1119 NULL --Case6
10107 '2017-06-02' 1124 1125 --Case7
50003 '2017-03-03' 2001 NULL --Case8a result 1
50004 '2017-04-04' 2002 NULL --Case8a result 2
50017 '2018-03-03' 2004 NULL --Case8b result 1
50018 '2018-04-04' 2003 NULL --Case8b result 2
50011 '2017-03-03' 2005 2007 --Case9a
50122 '2017-05-05' 2015 NULL --Case9b

或者,我会接受案例 7 输出两个单独的客户组:
10007       '2016-02-01'    1122        1123    --Case7 result 1
10107 '2017-06-02' 1124 1125 --Case7 result 2

因为案例 8a 和 8b 代表公司承认客户值得持有单独的帐户,所以我们希望将他们的组视为拆分,因此它具有单独的结果集。

另外,在大多数场景下,客户的账户数量较多,上述情况混搭加类是很常见的。例如,一个客户可以拥有五个账户(案例 1),然后随后与另一个客户开立一个或多个账户(案例 3),有时会更换主要账户持有人(案例 4),然后第一个客户再次开始开设个人账户(案例 5b)。

只要帐户编号是唯一的并且任何客户 ID 匹配,我都会尝试将表加入到自身的副本中。但是,这会删除只有一个帐户的客户,因此我添加了一个与 custid 或帐号不匹配的 cust 联合,以及按 custid 分组的客户。

不幸的是,第二部分不仅包括案例 0 中的 custid,而且还有一些不应该被排除在外的 custid。
select
max(date1) as date,
cust1id1 as cust1id
from
(
select
acctnumber as [acctnumber1],
date as [date1],
cust1id as [cust1id1],
cust2id as [cust2id1]
from
acct
) t1
join
(
select
acctnumber as [acctnumber2],
date as [date2],
cust1id as [cust1id2],
cust2id as [cust2id2]
from
acct
) t2
on t1.date1 > t2.date2 and
(t1.cust1id1 = t2.cust1id2 or
t1.cust1id1 = t2.cust2id2 or
t1.cust2id1 = t2.cust2id2)
Group by
cust1id1
union
select
max(date1) as date,
cust1id1 as cust1id
from
(
select
acctnumber as [acctnumber1],
date as [date1],
cust1id as [cust1id1],
cust2id as [cust2id1]
from
acct
) t1
join
(
select
acctnumber as [acctnumber2],
date as [date2],
cust1id as [cust1id2],
cust2id as [cust2id2]
from
acct
) t2
on (t1.acctnumber1 != t2.acctnumber2 and
t1.cust1id1 != t2.cust1id2 and
t1.cust1id1 != t2.cust2id2 and
t1.cust2id1 != t2.cust2id2)
group by
cust1id1

更新

到目前为止,感谢您提供的所有出色的答案和评论。我一直在尝试查询并比较结果。

@VladimirBaranov 提出了一个我以前在对其他答案的评论中没有考虑过的罕见案例。

与案例 7 类似,如果处理了案例 8 将是一个奖励,但不是预期的。

案例 9 很重要,应处理 9a 和 9b 的结果。

更新 2

我注意到我原来的 7 个案例存在问题。

在最近的帐户中,当客户不再在帐户中时,始终是第二个借款人。这完全是无意的,您可以查看这些示例中的任何一个,并且任何一个客户都可能是最近帐户中的剩余客户。

此外,每个案例都有最少数量的帐户来准确显示案例正在测试的内容,但这并不常见。通常在每个案例的每个步骤中,在客户切换到添加第二个客户之前,可以有 5、10、15 个或更多帐户,然后这两个帐户可以一起拥有许多帐户。

回顾我看到的答案,很多人都有索引、创建、更新和其他特定于能够编辑数据库的子句。不幸的是,我在这个数据库的消费者端,所以我有只读访问权限,我可以用来与数据库交互的程序会自动拒绝它们。

最佳答案

我要感谢 Jeff Breadner 提供带有示例数据的 DDL。

您必须逐步运行以下查询,逐个 CTE 并检查中间结果以了解它的作用。它假设 AcctNumber在给定的表中是唯一的。

首先,我想为每个客户找到最新的帐户。这是一个简单的top-n-per-group查询,我正在使用 ROW_NUMBER接近这里。
CTE_Customers通过汇总 Cust1ID 列出所有个人客户的简单列表和 Cust2ID . CTE_RN分配他们的行号。 CTE_LatestAccounts为每个客户提供最新帐户:

+------------------+------------+--------+
| LatestAcctNumber | LatestDT | CustID |
+------------------+------------+--------+
| 10000 | 2016-02-01 | 1110 |
| 10050 | 2017-02-01 | 1111 |
| 10052 | 2017-02-02 | 1112 |
| 10052 | 2017-02-02 | 1113 |
| 7060 | 2017-02-04 | 1114 |
| 7060 | 2017-02-04 | 1115 |
| 10004 | 2016-02-02 | 1116 |
| 10067 | 2017-02-05 | 1117 |
| 10054 | 2017-02-03 | 1118 |
| 10101 | 2017-06-02 | 1119 |
| 10058 | 2017-02-03 | 1120 |
| 10058 | 2017-02-03 | 1121 |
| 10007 | 2016-02-01 | 1122 |
| 10057 | 2017-02-03 | 1123 |
| 10107 | 2017-06-02 | 1124 |
| 10107 | 2017-06-02 | 1125 |
+------------------+------------+--------+

由于拥有将最新帐户“传播”给另一个客户的客户对,因此任务变得复杂。

客户对在原始表中定义,所以 CTE_MaxLatestAccounts从原始表中取出每一行并将最新帐户加入到其中两次 - 对于 Cust1DCust2ID .对于每一对,我都会选择两个最新帐户中的一个——最近的一个。因此,属于一对的客户可能会从其合作伙伴那里获得一个帐户。
+---------+---------+-------------+---------------------+
| Cust1ID | Cust2ID | MaxLatestDT | MaxLatestAcctNumber |
+---------+---------+-------------+---------------------+
| 1110 | NULL | 2016-02-01 | 10000 |
| 1111 | NULL | 2017-02-01 | 10050 |
| 1111 | NULL | 2017-02-01 | 10050 |
| 1120 | NULL | 2017-02-03 | 10058 |
| 1120 | 1121 | 2017-02-03 | 10058 |
| 1112 | NULL | 2017-02-02 | 10052 |
| 1113 | 1112 | 2017-02-02 | 10052 |
| 1114 | 1115 | 2017-02-04 | 7060 |
| 1115 | 1114 | 2017-02-04 | 7060 |
| 1116 | 1117 | 2017-02-05 | 10067 |
| 1117 | NULL | 2017-02-05 | 10067 |
| 1118 | NULL | 2017-02-03 | 10054 |
| 1118 | 1119 | 2017-06-02 | 10101 |
| 1119 | NULL | 2017-06-02 | 10101 |
| 1122 | 1123 | 2017-02-03 | 10057 |
| 1123 | 1124 | 2017-06-02 | 10107 |
| 1124 | 1125 | 2017-06-02 | 10107 |
+---------+---------+-------------+---------------------+
MaxLatestAcctNumber这里适用于 Cust1IDCust2ID .同一客户可能在此处多次列出,我们需要使用最新的帐户再次选择条目。这是一对的最新帐户,不适用于个人客户。

方法与开始时相同。把两个 Cust1IDCust2ID列表中的客户: CTE_CustomersWithLatestAccountFromPair .在 CTE_CustomersWithLatestAccountFromPairRN 中分配行号并在 CTE_FinalAccounts 中选择最终帐户.
+---------------------+
| MaxLatestAcctNumber |
+---------------------+
| 10000 |
| 10050 |
| 10052 |
| 10052 |
| 7060 |
| 7060 |
| 10067 |
| 10067 |
| 10101 |
| 10101 |
| 10058 |
| 10058 |
| 10057 |
| 10107 |
| 10107 |
| 10107 |
+---------------------+

现在我们只需要过滤原始表并只保留出现在此列表中的那些行(帐户)。请参阅下面的最终结果。

样本数据
declare @ACCT table (
AcctNumber int,
dt date,
Cust1ID int,
Cust2ID int
);

insert into @ACCT values
(10000, '2016-02-01', 1110, null),
(10001, '2016-02-01', 1111, null),
(10050, '2017-02-01', 1111, null),
(10008, '2016-02-01', 1120, null),
(10058, '2017-02-03', 1120, 1121),
(10002, '2016-02-01', 1112, null),
(10052, '2017-02-02', 1113, 1112),
(10003, '2016-02-02', 1114, 1115),
(7060, '2017-02-04', 1115, 1114),
(10004, '2016-02-02', 1116, 1117),
(10067, '2017-02-05', 1117, null),
(10005, '2016-02-01', 1118, null),
(10054, '2017-02-03', 1118, 1119),
(10101, '2017-06-02', 1119, null),
(10007, '2016-02-01', 1122, 1123),
(10057, '2017-02-03', 1123, 1124),
(10107, '2017-06-02', 1124, 1125);

查询
WITH
CTE_Customers
AS
(
SELECT
AcctNumber
,dt
,Cust1ID AS CustID
FROM @ACCT
WHERE Cust1ID IS NOT NULL

UNION ALL

SELECT
AcctNumber
,dt
,Cust2ID AS CustID
FROM @ACCT
WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
SELECT
AcctNumber
,dt
,CustID
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
SELECT
AcctNumber AS LatestAcctNumber
,dt AS LatestDT
,CustID
FROM CTE_RN
WHERE rn = 1
)
,CTE_MaxLatestAccounts
AS
(
SELECT
A.Cust1ID
,A.Cust2ID
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
FROM
@ACCT AS A
LEFT JOIN CTE_LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
LEFT JOIN CTE_LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
SELECT
Cust1ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust1ID IS NOT NULL

UNION ALL

SELECT
Cust2ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
SELECT
CustID
,MaxLatestDT
,MaxLatestAcctNumber
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
SELECT MaxLatestAcctNumber
FROM CTE_CustomersWithLatestAccountFromPairRN
WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;

结果
+------------+------------+---------+---------+
| AcctNumber | dt | Cust1ID | Cust2ID |
+------------+------------+---------+---------+
| 10000 | 2016-02-01 | 1110 | NULL |
| 10050 | 2017-02-01 | 1111 | NULL |
| 10058 | 2017-02-03 | 1120 | 1121 |
| 10052 | 2017-02-02 | 1113 | 1112 |
| 7060 | 2017-02-04 | 1115 | 1114 |
| 10067 | 2017-02-05 | 1117 | NULL |
| 10101 | 2017-06-02 | 1119 | NULL |
| 10057 | 2017-02-03 | 1123 | 1124 |
| 10107 | 2017-06-02 | 1124 | 1125 |
+------------+------------+---------+---------+

此结果与您想要的结果相匹配,除了最后一种情况 7。

我的查询不会尝试跟踪任意长度的链接客户链,并且仅限于一次处理一对。这就是案例 7 结果不是一行的原因。
查询将始终选择最后一个日期 ( 10107) 的行/帐户,它也可能选择链中间的帐户。在这种情况下,它选择了一行 10057 ,而不是 10007 ,因为这是客户的后期帐户 11221123 .

当我查看执行计划时,我看到 CTE_LatestAccounts 后面的查询基本上运行了四次。

如果您保存 CTE_LatestAccounts 的结果,则很可能放入具有适当索引的临时表中,整体性能会更好。

像这样的东西:
DECLARE @LatestAccounts TABLE 
(LatestAcctNumber int, LatestDT date, CustID int PRIMARY KEY);

WITH
CTE_Customers
AS
(
SELECT
AcctNumber
,dt
,Cust1ID AS CustID
FROM @ACCT
WHERE Cust1ID IS NOT NULL

UNION ALL

SELECT
AcctNumber
,dt
,Cust2ID AS CustID
FROM @ACCT
WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
SELECT
AcctNumber
,dt
,CustID
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
SELECT
AcctNumber AS LatestAcctNumber
,dt AS LatestDT
,CustID
FROM CTE_RN
WHERE rn = 1
)
INSERT INTO @LatestAccounts (LatestAcctNumber, LatestDT, CustID)
SELECT LatestAcctNumber, LatestDT, CustID
FROM CTE_LatestAccounts;


WITH
CTE_MaxLatestAccounts
AS
(
SELECT
A.Cust1ID
,A.Cust2ID
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
FROM
@ACCT AS A
LEFT JOIN @LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
LEFT JOIN @LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
SELECT
Cust1ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust1ID IS NOT NULL

UNION ALL

SELECT
Cust2ID AS CustID
,MaxLatestDT
,MaxLatestAcctNumber
FROM CTE_MaxLatestAccounts
WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
SELECT
CustID
,MaxLatestDT
,MaxLatestAcctNumber
,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
SELECT MaxLatestAcctNumber
FROM CTE_CustomersWithLatestAccountFromPairRN
WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;

如果您确实需要在链的长度是任意的时将所有链接的客户合并/分组为一行,您可以使用递归查询来完成,例如,这里: How to find all connected subgraphs of an undirected graph

一旦你用一些 GroupID 标记了每个客户,就可以找到每个客户的最新帐户,就像在这个查询的开头一样。然后在组中找到最新的帐户(而不是像此查询中的简单对)。

在链接问题中查找无向图的所有子图的查询对于大型数据集可能非常慢,并且有有效的基于非集合的算法来执行此操作。

如果您知道链的最大长度不能超过某个数字,则可以使此递归查询更有效。

关于sql - 查找每个客户组的最新帐户,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42119006/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com