gpt4 book ai didi

sql - 比较来自两个不同表的两列的逗号分隔值

转载 作者:行者123 更新时间:2023-12-04 23:44:26 25 4
gpt4 key购买 nike

我想比较两个不同 Oracle 表的逗号分隔值的两个列(差异表)的值。我想找到与所有值匹配的行(NAME1 所有值 应与 NAME2 值匹配)。

注意:逗号分隔的值顺序不同。

例子:

T1:

ID_T1             NAME1
===================================


1 ASCORBIC ACID, PARACETAMOL, POTASSIUM HYDROGEN CARBONATE
2 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID
3 CAFFEINE, PARACETAMOL PH. EUR.
4 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

T2:

ID_T2          NAME2
=================================

4 POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
6 PARACETAMOL PH. EUR.,CAFFEINE
7 CODEINE PHOSPHATE, PARACETAMOL DC
8 DEXCHLORPHENIRAMINE MALEATE, DEXTROMETHORPHAN HYDROBROMIDE
10 DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE

我的结果应该只显示基于两个表中所有名称匹配的匹配行。

    ID_T1    ID_T2    MATCHING NAME
==================================
1 4 POTASSIUM HYDROGEN CARBONATE, ASCORBIC ACID, PARACETAMOL
3 6 PARACETAMOL PH. EUR.,CAFFEINE
4 10 PSEUDOEPHEDRINE HYDROCHLORIDE,DEXCHLORPHENIRAMINE MALEATE

尝试使用 REGEXP_SUBST 但无法使其工作。

我使用下面的代码来解析这些值:

SELECT REGEXP_SUBSTR (NAME1, '[^,]+', 1, ROWNUM)
FROM T1
CONNECT BY ROWNUM <= LENGTH (NAME1) -
LENGTH (REPLACE (NAME, ',')) + 1

最佳答案

您可以将表格转换为第一范式,然后比较存储在每一行中的化合物。起点可以是:

{1} 标记每一行,并将标记写入新表。为每个标记赋予其原始 ID 加上 一个 3 字母前缀,指示该标记来自哪个表。{2} 按 ID 对新(“规范化”)表的行进行分组,并执行 LISTAGG()。执行自连接,并找到匹配的“ token 组”。

{1} 标记化,创建表作为选择 (CTAS)

create table tokens
as
select
ltrim( -- ltrim() and rtrim() remove leading/trailing spaces (blanks)
rtrim(
substr( N.wrapped
, instr( N.wrapped, ',', 1, T.pos ) + 1
, ( instr( N.wrapped, ',', 1, T.pos + 1 ) - instr( N.wrapped, ',', 1, T.pos ) ) - 1
)
)
) token
, N.id
from (
select ',' || name1 || ',' as wrapped, 'T1_' || to_char( id_t1 ) as id from t1 -- names wrapped in commas, (table)_id
union all
select ',' || name2 || ',' , 'T2_' || to_char( id_t2 ) from t2
) N join (
select level as pos -- (max) possible position of char in an existing token
from dual
connect by level <= (
select greatest( -- find the longest string ie max position (query T1 and T2)
( select max( length( name1 ) ) from t1 )
, ( select max( length( name2 ) ) from t2 )
) as pos
from dual
)
) T
on T.pos <= ( length( N.wrapped ) - length( replace( N.wrapped, ',') ) ) - 1
;

不使用 CONNECT BY 进行标记化的灵感来自 this SO answer .

TOKENS 表的内容如下所示:

SQL> select * from tokens ;
TOKEN ID
ASCORBIC ACID T1_1
SODIUM HYDROGEN CARBONATE T1_2
CAFFEINE T1_3
PSEUDOEPHEDRINE HYDROCHLORIDE T1_4
PARACETAMOL T1_100
sodium hydroxide T1_110
POTASSIUM HYDROGEN CARBONATE T2_4
SODIUM HYDROGEN CARBONATE T2_5
PARACETAMOL PH. EUR. T2_6
CODEINE PHOSPHATE T2_7
DEXCHLORPHENIRAMINE MALEATE T2_8
DEXCHLORPHENIRAMINE MALEATE T2_10
PARACETAMOL T2_200
...

{2} GROUP BY、LISTAGG、自加入

select
S1.id id1
, S2.id id2
, S1.tokengroup_T1
, S2.tokengroup_T2
from
(
select substr( id, 4, length( id ) - 3 ) id
, listagg( token, ' + ' ) within group ( order by token ) tokengroup_T1
from tokens
group by id
having substr( id, 1, 3 ) = 'T1_'
) S1
join
(
select substr( id, 4, length( id ) - 3 ) id
, listagg( token, ' + ' ) within group ( order by token ) tokengroup_T2
from tokens
group by id
having substr( id, 1, 3 ) = 'T2_'
) S2
on S1.tokengroup_T1 = S2.tokengroup_T2
;

-- result
ID1 ID2 TOKENGROUP_T1 TOKENGROUP_T2
4 10 DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE DEXCHLORPHENIRAMINE MALEATE + PSEUDOEPHEDRINE HYDROCHLORIDE
110 210 potassium carbonate + sodium hydroxide potassium carbonate + sodium hydroxide
1 4 ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE ASCORBIC ACID + PARACETAMOL + POTASSIUM HYDROGEN CARBONATE
3 6 CAFFEINE + PARACETAMOL PH. EUR. CAFFEINE + PARACETAMOL PH. EUR.

当这样做时,您可以将物质按(字母)排序,您还可以在这里选择一个您喜欢的“分隔符”(我们使用了“+”)。

备选方案

如果所有这些对您都没有用,或者您认为这太复杂了,那么您可以尝试使用 TRANSLATE()。在这种情况下,我建议从您的数据集中去除所有空格/空白(在查询中 - 改变原始数据!)像这样:

查询

select 
id1, id2
, name1, name2
from (
select
id_t1 id1
, id_t2 id2
, T1.name1 name1
, T2.name2 name2
from T1
join T2
on translate( replace( T1.name1, ' ', '' ), replace( T2.name2, ' ', '' ), '!' )
= translate( replace( T2.name2, ' ', '' ), replace( T1.name1, ' ', '' ), '!' )
) ;

结果

  ID1   ID2 NAME1                                                                NAME2                                                        
2 5 SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS, CITRIC ACID SODIUM HYDROGEN CARBONATE, SODIUM CARBONATE ANHYDROUS
3 6 CAFFEINE, PARACETAMOL PH. EUR. PARACETAMOL PH. EUR.,CAFFEINE
100 10 PARACETAMOL, DEXTROMETHORPHAN, PSEUDOEPHEDRINE, PYRILAMINE DEXCHLORPHENIRAMINE MALEATE, PSEUDOEPHEDRINE HYDROCHLORIDE
110 210 sodium hydroxide, potassium carbonate sodium hydroxide, potassium carbonate

注意:我已将以下行添加到您的示例数据中:

-- T1
110, 'sodium hydroxide, potassium carbonate'

-- T2
210, 'sodium hydroxide, potassium carbonate'
211, 'potassium hydroxide, sodium carbonate'

我发现使用 TRANSLATE() 很容易产生“误报”,即 ID 为 110、210 和 211 的物质看起来“匹配”。 (换句话说:我认为这不是这项工作的正确工具。)

DBFIDDLE here

(点击链接查看示例表和查询)。

关于sql - 比较来自两个不同表的两列的逗号分隔值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58237799/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com