gpt4 book ai didi

mysql - 将段落文档拆分为句子

转载 作者:行者123 更新时间:2023-11-29 07:18:44 26 4
gpt4 key购买 nike

我有一个段落文档数据库。我想拆分表“master_data”段落中的每个句子 并将其存储到不同的表“splittext”中。

主数据表:

id | Title | Paragraph

拆分文本表

id_sen | sentences | doc_id 

我尝试使用此查询来选择 Paragraph.master_data 中的每个句子

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '[^\.\!\* 
[\.\!\?]';

但它会产生括号错误。所以我尝试使用括号,并产生错误 Incorrect Parameter Count

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '([^\.\!\* 
[\.\!\?])';

我的预期结果是段落被拆分成句子并存储到新表中。并返回段落的原始id,存入doc_id。

例如:

主数据:

id | Title | Paragraph  |
1 | asds..| I want. Some. Coconut and Banana !! |
2 | wad...| Milkshake? some Nice milk. |

拆分文本表:

id| sentences | doc_id  |

1| I want | 1 |
2| Some | 1 |
.
.
.
5| Some Nice milk | 2 |

最佳答案

对于 MySQL 8.0,您可以使用 recursive CTE ,鉴于其 limitations .

with
recursive r as (
select
1 id,
cast(regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)'
) as char(256)) sentences,
id doc_id, Title, Paragraph
from master_data
union all
select id + 1,
regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)',
1, id + 1
),
doc_id, Title, Paragraph
from r
where sentences is not null
)
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;

输出:

| id |       sentences       | doc_id | Title  |
+----+-----------------------+--------+--------+
| 1 | I want. | 1 | asds.. |
| 2 | Some. | 1 | asds.. |
| 3 | Coconut and Banana !! | 1 | asds.. |
| 1 | Milkshake? | 2 | wad... |
| 2 | some Nice milk. | 2 | wad... |
| 1 | bar | 3 | foo |

演示 DB Fiddle .

关于mysql - 将段落文档拆分为句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57686770/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com