gpt4 book ai didi

sql - 从 SQL Server 列中删除 HTML

转载 作者:行者123 更新时间:2023-12-02 22:06:25 26 4
gpt4 key购买 nike

我们曾经使用所见即所得的编辑器,并且在 SQL Server 2008 R2 表中有四个文本类型的列。我想删除所有的 HTML 标签。这在 ASP(经典 ASP)页面中使用时有效:

Function RemoveHTML(strText )
Dim RegEx

Set RegEx = New RegExp

RegEx.Pattern = "<[^>]*>"
RegEx.Global = True

RemoveHTML = RegEx.Replace(strText, "")
End Function

但是我想要一个不同的解决方案,也许是 SQL 驱动的。非常想要一些帮助。正则表达式可以这样用在SQL语句中吗?有什么帮助建议会很好吗?我想我想保留 break 标签,但其余的都可以。

蒂亚

最佳答案

让我们创建一个示例表以供讨论

create table #tmp (id int identity primary key, sometext varchar(max));
insert #tmp values ('This <has some> HTML in it <and some more> yeah>');
insert #tmp values ('No HTML here');
insert #tmp values ('This is > than that');
insert #tmp values ('And This is < than that');
insert #tmp values ('');
insert #tmp values ('I have many blanks. Don''t lose them <y>');
insert #tmp values (null);
insert #tmp values ('<b>This is bold</b> and <i>this is in italics</i>');
insert #tmp values ('I <<<<stttuttter> a lot <>');
GO

接下来,如果您的数据库中还没有 Numbers(又名 Tally)表,请创建一个。

create table Numbers (number int identity primary key);
GO
insert Numbers default values
GO 1000

最好使用多语句 TSQL 来解决这个问题,它可以利用索引并具体化中间结果。这里的第一步是收集所有没有被 < 包围的字符。和 >标签。

create table #tmp2 (id int, number int, primary key(id,number), char char(1));

insert #tmp2
select u.id, N.Number, substring(u.sometext, N.number, 1) char
from #tmp u
join Numbers N on N.number <= len(u.sometext)
left join
(
select t.id, N.number lt, charindex('>', t.sometext, N.number+1) gt
from #tmp t
join Numbers N on substring(t.sometext, N.number, 1) = '<'
) exclusions on u.id = exclusions.id and n.number between exclusions.lt and exclusions.gt
where exclusions.id is null
GO

下一步只是使用 FOR XML 压缩它们.

update #tmp
set sometext = (select char+''
from #tmp2 b
where b.id = #tmp.id
order by Number
for xml path(''), type).value('/','nvarchar(max)')
where sometext like '%<%>%'

再次检查我们的表

select * from #tmp

id sometext
----------- --------------------------------------------------
1 This HTML in it yeah>
2 No HTML here
3 This is > than that
4 And This is < than that
5
6 I have many blanks. Don't lose them
7 NULL
8 This is bold and this is in italics
9 I a lot

对于一个小表,在单个查询中进行行内排除(如上所述)就足够了。对于更大的表,CTE 方法似乎效果更好,因为它只遍历数据一次。上面代码中的表可以通过在创建后对表运行这些来扩大:

-- replicate the table data to about 20K records
insert #tmp select sometext from #tmp
GO 11

-- expand each string by 8 times the original length
update #tmp set sometext = sometext + sometext
GO 3

并使用它来创建#tmp2 表

;with cte(id, pos, sometext, size, char, flag) as (
select id, 1, sometext, len(sometext), substring(sometext, 1, 1),
case when substring(sometext, 1, 1) = '<' then 1 else 0 end
from #tmp
where sometext like '%<%>%'
union all
select id, pos+1, sometext, size, substring(sometext, pos+1, 1),
case when substring(sometext, pos+1, 1) = '<' then 1 -- starts a new html tag section
when char = '>' then 0 -- closed in prior iteration
else flag end -- continue flag
from cte
where pos < len(sometext)
)
insert #tmp2 (id, Number, char)
select id, pos, char
from cte
where flag = 0
option (maxrecursion 500)
GO

这导致时间

  • 62 秒创建#tmp2
  • 1 秒使用#tmp2 更新#tmp

该算法具有线性复杂性,因此如果您的字符串较短,或者您的记录较少,只需进行近似缩放以了解该过程需要多长时间。

关于sql - 从 SQL Server 列中删除 HTML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16049537/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com