gpt4 book ai didi

sql-server - 特殊字符(夏威夷语“Okina”)导致奇怪的字符串行为

转载 作者:行者123 更新时间:2023-12-02 07:31:10 26 4
gpt4 key购买 nike

Hawaiian quote当 T-SQL 与字符串函数结合使用时,它会出现一些奇怪的行为。这里发生了什么?我错过了什么吗?其他角色是否也遇到同样的问题?

SELECT UNICODE(N'ʻ') -- Returns 699 as expected.

SELECT REPLACE(N'"ʻ', '"', '_') -- Returns "ʻ, I expected _ʻ

SELECT REPLACE(N'aʻ', 'a', '_') -- Returns aʻ, I expected _ʻ

SELECT REPLACE(N'"ʻ', N'ʻ', '_') -- Returns __, I expected "_

SELECT REPLACE(N'-', N'ʻ', '_') -- Returns -, I expected -

此外,例如在 LIKE 中使用时会很奇怪:

DECLARE @table TABLE ([Name] NVARCHAR(MAX))
INSERT INTO
@table
VALUES
('John'),
('Jane')

SELECT
*
FROM
@table
WHERE
[Name] LIKE N'%ʻ%' -- This returns both records. I expected none.

最佳答案

The Hawaiian quote has some weird behavior in T-SQL when using it in conjunction with string functions. ... Do other characters suffer from this same problem?

一些事情:

  1. 这不是夏威夷语“引语”:它是影响发音的“glottal stop”。
  2. 这不是“奇怪”的行为:只是不是您所期望的。
  3. 这种行为并不是一个具体的“问题”,尽管是的,还有其他角色表现出类似的行为。例如,以下字符(U+02DA 上方环形)的行为略有不同,具体取决于它位于字符的哪一侧:

    SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'˚a',  N'_'); -- Returns a_a
    SELECT REPLACE(N'a˚aa' COLLATE Latin1_General_100_CI_AS, N'a˚', N'_'); -- Returns _aa

现在,任何使用 SQL Server 2008 或更高版本的人都应该使用 100(或更高版本)级别的排序规则。他们在 100 系列中添加了许多 90 系列中没有的排序权重和大写/小写映射,或非编号系列,或大多数过时的 SQL Server 排序规则(名称以 SQL_)。

这里的问题不在于它不等于任何其他字符(二进制排序规则之外),事实上它实际上等于另一个字符 ( U+0312 Combining Turned Comma Above ):

;WITH nums AS
(
SELECT TOP (65536) (ROW_NUMBER() OVER (ORDER BY @@MICROSOFTVERSION) - 1) AS [num]
FROM [master].sys.all_columns ac1
CROSS JOIN [master].sys.all_columns ac2
)
SELECT nums.[num] AS [INTvalue],
CONVERT(BINARY(2), nums.[num]) AS [BINvalue],
NCHAR(nums.[num]) AS [Character]
FROM nums
WHERE NCHAR(nums.[num]) = NCHAR(0x02BB) COLLATE Latin1_General_100_CI_AS;
/*
INTvalue BINvalue Character
699 0x02BB ʻ
786 0x0312 ̒
*/

问题是,这是一个“间距修饰符”字符,因此它会附加到其之前或之后的字符,并修改其含义/发音,具体取决于您正在处理的修饰符字符。

根据Unicode Standard, Chapter 7 (Europe-I) ,第 7.8 节(修饰符字母),第 323 页(文档的,而不是 PDF 的):

7.8 Modifier Letters

Modifier letters, in the sense used in the Unicode Standard, are letters or symbols that are typically written adjacent to other letters and which modify their usage in some way. They are not formally combining marks (gc = Mn or gc = Mc) and do not graphically combine with the base letter that they modify. They are base characters in their own right. The sense in which they modify other letters is more a matter of their semantics in usage; they often tend to function as if they were diacritics, indicating a change in pronunciation of a letter, or otherwise distinguishing a letter’s use. Typically this diacritic modification applies to the character preceding the modifier letter, but modifier letters may sometimes modify a following character. Occasionally a modifier letter may simply stand alone representing its own sound.
...

Spacing Modifier Letters: U+02B0–U+02FF

Phonetic Usage. The majority of the modifier letters in this block are phonetic modifiers, including the characters required for coverage of the International Phonetic Alphabet. In many cases, modifier letters are used to indicate that the pronunciation of an adjacent letter is different in some way—hence the name “modifier.” They are also used to mark stress or tone, or may simply represent their own sound.

 
下面的例子应该有助于说明。我使用的是 100 级排序规则,并且它需要区分重音(即名称包含 _AS):

SELECT REPLACE(N'ʻ'    COLLATE Latin1_General_100_CI_AS, N'ʻ',   N'_'); -- Returns _
SELECT REPLACE(N'ʻa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _a
SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻ', N'_'); -- Returns __aa

SELECT REPLACE(N'ʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns ʻ__
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'ʻa', N'_'); -- Returns aʻ__

SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻ', N'_'); -- Returns _aa
SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'aʻa', N'_'); -- Returns _a

SELECT REPLACE(N'aʻaa' COLLATE Latin1_General_100_CI_AS, N'a', N'_'); -- Returns aʻ__
SELECT REPLACE(N'אʻaa' COLLATE Latin1_General_100_CI_AS, N'א', N'_'); -- Returns אʻaa
SELECT REPLACE(N'ffʻaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns ffʻaa
SELECT REPLACE(N'ffaa' COLLATE Latin1_General_100_CI_AS, N'ff', N'_'); -- Returns _aa



SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AS); -- 3
SELECT CHARINDEX(N'a', N'aʻa' COLLATE Latin1_General_100_CI_AI); -- 1



SELECT 1 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AS; -- (0 rows returned)
SELECT 2 WHERE N'a' = N'aʻ' COLLATE Latin1_General_100_CI_AI; -- 2

如果您需要以忽略其预期语言行为的方式处理这些字符,那么您必须使用二进制排序规则。在这种情况下,请使用最新级别的排序规则,并使用 BIN2 而不是 BIN(假设您使用的是 SQL Server 2005 或更高版本)。含义:

  • SQL Server 2000:Latin1_General_BIN
  • SQL Server 2005:Latin1_General_BIN2
  • SQL Server 2008、2008 R2、2012、2014 和 2016:Latin1_General_100_BIN2
  • SQL Server 2017 及更高版本:Japan_XJIS_140_BIN2

如果您好奇我为什么提出该建议,请参阅:

Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2)

并且,有关排序规则/Unicode/编码/等的更多信息,请访问:Collations Info

关于sql-server - 特殊字符(夏威夷语“Okina”)导致奇怪的字符串行为,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55455166/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com