- mongodb - 在 MongoDB mapreduce 中,如何展平值对象?
- javascript - 对象传播与 Object.assign
- html - 输入类型 ="submit"Vs 按钮标签它们可以互换吗?
- sql - 使用 MongoDB 而不是 MS SQL Server 的优缺点
我们正在开发一个系统,使用 UTF-8、UTF-16 和 UTF-32 Unicode 字符标准对 50 多种国际语言进行模糊匹配。到目前为止,我们已经能够使用 Levenshtein 距离来检测德语 Unicode 扩展字符单词的拼写错误。
我们想扩展这个系统来处理以 Unicode 表示的普通话中文表意文字。我们将如何进行相似汉字之间的 Levenshtein 距离计算?
最佳答案
首先,澄清一下:汉字并不等同于德语或英语字 .大多数你认为是词的东西(使用“词”的语义或句法定义)由 1-3 个字符组成。通过将这些字符序列表示为 UCS-2 或 UCS-4 代码点序列,可以直接将 Levenshtein 距离应用于此类字符序列。由于大多数单词都很短(尤其是长度为 1 或 2 个字符的单词),但它的用途可能有限。
但是,由于您的问题专门针对 单个字符之间的编辑距离 ,我相信需要一种不同的方法,这可能确实非常困难。
首先,您必须将每个字符表示为它所包含的组件/笔画的序列。有两个问题:
To expedite locating specific Han ideographic characters in the code charts, radical-stroke indices are provided on the Unicode web site. [...] The most influential authority for radical-stroke information is the eighteenth-century KangXi dictionary, which contains 214 radicals. The main problem in using KangXi radicals today is that many simplified characters are difficult to classify under any of the 214 KangXi radicals. As a result, various modern radical sets have been introduced. None, however, is in general use, and the 214 KangXi radicals remain the best known. [...] The Unicode radical-stroke charts are based on the KangXi radicals. The Unicode Standard follows a number of different sources for radical-stroke classification. Where two sources are at odds as to radical or stroke count for a given character, the character is shown in both positions in the radical-stroke charts.
In particular, Ideographic Description Sequences should not be used to provide alternative graphic representations of encoded ideographs in data interchange. Searching, collation, and other content-based text operations would then fail.
关于c++ - 如何确定普通话字符的 Levenshtein 距离?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12380619/
我是一名优秀的程序员,十分优秀!