gpt4 book ai didi

firefox-addon - 在拼写检查单词之前,Firefox (Hunspell) 如何以及如何清理文本?

转载 作者:行者123 更新时间:2023-12-04 13:35:31 27 4
gpt4 key购买 nike

我正在尝试以 Firefox 在对我正在构建的 Firefox 扩展程序的单个单词进行拼写检查之前所做的确切方式清理文本(我的插件使用 nspell,Hunspell 的 JavaScript 实现,因为 Firefox 不公开它使用的 Hunspell 实例通过扩展 API)。

我查看了 Firefox Gecko 克隆的代码库,即在 mozSpellChecker.h 中通过搜索“拼写检查”来查看文件和其他相关文件,但我似乎无法了解它们是如何清理文本的。

逆向工程它一直是一个主要的 PITA,到目前为止我有这个:

// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
if (!content) {
console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
return ''
}

// ToDo: first split string by spaces in order to properly ignore urls
const rxUrls = /^(http|https|ftp|www)/
const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\\]/
const rxSingleQuotes = /^'+|'+$/g

// split all content by any character that should not form part of a word
return content.split(rxSeparators)
.reduce((acc, string) => {
// remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
string = string.replace(rxSingleQuotes, '')
// we never want empty strings, so skip them
if (string.length < 1) {
return acc
}
// for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
if (!filter) {
return acc.concat([string])
}
// filter out emails, URLs, numbers, and strings less than 2 characters in length
if (!string.includes('@') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
return acc.concat([string])
}
return acc
}, [])
}

但是在测试诸如用于创建此问题的文本区域之类的内容时,我仍然发现内容之间存在很大差异。

需要明确的是:我正在寻找 Firefox 用于清理文本的确切方法、匹配项和规则,而且由于它是开源的,它应该在某个地方,但我似乎找不到它!

最佳答案

我相信您想要 mozInlineSpellWordUtil.cpp 中的功能.

来自 the header :

/**
* This class extracts text from the DOM and builds it into a single string.
* The string includes whitespace breaks whereever non-inline elements begin
* and end. This string is broken into "real words", following somewhat
* complex rules; for example substrings that look like URLs or
* email addresses are treated as single words, but otherwise many kinds of
* punctuation are treated as word separators. GetNextWord provides a way
* to iterate over these "real words".
*
* The basic operation is:
*
* 1. Call Init with the weak pointer to the editor that you're using.
* 2. Call SetPositionAndEnd to to initialize the current position inside the
* previously given range and set where you want to stop spellchecking.
* We'll stop at the word boundary after that. If SetEnd is not called,
* we'll stop at the end of the root element.
* 3. Call GetNextWord over and over until it returns false.
*/

您可以找到 the complete source here ,但相当复杂。例如, here is the method used将部分文本分类为电子邮件地址或 url,但仅处理它就超过 50 行。

编写拼写检查器在原则上似乎微不足道,但正如您从源代码中看到的那样,这是一项重大的工作。我并不是说您不应该尝试,但正如您可能已经发现的那样,问题在于边缘情况的细节。

举个例子,当您决定什么构成词边界时,您必须决定要忽略哪些字符,包括 ASCII 范围之外的字符。 For example, here你可以看到 MONGOLIAN TODO SOFT HYPHEN 像 ASCII 连字符一样被处理:
// IsIgnorableCharacter
//
// These characters are ones that we should ignore in input.

inline bool IsIgnorableCharacter(char ch) {
return (ch == static_cast<char>(0xAD)); // SOFT HYPHEN
}

inline bool IsIgnorableCharacter(char16_t ch) {
return (ch == 0xAD || // SOFT HYPHEN
ch == 0x1806); // MONGOLIAN TODO SOFT HYPHEN
}

再说一次,我并不是要劝阻您从事这个项目,但以一种在 HTML 上下文和多语言环境中工作的方式将文本标记为离散的单词是一项重大努力。

关于firefox-addon - 在拼写检查单词之前,Firefox (Hunspell) 如何以及如何清理文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62290285/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com