firefox-addon - 在拼写检查单词之前，Firefox (Hunspell) 如何以及如何清理文本？-6ren

firefox-addon - 在拼写检查单词之前，Firefox (Hunspell) 如何以及如何清理文本？

转载作者：行者123 更新时间：2023-12-04 13:35:31

27

4

我正在尝试以 Firefox 在对我正在构建的 Firefox 扩展程序的单个单词进行拼写检查之前所做的确切方式清理文本(我的插件使用 nspell，Hunspell 的 JavaScript 实现，因为 Firefox 不公开它使用的 Hunspell 实例通过扩展 API)。

我查看了 Firefox Gecko 克隆的代码库，即在 mozSpellChecker.h 中通过搜索“拼写检查”来查看文件和其他相关文件，但我似乎无法了解它们是如何清理文本的。

逆向工程它一直是一个主要的 PITA，到目前为止我有这个:

// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
  if (!content) {
    console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
    return ''
  }

  // ToDo: first split string by spaces in order to properly ignore urls
  const rxUrls = /^(http|https|ftp|www)/
  const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\\]/
  const rxSingleQuotes = /^'+|'+$/g

  // split all content by any character that should not form part of a word
  return content.split(rxSeparators)
    .reduce((acc, string) => {
      // remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
      string = string.replace(rxSingleQuotes, '')
      // we never want empty strings, so skip them
      if (string.length < 1) {
        return acc
      }
      // for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
      if (!filter) {
        return acc.concat([string])
      }
      // filter out emails, URLs, numbers, and strings less than 2 characters in length
      if (!string.includes('@') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
        return acc.concat([string])
      }
      return acc
    }, [])
}

但是在测试诸如用于创建此问题的文本区域之类的内容时，我仍然发现内容之间存在很大差异。

需要明确的是:我正在寻找 Firefox 用于清理文本的确切方法、匹配项和规则，而且由于它是开源的，它应该在某个地方，但我似乎找不到它!

最佳答案

我相信您想要 mozInlineSpellWordUtil.cpp 中的功能.

来自 the header :

/**
 *    This class extracts text from the DOM and builds it into a single string.
 *    The string includes whitespace breaks whereever non-inline elements begin
 *    and end. This string is broken into "real words", following somewhat
 *    complex rules; for example substrings that look like URLs or
 *    email addresses are treated as single words, but otherwise many kinds of
 *    punctuation are treated as word separators. GetNextWord provides a way
 *    to iterate over these "real words".
 *
 *    The basic operation is:
 *
 *    1. Call Init with the weak pointer to the editor that you're using.
 *    2. Call SetPositionAndEnd to to initialize the current position inside the
 *       previously given range and set where you want to stop spellchecking.
 *       We'll stop at the word boundary after that. If SetEnd is not called,
 *       we'll stop at the end of the root element.
 *    3. Call GetNextWord over and over until it returns false.
 */

您可以找到 the complete source here ，但相当复杂。例如， here is the method used将部分文本分类为电子邮件地址或 url，但仅处理它就超过 50 行。

编写拼写检查器在原则上似乎微不足道，但正如您从源代码中看到的那样，这是一项重大的工作。我并不是说您不应该尝试，但正如您可能已经发现的那样，问题在于边缘情况的细节。

举个例子，当您决定什么构成词边界时，您必须决定要忽略哪些字符，包括 ASCII 范围之外的字符。 For example, here你可以看到 MONGOLIAN TODO SOFT HYPHEN 像 ASCII 连字符一样被处理:

// IsIgnorableCharacter
//
//    These characters are ones that we should ignore in input.

inline bool IsIgnorableCharacter(char ch) {
  return (ch == static_cast<char>(0xAD));  // SOFT HYPHEN
}

inline bool IsIgnorableCharacter(char16_t ch) {
  return (ch == 0xAD ||   // SOFT HYPHEN
          ch == 0x1806);  // MONGOLIAN TODO SOFT HYPHEN
}

再说一次，我并不是要劝阻您从事这个项目，但以一种在 HTML 上下文和多语言环境中工作的方式将文本标记为离散的单词是一项重大努力。

关于firefox-addon - 在拼写检查单词之前，Firefox (Hunspell) 如何以及如何清理文本？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62290285/

27

4

0

文章推荐： graph - 在 gremlin 查询中显示级别

文章推荐： ssis - 如何使 SSIS 包中的服务器名称动态化

文章推荐： silverlight - Silverlight确认对话框以暂停线程

firefox - 将 Firefox 插件安装到 Firefox OS Firefox 浏览器中
我已经为桌面和移动 Firefox 开发了一些 Firefox 插件(扩展)，但现在我无法将插件/扩展安装到在 Firefox OS 中运行的浏览器中(我正在使用模拟器插件)。请注意，我不想创建一个传
firefox - 如何将选项卡从桌面版 Firefox 发送到移动版 Firefox 应用程序
可以将选项卡/网址从移动 Firefox 发送到桌面 Firefox 浏览器，但是否可以以相反的方式执行此操作？从桌面版 Firefox 到移动版 Firefox，并像其他方向一样自动加载。我找不到除
firefox - 如何在 Firefox 扩展中观察指定的 Firefox 事件？
我想等待 Firefox-Browser-Events (sessionstore-windows-restored, user-interaction-inactive,..) 以清除历史记录。我的
firefox - 如何在没有 Firefox 的情况下下载 Firefox 附加组件
我在公司网络中，想为 Firefox 安装一些开发人员工具。不幸的是，政策禁止 Firefox 直接访问互联网，但还有其他浏览器可以访问互联网。现在:如何在没有 Firefox 的情况下直接下载 xp
firefox - Firefox 中的选择性缓存
是否有用于在 firefox 中执行选择性缓存的插件或方法？我可以disable caching entirely ，但我仍然希望能够缓存一些需要几秒钟才能加载的大型 javascript 库 (ex
firefox-addon - Firefox SDK 简单存储和 Firefox 同步
我目前正在将 Chrome 扩展程序转换为 Firefox 插件，并希望复制 chrome.storage.sync 功能。但是，我无法使用 simple-storage 找到是否由 Firefox
firefox - Firefox 附加组件是否有默认许可证？
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 6年前关闭。 Improve thi
firefox - 地理定位不适用于 Firefox
所以，我使用这个代码: var options = { enableHighAccuracy: true, timeout: 2000, maximumAge: 100 }; navi
firefox - 在启动时使用临时加载项打开 Firefox
有没有办法打开 Firefox 并强制它在启动时加载临时加载项(webextension)？通常我必须手动去about:debugging并选择我硬盘上的扩展名。我正在寻找一个可以在加载 Firefo
firefox - 无法部署扩展 (firefox)
我正在密切关注教程 here当我尝试创建 Firefox 扩展时。我的扩展有以下树: backtosearch +-chrome +-content backtosearch.
firefox - 如何从代码重新启动 firefox？
如何从代码中正确地重启 firefox(没有任何“恢复 session ”的东西并且使用与以前相同的窗口)？我知道 bash 脚本进程中“firefox-bin”的 pid，并且我已将自定义插件加载
firefox - Firefox 开发者工具有哪些替代品？
自从 Firefox 的最后几次更新以来，我们心爱的 Firebug 已集成到 Firefox 开发人员工具中，并且包括我在内的很多人 don't like what happened到 Firebu
firefox - Firefox 可以显示文件的上传状态吗？
当你在某处上传图片时，在使用chrome时，你可以看到状态栏实际上显示了上传的“状态”，即上传完成的百分比。 Firefox 的状态栏有没有办法显示这个上传状态？最佳答案用谷歌搜索这个，发现这个:
firefox - Firefox 将当前打开的选项卡保存在磁盘的何处
例如 Chrome 保存在这里:~Library/Application Support/Google/Chrome/Default/Current Tabs和 Safari 在这里 ~/Librar
firefox - 当今 Firefox 和 Firefox 开发人员之间的区别 - 2017
当火狐开发者版推出时，我很高兴，我可以使用WebIde、响应式设计工具、滴管等……今天我受够了。里面有很多bug，我就不一一列举我和我的同事发送和批准了多少bug了…… 我在 google 中搜索过
firefox - Firefox 浏览器控制台没有提示？
我在 Ubuntu 上使用 Firefox，版本 39.0。我正在尝试调试一个附加组件，并希望在 chrome 权限下运行一些 JavaScript。根据 this page我应该能够在浏览器控制台中
firefox - 刷新书签中的收藏夹图标(Firefox)
几天前，我更改了我的网站的图标:打开网站后，它可以很好地工作: 我的网站也在我的书签中，但是显示了旧的收藏夹图标: 我已经看过here，但是答案并没有解决我的问题。解决方法可能非常简单，但是到目前为
firefox - Rust不接受来自本地消息传递的标准输入-FireFox
我正在使用web API从Firefox开发一个 native 消息传递应用程序。该扩展应该调用一个解析stdin的应用程序，然后基于它解析的一些数据调用我的另一个rust应用程序，但是出于显而易见的
firefox - Firefox 扩展和插件有什么区别？
在 Firefox 中有插件和扩展。你能解释一下为什么这些插件有不同的名称和标签吗？它们是否差异如此之大，以至于需要不同的名称？我认为区分这些东西有点不自然，扩展具有越来越多的功能，与插件相比它们缺少
firefox - Firefox 扩展中的并发和多线程
我正在使用附加构建器和附加 SDK 编写 Firefox 扩展。到目前为止，我已经能够解决任何限制，而无需迁移到 XUL。但是，我遇到了障碍。我的扩展程序有一个长时间运行的进程，可能会阻塞，因此我需

首页

博学

6Ren·AI

商城

firefox-addon - 在拼写检查单词之前，Firefox (Hunspell) 如何以及如何清理文本？