gpt4 book ai didi

amazon-web-services - 从亚马逊云搜索 sdf 中删除无效字符

转载 作者:行者123 更新时间:2023-12-05 05:28:06 26 4
gpt4 key购买 nike

当试图将从 pdf 文件中提取的数据发布到亚马逊云搜索域以进行索引时,由于数据中的字符无效,索引失败。

如何在发布到搜索终点之前删除这些无效字符?

我尝试转义和替换字符,但没有成功。

最佳答案

将文档上传到 CloudSearch(使用 aws sdk/json)时,我遇到了这样的错误:

Error with source for field content_stemmed: Validation error for field 'content_stemmed': Invalid codepoint B

根据 AWS(下面的引用资料)记录,我的解决方案是在上传之前从文档中删除无效字符:

例如,这是我使用 javascript 所做的:

const cleaned = someFieldValue.replace(
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g,
''
)

ref :

Both JSON and XML batches can only contain UTF-8 characters that are valid in XML. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are invalid and will cause errors.

You can use the following regular expression to match invalid characters so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/

关于amazon-web-services - 从亚马逊云搜索 sdf 中删除无效字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14258909/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com