gpt4 book ai didi

javascript - 将字符串拆分为行和句子,但忽略缩写

转载 作者:行者123 更新时间:2023-11-30 00:02:25 29 4
gpt4 key购买 nike

有一些字符串内容,我必须拆分。首先,我需要将字符串内容拆分成行。

我是这样做的:

str.split('\n').forEach((item) => {
if (item) {
// TODO: split also each line into sentences

let data = {
type : 'item',
content: [{
content : item,
timestamp: Math.floor(Date.now() / 1000)
}]
};

// Save `data` to DB
}
});

但现在我还需要将每一行拆分成句子。对我来说,困难在于正确拆分它。因此我会使用 . (点和空格)来分割线。但是还有一组缩写,不应该拆分行:

cont abbr = ['vs.', 'min.', 'max.']; // Just an example; there are 70 abbrevations in that array

...还有一些规则:

  1. 任何数字和点或单个字母和点也应作为拆分字符串忽略:1.2.30.A., b.
  2. 应忽略大小写:Max。 Lorem ipsum 不应拆分。 Lorem 最大值。 ipsum 要么。

示例

const str = 'Just some examples:\nThis example has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar.';

结果应该是四个数据对象:

{ type: 'item', content: [{ content: 'Just some examples:', timestamp: 123 }] }
{ type: 'item', content: [{ content: 'This example has min. 2 lines.', timestamp: 123 }] }
{ type: 'item', content: [{ content: 'Max. 10 lines.', timestamp: 123 }] }
{ type: 'item', content: [{ content: 'There are some words: 1. Foo and 2. bar.', timestamp: 123 }] }

最佳答案

您可以先检测字符串中的缩写和编号,然后将每个字符串中的点替换为虚拟字符串。在剩余点上拆分字符串后,表示句子结束,您可以恢复原始点。有了句子后,您可以像在原始代码中那样用换行符拆分每个句子。

更新后的代码允许在缩写中使用多个点(如 p.o.s.v.p. 所示)。

var i, j, strRegex, regex, abbrParts;
const DOT = "_DOT_";
const abbr = ["p.o.", "s.v.p.", "vs.", "min.", "max."];

var str = 'Just some examples:\nThis example s.v.p. has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar. And also A. p.o. professional letters.';

console.log("String: " + str);

// Replace dot in abbreviations found in string
for (i = 0; i < abbr.length; i++) {
abbrParts = abbr[i].split(".");
strRegex = "(\\W*" + abbrParts[0] + ")";
for (j = 1; j < abbrParts.length - 1; j++) {
strRegex += "(\\.)(" + abbrParts[j] + ")";
}
strRegex += "(\\.)(" + abbrParts[abbrParts.length - 1] + "\\W*)";
regex = new RegExp(strRegex, "gi");
str = str.replace(regex, function () {
var groups = arguments;
var result = groups[1];
for (j = 2; j < groups.length; j += 2) {
result += (groups[j] === "." ? DOT + groups[j+1] : "");
}
return result;
});
}

// Replace dot in numbers found in string
str = str.replace(/(\W*\d+)(\.)/gi, "$1" + DOT);

// Replace dot in letter numbering found in string
str = str.replace(/(\W+[a-zA-Z])(\.)/gi, "$1" + DOT);

// Split the string at dots
var parts = str.split(".");

// Restore dots in sentences
var sentences = [];
regex = new RegExp(DOT, "gi");
for (i = 0; i < parts.length; i++) {
if (parts[i].trim().length > 0) {
sentences.push(parts[i].replace(regex, ".").trim() + ".");
console.log("Sentence " + (i + 1) + ": " + sentences[i]);
}
}

关于javascript - 将字符串拆分为行和句子,但忽略缩写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39903391/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com