gpt4 book ai didi

javascript - 简单的短语检测,按短语正则表达式分割

转载 作者:行者123 更新时间:2023-12-04 09:15:27 25 4
gpt4 key购买 nike

我想拆分一个字符串,如:
输入:Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.输出:

Bangalore 
railway
line
Indian Railway
comes
under
Nagpur
division
Central Railway
请注意,复合名词将保持在一起,因为它们是标题案例。
我在正则表达式部分遇到了问题: split(/(?=\s[a-z]|[A-Z]\s|\.)/)我如何让它在“水 ꜜ Tor 博物馆”场景中 split ?
export function splitByPhrase(text: string) {
const outputFreq = text
.split(/(?=\s[a-z]|[A-Z]\s|\.)/)
.filter(Boolean)
.map((x) => x.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, "").trim())
.filter((x) => !stopWords.includes(x));

return outputFreq;
}

describe("phrases", () => {
it("no punctuation", () => {
expect(splitByPhrase("test. Toronto")).toEqual(["test", "Toronto"]);
});
it("no spaces", () => {
expect(splitByPhrase(" test Toronto ")).toEqual(["test", "Toronto"]);
});
it("simple phrase detection", () => {
expect(splitByPhrase(" water Tor Museum wants")).toEqual(["water", "Tor Museum", "wants"]);
});
it("remove stop words", () => {
expect(splitByPhrase("Toronto a Museum with")).toEqual(["Toronto", "Museum"]);
});
});

最佳答案

仅当断言左侧的内容不是大写字符后跟小写字符且右侧没有大写字符时,您才可以添加另一种替代方法来拆分。

(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))
Regex demo

const stopWords = [
"of", "The", "It", "the", "a", "with"
];

function splitByPhrase(text) {
return text
.split(/(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))/)
.map((x) => x.replace(/[.,\/#!$%^&*;:{}=_`~()-]/g, "").trim())
.filter((x) => !stopWords.includes(x)).filter(Boolean);
}

[
"Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.",
"test. Toronto",
" test Toronto ",
" water Tor Museum wants",
"Toronto a Museum with"
].forEach(i => console.log(splitByPhrase(i)));

关于javascript - 简单的短语检测,按短语正则表达式分割,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63247490/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com