gpt4 book ai didi

javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式

转载 作者:搜寻专家 更新时间:2023-11-01 00:34:39 26 4
gpt4 key购买 nike

我正在寻找可以从 URL 中提取所有子域 + 域的正则表达式。

我已经从 here 找到了这个:

/([a-z0-9|-]+\.)*[a-z0-9|-]+\.[a-z]+/

它能够提取子域 + 域,但不幸的是,它不关心子域/域前面的 -,也不支持非 ASCII 字符,如 RFC 3490 中指定的那样

这里有一些我想捕获的例子:

http://www.例如.中国/
http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html
https://www.fußballspiel.de/
http://www.simulateur-prêt.fr

最佳答案

我整理了以下正则表达式,我也对其进行了大量评论,希望能更好地描述正在发生的事情。它匹配所有 ASCII 和非 ASCII 字符,并成功地从您的示例中提取所需信息。

正则表达式示例:

const regexp = new RegExp(
"^" + // Ensures a match is found only if it starts at the beginning of a string.
"(?:^\\w+:\\/\\/)?" + // Matches a protocol at the beginning of the string, which is optional.
"(" + // The beginning of our capture group.
"(?:" + // The beginning of our sub-domain non-capturing group.
"(?!-)" + // Skips the match if a sub-domain begins with a hyphen.
"[\\w-]+" + // Matches one or more words or hyphens.
"|" + // OR
"[^\\x00-\\x7F]+-*" + // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
")+" + // The end of our sub-domain non-capturing group, requiring at least one match.
"\\." + // An escaped colon that'll serve as the separator for our sub-domain.
"(?:" + // The beginning of our domain non-capturing group including the colon separator.
"(?:" + // The beginning of our domain non-capturing group excluding the colon separator.
"(?!-)" + // Skips the match if a sub-domain begins with a hyphen.
"[\\w-]+" + // Matches one or more words or hyphens.
"|" + // OR
"[^\\x00-\\x7F]+-*" + // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
")+" + // The end of our domain non-capturing group excluding the colon separator, requiring at least one match.
"\\." + // An escaped colon that'll serve as the separator for our domain.
")*" + // The end of our domain non-capturing group, including the colon separator, requiring zero or more matches.
"(?:" + // The beginning of our top-level domain non-capturing group.
"(?!-)" + // Skips the match if a domain begins with a hyphen.
"[\\w-]+" + // Matches one or more words or hyphens.
"|" + // OR
"[^\\x00-\\x7F]+-*" + // Matches one or more character that is not in the ASCII character set as well as zero or more hyphens.
")*" + // The end of our top-level domain non-capturing group, requiring zero or more matches.
")", "im"); // The end of our capture group, and the end of our regex! Phew! The "gi" is to make the expression global and case-insensitive.

const urls = [
'http://www.例如.中国/',
'http://www.würstchen.mit.käsebrötchen.de:8080/news/index.html',
'https://www.fußballspiel.de/',
'http://www.simulateur-prêt.fr'
];

const hostnames = urls.map(url => {
return regexp.exec(url)[1];
})

hostnames.forEach((hostname, index) => {
console.log('Input:', urls[index], '\nOutput:', hostname);
})

希望对您有所帮助!一切顺利!

关于javascript - 可以从 URL 中提取所有子域 + 域并与 RFC 3490 兼容的正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57066708/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com