gpt4 book ai didi

c# - 将 html 拆分为单词

转载 作者:太空宇宙 更新时间:2023-11-03 22:27:23 24 4
gpt4 key购买 nike

假设我有以下字符串:

Hellotoevryone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsogladtoseeall.

这个字符串表示一系列没有空格分隔的字符,在这个字符串中还插入了一个html图像。现在我想把字符串分成单词,每个单词的长度为 10 个字符,所以输出应该是:

1)Hellotoevr
2)yone<img height="115" width="150" alt="" src="/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog
3)ladtoseeal
4)l.

所以想法是将任何 html 标记内容保留为 0 长度字符。

我写过这样的方法,但是没有考虑html标签:

public static string EnsureWordLength(this string target, int length)
{
string[] words = target.Split(' ');
for (int i = 0; i < words.Length; i++)
if (words[i].Length > length)
{
var possible = true;
var ord = 1;
do
{
var lengthTmp = length*ord+ord-1;
if (lengthTmp < words[i].Length) words[i] = words[i].Insert(lengthTmp, " ");
else possible = false;
ord++;
} while (possible);

}

return string.Join(" ", words);
}

我希望看到一个按照我描述的那样执行拆分的代码。谢谢。

最佳答案

这是符合您要求的正则表达式解决方案。请记住,如果您决定稍微更改您的要求,这可能不会起作用,这忠实于 well known quote here。 .

using System.Text.RegularExpressions;

string[] samples = {
@"Hellotoevryone<img height=""115"" width=""150"" alt="""" src=""/Content/Edt/image/b4976875-8dfb-444c-8b32-cc b47b2d81e0.jpg"" />Iamsogladtoseeall.",
"Testing123Hello.World",
@"Test<a href=""http://stackoverflow.com"">StackOverflow</a>",
@"Blah<a href=""http://stackoverflow.com"">StackOverflow</a>Blah<a href=""http://serverfault.com"">ServerFault</a>",
@"Test<a href=""http://serverfault.com"">Server Fault</a>", // has a space, not matched
"Stack Overflow" // has a space, not matched
};

// use these 2 lines if you don't want to use regex comments
//string pattern = @"^((?:\S(?:\<[^>]+\>)?){1,10})+$";
//Regex rx = new Regex(pattern);

// regex comments spanning multiple lines requires use of RegexOptions.IgnorePatternWhitespace
string pattern = @"^( # match line/string start, begin group
(?:\S # match (but don't capture) non-whitespace chars
(?:\<[^>]+\>)? # optionally match (doesn't capture) an html <...> tag
# to match img tags only change to (?:\<img[^>]+\>)?
){1,10} # match upto 10 chars (tags don't count per your example)
)+$ # match at least once, and match end of line/string
";
Regex rx = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);

foreach (string sample in samples)
{
if (rx.IsMatch(sample))
{
foreach (Match m in rx.Matches(sample))
{
// using group index 1, group 0 is the entire match which I'm not interested in
foreach (Capture c in m.Groups[1].Captures)
{
Console.WriteLine("Capture: {0} -- ({1})", c.Value, c.Value.Length);
}
}
}
else
{
Console.WriteLine("Not a match: {0}", sample);
}

Console.WriteLine();
}

使用上面的示例,这里是输出(括号中的数字 = 字符串长度):

Capture: Hellotoevr -- (10)
Capture: yone<img height="115" width="150" alt="" src="/Content/Edt/image/b49768
75-8dfb-444c-8b32-cc b47b2d81e0.jpg" />Iamsog -- (116)
Capture: ladtoseeal -- (10)
Capture: l. -- (2)

Capture: Testing123 -- (10)
Capture: Hello.Worl -- (10)
Capture: d -- (1)

Capture: Test<a href="http://stackoverflow.com">StackO -- (45)
Capture: verflow</a> -- (11)

Capture: Blah<a href="http://stackoverflow.com">StackO -- (45)
Capture: verflow</a>Bla -- (14)
Capture: h<a href="http://serverfault.com">ServerFau -- (43)
Capture: lt</a> -- (6)

Not a match: Test<a href="http://serverfault.com">Server Fault</a>

Not a match: Stack Overflow

关于c# - 将 html 拆分为单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/845375/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com