gpt4 book ai didi

c# - 在大文件中搜索和替换正则表达式而不会出现 OutOfMemoryException

转载 作者:可可西里 更新时间:2023-11-01 09:11:49 25 4
gpt4 key购买 nike

我创建了一个小代码来搜索正则表达式字符串并将其替换为其他内容,然后创建一个包含所做更改的新输出文件。该代码似乎适用于较小的文件,但对于 100 MB 或更大的文件,我给出了 System.OutOfMemoryException' 错误。

这是我的代码:

string foldername = Path.Combine(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
String.Format("FIXED_{0}.tmx",
Path.GetFileNameWithoutExtension(textBox1.Text)));

string text = File.ReadAllText(textBox1.Text);
text = Regex.Replace(text, @"<seg\b[^>]*>", "<seg>", RegexOptions.Multiline);
text = Regex.Replace(text, @"<seg>
</tuv>", "<seg></seg></tuv>", RegexOptions.Multiline);

File.WriteAllText(foldername, text);

Visual Studio 突出显示 string text = File.ReadAllText(textBox1.Text); 部分。我认为使用 File.ReadAllLines 可能会更好,但我无法使其与正则表达式一起使用。

有人可以帮我解决这个问题吗?我是 C# 的新手,我的代码可能不是最好的。

最佳答案

恐怕您必须自己实现替换。以下是使用状态机替换<seg\b[^>*]>的示例代码与 <seg> .它唯一的问题是,如果文件以 <seg attr="" 结尾, 那么只有 <seg将被写入输出。

enum TruncateSegState
{
Idle,
TagStart,
TagStartS,
TagStartSE,
TagStartSEG,
TagSEG
}

static void TruncateSeg(StreamReader input, StreamWriter output)
{
TruncateSegState state = TruncateSegState.Idle;
while (!input.EndOfStream)
{
char ch = (char)input.Read();
switch (state)
{
case TruncateSegState.Idle:
if (ch == '<')
state = TruncateSegState.TagStart;
output.Write(ch);
break;
case TruncateSegState.TagStart:
if (ch == 's')
state = TruncateSegState.TagStartS;
else
state = TruncateSegState.Idle;
output.Write(ch);
break;
case TruncateSegState.TagStartS:
if (ch == 'e')
state = TruncateSegState.TagStartSE;
else
state = TruncateSegState.Idle;
output.Write(ch);
break;
case TruncateSegState.TagStartSE:
if (ch == 'g')
state = TruncateSegState.TagStartSEG;
else
state = TruncateSegState.Idle;
output.Write(ch);
break;
case TruncateSegState.TagStartSEG:
if (char.IsWhiteSpace(ch))
state = TruncateSegState.TagSEG;
else
{
state = TruncateSegState.Idle;
output.Write(ch);
}
break;
case TruncateSegState.TagSEG:
if (ch == '>')
{
state = TruncateSegState.Idle;
output.Write(ch);
}
break;
}
}
}

用法:

using (StreamReader reader = new StreamReader("input.txt"))
using (StreamWriter writer = new StreamWriter("temp.txt"))
TruncateSeg(reader, writer);

生成 temp.txt 之后,您将其用作下一个方法的输入,该方法添加缺失的 </seg>标签。

enum ReplaceSegTuvState
{
Idle,
InsideSEG
}

static void ReplaceSegTuv(StreamReader input, StreamWriter output)
{
ReplaceSegTuvState state = ReplaceSegTuvState.Idle;
StringBuilder segBuffer = new StringBuilder();
while (!input.EndOfStream)
{
char ch = (char)input.Read();
switch (state)
{
case ReplaceSegTuvState.Idle:
if (ch == '<')
{
char[] buffer = new char[4];
int bufferActualLength = input.ReadBlock(buffer, 0, buffer.Length);
output.Write('<');
output.Write(buffer, 0, bufferActualLength);
if (bufferActualLength == buffer.Length && "seg>".SequenceEqual(buffer))
{
segBuffer.Clear();
state = ReplaceSegTuvState.InsideSEG;
}
}
else
output.Write(ch);
break;
case ReplaceSegTuvState.InsideSEG:
if (ch == '<')
{
char[] buffer = new char[5];
int bufferActualLength = input.ReadBlock(buffer, 0, buffer.Length);
if (bufferActualLength == buffer.Length && "/tuv>".SequenceEqual(buffer))
{
output.Write("</seg>");
output.Write("</tuv>");
state = ReplaceSegTuvState.Idle;
}
else
{
output.Write(segBuffer.ToString());
output.Write('<');
output.Write(buffer, 0, bufferActualLength);
state = ReplaceSegTuvState.Idle;
}
}
else if (!char.IsWhiteSpace(ch))
{
output.Write(segBuffer.ToString());
output.Write(ch);
state = ReplaceSegTuvState.Idle;
}
else
segBuffer.Append(ch);
break;
}
}
}

用法:

using (StreamReader reader = new StreamReader("temp.txt"))
using (StreamWriter writer = new StreamWriter("output.txt"))
ReplaceSegTuv(reader, writer);

关于c# - 在大文件中搜索和替换正则表达式而不会出现 OutOfMemoryException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24046121/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com