gpt4 book ai didi

c# - 读取大文件时内存不足

转载 作者:太空宇宙 更新时间:2023-11-03 17:05:11 25 4
gpt4 key购买 nike

我正在创建一个分析文件数据质量的工具。所以我需要阅读文件的每一行并分析每一行。我还需要将文件的所有行存储在内存中,因为用户将能够深入到特定部分。所以对于一个包含数千行的文件来说,基本上一切正常。但是,当尝试使用包含超过 400 万行的 CSV 文件时,我遇到了内存不足异常。我原以为 C# 能够在其内存缓存中处理数百万个数据,但看起来并不像。所以我有点卡住了,不知道该怎么办。也许我的代码不是最高效的,所以如果你能告诉我一种改进它的方法,那会很棒吗?请记住,我需要将文件的所有行都保存在内存中,因为根据用户的操作,我需要访问特定的行以将它们显示给用户。

下面是读取每一行的调用

using (FileStream fs = File.Open(this.dlgInput.FileName.ToString(),   FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs))
using (System.IO.StreamReader sr = new StreamReader(this.dlgInput.FileName.ToString(), Encoding.Default, false, 8192))
{
string line;
if (this.chkSkipHeader.Checked)
{
sr.ReadLine();
}

progressBar1.Visible = true;
int nbOfLines = File.ReadLines(this.dlgInput.FileName.ToString()).Count();
progressBar1.Maximum = nbOfLines;

this.lines = new string[nbOfLines][];
this.patternedLines = new string[nbOfLines][];
for (int i = 0; i < nbOfLines; i++)
{
this.lines[i] = new string[this.dgvFields.Rows.Count];
this.patternedLines[i] = new string[this.dgvFields.Rows.Count];
}

// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
this.recordCount += 1;
char[] c = new char[1] { ',' };
System.Text.RegularExpressions.Regex CSVParser = new System.Text.RegularExpressions.Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] fields = CSVParser.Split(line);
ParseLine(fields);
this.lines[recordCount - 1] = fields;
progressBar1.PerformStep();
}
}

下面是 ParseLine 函数,它还通过一些需要分析的数组保存在内存中:

private void ParseLine(String[] fields2)
{
for (int j = 0; j <= fields2.Length - 1; j++)
{
if ((int)this.dgvFields.Rows[j].Cells["colSelected"].Value == 1)
{
/*' ************************************************
' Save Number of Counts by Value
' ************************************************/

if (this.values[j].ContainsKey(fields2[j]))
{
//values[0] = Dictionary<"TEST", 1> (fields2[0 which is source code] = count])
this.values[j][fields2[j]] += 1;
}
else
{
this.values[j].Add(fields2[j], 1);
}

/* ' ************************************************
' Save Pattern Values/Counts
' ************************************************/

string tmp = System.Text.RegularExpressions.Regex.Replace(fields2[j], "\\p{Lu}", "X");
tmp = System.Text.RegularExpressions.Regex.Replace(tmp, "\\p{Ll}", "x");
tmp = System.Text.RegularExpressions.Regex.Replace(tmp, "[0-9]", "0");


if (this.patterns[j].ContainsKey(tmp))
{
this.patterns[j][tmp] += 1;
}
else
{
this.patterns[j].Add(tmp, 1);
}

this.patternedLines[this.recordCount - 1][j] = tmp;
/* ' ************************************************
' Count Blanks/Alpha/Numeric/Phone/Other
' ************************************************/


if (String.IsNullOrWhiteSpace(fields2[j]))
{
this.blanks[j] += 1;
}
else if (System.Text.RegularExpressions.Regex.IsMatch(fields2[j], "^[0-9]+$"))
{
this.numeric[j] += 1;
}
else if (System.Text.RegularExpressions.Regex.IsMatch(fields2[j].ToUpper().Replace("EXTENSION", "").Replace("EXT", "").Replace("X", ""), "^[0-9()\\- ]+$"))
{
this.phone[j] += 1;
}
else if (System.Text.RegularExpressions.Regex.IsMatch(fields2[j], "^[a-zA-Z ]+$"))
{
this.alpha[j] += 1;
}
else
{
this.other[j] += 1;
}

if (this.recordCount == 1)
{
this.high[j] = fields2[j];
this.low[j] = fields2[j];
}
else
{
if (fields2[j].CompareTo(this.high[j]) > 0)
{
this.high[j] = fields2[j];
}

if (fields2[j].CompareTo(this.low[j]) < 0)
{
this.low[j] = fields2[j];
}
}
}
}
}

更新:新代码

int nbOfLines = File.ReadLines(this.dlgInput.FileName.ToString()).Count();
//Read file

using (System.IO.StreamReader sr = new StreamReader(this.dlgInput.FileName.ToString(), Encoding.Default, false, 8192))
{
string line;
if (this.chkSkipHeader.Checked)
{ sr.ReadLine(); }
progressBar1.Visible = true;

progressBar1.Maximum = nbOfLines;
this.lines = new string[nbOfLines][];
this.patternedLines = new string[nbOfLines][];
for (int i = 0; i < nbOfLines; i++)
{
this.lines[i] = new string[this.dgvFields.Rows.Count];
this.patternedLines[i] = new string[this.dgvFields.Rows.Count];
}

// Read and display lines from the file until the end of
// the file is reached.
while ((line = sr.ReadLine()) != null)
{
this.recordCount += 1;
char[] c = new char[1] { ',' };
System.Text.RegularExpressions.Regex CSVParser = new System.Text.RegularExpressions.Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] fields = CSVParser.Split(line);
ParseLine(fields);
this.lines[recordCount - 1] = fields;
progressBar1.PerformStep();
}
}

最佳答案

您需要创建辅助类来缓存整个文件中每一行的起始位置。

 int[] cacheLineStartPos;

public string GetLine (int lineNumber)
{
int linePositionInFile = cacheLineStartPos[lineNumber];

reader.Position = linePositionInFile;

return reader.ReadLine();
}

当然这只是一个例子,逻辑可以更复杂。

关于c# - 读取大文件时内存不足,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39126126/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com