gpt4 book ai didi

powershell - 按一定数量的CRLF行定界符分割大文件

转载 作者:行者123 更新时间:2023-12-03 01:29:29 25 4
gpt4 key购买 nike

我有一个超过1.5GB的文件,我希望将其分成较小的块以进行一些工作,然后再重新添加。

我有以下脚本,每x行数拆分。该文件可以包含混合使用CRLF和LF分隔符的行。

我要寻找的是按x个CRLF行定界符分割,因为在现有脚本中我可以分割两个完整的数据条目。 CRLF是记录之间的定义定界符,LF存在于自由文本字段中。

注意:下面的代码还将现有的LF转换为CRLF。我希望按照原始格式维护行定界符。

版本是5.1

$sourceFolder_local="D:\FileCleaning\"
$raw = $sourceFolder_local + $file.name

#split test
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$filename = $raw
$rootName = $raw.Replace(".csv","")
$ext = ".csv"

$linesperFile = 100000
$filecount = 1
$reader = $null
try{
$reader = [io.file]::OpenText($filename)
try{
"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0

while($reader.EndOfStream -ne $true) {
"Reading $linesperFile"
while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){
$writer.WriteLine($reader.ReadLine());
$linecount++
}

if($reader.EndOfStream -ne $true) {
"Closing file"
$writer.Dispose();

"Creating file number $filecount"
$writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext))
$filecount++
$linecount = 0
}
}
} finally {
$writer.Dispose();
}
} finally {
$reader.Dispose();
}
$sw.Stop()

Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds"

最佳答案

此脚本通过您选择的字符串定界符(例如CRLF)分割文件:

Add-Type -AssemblyName System.Collections

$file = get-item 'D:\test\largefile.txt'
$delimiter = [environment]::NewLine # delimiter to split file
$delimCounter = 5 # split after X occurances of delimiter


$fileReader = [System.IO.StreamReader]::new( $file, [System.Text.Encoding]::default,$true)
$peek = $fileReader.Peek()
$encoding = $fileReader.CurrentEncoding
[void]$fileReader.Close()
[void]$fileReader.Dispose()

switch( $encoding.BodyName ) {
'utf-8' {
$enc = [System.Text.Encoding]::UTF8
break
}
'utf-7' {
$enc = [System.Text.Encoding]::UTF7
break
}
'utf-16' {
$enc = [System.Text.Encoding]::Unicode
break
}
'utf-32' {
$enc = [System.Text.Encoding]::UTF32
break
}
'bigendianunicode' {
$enc = [System.Text.Encoding]::BigEndianUnicode
break
}
'ascii' {
$enc = [System.Text.Encoding]::ASCII
break
}
default {
$enc = $null
}
}

if( $enc ) {
$delimiter = [string]::new( $enc.GetBytes($delimiter) )
}

$fileReader = [System.IO.FileStream]::new( $file, [System.IO.FileMode]::Open )
$delimBuffer = [System.Collections.Generic.List[byte]]::new()
$fileBuffer = [System.Collections.Generic.List[byte]]::new()
$fileCounter = 0
$delimCounter1 = $delimCounter

[void]$delimBuffer.AddRange( [byte[]]0 * $delimiter.Length )

$byte = $fileReader.ReadByte()

while( $byte -ge 0 ) {

[void]$delimBuffer.RemoveAt(0)
[void]$delimBuffer.Add( [byte]$byte )
[void]$fileBuffer.Add( [byte]$byte )

if( [String]::new( $delimBuffer ) -eq $delimiter ) {
$delimCounter1--
if( !$delimCounter1 ) {
# remove last CRLF (if not needed, remove next line)
[void]$fileBuffer.RemoveRange( $fileBuffer.Count - $delimiter.Length, $delimiter.Length )
[System.IO.File]::WriteAllBytes( ($file.DirectoryName + '\' + $file.BaseName + $fileCounter + $file.Extension), $FileBuffer )
[void]$fileBuffer.Clear()
$fileCounter++
$delimCounter1 = $delimCounter
}
}

$byte = $fileReader.ReadByte()
}

if( $fileBuffer.Count -gt 0 ) {
[System.IO.File]::WriteAllBytes( ($file.DirectoryName + '\' + $file.BaseName + $fileCounter + $file.Extension), $fileBuffer )
[void]$fileBuffer.Clear()
}

[void]$fileReader.Close()
[void]$fileReader.Dispose()

关于powershell - 按一定数量的CRLF行定界符分割大文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59649215/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com