gpt4 book ai didi

powershell - 如何让这个 PowerShell 脚本更快地解析大文件?

转载 作者:行者123 更新时间:2023-12-03 21:19:24 35 4
gpt4 key购买 nike

我有以下 PowerShell 脚本可以解析一些非常大的文件 ETL目的。首先,我的测试文件大约为 30 MB。预计大约 200 MB 的较大文件。所以我有几个问题。

下面的脚本可以工作,但即使处理 30 MB 的文件也需要很长时间。

PowerShell 脚本:

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$array = @()

$content = gc $path\$infile |
select -skip 4 |
where {$_ -match "[|].*[|].*"} |
foreach {$_ -replace "^[|]","" -replace "[|]$",""}

$header = $content[0]

$array = $content[0]
for ($i = 1; $i -le $content.length; $i+=1) {
if ($array[$i] -ne $content[0]) {$array += $content[$i]}
}

$array | out-file $path\$outfile -encoding ASCII

数据文件摘录:
---------------------------
|Data statistics|Number of|
|-------------------------|
|Records passed | 93,118|
---------------------------
02/14/2012 Production Operations and Confirmations 2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Production Operations and Confirmations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ProductionOrderNumber|MaterialNumber |ModifiedDate|Plant|OperationRoutingNumber|WorkCenter|OperationStatus|IsActive| WbsElement|SequenceNumber|OperationNumber|OperationDescription |OperationQty|ConfirmedYieldQty|StandardValueLabor|ActualDirectLaborHrs|ActualContractorLaborHrs|ActualOvertimeLaborHrs|ConfirmationNumber|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|180849518 |011255486L1 |02/08/2012 |2101 | 9901123118|56B30 |I9902 | |SOC10MA2302SOCJ31| |0140 |Operation 1 | 1 | 0 | 0.0 | | 499.990 | | 9908651250|
|180849518 |011255486L1 |02/08/2012 |2101 | 9901123118|56B30 |I9902 | |SOC10MA2302SOCJ31|14 |9916 |Operation 2 | 1 | 0 | 499.0 | | | | 9908532289|
|181993564 |011255486L1 |02/09/2012 |2101 | 9901288820|56B30 |I9902 | |SOC10MD2302SOCJ31|14 |9916 |Operation 1 | 1 | 0 | 499.0 | | 399.599 | | 9908498544|
|180885825 |011255486L1 |02/08/2012 |2101 | 9901162239|56B30 |I9902 | |SOC10MG2302SOCJ31| |0150 |Operation 3 | 1 | 0 | 0.0 | | 882.499 | | 9908099659|
|180885825 |011255486L1 |02/08/2012 |2101 | 9901162239|56B30 |I9902 | |SOC10MG2302SOCJ31|14 |9916 |Operation 4 | 1 | 0 | 544.0 | | | | 9908858514|
|181638583 |990104460I0 |02/10/2012 |2101 | 9902123289|56G99 |I9902 | |SOC11MAR105SOCJ31| |0160 |Operation 5 | 1 | 0 | 1,160.0 | | | | 9914295010|
|181681218 |990104460B0 |02/08/2012 |2101 | 9902180981|56G99 |I9902 | |SOC11MAR328SOCJ31|0 |9910 |Operation 6 | 1 | 0 | 916.0 | | | | 9914621885|
|181681036 |990104460I0 |02/09/2012 |2101 | 9902180289|56G99 |I9902 | |SOC11MAR108SOCJ31| |0180 |Operation 8 | 1 | 0 | 1.0 | | | | 9914619196|
|189938054 |011255486A2 |02/10/2012 |2101 | 9999206805|5AD99 |I9902 | |RS08MJ2305SOCJ31 | |0599 |Operation 8 | 1 | 0 | 0.0 | | | | 9901316289|
|181919894 |012984532A3 |02/10/2012 |2101 | 9902511433|A199399Z |I9902 | |SOC12MCB101SOCJ31|0 |9935 |Operation 9 | 1 | 0 | 0.5 | | | | 9916914233|
|181919894 |012984532A3 |02/10/2012 |2101 | 9902511433|A199399Z |I9902 | |SOC12MCB101SOCJ31|22 |9951 |Operation 10 | 1 | 0 | 68.080 | | | | 9916914224|

最佳答案

您的脚本一次读取一行(慢!)并将几乎整个文件存储在内存中(大!)。

试试这个(没有广泛测试):

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"

$batch = 1000

[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'

$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line

[regex]$header_regex = [regex]::escape($header_line)

$header_line.trim('|') | Set-Content $path\$outfile

Get-Content $path\$infile -ReadCount $batch |
ForEach {
$_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
}

这是内存使用和速度之间的折衷。 -match-replace运算符将在数组上工作,因此您可以一次过滤和替换整个数组,而不必遍历每条记录。 -readcount将导致以 $batch 记录的块读取文件,因此您基本上一次读取 1000 条记录,在该批次上进行匹配和替换,然后将结果附加到输出文件中。然后它返回接下来的 1000 条记录。增加 $batch 的大小应该会加快它的速度,但它会使用更多的内存。调整它以适合您的资源。

关于powershell - 如何让这个 PowerShell 脚本更快地解析大文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9439210/

35 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com