powershell - 在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是什么？-6ren

powershell - 在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是什么？

转载作者：行者123 更新时间：2023-12-02 23:16:38

24

4

我正在使用 Powershell 进行一些 ETL 工作，读取压缩文本文件并根据每行的前三个字符将它们拆分出来。

如果我只是过滤输入文件，我可以将过滤后的流传输到 Out-File 并完成它。但我需要将输出重定向到多个目的地，据我所知，这不能通过简单的管道来完成。我已经在使用 .NET 流读取器来读取压缩的输入文件，并且我想知道是否还需要使用流编写器来写入输出文件。

天真的版本看起来像这样:

while (!$reader.EndOfFile) {
  $line = $reader.ReadLine();
  switch ($line.substring(0,3) {
    "001" {Add-Content "output001.txt" $line}
    "002" {Add-Content "output002.txt" $line}
    "003" {Add-Content "output003.txt" $line}
    }
  }

这看起来像是个坏消息:每行一次查找、打开、写入和关闭文件。输入文件是 500MB 以上的巨大怪物。

有没有一种惯用的方法可以通过 Powershell 构造有效地处理这个问题，或者我应该求助于 .NET Streamwriter？

我可以使用(New-Item“path”-type“file”)对象的方法吗？

编辑上下文:

我正在使用DotNetZip将 ZIP 文件作为流读取的库；因此，streamreader 而不是 Get-Content/gc。示例代码:

[System.Reflection.Assembly]::LoadFrom("\Path\To\Ionic.Zip.dll") 
$zipfile = [Ionic.Zip.ZipFile]::Read("\Path\To\File.zip")

foreach ($entry in $zipfile) {
  $reader = new-object system.io.streamreader $entry.OpenReader();
  while (!$reader.EndOfFile) {
    $line = $reader.ReadLine();
    #do something here
  }
}

我可能应该对 $zipfile 和 $reader 进行 Dispose()，但这是另一个问题了!

最佳答案

阅读

至于读取文件和解析，我会使用 switch 语句:

switch -file c:\temp\stackoverflow.testfile2.txt -regex {
  "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
  "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
  "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
}

我认为这是更好的方法，因为

有对正则表达式的支持，但你不支持必须制作子字符串(这可能很贵)和
参数-file 非常方便;)

写作

至于编写输出，我将测试使用 Streamwriter，但是如果 Add-Content 的性能适合您，我会坚持使用它。

添加:Keith建议使用>>运算符，但是，它似乎很慢。除此之外，它以 Unicode 格式写入输出，这会使文件大小加倍。

看看我的测试:

[1]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c >> c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c >> c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c >> c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
159,1585874
[2]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.txt}}}
>> }).TotalSeconds
>>
9,2696923

差异巨大。

仅供比较:

[3]: (measure-command {
>>     $reader = new-object io.streamreader c:\temp\stackoverflow.testfile2.txt
>>     while (!$reader.EndOfStream) {
>>         $line = $reader.ReadLine();
>>         switch ($line.substring(0,3)) {
>>             "001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $line}
>>             "002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $line}
>>             "003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $line}
>>             }
>>         }
>>     $reader.close()
>> }).TotalSeconds
>>
8,2454369
[4]: (measure-command {
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {Add-Content c:\temp\stackoverflow.testfile.001.txt $_}
>>         "^002" {Add-Content c:\temp\stackoverflow.testfile.002.txt $_}
>>         "^003" {Add-Content c:\temp\stackoverflow.testfile.003.txt $_}
>>     }
>> }).TotalSeconds
8,6755565

补充:我对写作表现很好奇..而且我有点惊讶

[8]: (measure-command {
>>     $sw1 = new-object io.streamwriter c:\temp\stackoverflow.testfile.001.txt3b
>>     $sw2 = new-object io.streamwriter c:\temp\stackoverflow.testfile.002.txt3b
>>     $sw3 = new-object io.streamwriter c:\temp\stackoverflow.testfile.003.txt3b
>>     switch -file c:\temp\stackoverflow.testfile2.txt -regex {
>>         "^001" {$sw1.WriteLine($_)}
>>         "^002" {$sw2.WriteLine($_)}
>>         "^003" {$sw3.WriteLine($_)}
>>     }
>>     $sw1.Close()
>>     $sw2.Close()
>>     $sw3.Close()
>>
>> }).TotalSeconds
>>
0,1062315

速度快了 80 倍。现在您必须做出决定 - 如果速度很重要，请使用 StreamWriter。如果代码清晰度很重要，请使用 Add-Content。

<小时/>

子字符串与正则表达式

根据 Keith 的说法，Substring 快了 20%。一如既往，这取决于情况。然而，就我而言，结果是这样的:

[102]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch ($_.Substring(0,3)) {
>>             '001'{$c | Add-content c:\temp\stackoverflow.testfile.001.s.txt} `
>>             '002'{$c | Add-content c:\temp\stackoverflow.testfile.002.s.txt} `
>>             '003'{$c | Add-content c:\temp\stackoverflow.testfile.003.s.txt}}}
>> }).TotalSeconds
>>
9,0654496
[103]: (measure-command {
>>     gc c:\temp\stackoverflow.testfile2.txt  | %{$c = $_; switch -regex ($_) {
>>             '^001'{$c | Add-content c:\temp\stackoverflow.testfile.001.r.txt} `
>>             '^002'{$c | Add-content c:\temp\stackoverflow.testfile.002.r.txt} `
>>             '^003'{$c | Add-content c:\temp\stackoverflow.testfile.003.r.txt}}}
>> }).TotalSeconds
>>
9,2563681

所以差异并不重要，对我来说，正则表达式更具可读性。

关于powershell - 在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2159763/

24

4

0

文章推荐： powershell - 交换powershell:超出默认范围的get-mailbox

文章推荐： javascript - 为什么 Firebase 时间戳对象返回未定义？

文章推荐： elasticsearch - 为什么我在Elasticsearch中没有跟踪日志？

文章推荐： javascript - 使用多个键排序meteor mongodb

powershell - 实现 PowerShell PSProvider *in* PowerShell
我正在寻找实现 PowerShell 提供程序在电源外壳。我一直在想，如果我只是定义类型，然后将它们导入我的 session (导入模块)，我应该能够让它们可用。例如，这个不工作但它沿着我想
powershell - Powershell:与x86 powershell/ise一起运行的脚本
我创建的脚本使用了组件，这些组件仅在32位版本的Powershell中可用。默认情况下，Windows使用Powershell x64执行脚本，这会导致一些错误。是一种在脚本开头设置值以强制Win
powershell - 如何检测 Powershell 中的 Powershell 嵌套？
是否可以从 Powershell 中检测它是否是嵌套 shell？如果我打开 Powershell 或 cmd.exe 窗口，然后输入 powershell 在那里，是否有一个神奇的 $host.s
powershell - 如何为同一台机器自动化 PowerShell 或 PowerShell Core
随着 PowerShell Core 的发布，应用程序在使用托管自动化库 (system.management.automation) 时如何选择调用哪个版本的 Powershell(Powershe
powershell - 如何在企业中使用 PowerShell 和 PowerShell 模块
最近，我加入了我企业的 Windows 团队，凭借我的开发人员背景(一般是 Java、.NET 和 Web)，我很快就对 PowerShell 产生了兴趣。我可以看到它比普通的旧批处理文件、VB 更有
powershell - 如何在另一个 powershell 脚本中包含另一个 powershell 脚本文件？
假设我有一个 powershell 脚本，它在我当前路径的相对路径中包含一个 Powershell 哈希。让我们称之为“name.ps1”，它包含: $names = @{ "bob" = "b
powershell - 为 Powershell 模块添加 Powershell 管理单元并多次导入
我想为我正在构建的自定义 Powershell Commandlet 使用 SqlServerCmdletSnapin。如果我将以下代码添加到 PSM1 的开头: if ( (Get-PSSnapin
powershell - 使用来自另一个 powershell 脚本的参数调用 PowerShell 脚本
如何调用从 PowerShell 脚本中获取命名参数的 PowerShell 脚本？ foo.ps1: param( [Parameter(Mandatory=$true)][String]$a=''
powershell - 何时选择开发 PowerShell 模块而不是 PowerShell 脚本
我即将为 Windows 管理员编写一个 PowerShell 脚本，以帮助他们完成与部署 Web 应用程序相关的某些任务。有什么理由让我应该赞成或排除开发 PowerShell 模块 (.psm1
powershell - 如何在不关闭 powershell 控制台的情况下从 Powershell 模块函数返回非零退出代码？
我的 powershell 模块有一个函数，我希望它返回一个非零退出代码。但是，作为一个模块函数，当我运行 Import-Module 时，它会加载到 powershell 控制台的上下文中。所以，当
powershell - 使用参数从 powershell 中调用 powershell 脚本时遇到问题
我在这个问题上花了最后 4 个小时，非常感谢您提供的任何意见。我需要使用不同的凭据调用 powershell 脚本并将参数传递给该脚本。安装 WISEScript 中包装的程序后，此脚本开始收集机
powershell - 有没有办法让我在 powershell 运行时以编程方式获取 Powershell 输入和输出？
我有一个场景，我需要将 powershell 命令的命令和输出转发到另一个进程以进行日志记录和处理。我希望这个控制台尽可能接近 powershell，因此不希望将它简单地托管在另一个控制台程序中。
powershell - 在 PowerShell 脚本中调用其他 PowerShell 脚本
我正在尝试让一个主 PowerShell 脚本运行所有其他脚本，同时等待 30-60 秒以确保完成任务。我尝试过的所有其他操作都不会停止/等待第一个脚本及其进程完成，然后才能同时完成所有其他脚本，并且
powershell - 如何从 Powershell 打开 Powershell 控制台窗口
我正在编写一个脚本来使用多个 plink (PuTTY) session 作为 Windows 版本的 clustersh。然而，我陷入困境，因为我想从 powershell 打开多个 Powersh
powershell - 如何从 Powershell 运行 Powershell x86？
我读了这个答案:How to Open Powershell from Powershell start powershell 这将打开基础的大分辨率 PS 实例。如何打开 PS(x86)？最佳答案
powershell - 我可以在 PowerShell 退出时将 PowerShell 历史记录写入文本文件吗？
我很想知道我们是否可以在 Powershell 中做到这一点。使用 Out-File 命令，我们可以通过管道将其输出写入文件。这样我就可以将我所有的历史命令发送到一个文本文件中。问题是我可以在每次
powershell - Powershell 中的管道
我在 about_Pipelines 阅读了有关 PowerShell 中的管道工作原理的信息，并了解到管道一次传送一个对象。所以，这个 Get-Service | Format-Table -Pr
powershell - powershell 启动过程出错
我正在尝试像这样从 powershell 启动一个进程:- $proc = (start-process $myExe -argumentList '/myArg True' -windowStyle
powershell - 表达式或语句中的意外标记 - powershell
## To run the script # .\get_status.ps1 -Hostname -Service_Action -Service_Name #$Hostname = "hos
powershell - Powershell 输出的自定义字体颜色
让我们使用 powershell 命令 Write-Host "red text"-Fore red这会在红色前景中显示“红色文本”。但是，您希望文本以稍微亮一点的方式显示字体颜色，浅红色。有没有

首页

博学

6Ren·AI

商城

powershell - 在 Powershell 中，按记录类型拆分大型文本文件的最有效方法是什么？

阅读

写作

子字符串与正则表达式