powershell - 在 Powershell 中搜索许多大型文本文件-6ren

powershell - 在 Powershell 中搜索许多大型文本文件

转载作者：行者123 更新时间：2023-12-02 23:15:36

25

4

我经常需要在一个目录中搜索服务器日志文件，该目录可能包含 50 个或更多文件，每个文件超过 200 MB。我在 Powershell 中编写了一个函数来执行此搜索。它查找并提取给定查询参数的所有值。它适用于单个大文件或小文件集合，但在上述情况下完全咬合，即大文件目录。

该函数接受一个参数，该参数由要搜索的查询参数组成。

在伪代码中:

Take parameter (e.g. someParam or someParam=([^& ]+))
Create a regex (if one is not supplied)
Collect a directory list of *.log, pipe to Select-String
For each pipeline object, add the matchers to a hash as keys
Increment a match counter
Call GC
At the end of the pipelining: 
if (hash has keys) 
    enumerate the hash keys, 
    sort and append to string array
    set-content the string array to a file 
    print summary to console
    exit
else
    print summary to console
    exit

这是文件处理的精简版本。

$wtmatches = @{};
gci -Filter *.log | Select-String -Pattern $searcher |       
%{ $wtmatches[$_.Matches[0].Groups[1].Value]++; $items++; [GC]::Collect(); }

我只是使用一个旧的 perl 技巧，通过将找到的项目设为哈希的键来对它们进行重复数据删除。也许，这是一个错误，但处理的典型输出最多约为 30,000 个项目。更常见的是，找到的项目在数千个范围内。从我所见，散列中的键数不会影响处理时间，而是文件的大小和数量会破坏它。我最近绝望地投入了GC，它确实有一些积极的影响，但它是微不足道的。

问题在于，对于大型文件的大量集合，处理过程会在大约 60 秒内将 RAM 池吸干。有趣的是，它实际上并没有使用很多 CPU，但是有很多 volatile 存储正在进行。一旦 RAM 使用率上升 90% 以上，我就可以打卡出去看电视了。完成处理以生成具有 15,000 或 20,000 个唯一值的文件可能需要数小时。

我想要提高效率的建议和/或建议，即使这意味着使用不同的范例来完成处理。我带着我所知道的去了。我几乎每天都使用这个工具。

哦，我致力于使用 Powershell。 ;-) 这个函数是我为我的工作编写的完整模块的一部分，因此，Python、perl 或其他有用语言的建议在这种情况下没有用。

谢谢。

mp

更新:
使用 latkin 的 ProcessFile函数，我使用以下包装器进行测试。他的功能比我原来的要快几个数量级。

function Find-WtQuery {

<#
 .Synopsis
  Takes a parameter with a capture regex and a wildcard for files list.

 .Description
  This function is intended to be used on large collections of large files that have
  the potential to take an unacceptably long time to process using other methods. It
  requires that a regex capture group be passed in as the value to search for.

 .Parameter Target
  The parameter with capture group to find, e.g. WT.z_custom=([^ &]+).

 .Parameter Files
  The file wildcard to search, e.g. '*.log'

 .Outputs
  An object with an array of unique values and a count of total matched lines.
#>

        param(
        [Parameter(Mandatory = $true)] [string] $target,
        [Parameter(Mandatory = $false)] [string] $files
    )

    begin{
        $stime = Get-Date
    }
    process{
        $results = gci -Filter $files | ProcessFile -Pattern $target  -Group 1;
    }
    end{
        $etime = Get-Date;
        $ptime = $etime - $stime;
        Write-Host ("Processing time for {0} files was {1}:{2}:{3}." -f (gci   
    -Filter $files).Count, $ptime.Hours,$ptime.Minutes,$ptime.Seconds);
        return $results;
    }
}

输出:

clients:\test\logs\global
{powem} [4] --> Find-WtQuery -target "WT.ets=([^ &]+)" -files "*.log"
Processing time for 53 files was 0:1:35.

感谢大家的评论和帮助。

最佳答案

这是一个有望加速并减少文件处理部分的内存影响的函数。它将返回一个具有 2 个属性的对象:匹配的总行数，以及来自指定匹配组的唯一字符串的排序数组。 (根据您的描述，您似乎并不真正关心每个字符串的计数，只关心字符串值本身)

function ProcessFile
{
   param(
      [Parameter(ValueFromPipeline = $true, Mandatory = $true)]
      [System.IO.FileInfo] $File,

      [Parameter(Mandatory = $true)]
      [string] $Pattern,

      [Parameter(Mandatory = $true)]
      [int] $Group
   )

   begin
   {
      $regex = new-object Regex @($pattern, 'Compiled')
      $set = new-object 'System.Collections.Generic.SortedDictionary[string, int]'
      $totalCount = 0
   }

   process
   {
      try
      {
        $reader = new-object IO.StreamReader $_.FullName

        while( ($line = $reader.ReadLine()) -ne $null)
        {
           $m = $regex.Match($line)
           if($m.Success)
           {
              $set[$m.Groups[$group].Value] = 1      
              $totalCount++
           }
        }
      }
      finally
      {
         $reader.Close()
      }
   }

   end
   {
      new-object psobject -prop @{TotalCount = $totalCount; Unique = ([string[]]$set.Keys)}
   }
}

你可以像这样使用它:

$results = dir *.log | ProcessFile -Pattern 'stuff (capturegroup)' -Group 1
"Total matches: $($results.TotalCount)"
$results.Unique | Out-File .\Results.txt

关于powershell - 在 Powershell 中搜索许多大型文本文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12122690/

25

4

0

文章推荐： api - 使用 Powershell 从 Amazon 获取图书元数据

文章推荐： elasticsearch - Elasticsearch中每个组的最大总和

文章推荐： elasticsearch - Elasticsearch-py具有查询的批量更新脚本？

文章推荐： elasticsearch - Lucene 查询中的顺序会影响结果吗？

powershell - 实现 PowerShell PSProvider *in* PowerShell
我正在寻找实现 PowerShell 提供程序在电源外壳。我一直在想，如果我只是定义类型，然后将它们导入我的 session (导入模块)，我应该能够让它们可用。例如，这个不工作但它沿着我想
powershell - Powershell:与x86 powershell/ise一起运行的脚本
我创建的脚本使用了组件，这些组件仅在32位版本的Powershell中可用。默认情况下，Windows使用Powershell x64执行脚本，这会导致一些错误。是一种在脚本开头设置值以强制Win
powershell - 如何检测 Powershell 中的 Powershell 嵌套？
是否可以从 Powershell 中检测它是否是嵌套 shell？如果我打开 Powershell 或 cmd.exe 窗口，然后输入 powershell 在那里，是否有一个神奇的 $host.s
powershell - 如何为同一台机器自动化 PowerShell 或 PowerShell Core
随着 PowerShell Core 的发布，应用程序在使用托管自动化库 (system.management.automation) 时如何选择调用哪个版本的 Powershell(Powershe
powershell - 如何在企业中使用 PowerShell 和 PowerShell 模块
最近，我加入了我企业的 Windows 团队，凭借我的开发人员背景(一般是 Java、.NET 和 Web)，我很快就对 PowerShell 产生了兴趣。我可以看到它比普通的旧批处理文件、VB 更有
powershell - 如何在另一个 powershell 脚本中包含另一个 powershell 脚本文件？
假设我有一个 powershell 脚本，它在我当前路径的相对路径中包含一个 Powershell 哈希。让我们称之为“name.ps1”，它包含: $names = @{ "bob" = "b
powershell - 为 Powershell 模块添加 Powershell 管理单元并多次导入
我想为我正在构建的自定义 Powershell Commandlet 使用 SqlServerCmdletSnapin。如果我将以下代码添加到 PSM1 的开头: if ( (Get-PSSnapin
powershell - 使用来自另一个 powershell 脚本的参数调用 PowerShell 脚本
如何调用从 PowerShell 脚本中获取命名参数的 PowerShell 脚本？ foo.ps1: param( [Parameter(Mandatory=$true)][String]$a=''
powershell - 何时选择开发 PowerShell 模块而不是 PowerShell 脚本
我即将为 Windows 管理员编写一个 PowerShell 脚本，以帮助他们完成与部署 Web 应用程序相关的某些任务。有什么理由让我应该赞成或排除开发 PowerShell 模块 (.psm1
powershell - 如何在不关闭 powershell 控制台的情况下从 Powershell 模块函数返回非零退出代码？
我的 powershell 模块有一个函数，我希望它返回一个非零退出代码。但是，作为一个模块函数，当我运行 Import-Module 时，它会加载到 powershell 控制台的上下文中。所以，当
powershell - 使用参数从 powershell 中调用 powershell 脚本时遇到问题
我在这个问题上花了最后 4 个小时，非常感谢您提供的任何意见。我需要使用不同的凭据调用 powershell 脚本并将参数传递给该脚本。安装 WISEScript 中包装的程序后，此脚本开始收集机
powershell - 有没有办法让我在 powershell 运行时以编程方式获取 Powershell 输入和输出？
我有一个场景，我需要将 powershell 命令的命令和输出转发到另一个进程以进行日志记录和处理。我希望这个控制台尽可能接近 powershell，因此不希望将它简单地托管在另一个控制台程序中。
powershell - 在 PowerShell 脚本中调用其他 PowerShell 脚本
我正在尝试让一个主 PowerShell 脚本运行所有其他脚本，同时等待 30-60 秒以确保完成任务。我尝试过的所有其他操作都不会停止/等待第一个脚本及其进程完成，然后才能同时完成所有其他脚本，并且
powershell - 如何从 Powershell 打开 Powershell 控制台窗口
我正在编写一个脚本来使用多个 plink (PuTTY) session 作为 Windows 版本的 clustersh。然而，我陷入困境，因为我想从 powershell 打开多个 Powersh
powershell - 如何从 Powershell 运行 Powershell x86？
我读了这个答案:How to Open Powershell from Powershell start powershell 这将打开基础的大分辨率 PS 实例。如何打开 PS(x86)？最佳答案
powershell - 我可以在 PowerShell 退出时将 PowerShell 历史记录写入文本文件吗？
我很想知道我们是否可以在 Powershell 中做到这一点。使用 Out-File 命令，我们可以通过管道将其输出写入文件。这样我就可以将我所有的历史命令发送到一个文本文件中。问题是我可以在每次
powershell - Powershell 中的管道
我在 about_Pipelines 阅读了有关 PowerShell 中的管道工作原理的信息，并了解到管道一次传送一个对象。所以，这个 Get-Service | Format-Table -Pr
powershell - powershell 启动过程出错
我正在尝试像这样从 powershell 启动一个进程:- $proc = (start-process $myExe -argumentList '/myArg True' -windowStyle
powershell - 表达式或语句中的意外标记 - powershell
## To run the script # .\get_status.ps1 -Hostname -Service_Action -Service_Name #$Hostname = "hos
powershell - Powershell 输出的自定义字体颜色
让我们使用 powershell 命令 Write-Host "red text"-Fore red这会在红色前景中显示“红色文本”。但是，您希望文本以稍微亮一点的方式显示字体颜色，浅红色。有没有

首页

博学

6Ren·AI

商城

powershell - 在 Powershell 中搜索许多大型文本文件