gpt4 book ai didi

powershell - 使用 Powershell,如何通过挑选哪些字段转到哪个 csv 来将大型 CSV 拆分为多个 csv?

转载 作者:行者123 更新时间:2023-12-02 23:56:36 25 4
gpt4 key购买 nike

我有几亿行 CSV,没有标题、制表符分隔和大约 40 个字段。
我需要将此 csv 拆分为 6 个较小的 csv。
较小的 csv 字段将通过 cherry-pick 指定。如本例所示:

Output csv#1 uses fields 1, 4, 23, 16

Output csv#2 uses fields 2, 3, 5, 24, 34

Output csv#3 uses fields 6, 25, 20, 16

Output csv#4 uses fields 10, 29, 31, 32, 34, 35

Etc.


几个原始字段不会被使用。
我很高兴阅读了许多其他用户关于将单个 csv 拆分为多个 csv 的帖子,但还没有偶然发现一个提示如何指定将哪些字段放入哪个 csv 的帖子。
我假设我应该使用 streamReader/StreamWriter 和 '`t' 来检测分隔符。我不在乎我是否必须在每次迭代的源文件上运行一次,而不是一次运行它。我希望这会在一夜之间(或一个周末!)
我只是还没有弄清楚如何为正确的输出 csv 选择每个文件。
感谢您的任何想法!
-=布鲁斯
这是我的示例 csv。此示例以逗号分隔,而非制表符分隔。我可以处理任何代码的制表符分隔更改。
12/22/2018 16:00,Something,,4503333883,137.11.21.251,113,,Tackov,SO,1968974515,192.168.4312.179,37468,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,RST,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,With Stackoverflow all things are possible,0,7,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something2,stackover,2018138499,120.43.212.240,34749,,Tackov,SO,1968982930,192.168.4345.146,4545,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something3,Overflows,3203409083,12.45.36.187,60170,,Tackov,SO,1968976181,192.168.4319.53,4545,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something4,,1390242054,82.221.105.6,59922,,Tackov,SO,1968986785,192.168.4360.161,88,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something7,Ackover,1245693210,11.63.197.26,60300,,Tackov,SO,1968982959,192.168.4345.175,99,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
12/22/2018 16:00,Something2,Koverflo,3026283045,199.97.106.37,47861,,Tackov,SO,1968982790,192.168.4345.6,443,QWE,,172166316,99.0.0.0-99.255.255.255 (STACK),3011,overflow_internet,stack-overflow002-stack.over.flow,stack_over_flow,1,,CISCO,ASA,inside,US/Eastern,STACK2 Type,,STACK1 Flow,,Forty4Fortnights,,QWE FourceScore,,ABC,,AndForty,,Fries,,With That,,Shake,,14012,Lunch ,with Google all things are possible ABC,0,5,/dot,/Access,/Dot/Com/Service,/Win,/Fail/Win
如果数据不能很好地复制粘贴,每一行都以字符“/Fail/Win”结尾
=-=-=-=-=-=-=-=-==-
7小时后......
我想除了一个小细节,我什么都有。用这种方法读写时,我无法将我的列分开。我的测试文件现在读写时间为 29.1 秒。耶!但它所做的只是读取它,然后将其写回。不是很有效。
每个“Tessalating 解决方案”的变量 $Header 包含我想在我的测试代码中拆分的标题。我还没有找到如何指定分隔符,然后放置 $Header 变量。
感谢您的任何帮助。
    clear
$sw = new-object System.Diagnostics.Stopwatch
$sw.Start()
$Header = "Select-Object -Property 'SourceGeoCountryName','SourceGeoLocationInfo','SourceAddress','SourceHostName','SourcePort','SourceDnsDomain'"
$MyPath = "C:\Users\myusername\SplitOuts\"
$InputFile ="mysourcefile.csv"
$OutputFileName = "table1.csv"
$FileName = $MyPath + $InputFile
$OutputFile = $MyPath + $OutputFileName
Set-Location $MyPath
$reader = $null
If(Test-path $OutputFile) {Clear-Content $OutputFile}
$stream=[System.IO.StreamWriter] $OutputFile
ForEach ($line in [System.IO.File]::ReadLines($FileName)) {
$stream.WriteLine($line)
}
$stream.close()
$sw.Stop()
Write-Host "Completed in " $sw.Elapsed.TotalSeconds.ToString("00.0") "seconds" -ForegroundColor Yellow`
-=布鲁斯

最佳答案

由于您愿意在周末运行它,即速度无关紧要,我认为没有必要避免 Import-CSV立即地。指定一些有意义的标题,并执行以下操作:

$Header = 'Date', 'IP', 'CustName', 
'Things', 'Code', 'AccountNum',
'Whatever' # etc.

Import-Csv -Path c:\temp\input.tsv -Delimiter "`t" -Header $Header | ForEach-Object {

# small CSV 1
$_ | Select-Object -Property 'Date', 'Code', 'Whatever' |
Export-Csv -Path c:\temp\output1.csv -Append -Delimiter "`t" -NoTypeInformation

# small CSV 2
$_ | Select-Object -Property 'AccountNum', 'IP', 'CustName' |
Export-Csv -Path c:\temp\output2.csv -Append -Delimiter "`t" -NoTypeInformation

#... etc.

}

不是最有效的,输出的大量文件打开和关闭,每一行的开销和对象流失,但非常直截了当。为字段命名,然后在每个文件中选择您想要的字段。

编辑 StreamReader/StreamWriter 方法,希望运行得更快:
$out1 = [System.IO.File]::CreateText('d:\test\out1.csv')
$out2 = [System.IO.File]::CreateText('d:\test\out2.csv')

$in = [System.IO.File]::OpenText('d:\test\input.csv')
while ($in.Peek() -ge 0) {

# read and split next line by tab
$fields = $in.ReadLine() -split "`t"

# select some fields by number, join them into a new line
# write to small csv1
$f1 = $fields[0,3,5] -join "`t"
$out1.WriteLine($f1)

# same for csv2
$f2 = $fields[1,2,7] -join "`t"
$out2.WriteLine($f2)

# ..
# etc.

}
$out1.Close()
$out2.Close()
$in.Close()

关于powershell - 使用 Powershell,如何通过挑选哪些字段转到哪个 csv 来将大型 CSV 拆分为多个 csv?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43524331/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com