gpt4 book ai didi

sorting - 将 2 个大型排序 CSV 文件组合成一个文件

转载 作者:IT王子 更新时间:2023-10-29 02:24:42 26 4
gpt4 key购买 nike

我有一个大约 50GB 大小的存档文件。

每周,我都必须获取一个 CSV 文件并将其与非常大的 50GB CSV 文件合并。

我是 Go 的新手,希望在 Go 中有一个很好的弹性解决方案。

文件看起来像:

"a:123", 101010
"b:123", 101010
"some-key-here:123", 101010
"some-key-here:234", 101010

最佳答案

虽然我没有自己编译它来检查,但一旦您实现了 compare() 函数,这应该会执行您想要的操作。它本质上是 Mergesort 算法的“合并”步骤。由于您已经按排序顺序获得了两个文件,因此您只需要合并步骤,这可以以流方式完成。

package main

import (
"encoding/csv"
"io"
"log"
"os"
)

const outFile = "your/output/file/path.ext"

func main() {
// make sure there are only 2 args
if len(os.Args) != 3 {
log.Panic("\nUsage: command file1 file2")
}

// open the first file
f1, e := os.Open(os.Args[1])
if e != nil {
log.Panic("\nUnable to open first file: ", e)
}
defer f1.Close()

// open second file
f2, e := os.Open(os.Args[2])
if e != nil {
log.Panic("\nUnable to open second file: ", e)
}
defer f2.Close()

// create a file writer
w, e := os.Create(outFile)
if e != nil {
log.Panic("\nUnable to create new file: ", e)
}
defer w.Close()

// wrap the file readers with CSV readers
cr1 := csv.NewReader(f1)
cr2 := csv.NewReader(f2)

// wrap the out file writer with a CSV writer
cw := csv.NewWriter(w)

// initialize the lines
line1, b := readline(cr1)
if !b {
log.Panic("\nNo CSV lines in file 1.")
}
line2, b := readline(cr2)
if !b {
log.Panic("\nNo CSV lines in file 2.")
}

// copy the files according to similar rules of the merge step in Mergesort
for {
if compare(line1, line2) {
writeline(line1)
if line1, b = readline(cr1); !b {
copy(cr2, w)
break
}
} else {
writeline(line2)
if line2, b = readline(cr2); !b {
copy(cr1, w)
break
}
}
}

// note the files will be closed here, since we defered it above
}

func readline(r csv.Reader) ([]string, bool) {
line, e := r.Read()
if e != nil {
if e == io.EOF {
return nil, false
}
log.Panic("\nError reading file: ", e)
}
return line, true
}

func writeline(w csv.Writer, line []string) {
e := w.Write(line)
if e != nil {
log.Panic("\nError writing file: ", e)
}
}

func copy(r csv.Reader, w csv.Writer) {
for line, b := readline(r); !b; r, b = readline(r) {
writeline(w, line)
}
}

func compare(line1, line2 string) bool {
/* here, determine if line1 and line2 are in the correct order (line1 first)
if so, return true, otherwise false
*/
}

注意:此答案经过大量编辑以包含内联代码而不是链接。此外,自从我的初稿以来,代码有了显着改进,但由于这里没有任何事件,我只是放弃旧版本并重写我的答案。

关于sorting - 将 2 个大型排序 CSV 文件组合成一个文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15669388/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com