arrays - 减少 'While read' 循环的处理时间-6ren

arrays - 减少 'While read' 循环的处理时间

转载作者：行者123 更新时间：2023-12-04 13:05:28

26

4

shell 脚本新手..

我有一个巨大的 csv 文件，长度不等 f11，例如

“000000aaad000000bhb200000uwwed...”
“000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew...”
..

将字符串拆分为 10 个大小后，我需要 6-9 个字符。然后我必须使用定界符“|”加入他们的行列，比如

0aaa|0bhb|uwwe...
0aba|bbrb|0wwq|caba|0bhb|0qwe...

并将处理后的f11与其他字段加入

这是处理 10k 条记录所花费的时间 ->

真正的 4m43.506s
用户 0m12.366s
系统 0m12.131s

20K 条记录 ->
真正的5m20.244s
用户 2m21.591s
系统 3m20.042s

80K 条记录(约 370 万条 f11 使用“|”拆分和合并)->

真正的 21 米 18.854 秒
用户 9m41.944s
系统 13m29.019s

我预计处理 650K 条记录的时间是 30 分钟(大约 5600 万次 f11 拆分和合并)。有什么优化方法吗？

while read -r line1; do
    f10=$( echo $line1 | cut -d',' -f1,2,3,4,5,7,9,10)
    echo $f10 >> $path/other_fields
    
    f11=$( echo $line1 | cut -d',' -f11 )
    f11_trim=$(echo "$f11" | tr -d '"')
    echo $f11_trim | fold -w10 > $path/f11_extract 

    cat $path/f11_extract | awk '{print $1}' | cut -c6-9 >> $path/str_list_trim
    
    arr=($(cat $path/str_list_trim))
    printf "%s|" ${arr[@]} >> $path/str_list_serialized
    printf '\n' >> $path/str_list_serialized
    arr=()
    
    rm $path/f11_extract
    rm $path/str_list_trim

done < $input
sed -i 's/.$//' $path/str_list_serialized
sed -i 's/\(.*\)/"\1"/g' $path/str_list_serialized

paste -d "," $path/other_fields $path/str_list_serialized > $path/final_out

最佳答案

由于以下原因，您的代码效率不高:

在循环中调用多个命令，包括 awk。
生成许多中间时间文件。

你可以用 awk 完成这项工作:

awk -F, -v OFS="," '                                    # assign input/output field separator to a comma
{
    len = length($11)                                   # length of the 11th field
    s = ""; d = ""                                      # clear output string and the delimiter
    for (i = 1; i <= len / 10; i++) {                   # iterate over the 11th field
        s = s d substr($11, (i - 1) * 10 + 6, 4)        # concatenate 6-9th substring of 10 characters long chunks
        d = "|"                                         # set the delimiter to a pipe character
    }
    $11 = "\"" s "\""                                   # assign the 11th field to the generated string
} 1' "$input"                                           # the final "1" tells awk to print all fields

输入示例:

1,2,3,4,5,6,7,8,9,10,000000aaad000000bhb200000uwwed
1,2,3,4,5,6,7,8,9,10,000000aba200000bbrb2000000wwqr00000caba2000000bhbd000000qwew

输出:

1,2,3,4,5,6,7,8,9,10,"0aaa|0bhb|uwwe"
1,2,3,4,5,6,7,8,9,10,"0aba|bbrb|0wwq|caba|0bhb|0qwe"

关于arrays - 减少 'While read' 循环的处理时间，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69717681/

26

4

0

文章推荐： nginx - 将多个 nginx 配置组合成一个

文章推荐： python - 在python中使用回车和换行的正确方法

文章推荐： c++ - 尝试使用 conan 安装 gtest 时出现 HTTPSConnectionPool 错误

rust - 为什么 Read::read 和 Read::read_exact 需要初始化传递给它们的缓冲区？
我有一个阅读器，其中包含有关 51*51 网格的信息，其中网格上的每个点都由 f32 表示。 .我想将这些数据读入一个向量，以便我可以轻松处理它: pub fn from_reader(reader:
sql-server - 为什么在 SQL Server 中首次执行查询时 'physical reads' 小于 'read-ahead reads' 和 'logical reads'？
我重新启动了 SQL Server 2005 并运行了统计 IO 的查询。我得到了这些结果:表“xxx”。扫描计数 1，逻辑读取 789，物理读取 3，预读读取 794，... 预读读取数是读取并放
lisp - defstruct - :read-only is not read only
在 CLHS 中，我为 :read-only x 读到:“当 x 为真时，这指定不能更改此插槽；它将始终包含构造时提供的值。” 我可以做到这一点(CCL、SBCL): CL-USER> (defstr
multithreading - “reads before reads”在内存排序中是什么意思？
让我们考虑一下这句话(Total Store Ordering): reads are ordered before reads, writes before writes, and reads be
rust - Read::read 是否保证附加数据而不覆盖任何现有数据？
我正在开发一个 SMTP 库，它使用缓冲读取器通过网络读取行。我想要一种安全的方式来从网络读取数据，而不依赖于 Rust 内部机制来确保代码按预期工作。具体来说，我想知道 Read trait 是否
Clojure & ClojureScript : clojure. core/read-string, clojure.edn/read-string 和 cljs.reader/read-string
我不清楚所有这些读取字符串函数之间的关系。嗯，很明显clojure.core/read-string可以读取 pr[n] 输出的任何序列化字符串甚至 print-dup .也很清楚clojure.ed
c - 如何使 read() 非阻塞并重置 read()
所以我做了这个功能，就像倒计时一样。我想在倒计时减少时读取命令。我的大问题是让 read() 在倒计时减少时等待输入。如您所见，我尝试使用 select() 但在第一个 printf 之后("time
echart报错Cannot read properties of undefined (reading ‘type‘)
这是我vue3+echart5 遇到的报错：Cannot read properties of undefined (reading ‘type‘) 这个问题需要搞清楚两个关键方法： toRaw：作
c - LLVM 内存依赖性分析中的 Read after Read 依赖性
下图中，左边是C代码，右边是未优化的LLVM IR形式。 The Figure 在 IR 上运行 MemoryDependenceAnalysis 可查找内存依赖性。原始代码及其 IR 等效代码中
bash - 为什么管道输入到 "read"仅在馈入 "while read ..."构造时才有效？
这个问题在这里已经有了答案: Read values into a shell variable from a pipe (17 个答案) 关闭 3 年前。我一直在尝试像这样从程序输出中读取环境变
c - 需要像 read() 这样的函数将整数数据读入缓冲区并获得与 read() 相同的缓冲区值
当我输入相同的整数时，如何将整数转换为与使用 read(0,buff,nbytes) 获得的缓冲区相同的值/编码字符？我正在尝试编写类似 read() 的东西，但用整数数据代替读取到缓冲区的文件描述符
linux - “read”命令不在“while read line”循环中执行
This question already has answers here: Closed 2 years ago. Read input in bash inside a while loop （
c# - 在调用 Read() 之前尝试访问字段无效，但我先调用了 Read()
我正在尝试处理来自 MySQL 数据库的一些数据(主要是 double 值)。我收到此错误消息: Invalid attempt to access a field before calling Re
java - DataInputStream.read() 与 DataInputStream.readFully()
我正在制作一个简单的 TCP/IP 套接字应用这样做有什么不同: DataInputStream in = new DataInputStream(clientSocket.getInputStre
java - HttpMessageNotReadableException : Could not read JSON: Read timed out
我操作API服务器。手机APP访问API服务器时，有时会出现该异常。我尝试在测试服务器上进行测试，但无法重现。(我改变了apache和tomcat的连接时间。) 有什么问题？？我该如何解决这个问
html - "Click here to read this article""Read More"为什么这些对屏幕阅读器不利？
我在段落末尾使用“阅读更多”只是为了提醒像P.T.O一样的用户为什么会有问题？最佳答案您必须明白，许多屏幕阅读器用户不会等到整个页面都读给他们听。他们使用键盘快捷键在页面中导航。 JAWS(可以
angular - 类型错误 : Cannot read properties of undefined (reading 'match' )
我已将我的 Angular 应用程序从 12 版本升级到 13 版本。我在单元测试运行期间开始遇到此错误。 Chrome Headless 94.0.4606.61 (Windows 10) AppC
angular - 类型错误 : Cannot read properties of undefined (reading 'pipe' )
我正在尝试为以下组件编写一个。我正在使用 queryParams 然后使用 switchmap 来调用服务。这是 url 的样子: http://localhost:4200/test-fee/det
javascript - 未捕获的类型错误 : Cannot read properties of undefined (reading 'remove' )
我的代码有什么问题？ Uncaught TypeError: Cannot read properties of undefined (reading 'remove') 和 Uncaught Typ
javascript - 类型错误 : Cannot read properties of undefined (reading 'requestContent' )
我在我的 React 应用程序中遇到了这个问题。类型错误:无法读取未定义的属性(读取“requestContent”) 我在我的应用程序中使用 commercejs。代码指向 isEmpty=!ca

首页

博学

6Ren·AI

商城

arrays - 减少 'While read' 循环的处理时间