gpt4 book ai didi

bash - 如何删除分隔文件中标签之间的空格?

转载 作者:行者123 更新时间:2023-12-02 18:18:53 31 4
gpt4 key购买 nike

我从 MySQL 系统中转储了这个表,尽管它遵循 RFC 标准,但它似乎在存储 HTML 文本的列中添加了不需要的空间。例如:

   "2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"
<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

这是大约 30K 行中的一个,所以我试图找出一种聪明的方法来删除这里的 "和

awk '{$1=$1;printf $0}' 

这种方法可行,但它将所有内容混合到一行中,这不是我想要的。我想在 CSV 转储中保留换行符。我非常想听听您关于如何解决这个问题的想法。

最佳答案

即使您的输入文件很大,以下使用 GNU awk 进行多字符 RS、RT 和 gensub() 的操作也将起作用,因为它不会将整个文件读入内存,它只是读取由 "<spaces>< 分隔的字符串或一次换行一个:

$ awk -v RS='"\\s+<|\n' '{printf "%s%s", $0, gensub(/"\s+</,"\"<",1,RT)}' file
"2000","Something","Something,"Something","Something","Something","2017-11-15 15:12:51","115060","Something","Something","Something","Something","","Something","Something","Something","Tabuk","TKPR","999","Something","Something","103984","Something","Something","UTC+03:00","sameday","15","100","3","1443","1","Something","3","Something","<div style=""margin:1em;"">"<div lang=""en"dir=""ltr"style=""font-family: Microgramma;"">"

我假设当您在问题中说 and possibly others 时,您指的是其他情况,例如 "<spaces><div>,其中有一个 ",然后是空格,然后是一个以 < 开头的标记,但这显然只是一个猜测。

关于bash - 如何删除分隔文件中标签之间的空格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71141424/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com