gpt4 book ai didi

Unix commands to fill gaps between dates in a file(用于填充文件中日期之间的间隙的Unix命令)

转载 作者:bug小助手 更新时间:2023-10-24 18:36:00 26 4
gpt4 key购买 nike



I want to find the records with same name and age but different date. Then if there is a gap between the prev date and next line date , i have to fill the gap.

我想找名字和年龄相同但日期不同的记录。然后,如果上一行日期和下一行日期之间有差距,我必须填补这一差距。


Sample data
file.txt

示例数据文件.txt


20230907,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230922,Daniel,28, Photographer

What I did so far:

到目前为止,我所做的是:


#create zero byte file for all filled gaps
cat /dev/null > fillGap.txt
For line in `awk -F"," '{print $2","$3}' file.txt`;do
#if name,age NOT found in fillGap.txt then grep everything that matches in file.txt
if [[ -z `grep -w ${line} fillGap.txt` ]];then
grep -w ${line} file.txt > MatchNameAge.txt
#this is the part of checking if there is a gap between dates and if so, gaps will be filled. I haven't figured it out on how I should do it. maybe you could help me
#after filling the gap, the transformed will be appended in fileGap.txt
else
#if name,age have already found in filledGap.txt, there's nothing to do.
fi
done

#my code lacks a lot of context cause I only have the idea and keep on trying piece by piece on how I code it.

#我的代码缺乏很多上下文,因为我只有一个想法,并不断尝试如何编写它。


This idea might works fine but, using for loop for large files would cause long running.

这个想法可能工作得很好,但是,对大文件使用for循环会导致长时间运行。


Can you help me achieve goal below with shorter runtime as possible?

你能帮我在尽可能短的时间内实现下面的目标吗?


Desired Output:

所需输出:


20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer

I'd like yo add such cases in my data like below. What if I have a date that is end of the month and the next date is the next month

我希望你在我的数据中添加这样的案例,如下所示。如果我的日期是月底,下一个日期是下个月,该怎么办?


20230930,Allan,29,Programmer
20231004,Allan,29,Engineer

Output should look like this

输出应如下所示


20230930,Allan,29,Programmer
20231001,Allan,29,Programmer
20231002,Allan,29,Programmer
20231003,Allan,29,Programmer
20231004,Allan,29,Engineer

What matters is that filling gap between dates with same "name,age"

重要的是,用相同的“姓名、年龄”来填补日期之间的差距。


更多回答

is the input file guaranteed to already be sorted by date + name + age?

是否保证输入文件已按日期+名称+年龄排序?

@markp-fuso no. The file is not guaranteed to be already sorted

@markp-fuso no.不能保证该文件已排序

You should have included some interesting cases in your sample input/output such as same name but different age and, more importantly, dates that cross month and year boundaries, especially across Feb 29 in a leap year. What you currently have would only test the most basic sunny-day cases so a script that produces the expected output you provided from the sample input you provided could be completely wrong for other dates.

您应该在您的样例输入/输出中包含一些有趣的案例,比如同名但不同的年龄,更重要的是,日期跨越月和年的界限,特别是在闰年的2月29日。您目前所拥有的只会测试最基本的情况,因此从您提供的样例输入生成预期输出的脚本对于其他日期可能是完全错误的。

优秀答案推荐

Using any sort plus GNU awk for time functions and gensub():

对时间函数和genSub()使用Any Sort和GNU awk:


$ cat tst.sh
#!/usr/bin/env bash

sort -t, -k2,2 -k1,1n "${@:--}" |
awk '
BEGIN { FS=OFS="," }
($2 == p[2]) && ($3 == p[3]) {
for ( date=nextDate(p[1]); date<$1; date=nextDate(date) ) {
sub(/[^,]+/,date,p[0])
print p[0]
}
}
{
print
split($0,p)
p[0] = $0
}

function nextDate(date,secs) {
date = gensub(/(.{4})(..)(..)/,"\\1 \\2 \\3",1,date)
secs = mktime(date " 12 0 0") + (24 * 60 * 60)
return strftime("%Y%m%d",secs)
}
'


$ ./tst.sh file.txt
20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer

Alternatively, using any awk just replace the nextDate() function above with this version:

或者,使用任何awk只需将上面的nextDate()函数替换为以下版本:


    function nextDate(date,     i,tmp,year,month,day,maxDays) {
year = substr(date,1,4)+0
month = substr(date,5,2)+0
day = substr(date,7,2)+0

# see https://www.timeanddate.com/date/leapyear.html
if ( (month == 2) && !(year % 4) && ((year % 100) || !(year % 400)) ) {
maxDays = 29
}
else {
if ( !(1 in daysIn) ) {
split("1 31 2 28 3 31 4 30 5 31 6 30 7 31 8 31 9 30 10 31 11 30 12 31",tmp," ")
for ( i=1; i in tmp; i+=2 ) {
daysIn[i] = tmp[i+1]
}
}
maxDays = daysIn[month]
}

if ( ++day > maxDays ) {
day = 1
if ( ++month > 12 ) {
month = 1
year++
}
}

return sprintf("%04d%02d%02d",year,month,day)
}


Dates are a bit tricky to deal with in awk, so I tend to write dedicated functions to deal with them.

在awk中处理日期有点麻烦,所以我倾向于编写专门的函数来处理它们。


Here is an example solution that assumes a sorted input (tested with gawk and mawk):

下面是一个示例解决方案,它假定输入是排序的(使用gawk和mawk进行了测试):


parse.awk

Parse.awk


BEGIN { 
sid = 60 * 60 * 24; # Seconds in a day
FS = OFS = ",";
}

# Convert date to seconds-since-the-unix-epoch (Jan 1, 1970)
function getdate(d) {
year = substr(d, 1, 4);
month = substr(d, 5, 2);
day = substr(d, 7, 2);

return mktime(year" "month" "day" 00 00 00");
}

# Add one day to date
function addday(d) {
return strftime("%Y%m%d", getdate(d) + sid);
}

($2 SUBSEP $3) in nameage {
d = nameage[$2,$3];

while (addday(d) < $1) {
print (addday(d), $2, $3, occupation[$2,$3]);
d = addday(d);
}
}

{
nameage[$2,$3] = $1;
occupation[$2,$3] = $4;
}

1;

Run it like this:

按如下方式运行:


awk -f parse.awk infile

Output:

产出:


20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer


Solution in TXR:

TXR中的解决方案:


$ txr filldate.txr data
20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28, Photographer

Code in filldate.txr:

填充日期.txr中的代码:


@(repeat)
@ (cases)
@date0,@name,@age,@rest0
@ (trailer)
@date1,@name,@age,@rest1
@ (bind dates @[giterate (op nequal date1)
(opip
(time-parse "%Y%m%d")
.(time-utc)
(+ 86400)
(time-str-utc "%Y%m%d"))
date0])
@ (output)
@ (repeat)
@dates,@name,@age,@rest0
@ (end)
@ (end)
@ (or)
@line
@ (do (put-line line))
@ (end)
@(end)

Colorized with Vim:

用Vim上色:


Colorized with Vim



This might work for you (GNU sed and date):

这可能对你有用(GNU sed和date):


sed -E '/\n/!N
/(,[^,]*,[^,]*,).*\n.*\1/{/^([^,]*,).*\n\1/D
s/^([^,]*)(,.*\n)/echo "&$(date -d "\1 +1 day" +%Y%m%d)\2"/e}
P;D' file

Use the N;P;D sed idiom which appends the next line to the current line, prints/deletes the first line and repeats.

使用N;P;D sed成语,将下一行添加到当前行,打印/删除第一行,然后重复。


In this case, append the next line if one or no other lines in the pattern space.

在这种情况下,如果模式空间中只有一行或没有其他行,则追加下一行。


Compare the second and third fields of those lines and if the same:

比较这些行的第二个和第三个字段,如果相同:


Compare the dates of those lines and if the same then delete the first line and repeat.

比较这些行的日期,如果相同,则删除第一行并重复。


If the dates are not the same, insert a line with tomorrows date between the first and second lines.

如果日期不同,请在第一行和第二行之间插入带有明天日期的行。


Regardless, print/delete the first line and repeat.

无论如何,打印/删除第一行,然后重复。


N.B. The brunt of the work is done by the evaluated substitution which needs to interpolate an echo command.

注意:主要工作是由评估的替换完成的,该替换需要插入回声命令。




If the file is not sorted use:

如果未对文件进行排序,请使用:


sort -t, -k2,3 -k1,1 file |
sed -E '/\n/!N
/(,[^,]*,[^,]*,).*\n.*\1/{/^([^,]*,).*\n\1/D
s/^([^,]*)(,.*\n)/echo "&$(date -d "\1 +1 day" +%Y%m%d)\2"/e}
P;D'


Using :

使用Sqlite3:


#!/bin/bash

# create header in file.txtA
a=$(head -1 file.txt | sed -e 's/[^,]*,*/#/g')
b=abcdefgh;
c=$(echo ${b:0:${#a}} | sed -E 's/(.)/\1,/g')
c=${c%?}
sed -i -e "1i$c" file.txt

# create output
sql=".import file.txt file
update file set a=date(substr(a,1,4)||'-'||substr(a,5,2)||'-'||substr(a,7,2));
with cte as (
select a,b,c,d,lead(a) over (partition by b order by a) as m from file
union all
select date(a,'1 day'),b,c,d,m
from cte
where date(a,'1 day') < m
)
select replace(a,'-','') as a,b,c,d
from cte
order by a,b;
"
db=$(mktemp)
echo "$sql" | sqlite3 -csv $db

# strip header and remove db
sed -i -e '1d' file.txt
rm -f $db


  • A temporary file is created for the database (using: mktemp)

  • Headers are inserted into file.txt. This is done using the most simple name I could think of, so the first column will be named a, the second b, ... (see NOTE)

  • In the SQL part the dates are formatted so 20230907 becomes 2023-09-07. This is so SQLite can calculate tomorrow using date('2023-09-07','1 day').

  • When querying the - signs in the date are removed again.

  • The cte add records based on the value of lead(a) over (partition by b order by a) which is the next data available for a person.

  • Because I change original input file (adding a header), I also remove this line at the end of the script. (see NOTE)

  • Finally the temporary database is removed.


NOTE: I should not change the original input, but use another tempfile for that...

注意:我不应该更改原始输入,而是使用另一个临时文件...


NOTE2: Because CSV is a weird format, output is not what you might expect for the Photographer. Output will be " Photographer". For more reading on this see: Spaces between separator and value in csv file

注2:因为CSV是一种奇怪的格式,所以输出结果并不是你对摄影师的期望。输出将是“摄影师”。有关详细信息,请参阅:CSV文件中的分隔符和值之间的空格



Here's a way to do it without needing to call mktime() and strftime() nonstop at the expense of lots of regex (it should be able to handle all leap year calculations) ::

以下是一种不需要不停地调用mktime()和strftime()而以大量正则表达式为代价的方法(它应该能够处理所有的闰年计算):


echo ' 
20230907,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20240317,Daniel,28, Photographer' |



mawk '
function ______(__, ___, ____, _____, _) {
__ = (_ = "")__
____ = (_)____

while((__ = sprintf("%d",__ += __~/([01][1-9]|10|2[0-7])$/ ? 1 \
: __~ /1231$/ ? 8870 \
: __~/(11|[469])30$/ ? 71 : __~ /[013578]31$/ ? 70 \
: __~/22[89]$/ && __~((_ = \
__~"(([2468][048]|[13579][26]|(^|[0+-])[48])(00)?|"\
"0000|^[+-]?0*)....$") ? "9$" : "8$") ? 73-_ : 1)) < ____)
print __,
___, _____

} BEGIN { FS = (_ = "[ \t]*") (____ = OFS = ",")_ } {

if (___[_ = $2____$3] && __[_] + 1 < +$1)
______(__[_], _, $1, ___[_])

print __[_] = $1, _, ___[_] = $4 }'



20230907,Allan,29,Marketing
20230908,Allan,29,Marketing
20230909,Allan,29,Marketing
20230910,Allan,29,Marketing
20230911,Allan,29,Marketing
20230912,Allan,29,VirtualAssistant
20230913,Allan,29,Programmer
20230920,Daniel,28,Engineer
20230921,Daniel,28,Engineer
20230922,Daniel,28,Engineer
20230923,Daniel,28,Engineer
20230924,Daniel,28,Engineer
20230925,Daniel,28,Engineer
20230926,Daniel,28,Engineer
20230927,Daniel,28,Engineer
20230928,Daniel,28,Engineer
20230929,Daniel,28,Engineer
20230930,Daniel,28,Engineer
20231001,Daniel,28,Engineer
20231002,Daniel,28,Engineer
20231003,Daniel,28,Engineer
20231004,Daniel,28,Engineer
20231005,Daniel,28,Engineer
20231006,Daniel,28,Engineer
20231007,Daniel,28,Engineer
20231008,Daniel,28,Engineer
20231009,Daniel,28,Engineer
20231010,Daniel,28,Engineer
20231011,Daniel,28,Engineer
20231012,Daniel,28,Engineer
20231013,Daniel,28,Engineer
20231014,Daniel,28,Engineer
20231015,Daniel,28,Engineer
20231016,Daniel,28,Engineer
20231017,Daniel,28,Engineer
20231018,Daniel,28,Engineer
20231019,Daniel,28,Engineer
20231020,Daniel,28,Engineer
20231021,Daniel,28,Engineer
20231022,Daniel,28,Engineer
20231023,Daniel,28,Engineer
20231024,Daniel,28,Engineer
20231025,Daniel,28,Engineer
20231026,Daniel,28,Engineer
20231027,Daniel,28,Engineer
20231028,Daniel,28,Engineer
20231029,Daniel,28,Engineer
20231030,Daniel,28,Engineer
20231031,Daniel,28,Engineer
20231101,Daniel,28,Engineer
20231102,Daniel,28,Engineer
20231103,Daniel,28,Engineer
20231104,Daniel,28,Engineer
20231105,Daniel,28,Engineer
20231106,Daniel,28,Engineer
20231107,Daniel,28,Engineer
20231108,Daniel,28,Engineer
20231109,Daniel,28,Engineer
20231110,Daniel,28,Engineer
20231111,Daniel,28,Engineer
20231112,Daniel,28,Engineer
20231113,Daniel,28,Engineer
20231114,Daniel,28,Engineer
20231115,Daniel,28,Engineer
20231116,Daniel,28,Engineer
20231117,Daniel,28,Engineer
20231118,Daniel,28,Engineer
20231119,Daniel,28,Engineer
20231120,Daniel,28,Engineer
20231121,Daniel,28,Engineer
20231122,Daniel,28,Engineer
20231123,Daniel,28,Engineer
20231124,Daniel,28,Engineer
20231125,Daniel,28,Engineer
20231126,Daniel,28,Engineer
20231127,Daniel,28,Engineer
20231128,Daniel,28,Engineer
20231129,Daniel,28,Engineer
20231130,Daniel,28,Engineer
20231201,Daniel,28,Engineer
20231202,Daniel,28,Engineer
20231203,Daniel,28,Engineer
20231204,Daniel,28,Engineer
20231205,Daniel,28,Engineer
20231206,Daniel,28,Engineer
20231207,Daniel,28,Engineer
20231208,Daniel,28,Engineer
20231209,Daniel,28,Engineer
20231210,Daniel,28,Engineer
20231211,Daniel,28,Engineer
20231212,Daniel,28,Engineer
20231213,Daniel,28,Engineer
20231214,Daniel,28,Engineer
20231215,Daniel,28,Engineer
20231216,Daniel,28,Engineer
20231217,Daniel,28,Engineer
20231218,Daniel,28,Engineer
20231219,Daniel,28,Engineer
20231220,Daniel,28,Engineer
20231221,Daniel,28,Engineer
20231222,Daniel,28,Engineer
20231223,Daniel,28,Engineer
20231224,Daniel,28,Engineer
20231225,Daniel,28,Engineer
20231226,Daniel,28,Engineer
20231227,Daniel,28,Engineer
20231228,Daniel,28,Engineer
20231229,Daniel,28,Engineer
20231230,Daniel,28,Engineer
20231231,Daniel,28,Engineer
20240101,Daniel,28,Engineer
20240102,Daniel,28,Engineer
20240103,Daniel,28,Engineer
20240104,Daniel,28,Engineer
20240105,Daniel,28,Engineer
20240106,Daniel,28,Engineer
20240107,Daniel,28,Engineer
20240108,Daniel,28,Engineer
20240109,Daniel,28,Engineer
20240110,Daniel,28,Engineer
20240111,Daniel,28,Engineer
20240112,Daniel,28,Engineer
20240113,Daniel,28,Engineer
20240114,Daniel,28,Engineer
20240115,Daniel,28,Engineer
20240116,Daniel,28,Engineer
20240117,Daniel,28,Engineer
20240118,Daniel,28,Engineer
20240119,Daniel,28,Engineer
20240120,Daniel,28,Engineer
20240121,Daniel,28,Engineer
20240122,Daniel,28,Engineer
20240123,Daniel,28,Engineer
20240124,Daniel,28,Engineer
20240125,Daniel,28,Engineer
20240126,Daniel,28,Engineer
20240127,Daniel,28,Engineer
20240128,Daniel,28,Engineer
20240129,Daniel,28,Engineer
20240130,Daniel,28,Engineer
20240131,Daniel,28,Engineer
20240201,Daniel,28,Engineer
20240202,Daniel,28,Engineer
20240203,Daniel,28,Engineer
20240204,Daniel,28,Engineer
20240205,Daniel,28,Engineer
20240206,Daniel,28,Engineer
20240207,Daniel,28,Engineer
20240208,Daniel,28,Engineer
20240209,Daniel,28,Engineer
20240210,Daniel,28,Engineer
20240211,Daniel,28,Engineer
20240212,Daniel,28,Engineer
20240213,Daniel,28,Engineer
20240214,Daniel,28,Engineer
20240215,Daniel,28,Engineer
20240216,Daniel,28,Engineer
20240217,Daniel,28,Engineer
20240218,Daniel,28,Engineer
20240219,Daniel,28,Engineer
20240220,Daniel,28,Engineer
20240221,Daniel,28,Engineer
20240222,Daniel,28,Engineer
20240223,Daniel,28,Engineer
20240224,Daniel,28,Engineer
20240225,Daniel,28,Engineer
20240226,Daniel,28,Engineer
20240227,Daniel,28,Engineer
20240228,Daniel,28,Engineer
20240229,Daniel,28,Engineer
20240301,Daniel,28,Engineer
20240302,Daniel,28,Engineer
20240303,Daniel,28,Engineer
20240304,Daniel,28,Engineer
20240305,Daniel,28,Engineer
20240306,Daniel,28,Engineer
20240307,Daniel,28,Engineer
20240308,Daniel,28,Engineer
20240309,Daniel,28,Engineer
20240310,Daniel,28,Engineer
20240311,Daniel,28,Engineer
20240312,Daniel,28,Engineer
20240313,Daniel,28,Engineer
20240314,Daniel,28,Engineer
20240315,Daniel,28,Engineer
20240316,Daniel,28,Engineer
20240317,Daniel,28,Photographer

更多回答

Clever! I would have mindlessly implemented the POSIX versions of mktime and strftime when in fact you don't even need to convert the date at all. +1

聪明的!我会轻率地实现mktime和strftime的POSIX版本,而实际上您根本不需要转换日期。+1

@Fravadona If you implemented those anyway and put them under a BSD2 license, I could include them in cppawk, where they would be available via some new header file (which would suppress the definitions on GNU Awk).

@Fravadona如果您无论如何都要实现它们,并将它们放在BSD2许可证下,我可以将它们包含在cppawk中,在cppawk中可以通过一些新的头文件获得它们(这将抑制GNU Awk上的定义)。

@Kaz my implementation of mktime doesn't handle the timezones, and strftime doesn't return a formatted string but fills an array with the components of the date instead. Supporting the timezones and the format string would need a lot of code

@Kaz我的mktime实现不处理时区,strftime不返回格式化字符串,而是用日期的组成部分填充数组。支持时区和格式字符串需要大量代码

mktime and strftime are GNU specific, but OP's "UNIX" might have it. It's still possible to implement the same functionalities with a POSIX awk tho

Mktime和strftime是特定于GNU的,但是OP的“Unix”可能有它。它仍然可以使用POSIX awk来实现相同的功能

@Fravadona: This works at least with gawk, mawk and nawk

@Fravadona:这至少适用于Gawk、Mawk和Nawk

@Fravadona: I take it back, this does not work in nawk because of the lacking time functions. Adding a note

@Fravadona:我收回我的话,这在nawk不起作用,因为缺乏时间功能。添加便笺

CSV isn't particularly weird IMO. Per RFC4180 it's OK in CSV to have spaces at the start or end of a field as long as there aren't then double quotes inside the field with the spaces outside of the quotes. So ,<blank>Photographer as the OP has is fine and so is ,"<blank>Photographer" but not ,<blank>"Photographer".

CSV并不是特别奇怪的IMO。根据RFC4180,在CSV中,只要字段内没有双引号,引号外有空格,在字段的开始或结束处留有空格是可以的。因此,OP中的“摄影师”和“摄影师”都很好,但不是“摄影师”。

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com