gpt4 book ai didi

bash - 通过 bash 删除冗余

转载 作者:行者123 更新时间:2023-12-04 15:20:43 26 4
gpt4 key购买 nike

我有这个问题,我有以下几行:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我想删除在另一行上有每个参数的每一行,比方说,这两行:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我只想保留这个:

http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

因为参数比较多,第一个是多余的。

我想保留这些:

http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre&ca=2412

我想删除其他行中具有相同参数的每一行,保留参数较多的行,而不是参数较少的行。

另一个例子:

我想转换这个:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123

进入这个:

http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103

相同的参数在不同的资源中,必须是不同的行。

如果我得到这个:

http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/content2/index.cfm?ID=123

我想保留它们。

编辑 8 月 19 日:


另一个 URL 示例以及我希望如何处理它们:

https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://es.answers.search.yahoo.com/search?p=educastur+campus&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3

它应该输出:

https://es.answers.search.yahoo.com/search?p=mixmail+correo&fr2=piv-web
https://techvalidation.dell.com/Default.aspx?id=9d459f5c-8a26-4268-b37c-23980a6ba577&Key=%2fuKb2WS3da4lk%2f34VSXE4F02YqS5LfvbKFGcDXNQxgIvvbodU3o3lHoNm09M67Ut&SRC=QuoteCenter&newsession=true
https://techvalidation.dell.com/technicalvalidationlist.aspx?key=6ivAYJco9bouAJBNkQ8rgtGWPdfLVRumAScf7bIb6DMpj6SYVdWy6bd4ITEPF4tQMkNzNpGshERZndX3Ia%2bbqhJ3CnrC46qJkHJ4TdiyN78%3d&PartnerAffinityId=3341728904&SRC=QuoteCenter
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr
https://wwwmg.pandacn.ford.com/forms/frmservlet?config=pandacn3
https://www.panda.ford.com/forms/frmservlet?config=pandain4

我的方法只适用于只有一个参数的 URL:

https://www.panda.ford.com/forms/frmservlet?config=pandain4
https://www.panda.ford.com/forms/frmservlet?config=pandain3

我这样做:cat list.txt | sort -u -t "="-k 1,1 并且我输出:

https://www.panda.ford.com/forms/frmservlet?config=pandain4

但是这些失败了:

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF
https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr

我在哪里

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_LOGOFF

|猫列表.txt | sort -u -t "="-k 1,1 我想要另一行

https://web.xnet.ford.com/3.0/samlerror?faultreason=SSO_ERROR&language=fr

因为它有相同的参数而且更多。

问候!


最佳答案

正确地执行此操作需要大量内部排序,这在 bash 循环中会产生大量进程并大大降低工作速度。

切换到 perl。请注意,这会重新排序参数和行;如果您需要不改动和/或按原始顺序排列的原始行,我们将不得不再添加一到三个步骤。您还应该注意,您拥有大写和小写的知识; url 通过端口不区分大小写,但之后的路径区分大小写,因此即使它们获得相同的参数,它们也不会注册为相同的。

#!/usr/bin/env perl

use strict; # I ALWAYS use strict and warnings unless
use warnings; # there is some compelling reason not to.

open my $fh, 'urls' or die "urls: $!";
my %urlsOUT;
foreach ( <$fh> ) { chomp;
my %args; # clean for each record
m!^(https?://[^/]+)(/[^?]+)[?](.*)!i; # catch the base in separate case sensitivities
my ($base) = lc($1).$2; # always lowercase the case insensitive part
@args{ split /[?&]+/, $3 } = (); # removes duplicate args in a url
my ( $args ) = join '&', reverse sort keys %args; # reassemle ORDERED
$urlsOUT{"$base?$args"}=''; # now a unique key
}

my $urlsOUT='';
REC: foreach my $url (reverse sort keys %urlsOUT ) { # ORDERED
for ( split /[?&]/, $url ) { # for each arg
if ( $urlsOUT !~ /\b$_\b/ ) { # if new
$urlsOUT .= "$url\n"; # keep this
next REC; # check next
}
}
}

print $urlsOUT;

这将始终如一地对 URL 中的所有参数进行重新排序和删除重复,删除所有结果记录,然后检查每个剩余记录(按降序排列)以消除任何没有东西之前没有其他记录。

我将程序文件命名为 tst 并制作了一个 tst1 和一个 urls

$: cat tst1
http://test/foo?foo
http://test/foo?bar
http://test/foo?foo
http://test2/foo?foo
http://test2/foo?baz
http://test2/foo?foo&bar
http://test2/foo?baz
http://test/foo?foo&bar
http://test/foo?bar&foo
http://test2/foo?bar&foo
http://test3/foo?bar
http://test3/foo?foo&bar&baz
http://test2/foo?foo&bar&baz
http://test/foo?foo&bar&baz

$: ./tst tst1
http://test3/foo?foo&baz&bar
http://test2/foo?foo&baz&bar
http://test/foo?foo&baz&bar

$: cat urls
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm?upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/content/index.cfm?ID=123&foo=bar
http://grouplogic.com:80/store/index.cfm?cfid=11812682&cftoken=26157811&fa=conre
http://grouplogic.com:80/store/index.cfm?fa=PrtSlt&id=532&prTpID=5&
http://grouplogic.com:80/store/index.cfm? upTp=2&fa=upgrade&UpNewType=2&prTpID=5&&ptype=FS
http://grouplogic.com:80/news-events/index.cfm?fa=viewRelease&ID=21&prod=2
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&ProdID=1&archive=1
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?ID=123

$: ./tst urls
http://grouplogic.com:80/store/index.cfm?upTp=2&ptype=FS&prTpID=5&fa=upgrade&UpNewType=2
http://grouplogic.com:80/store/index.cfm?prTpID=5&id=532&fa=PrtSlt
http://grouplogic.com:80/store/index.cfm?fa=conre&cftoken=26157811&cfid=11812682
http://grouplogic.com:80/public/quickpoll/index.cfm?fuseaction=quickPollResults&QuestionID=8
http://grouplogic.com:80/news-events/index.cfm?prod=2&fa=viewRelease&ID=21
http://grouplogic.com:80/news-events/index.cfm?fa=viewNews&ID=390
http://grouplogic.com:80/knowledge/index.cfm?fuseaction=view&docID=10
http://grouplogic.com:80/content/index.cfm?fuseaction=faq_list&archive=1&ProdID=1
http://grouplogic.com:80/content/index.cfm?foo=bar&ID=123
http://grouplogic.com:80/content/index.cfm?ID=103
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&docID=111
http://grouplogic.com:80/Knowledge/index.cfm?fuseaction=view&amp

请注意,输出是区分大小写的 ASCII 排序的,并清除了尾随和重复/冗余的符号。

perl 中使用内部读取和排序也快得多。

real    0m0.170s
user 0m0.046s
sys 0m0.092s

旧版本

尽管您至少可以消除嵌套循环中的冗余比较,但我认为没有比蛮力双重传递更优雅的方法了。

lst=( $( sort -ru x ) ) # unique reverse sort once to eliminate simple dups

for (( ndx1=0; ndx1<${#lst[@]}-1; ndx1++ )) # walk thru once in outer loop
do [[ -n "${lst[ndx1]}" ]] || continue # ignore removed
for (( ndx2=ndx1+1; ndx2<${#lst[@]}; ndx2++ )) # inner skips prev, no redux
do case "${lst[ndx1]}" in # case statement string match
"${lst[ndx2]}"*) unset lst[ndx2] ;; # remove shorter versions
*) continue 2 ;; # no match, skip ahead
esac
done
done

printf "%s\n" "${lst[@]}" # print out what's left

我以相反的顺序唯一地排序以消除简单的重复并设置比较,并存储到一个数组中以便于嵌套循环。

外层循环遍历数组一次;它不会理会最后一条记录,因为内部循环会处理它。内循环从外循环中当前记录之后的记录开始 - 没有理由再次检查前一个,因为它们已排序。

由于内循环删除了记录,如果指定索引处的外键记录为空,则外循环将完全跳过检查。

case 语句检查外循环中当前记录之后的每条记录。如果内键包含在当前外循环键记录中,则使用 unset 从数组中删除较短的版本,然后循环继续到下一条记录进行检查。

当内循环记录不再是外循环键的一部分时,我们知道我们已经超越了相关记录(因为它们已排序),所以我们跳过毫无意义地检查列表的其余部分并继续continue 2 的下一个外键记录。

相关记录的移动窗口应该做最少的浪费工作。

关于bash - 通过 bash 删除冗余,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63373236/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com