gpt4 book ai didi

Randomize txt file in Linux but guarantee no repetition of lines(在Linux中随机化txt文件,但保证不重复行)

转载 作者:bug小助手 更新时间:2023-10-28 21:58:28 24 4
gpt4 key购买 nike



I have a file called test.txt which looks like this:

我有一个名为test.txt的文件,如下所示:


Line 1
Line 2
Line 3
Line 3
Line 3
Line 4
Line 8

I need some code which will randomize these lines BUT GUARANTEE that the same text cannot appear on consecutive lines ie "Line 3" must be split up and not appear twice or even three times in a row.

我需要一些代码,将随机化这些行,但保证相同的文本不能出现在连续的行,即“行3”必须分开,不出现两次,甚至三次在一行。


I've seen many variations of this problem answered on here but as yet, none that deal with the repetition of lines.

我在这里看到了这个问题的许多变体,但到目前为止,还没有一个处理重复行的问题。


So far I have tested the following:

到目前为止,我已经测试了以下内容:


shuf test.txt

awk 'BEGIN{srand()}{print rand(), $0}' test.txt | sort -n -k 1 | awk 'sub(/\S /,"")'*

awk 'BEGIN {srand()} {print rand(), $0}' test.txt | sort -n | cut -d ' ' -f2-

cat test.txt | while IFS= read -r f; do printf "%05d %s\n" "$RANDOM" "$f"; done | sort -n | cut -c7-

perl -e 'print rand()," $_" for <>;' test.txt | sort -n | cut -d ' ' -f2-

perl -MList::Util -e 'print List::Util::shuffle <>' test.txt

All of which randomize the lines within the file but often end up with the same lines appearing consecutively within the file.

所有这些操作都随机化了文件中的行,但通常以相同的行在文件中连续出现而告终。


Is there any way I can do this?

我有什么办法可以做到这一点吗?


This is the data before edit. You can see number 82576483 appears in consecutive lines

这是编辑前的数据。您可以看到数字82576483出现在连续的行中


REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

NOTE: asterisks added to highlight lines of interest; asterisks do not exist in the data file

注意:添加星号是为了突出显示感兴趣的行;数据文件中不存在星号


This is what I need to happen where number 82576483 is spread out across the file rather than being on consecutive lines

这就是我需要发生的情况,编号82576483分布在整个文件中,而不是在连续的行上


REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

更多回答
优秀答案推荐

An efficient approach, at least compared to just trying at random repeatedly:

一种有效的方法,至少与反复随机尝试相比是这样的:



  1. Sort all the unique string

  2. For each duplicate,

    1. Identify the places in which it could be placed.

    2. Pick one at random.

    3. Insert the duplicate there.




use strict;
use warnings;

use List::Util qw( shuffle );

my %counts; ++$counts{ $_ } while <>;

my @strings = shuffle keys %counts;

for my $string ( keys( %counts ) ) {
my $count = $counts{ $string };
for ( 2 .. $count ) {
my @safe =
grep { $_ == 0 || $strings[ $_ - 1 ] ne $string }
grep { $_ == @strings || $strings[ $_ - 0 ] ne $string }
0 .. @strings;

my $pick = @safe ? $safe[ rand( @safe ) ] : rand( @strings+1 );

splice( @strings, $pick, 0, $string );
}
}

print( @strings );

(Just wrap with perl -e'...' to run form the shell.)

(只需用perl-e‘...’从外壳中运行。)


Tested. There may be an even better approach.

测试过了。或许还有一种更好的方法。



General approach:

一般方法:



  • use associative array (linecnt[]) to keep count of number of times a line is seen

  • break linecnt[] into two separate normal arrays: single[1]=<lineX>; single[2]=<lineY> and multi[1]=<lineA_copy1>; multi[2]=<lineA_copy2>; multi[3]=<lineB_copy1>

  • while we have at least one entry in both arrays (single[] / multi[]) intersperse our printing (ie, print random(single[]), print randome(multi[]), print random(single[]), print randome(multi[])); NOTE: obviously not truly random but this allows us to maximize chances of separating dupes while limiting cpu overhead (ie, no need to repetitively shuffle hoping for a 'random' ordering that splits dupes)

  • if we have any single[] entries left then print random(single[])

  • if we have any multi[] entries left then print random(multi[]); NOTE: assumes OP's comment re: tough!! means dupes can be printed consecutively if this is all that's left


One awk idea:

一个awk的想法:


$ cat dupes.awk

function print_random(a, acnt, ndx) {
ndx=int(1 + rand() * acnt)
print a[ndx]
if (acnt>1) { a[ndx]=a[acnt]; delete a[acnt] }
return --acnt
}

BEGIN { srand() }

{ linecnt[$0]++ }

END { for (line in linecnt) {
if (linecnt[line] == 1)
single[++scnt]=line
else
for (i=1; i<=linecnt[line]; i++)
multi[++mcnt]=line
delete linecnt[line]
}

while (scnt>0 && mcnt>0) {
scnt=print_random(single,scnt)
mcnt=print_random(multi,mcnt)
}

while (scnt>0)
scnt=print_random(single,scnt)

while (mcnt>0)
mcnt=print_random(multi,mcnt)
}

NOTES:

备注:



  • srand() isn't truly random (eg, two quick, successive runs can generate the same exact output)

  • additional steps could be added to insure quick, successive runs don't generate exact output (eg, providing an OS-level seed for use in srand())


Running against OP's sample set of data:

针对OP的样本数据集运行:


$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N>

NOTES:

备注:



  • data lines cut for brevity

  • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the single[] entries

  • repeated runs will generate different results




An example of processing duplicates ...

一个处理副本的例子。


$ cat test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>

Result of running our awk script:

运行awk脚本的结果:


$ awk -f dupes.awk test.txt
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**99999999**</CUST-ACNT-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N>

NOTES:

备注:



  • blank line added to highlight a) 1st block of interleaved single[] / multi[] entries and b) 2nd block finishing off the rest of the multi[] entries

  • repeated runs will generate different results



has some nice syntax for a concise approach.

Ruby为简洁的方法提供了一些很好的语法。


https://stackoverflow.com/a/65843200 is easily modified for your data:

您可以根据您的数据轻松修改https://stackoverflow.com/a/65843200:


ruby -e '

regex = /<CUST-ACNT-N>\d+<\/CUST-ACNT-N>/

arr = readlines.map {|line| {:k => line[regex], :v => line}}
arr = arr.sort_by {|kv| kv[:k]}
mid = arr.size.succ / 2
arr = arr[0..mid-1].zip(arr[mid..-1]).flatten.compact.map {|kv| kv[:v]}
idx = (1..arr.size-1).find { |i| arr[i] == arr[i-1] }

puts idx ? arr.rotate(idx) : arr

' file.txt



Another approach: First shuffle the lines then go line by line, collecting dupes as they come. For each line check the existing dupes to slip them in if possible. After the input has been processed this way then go over the result from the beginning to try to place the remaining dupes

另一种方法是:先洗掉台词,然后一行一行地走,在他们来的时候收集被骗的人。对于每一行,如果可能,检查现有的副本以将其添加进来。以这种方式处理输入后,从头开始检查结果以尝试放置剩余的重复项


use warnings;
use strict;
use feature 'say';
use List::Util qw(shuffle any);

# Push dupes to data unless same as last element or has been added already
sub add_dupes {
my ($data, $dupes, $mask) = @_;

for my $idx (0..$#$dupes) {
next if $dupes->[$idx] eq $data->[-1];
next if any { $idx == $_ } @$mask;

push @$data, $dupes->[$idx];
push @$mask, $idx;
}
}

my @lines = <>;
chomp @lines;

my @res = shift @lines;

foreach my $line (shuffle @lines) {
if ($line eq $res[-1]) { push @dupes, $line }
else { push @res, $line }

# Redistribute dupes found so far if possible
add_dupes(\@res, \@dupes, \@mask_dupes);
}

# Redistribute remaining (unused) dupes
my @final;
foreach my $line (@res) {
if ($line eq $final[-1]) { push @dupes, $line }
else { push @final, $line }

add_dupes(\@final, \@dupes, \@mask_dupes);
}

say "\nFinal (", scalar @final, " items):";
say for @final;

This stores dupes on an array as they are found, and for each line checks whether it can slip in existing dupe(s). It uses an ancillary mask array to mark indices of dupes that have been used.

这会在发现复制项时将其存储在数组中,并针对每一行检查它是否可以插入现有复制项(S)。它使用辅助掩码阵列来标记已使用的复制的索引。


Notes

备注



  • Shuffling first helps since many of the consecutive duplicate lines will get moved around, with an overwhelming probability

    首先洗牌是有帮助的,因为许多连续的重复行将以压倒性的概率被移动



  • The duplicates array is searched for each line of data so in principle the worst case is O(N2) (or, rather, O(NM)). This, I think, has to be done in some way in any approach, but it should be possible to minimize these cross searches.

    在重复数组中搜索每行数据,因此原则上最坏的情况是O(N2)(或者,更确切地说,是O(Nm))。我认为,在任何方法中都必须以某种方式做到这一点,但应该可以将这种交叉搜索降到最低。


    However, the array of dupes is expected to be rather short and most of the time not the whole array is searched. So if the input isn't gigantic with a lot of dupes this should perform well.

    然而,预计复制数组将相当短,并且大多数情况下不会搜索整个数组。因此,如果输入不是很大,并且有很多欺骗,那么这应该会运行得很好。



  • If there happen to be no duplicates in the end we are copying an array needlessly. But that's not a terrible sin, if it's once.

    如果最后碰巧没有副本,我们就是在不必要地复制数组。但这并不是什么可怕的罪行,如果只有一次的话。




Tested with various input, with many duplicates of multiple lines, but needs more testing. (At least, add basic diagnostic prints and run repeatedly -- it shuffles each time so repeated runs help -- and examine the output.)

测试了各种输入,有许多重复的多行,但需要更多的测试。(至少,添加基本的诊断打印并重复运行--每次都会洗牌,因此重复运行会有所帮助--并检查输出。)



Using any awk:

使用任何awk:


$ cat tst.awk
match($0,/<CUST-ACNT-N>[^<]+<\/CUST-ACNT-N>/) {
key = substr($0,RSTART,RLENGTH)
gsub(/^<CUST-ACNT-N>|<\/CUST-ACNT-N>$/,"",key)
keys[NR] = key
lines[NR] = $0
}
END {
srand()
maxAttempts = 1000
while ( (output == "") && (++attempts <= maxAttempts) ) {
output = distribute()
}
printf "%s", output
if ( output == "" ) {
print "Error: Failed to distribute the input." | "cat>&2"
exit 1
}
}

function distribute( iters,numLines,maxIters,tmpLines,tmpKeys,idx,i,ret) {
for ( idx in keys ) {
tmpKeys[idx] = keys[idx]
tmpLines[idx] = lines[idx]
numLines++
}

maxIters = 1000
while ( (numLines > 0) && (++iters <= maxIters) ) {
idx = int(1+rand()*numLines)

if ( tmpKeys[idx] != prev ) {
ret = ret tmpLines[idx] ORS
prev = tmpKeys[idx]
for ( i=idx; i<numLines; i++ ) {
tmpKeys[i] = tmpKeys[i+1]
tmpLines[i] = tmpLines[i+1]
}
numLines--
}
}

if ( numLines ) {
ret = ""
}
return ret
}


$ awk -f tst.awk file
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

So, in one attempt to produce output it tries 1000 times (maxIters) to find at random from the set of unprocessed lines a next line to output that isn't the same as the line it just added to the output but that could ultimately still fail, and so it tries 1000 times (maxAttempts) to produce output. That could still fail - increase those values if you like but you could still end up with output that simply can't be organized as you like (e.g. only 2 lines of input where both lines are identical).

因此,在一次生成输出的尝试中,它尝试1000次(MaxIters)从一组未处理的行中随机查找要输出的下一行,该行与它刚刚添加到输出中的行不同,但最终仍可能失败,因此它尝试1000次(MaxAttempt)生成输出。这仍然可能失败-如果您愿意增加这些值,但您仍然可能最终得到不能按您喜欢的方式组织的输出(例如,只有两行输入,其中两行是相同的)。


You could make it more efficient and increase it's chances of success by changing this code:

您可以通过更改以下代码来提高它的效率并增加其成功的机会:


        ret = ret tmpLines[idx] ORS
prev = tmpKeys[idx]
for ( i=idx; i<numLines; i++ ) {
tmpKeys[i] = tmpKeys[i+1]
tmpLines[i] = tmpLines[i+1]
}
numLines--

to create/use secondary arrays consisting of only the keys+lines that do not have the same key as the one just processed so then we wouldn't need the if ( tmpKeys[idx] != prev ) test above it and we wouldn't run the risk of idx = int(1+rand()*numLines) above that randomly finding the same key 1000 times when there were others to choose from. That enhancement is left as an exercise :-).

为了创建/使用仅由键+行组成的二级数组,这些键+行与刚刚处理的键不具有相同的键,因此我们不需要上面的if(tmpKeys[idx]!=prev)测试,也不会在有其他键可供选择时运行上面的idx=int(1+rand()*numLines)1000次的风险。这一改进留作练习:-)。



Using TXR Lisp:

使用TXR LISP:


$ txr spread-sort.tl < data
Line 2
Line 4
Line 3
Line 1
Line 3
Line 8
Line 3
$ txr spread-sort.tl < data
Line 4
Line 3
Line 1
Line 3
Line 8
Line 3
Line 2
$ txr spread-sort.tl < data
Line 4
Line 3
Line 8
Line 3
Line 1
Line 3
Line 2

The code:

代码:


(set *random-state* (make-random-state))

(let ((dupstack (vec)))
(labels ((distrib (single)
(build
(pend single)
(each ((i 0..(len dupstack)))
(iflet ((item (pop [dupstack i])))
(add item)))
(upd dupstack (remq nil))))
(distrib-push (dupes)
(prog1
(distrib nil)
(vec-push dupstack dupes))))
(flow (get-lines)
sort-group
shuffle
(mapcar [iff cdr distrib-push distrib])
(mapcar distrib)
tprint)))

This is not a correct algorithm in that if that if the input has a high ratio of duplicates, such that there is a correct ordering, such as:

这不是一个正确的算法,因为如果输入具有很高的重复率,从而存在正确的排序,例如:


1
2
2

it will not consistently produce the two 2 1 2 orders that separates the duplicates.

它不会始终如一地产生分隔副本的两个2 1 2订单。


The main flow of the algorithm is in the flow form. Lines are obtained from standard input and passed through sort-group, which will group the duplicates and sort, resulting in a list of lists of strings. Lines which aren't duplicates are lists of length 1. We shuffle this list of lists randomly, which means that the duplicates stay together.

该算法的主要流程是以流的形式。行是从标准输入中获得的,并通过Sort-Group传递,它将对重复项进行分组并进行排序,从而生成一个字符串列表。不是重复的行是长度为1的列表。我们随机地对列表的列表进行混洗,这意味着重复的行会留在一起。


We then distribute the duplicates using two passes, which use a vector called dupstack.

然后,我们使用两个通道分发副本,这两个通道使用一个名为dupstack的向量。


In the first pass, we map the list of lists such that the singletons are passed through distrib and the duplicates are passed through distrib-push. This moves around the duplicates in the way described below. After this pass, some items remain in the dupstack; so the list-of-lists does not have all the items. We make another pass, this time just passing every list through distrib, which distributes the items out of dupstack.

在第一个过程中,我们映射列表列表,这样单例通过Distrib传递,副本通过Distrib-Push传递。这将以下面描述的方式在副本之间移动。在此过程之后,一些项仍保留在DupStack中;因此列表列表中并不包含所有项。我们进行另一次传递,这一次只是将每个列表传递给Distrib,它将项分发到dupstack之外。


The dupstack is a vector of lists, which are lists of duplicate lines. E.g. [dupstack 0] might contain ("Line 3" "Line 3" "Line 3") and such.

DupStack是列表的向量,这些列表是重复行的列表。例如,[DupStack 0]可能包含(“Line 3”“Line 3”“Line 3”)等。


How distrib works is: it sweeps through dupstack, pops one element off the front of each element and appends it to the input list, returning that input list. If we map using this operation, it means that to each list we visit, we add one item from each duplicate set. After each sweep through this stack, we condense it using (upd dupstack (remq nil)) to purge it of lists that have become empty.

Distrib的工作原理是:它遍历dupstack,从每个元素的前面弹出一个元素,并将其附加到输入列表,然后返回该输入列表。如果我们使用此操作进行映射,这意味着我们访问的每个列表都会从每个重复集合中添加一项。每次扫描完这个堆栈后,我们使用(UPD dupStack(Req Nil))来压缩它,以清除已变为空的列表。


The function distrib-push is used in the first pass when processing lists that have more than one element (indicated by the Lisp cdr function returning nonempty). What distrib-push does is call distrib with an empty list, just to collect any available duplicates, one of each. These items cherry-picked from dupstack then replace the current items. Those items, identical strings, are pushed into a new slot in the duplicate stack.

当处理包含多个元素的列表时,函数Distrib-Push在第一次传递中使用(由返回非空的Lisp CDR函数表示)。Distrib-Push所做的是使用一个空列表调用Distrib,只是为了收集任何可用的副本,每个副本一个。这些项目是从DupStack中精心挑选的,然后替换当前的项目。这些相同的字符串被推入复制堆栈中的新槽中。



Here is a sample file:

以下是一个示例文件:


$ cat file
A Line 0
A Line 1
A Line 2
A Line 3
A Line 4
B Line 5
B Line 6
B Line 7
B Line 8
C Line 9
C Line 10
C Line 11
D Line 12

With the first column defined as the key and the constraint that no key can be next to the same key, randomize the file. Given that constraint, the result will only be randomish since there are more A's than B's and A will have to be at the start of the sequence. (Odd total items have more even than odd indexes since the parity of 0 is even.)

将第一列定义为关键字,并且约束条件是没有关键字不能紧跟在同一关键字之后,将文件随机化。考虑到这一限制,结果将只是随机的,因为A比B多,并且A必须位于序列的开头。(奇数总数项的偶数索引比奇数索引多,因为0的奇偶性是偶数。)


The general approach would be:

一般的做法是:



  1. Group all similar key lines together;

  2. Randomly choose from groups of input lines to be odd or even so the defined group is distributed;

  3. Randomly choose the remaining lines for output.


This is easily done in Ruby:

这在Ruby中很容易做到:


ruby -e '
BEGIN{keys=Hash.new { |h, k| h[k] = []} }
data=$<.read.split(/\R/)
data.each.with_index{|s,i|
s.match(/^(\S+)/); keys[$1]<<i
} # regex for key goes here
olines=(0..data.length-1).to_a; nlines=Hash.new()
grp_cnt=keys.values.map{|sa| sa.length if sa.length>1}.compact.sum
keys.sort_by{|k,v| [-v.length, v[0]]}.each{|k, grp|
if grp.length>1 then
evens, odds=olines.partition{|n| n.even?}
if grp_cnt.to_f/data.length > 0.6 then
pool=evens.length>odds.length ? evens[0...grp.length] : odds.reverse[0...grp.length]
else
pool=evens.length>odds.length ? evens : odds
end
if pool.length<grp.length then pool=olines end
else
pool=olines
end

this_grp=pool.sample(grp.length)
grp.zip(this_grp).each{|ks, vs| nlines[ks]=vs}

olines.reject!{|line| this_grp.include?(line) } # remove the used lines
}
nlines.sort_by{|k,v| v}.each{|v,k| puts "Line #{v} in => Line #{k} out; \"#{data[k]}\" => \"#{data[v]}\""}
' file

Prints:

打印:


Line 0 in => Line 0 out; "A Line 0" => "A Line 4"
Line 12 in => Line 1 out; "A Line 1" => "D Line 12"
Line 2 in => Line 2 out; "A Line 2" => "A Line 2"
Line 10 in => Line 3 out; "A Line 3" => "C Line 10"
Line 3 in => Line 4 out; "A Line 4" => "A Line 3"
Line 7 in => Line 5 out; "B Line 5" => "B Line 7"
Line 1 in => Line 6 out; "B Line 6" => "A Line 1"
Line 5 in => Line 7 out; "B Line 7" => "B Line 5"
Line 4 in => Line 8 out; "B Line 8" => "A Line 0"
Line 6 in => Line 9 out; "C Line 9" => "B Line 6"
Line 11 in => Line 10 out; "C Line 10" => "C Line 11"
Line 8 in => Line 11 out; "C Line 11" => "B Line 8"
Line 9 in => Line 12 out; "D Line 12" => "C Line 9"

This is trivial to change to accomodate the OP example input. Only the regex for the key and the output line is changed:

这对于适应OP示例输入来说是微不足道的。仅更改键和输出行的正则表达式:


ruby -e '
BEGIN{keys=Hash.new { |h, k| h[k] = []} }
data=$<.read.split(/\R/)
data.each.with_index{|s,i|
s.match(/<CUST-ACNT-N>([^<]+)</); keys[$1]<<i
} # regex for key goes here
olines=(0..data.length-1).to_a; nlines=Hash.new()
grp_cnt=keys.values.map{|sa| sa.length if sa.length>1}.compact.sum
keys.sort_by{|k,v| [-v.length, v[0]]}.each{|k, grp|
if grp.length>1 then
evens, odds=olines.partition{|n| n.even?}
if grp_cnt.to_f/data.length > 0.6 then
pool=evens.length>odds.length ? evens[0...grp.length] : odds.reverse[0...grp.length]
else
pool=evens.length>odds.length ? evens : odds
end
if pool.length<grp.length then pool=olines end
else
pool=olines
end

this_grp=pool.sample(grp.length)
grp.zip(this_grp).each{|ks, vs| nlines[ks]=vs}

olines.reject!{|line| this_grp.include?(line) } # remove the used lines
}
nlines.sort_by{|k,v| v}.each{|v,k| puts "#{data[v]}"}
' file

Prints:

打印:


REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82590483</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>25.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>83476098</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576324</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>98.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576883</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>17.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441754</ORD-AUTH-C><ORD-AUTH-V>94.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5759148</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576786</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>24.79</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>82576113</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>28.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>
REC-TYPE-C>CHARGE INVOICE</REC-TYPE-C><CUST-ACNT-N>**82576483**</CUST-ACNT-N><CUST-NAME-T>TEST TEN</CUST-NAME-T><ORD-AUTH-C>0044441552</ORD-AUTH-C><ORD-AUTH-V>21.99</ORD-AUTH-V><OUT-DOCM-D>01/09/2023</OUT-DOCM-D><ORD-N>5758655</ORD-N>

更多回答

Thank you so much for this. Does the trick exactly as intended & I'm familiar with most of what's going on in here...Thanks for everyone's efforts in helping me in all this. It's much appreciated

非常感谢完全按照预期来做这个把戏&我对这里发生的大部分事情都很熟悉。感谢大家的努力,在这一切帮助我。非常感谢

Suppose you have the input seq 0 10 | awk 'BEGIN{ ff[0]="A"; ff[1]="B"; ff[2]="C"} {print ff[$1%(FNR>3?3:2)]}' | sort | awk '{print $1, "Line", FNR-1}' >file Your approach does not work (using first column as key.)

假设您有输入seq 0 10|awk‘Begin{ff[0]=“A”;ff[1]=“B”;ff[2]=“C”}{print ff[$1%(fnr>3?3:2)]}’|ort|awk‘{print$1,“Line”,fnr-1}’>文件您的方法不起作用(使用第一列作为键)。

@dawg not sure what you're getting at; your code generates a bunch of lines like A line 0; A line 1; B line 2; ..., with none of the lines being duplicated; OP's question refers to the whole line being the 'key'; at no point does anyone (OP, me) suggest this approach would work for some other data set where we're looking at duplicate columns (as opposed to duplicate rows) ... ??????

@dawg不确定您的意思;您的代码生成了一串行,如A行0;A行1;B行2;...,没有任何行被复制;OP的问题指的是整行都是“键”;任何人(OP,我)都不会建议这种方法适用于我们正在查看重复列(而不是重复行)的其他数据集...?

Delete the ` Line \d` part then. seq 0 10 | awk 'BEGIN{ ff[0]="A"; ff[1]="B"; ff[2]="C"} {print ff[$1%(FNR>3?3:2)]}' | sort >file does not work either...

然后删除`行\d`部分。SEQ 0 10|awk‘Begin{ff[0]=“A”;ff[1]=“B”;ff[2]=“C”}{print ff[$1%(fnr>3?3:2)]}’|Sort>文件也不工作...

define does not work; your code generates nothing but duplicates; OP hasn't defined how to process an excess number of duplicates; my previous answer (see the edit history) went a bit further to randomize excessive duplicates but still had limitations (eg, how to randomize 3 lines that are all A)

定义不起作用;您的代码只生成重复项;OP没有定义如何处理过多的重复项;我之前的回答(参见编辑历史记录)更进一步地随机化了过多的重复项,但仍然有局限性(例如,如何随机化3行都是A的行)

Thanks Ed. It;s working most of the time but I'm getting the odd "Failed to distribute the input" error when I try to run it with the "file" in your example. Is this just a quirk of the random nature of what we're trying to do?

谢谢,艾德。它;S大部分时间都在工作,但当我尝试使用您示例中的“文件”运行它时,我得到了奇怪的“无法分发输入”错误。这只是我们试图做的事情的随机性的一种怪癖吗?

You're welcome. I explained that in the paragraph at the bottom of my answer. It'll try up to 1000*1000 = 1,000,000 times to produce the desired output before giving up.

不用谢。我在答案末尾的一段中解释了这一点。在放弃之前,它将尝试最多1000*1000=1,000,000次以产生所需的输出。

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com