Sent from my iPhone

> On 9 Oct 2016, at 11:05 pm, John Hearns <john.hea...@xma.co.uk> wrote:
> 
> This sounds interesting.
> 
> I would question this step though, as it seems intrinically a bottlneck:
>> Removes duplicate lines:
>> $ sort filename.txt | uniq
> 
> First off - do you need a sorted list?  If not do not perform the sort...
> 
> Secondly - how efficient is the Unix 'uniq' utility?

The command could be collapsed to

sort -u

Gnu sort is quite efficient I think.

> How about, every time you generate a string, check if it has already been 
> generated.  I guess this then starts to depend on the efficiency of searching 
> through a file.

Assuming you're using a binary tree to store those, you're effectively sorting 
the list anyway. Probably with O(N log N) execution time. 

> Thinking about this problem, would it be better to allocate each MPI rank a 
> unique set of string to generate?
> Say there are N characters in the set A-Z a-z 0-9 plus symbols . This is 
> likely to be 255 as I'm assuming you are talking the ASCII character set.
> So then if you have 256  MPI ranks then allocate rank 1 to generate all 
> strings commencing with 'A'  then rank 2 'B' etc.
> If you are lucky enough to have 256x256 cores to allocate then it goes 'AA' 
> 'AB' 'AC'  etc. etc.

That's no longer random though is it? You're absolutely fixing the frequency of 
the first letter. The more you do that, the less random your strings will 
become. 

Tim

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to