How about this:

The uniq command is actually not that powerful. It only checks that the current string against the previous string, to see if they match, which is why you almost always see uniq preceded by sort. For example:

$ echo -e "alpha
> alpha
> bravo
> alpha
> charlie
> bravo
> delta
> alpha" | uniq
alpha
bravo
alpha
charlie
bravo
delta
alpha

$ echo -e "alpha
alpha
bravo
alpha
charlie
bravo
delta
alpha" | sort | uniq
alpha
bravo
charlie
delta

In the first case, the word that was eliminated was the first 'alpha' because it was followed directly by another alpha. In order to remove all duplicates, you do need to do a sort first, if you're using the standard Unix/Linux utilities. If you're writing a program to remove duplicates, you'd want to sort the words into a tree structure as you read them, so that when to search to see if a word has already been read, it will be much quicker to test for that.

As an MPI program it might be interesting to have each MPI rank store the list of previously read words for just a subsection of the alphabet (rank 1 stores words starting with a-c, rank 2 d - f, and so on) and then have rank 0 read the file and send the requests to the appropriate rank to see if it's been seen there, and add it to it's local list if it hasn't been seen yet. If asynchronous MPI functions are used, you could get a parallel speed up, but I magine it would be noticeable only for for very large lists of words. For even reasonable test cases the MPI over head and latencies would override the benefit of parallelism in this case, but it would be a good learning exercise.

When rank 0 reaches the end of the file, it send a signal to the other ranks, which then sends their sorted sublist back to rank 0 to display the results.


On 10/09/2016 06:04 PM, John Hearns wrote:

This sounds interesting.

I would question this step though, as it seems intrinically a bottlneck:
Removes duplicate lines:
$ sort filename.txt | uniq
First off - do you need a sorted list?  If not do not perform the sort...

Secondly - how efficient is the Unix 'uniq' utility?
How about, every time you generate a string, check if it has already been 
generated.  I guess this then starts to depend on the efficiency of searching 
through a file.
Thinking about this problem, would it be better to allocate each MPI rank a 
unique set of string to generate?
Say there are N characters in the set A-Z a-z 0-9 plus symbols . This is likely 
to be 255 as I'm assuming you are talking the ASCII character set.
So then if you have 256  MPI ranks then allocate rank 1 to generate all strings 
commencing with 'A'  then rank 2 'B' etc.
If you are lucky enough to have 256x256 cores to allocate then it goes 'AA' 
'AB' 'AC'  etc. etc.

Which further makes me think - why generate random strings in the first place? 
If you want a sorted list, then set the MPI ranks off to generate sequential 
strings,
and as I say make sure there is no duplicated effort somehow by allocating 
unique subsections to each rank.
All you have to do is gather in the already sorted lists at the end, starting 
at AA , AB etc etc
This is quite thought provoking - maybe the random approach IS more efficient 
???

Thinking further though, if you have processors with AVX externsions then is 
there an efficient way to bring those to bear on the problem - 256 and 512 
vector lengths mapping rather well to ASCII 255 characters.

I know this is an exercise in learning MPI, but I also idly winder if the 
Micron Automata could do this rather well
http://www.micronautomata.com/




________________________________________
From: Beowulf [beowulf-boun...@beowulf.org] on behalf of Skylar Thompson 
[skylar.thomp...@gmail.com]
Sent: 08 October 2016 15:20
To: dar...@wisecorp.co.uk
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] Generation of strings MPI fashion..

If you haven't used MPI before, I would break this up into chunks:

1. Write a MPI program that gathers a fixed amount of /dev/urandom
(Brian's suggestion is wise) data and sends it back to the master. This
will get you used to the basic MPI_Send/MPI_Recv commands.

2. Use the same program, but use MPI_Gather rather than MPI_Recv to
assemble the final array. That will get you used to collective
communication.

3. Find a parallel sorting algorithm and implement it. You'll still need
to have rank 0 do some work, but the point of the algorithm would be to
reduce the amount of work it has to do. I found a good description of
parallel merge sort here:

http://penguin.ewu.edu/~trolfe/ParallelMerge/ParallelMerge.html

Good luck!

Skylar

On 10/07/2016 10:19 AM, Darren Wise wrote:
Heya folks,

This may seem really simple to some and it is fairly simple from the
terminal using bash of generation and sorting (examples below)

What I would like to do, to get started with making a program to run in
MPI on my cluster.. I thought of a fairly simple bash script and this
works fine for a single machine but what is the best way to go around it
or converting this simple notion in to an MPI runable command.

Generates random strings of chars:
$ tr -cd '[upper:]' < /dev/random | fold -w9 | head -c${1:-1000} | tee
somefile.txt

Removes duplicate lines:
$ sort filename.txt | uniq

This is fine for generation as I mention for a single machine, but
what's the best way to turn this around and use MPI with a shared NFS
mounted filesystem..

While this is an example im going to generate a bash script to allow me
to generate every possible permutation of a desired string length along
with types of chars a-z, A-Z, 0-9 along with symbols..

So any advice is welcome really and it's just an educational way for me
to transpire into generating code that can be used within my small beowulf.

Kind regards,
Darren Wise Esq.

www.wisecorp.co.uk <http://www.wisecorp.co.uk>
www.wisecorp.co.uk/babywise <http://www.wisecorp.co.uk/babywise>
www.darrenwise.co.uk <http://www.darrenwise.co.uk>

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to