How about this:
The uniq command is actually not that powerful. It only checks that the
current string against the previous string, to see if they match, which
is why you almost always see uniq preceded by sort. For example:
$ echo -e "alpha
> alpha
> bravo
> alpha
> charlie
> bravo
> delta
> alpha" | uniq
alpha
bravo
alpha
charlie
bravo
delta
alpha
$ echo -e "alpha
alpha
bravo
alpha
charlie
bravo
delta
alpha" | sort | uniq
alpha
bravo
charlie
delta
In the first case, the word that was eliminated was the first 'alpha'
because it was followed directly by another alpha. In order to remove
all duplicates, you do need to do a sort first, if you're using the
standard Unix/Linux utilities. If you're writing a program to remove
duplicates, you'd want to sort the words into a tree structure as you
read them, so that when to search to see if a word has already been
read, it will be much quicker to test for that.
As an MPI program it might be interesting to have each MPI rank store
the list of previously read words for just a subsection of the alphabet
(rank 1 stores words starting with a-c, rank 2 d - f, and so on) and
then have rank 0 read the file and send the requests to the appropriate
rank to see if it's been seen there, and add it to it's local list if it
hasn't been seen yet. If asynchronous MPI functions are used, you could
get a parallel speed up, but I magine it would be noticeable only for
for very large lists of words. For even reasonable test cases the MPI
over head and latencies would override the benefit of parallelism in
this case, but it would be a good learning exercise.
When rank 0 reaches the end of the file, it send a signal to the other
ranks, which then sends their sorted sublist back to rank 0 to display
the results.
On 10/09/2016 06:04 PM, John Hearns wrote:
This sounds interesting.
I would question this step though, as it seems intrinically a bottlneck:
Removes duplicate lines:
$ sort filename.txt | uniq
First off - do you need a sorted list? If not do not perform the sort...
Secondly - how efficient is the Unix 'uniq' utility?
How about, every time you generate a string, check if it has already been
generated. I guess this then starts to depend on the efficiency of searching
through a file.
Thinking about this problem, would it be better to allocate each MPI rank a
unique set of string to generate?
Say there are N characters in the set A-Z a-z 0-9 plus symbols . This is likely
to be 255 as I'm assuming you are talking the ASCII character set.
So then if you have 256 MPI ranks then allocate rank 1 to generate all strings
commencing with 'A' then rank 2 'B' etc.
If you are lucky enough to have 256x256 cores to allocate then it goes 'AA'
'AB' 'AC' etc. etc.
Which further makes me think - why generate random strings in the first place?
If you want a sorted list, then set the MPI ranks off to generate sequential
strings,
and as I say make sure there is no duplicated effort somehow by allocating
unique subsections to each rank.
All you have to do is gather in the already sorted lists at the end, starting
at AA , AB etc etc
This is quite thought provoking - maybe the random approach IS more efficient
???
Thinking further though, if you have processors with AVX externsions then is
there an efficient way to bring those to bear on the problem - 256 and 512
vector lengths mapping rather well to ASCII 255 characters.
I know this is an exercise in learning MPI, but I also idly winder if the
Micron Automata could do this rather well
http://www.micronautomata.com/
________________________________________
From: Beowulf [beowulf-boun...@beowulf.org] on behalf of Skylar Thompson
[skylar.thomp...@gmail.com]
Sent: 08 October 2016 15:20
To: dar...@wisecorp.co.uk
Cc: beowulf@beowulf.org
Subject: Re: [Beowulf] Generation of strings MPI fashion..
If you haven't used MPI before, I would break this up into chunks:
1. Write a MPI program that gathers a fixed amount of /dev/urandom
(Brian's suggestion is wise) data and sends it back to the master. This
will get you used to the basic MPI_Send/MPI_Recv commands.
2. Use the same program, but use MPI_Gather rather than MPI_Recv to
assemble the final array. That will get you used to collective
communication.
3. Find a parallel sorting algorithm and implement it. You'll still need
to have rank 0 do some work, but the point of the algorithm would be to
reduce the amount of work it has to do. I found a good description of
parallel merge sort here:
http://penguin.ewu.edu/~trolfe/ParallelMerge/ParallelMerge.html
Good luck!
Skylar
On 10/07/2016 10:19 AM, Darren Wise wrote:
Heya folks,
This may seem really simple to some and it is fairly simple from the
terminal using bash of generation and sorting (examples below)
What I would like to do, to get started with making a program to run in
MPI on my cluster.. I thought of a fairly simple bash script and this
works fine for a single machine but what is the best way to go around it
or converting this simple notion in to an MPI runable command.
Generates random strings of chars:
$ tr -cd '[upper:]' < /dev/random | fold -w9 | head -c${1:-1000} | tee
somefile.txt
Removes duplicate lines:
$ sort filename.txt | uniq
This is fine for generation as I mention for a single machine, but
what's the best way to turn this around and use MPI with a shared NFS
mounted filesystem..
While this is an example im going to generate a bash script to allow me
to generate every possible permutation of a desired string length along
with types of chars a-z, A-Z, 0-9 along with symbols..
So any advice is welcome really and it's just an educational way for me
to transpire into generating code that can be used within my small beowulf.
Kind regards,
Darren Wise Esq.
www.wisecorp.co.uk <http://www.wisecorp.co.uk>
www.wisecorp.co.uk/babywise <http://www.wisecorp.co.uk/babywise>
www.darrenwise.co.uk <http://www.darrenwise.co.uk>
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
Any views or opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Employees of XMA Ltd are
expressly required not to make defamatory statements and not to infringe or
authorise any infringement of copyright or any other legal right by email
communications. Any such communication is contrary to company policy and
outside the scope of the employment of the individual concerned. The company
will not accept any liability in respect of such communication, and the
employee responsible will be personally liable for any damages or other
liability arising. XMA Limited is registered in England and Wales (registered
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane,
Wilford, Nottingham, NG11 7EP
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf