Re: [Beowulf] Java vs C++ for interfacing to parallel library

Robert G. Brown Tue, 22 Aug 2006 07:43:04 -0700

On Tue, 22 Aug 2006, Joe Landman wrote:

All OO (and GSL is OO in its interface) really want you to go through
their accessors.  Basically where I got stuck was that I had to write
the retrieval dereferencer for a Perl object in C, and then store that
in GSL.  Thought I could do that quickly, 1/2 hour in and I punted :)


Ouch.  Half a week maybe.  Unless you are already familiar with perl's
back-end magic...or can find a really good example somewhere, or a
library that does it for you.  I think that man perlguts describes the C
interface (all through SV/AV/HV accessors that are quite primitive).  A
world of pain, in other words...

The beauty of OO is that you could in theory change the underlying basic
types and not break the method.  The danger is that it adds so much
abstraction that data manipulation gets real slow.   Like going through
an accessor function.


This particular theory is IMO laughably untrue in nine cases out of ten
in actual code, at least numerical code.  Places where it IS true are
often places where the data can be well (sensibly) encoded by XML.
Places where it is untrue are nearly everywhere else.  I'm including in
"untrue" a large number of cases as well where one "can" encode in XML
but the encoding is trivial or obviously needless overhead and hence
silly which includes most cases where the data types are standard
programming forms -- scalars, vectors/arrays and so on.

OO makes sense when your data can be sensibly represented as objects and
not as calculational variables and when there is a distinct need for the
objects to be self-documenting (contain metadata), extensible (maybe we
don't know yet just what the objects should be and want to be able to
glue in additional stuff without necessarily breaking what is there),
and so on.  Graphics is very OO -- the entire concept of the "widget" is
of a reusable object with "properties" that can be manipulated in common
ways.  DB applications are certainly very OO, which is why (I think) OO
languages became very popular with business, especially when tightly
integrated with graphics.  Numerical simulations, linear algebra, and so
on are not, really, terribly good candidates for "objectizing".  Of
course one can do it, but in most cases it is why bother, or at the very
least using "lite" objects (lazy objects, ie structs) without worrying
much about protection/inheritance, accessors, creator/destructors,
"methods", or full encapsulation of same.

Something I was thinking about yesterday were ways to mitigate this.
Basically for arrays, you can (carefully) construct a good C array, and
layer metadata in a separate array, and create a "high speed" accessor
interface to provide effectively the normal (*..*) array semantics.  I
think it is a solvable problem.


It is a solved problem.  That's exactly the way to efficiently handle
the GSL objects, for example, and the GSL is even structured to
facilitate this if you plumb the depths of its #include files and learn
about how a block is a chunk of memory with a length, a vector is a
block of memory with a (separately represented) size, a stride, and a
pointer to the address of the block (and an "owner" field that controls
the freeing of the underlying memory block in case the vector
dereferences a block of memory shared by more than one vector.

Since it is C, one can therefore access a gls_vector* as:

  (casttype) vector->data[i]

or

  (catttype) gsl_vector_get(vector,i).

The latter dereferences the vector through an obvious
displacement/stride (macro?)?  The former does the same thing, only it
uses the compiler's dereferencing code and almost certainly is
considerably faster (in addition to being a hell of a lot more readable
in code).  C lovers will also see that since they have the base address
of the memory block handy, they can repack memory locations at will into
a **array (or ***tensor etc) with a cast and dereference the vector as
array[i][j] in actual code (in say a derivs function) while STILL being
able to pass the vector on to an e.g. ODE solver as a vector.

Accessing it via the get, however, permits error handling and so on,
provided that you are willing to live with the restriction that now the
vector MUST start at 0 and run through size-1 in units of stride and the
code is incredibly ugly and unnatural (compared to say, fortran).
Accessing the same data via pointers permits one to write totally
natural and readable code, but of course gives up error checking.  This
is really the FUNDAMENTAL issue between strongly typed, error checking
languages and C.

C makes you face the tradeoffs directly -- you cannot just ignore the
cost of providing a safety net for the programmer because you have to
directly code any such safety net and in the process can see what it
costs.  Fortran and C++ live on opposite poles of this -- for fortran
the tradeoff is to live with a very restricted standard data
representation that is highly tuned to numerical programming (and hence
highly optimizable by a good compiler and strongly typed if one doesn't
cheat in pointers and structs in a way that defeats the optimizer).
Code is fairly readable, except for the eternal silliness over not
permitting starting index displacement.  C++ provides one with a TOTALLY
flexible data representation and encourages you to use it for
"everything", but provides strong typing and error checking by requiring
that your data representation be rigorously defined according to a
standard prescription that clearly slows down nearly everything and
again makes certain classes of numerical code very difficult to read.
In C you have a choice, and can even have your cake and eat it too up to
a point -- you can deliberately code in a fortran-like style and end up
with very readable numerical code (but without much error protection
from the compiler itself), you can code in a very OO style and add your
own protection and error testing in accessor functions (as the GSL
does), or anything in between or all mixed up or (as shown above) even
both at the same time.

To me, the rigorous encapsulation of the simple concept of a C vector in
the GSL is a major PITA and a real obstacle to efficient programming,
which is obviously not desireable in a numerical library.  One of the
most attractive features of C is the ease with which one can add an
offset to the base address so that a vector can start and end on any
index -- I don't even think about this anymore in actual code -- if my
indices do in fact run from 0 to 99 I use an ordinary e.g. int vec[100];
declaration.  In all other cases I use int *vec and malloc/displace the
indices so that they run from -50 to 50 if that's what they actually do
in the data I wish to represent.  This is something that Numerical
Recipes got right, for all that they got wrong -- they have a much
simpler and more natural set of routines for creating and destroying
vectors and matrices, alas without metadata that might have made it
possible to write robust library routines that at least are guaranteed
not to go out of bounds on the primary data objects provided that one
uses the metadata to set loop limits (although in C obviously NOTHING
can prevent you from referencing past the end of a block of anonymous
memory if you choose to access it that way via a pointer).

Alas, as I said when I gently suggested that it would be desirable to
implement a more hierarchically extensible view of memory that enables a
more fortran-like dereferencing of tensors of at least a rank high
enough to hold common objects from physics -- fourth rank at an absolute
minimum, sixth to eighth even better -- well, Brian etc. didn't exactly
fall over themselves to agree.  And I'm not sure that I blame them -- it
would break the hell out of their otherwise excellent code to convert
(because god HELP you if you ever have to change the underlying objects,
no matter HOW OO your interface was in a major code revision:-) and I
suspect that they coded the way they did to make it relatively easy to
use the library in C++.  And nothing prevents one from writing one's own
gsl-like tensor object for use in one's own code, so I did.;-)

Well, you can create typemaps in perl which tell it what to do with it.
This is where you would need some metadata magic on the SV/AV types to
make sense of this in perl.  And you are right, this might be where you
could exploit SWIG to do the heavy lifting.


Or do the SV/AV stuff by hand.  But it all just gives me a headache
thinking about it.  The GSL types are well-documented and transparent
(for all that the provided interface is opaque) so any problem can be
fairly easily solved by the bold and skillful.  In perl, you have no
hope of trying the same trick without reading the source and more.

Hmmm.  Perl (and the other dynamic languages) are smart enough to deal
with arrays/hashes of anything, including other arrays and hashes.  So
if you wanted to create what you are talking about there in pure perl

package tensor;
use base qw(Class::Accessor);
1;

use tensor;

$tensor = tensor->new(
                        xdim => 10,
                        ydim => 10,
                        zdim => 10
                     );

$tensor->{tensor} = tensor->new;

for($i=0;$i<$I_MAX;$i++)
for($j=0;$j<$tensor->{xdim};$j++)
 for($k=0;$k<$tensor->{ydim};$k++)
  for($l=0;$l<$tensor->{zdim};$l++)
   {
    # this parses to something like
    # array reference of object tensor->{tensor}->{tensor} indexed by
$i,$j,$k, is set to the result of calling epsilon on the arguments $i,
$j, $k.
    @{$tensor->{tensor}->{tensor}} [$i][$j][$k] = &epsilon($i,$j,$k);
   }

or something like that.  Coding before coffee, always dangerous ...
Probably some bugs in my tensor.


No, you miss my point.  I know how to do the same thing in perl a couple
of different ways, even -- one hash element could be an array for
example, to be a bit closer to the C.  I mean how do you build a block
of plain memory suitable for passing to a C routine that expects to get
a mytensor?  As the carefully constructed contents of a string?  However
you create and fill a tensor in perl, you have to pass the tensor in (by
reference, at a guess) and then be able to dereference it in C OR be
able to create a block of memory on the perl side that just happens to
be exactly what the C routine interprets as "a tensor".

Coding this pretty much requires a lot of fairly ugly code in a
translation layer using accessors or a really DEEP knowledge of where
and how perl stores the actual numbers.  Alas, I think that the core
perl data storage routines are pretty much all hashes to anonymous
memory blocks, so that getting to the data except by means of an
accessor or in the context of the interpreter itself is pretty much
impossible.  But I don't really know, and don't have the time or energy
to work it out.

Actually you have far more control over what gets emitted in the
instruction sequence.  Sometimes this is a very good thing.  There are
some loops in C that are really better off in SSE, though nothing you
can do to the compilers will convince them of this.  So you rewrite
those loops in assembler.


Sure.  I was simply referring to how one views and manages memory
blocks.  For the actual assembler code itself one is still at the mercy
of the compiler.  However, inlining assembler in C is really pretty
easy, if the payoff is worth it.  The C-assembler boundary is a thin
one, and for MANY (routine) things the compiler produces assembler that
isn't far away from what you'd do with assembler by hand.  Using
relatively new things like SSE, or tuning code to a given cache or
instruction architecture (e.g. i386 generic vs i686) is something that
you can only hope is done well in the compiler.

But that's why companies like Pathscale exist, right?  One CAN write a
better/smarter compiler -- it is just a hell of a lot of work,
especially since one has some extremely rigorous and demanding testing
that is necessary as coders absolutely hate it when the programming
environment they use has deep bugs in it.  It's hard enough for me to
debug my OWN code with a programming toolset I have absolute confidence
in (so that I can be pretty damn certain any error I encounter is my own
damn fault).  I've encountered compiler/CPU errors a handful of times
over the years, and they can easily leave one gibbering and looking
around to see if the walls of reality itself are similarly crumbling.

It's also something I'm very interested in regarding beowulfery and
numerical code in general (there, and you thought this whole thread had
turned INFINITELY OT didn't you:-).  Code tuning and optimization
ultimately requires a very accurate knowledge of the timings of all
sorts of elementary and structured composite operations, in context.
Generally speaking, these timings are hidden from the programmer so
thoroughly that it becomes very difficult indeed to estimate whether a
given chunk of code will be faster coded this way or that way any way
but empirically (and then only in a very coarse grained way by timing
whole blocks of code).  Yet significant optimizations are sometimes just
a matter of reordering instructions or writing loops a certain way.

Alas, measuring timings has a sort of heisenberg uncertainty to it
(seriously).  You have to insert timing code around blocks to be timed,
and this of course CHANGES the environment in which those blocks are
being executed in addition to adding the times required to perform the
timings, which then must themselves be timed and compensated for.
Microbenchmarking is the art of making accurate measurements of timings
of very small blocks of code (ideally down to atomic instructions,
although timing a single general purpose instruction turns out to be
VERY difficult as the timing code takes longer to execute than the code
being timed and really does destroy its execution context in a real
program).  Still, I dream of (and have started work on) a
microbenchmarking daemon that produces an XML file containing timings of
all sorts of low level operations in a generic context, which
information can be used by the programmer in assessing how best to write
their code or IN the program to do things like autotune loops (stride,
blocking, algorithm) or code partitioning in a parallel environment.

This is a real job of work, however, and I'll probably need grant
support to finish it and maintain it.  Which is on the agenda real soon
now (as soon as I get dieharder back to par as a library + UI, maybe
sometime today as I'm more than halfway there and the remaining work is
all tedious boilerplate implementation of existing code).

This is ... fun (e.g. best left to grad students who like the challenge,
and arent focusing upon the drudgery of writing hand coded assembly).


I have no such grad students, although that may change subject to the
availability of grant money.  And I will say from direct experience that
one has to get the RIGHT grad student to make this a win-win situation.

It is like the situation with kids.  Which is faster, getting a teen-age
son to wash the dishes or wash them yourself?  Which one is EASIER, and
less stressful?  Washing them yourself, by far and away.  I can write
and debug several programs on my own in the time it would require to
mentor a student through just one of them, unless the student were a
born ubercoder.  Which exist, don't get me wrong.  I've even had a few
as (undergrad) students in my physics classes.  We don't seem to get
them in physics grad school here, though -- go figure.

Unfortunately, my time to do R&D and pursue fun projects is bounded, as
my time needs to be spent on those things that generate revenue.  This
is either a blessing or a curse, I prefer to think about it as the
former.  If we get enough revenue then we have time to "fund" research
(e.g. spend time on things other than revenue projects and product R&D).


This is no different, really, from life in academia.  Life is a struggle
always, to exist first, to live well second, to live well and get useful
and important things done third, where a few selfless folks might invert
the order of the last two to at least some extent.

But anyway, I think this is enough semi-OT digression and I have real
work to do, so I hereby abandon this thread.  (sounds of cheering in the
background from offices all over the world...;-)

   rgb

--
Robert G. Brown                        http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:[EMAIL PROTECTED]


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Java vs C++ for interfacing to parallel library

Reply via email to