optimizing calling conventions for function returns

2006-05-23 Thread Jon Smirl

Looking at assembly listings of the Linux kernel I see thousands of
places where function returns are checked to be non-zero to indicate
errors. For example something like this:

mov bx, 0
.L1
   call foo
   test ax,ax
   jnz .Lerror
   inc bx
   cmp bx, 10
   jne .L1
   

.Lerror
process error

A new calling convention could push two return addresses for functions
that return their status in EAX. On EAX=0 you take the first return,
EAX != 0 you take the second.

So the above code becomes:

push .Lerror
mov bx, 0
.L1
   call foo
   inc bx
   cmp bx, 10
   jne .L1
   add sp, 2

.Lerror
   process error

The called function then does 'ret' or 'ret 4' depending on the status
of EAX != 0.

Of course there are many further optimizations that can be done, but
this illustrates the concept.

Has work been done to evaluate a calling convention that takes error
checks like this into account? Are there size/performance wins? Or am
I just reinventing a variation on exception handling?

--
Jon Smirl
[EMAIL PROTECTED]


Re: optimizing calling conventions for function returns

2006-05-23 Thread Jon Smirl

On 5/23/06, Paul Brook <[EMAIL PROTECTED]> wrote:

> Has work been done to evaluate a calling convention that takes error
> checks like this into account? Are there size/performance wins? Or am
> I just reinventing a variation on exception handling?

This introduces an extra stack push and will confuse a call-stack branch
predictor. If both the call stack and the test are normally predicted
correctly I'd guess this would be a performance loss on modern cpus.


Note that the error return is above the normal return and not placed
there by a call, it should look like data to the predictor. The normal
return is placed on the stack by a call which should continue to be
correctly predicted, I would expect the error return path to be
mispredicted but it is supposed to be the unlikely case. Is the
callstack branch correctly predicted if the routine being called is
complex?

This does eliminate the test./jmp after every function call.

Further branches could be eliminated by having multiple returns from
the called function at the expense of increasing code size.



Paul




--
Jon Smirl
[EMAIL PROTECTED]


Re: optimizing calling conventions for function returns

2006-05-23 Thread Jon Smirl

On 5/23/06, Florian Weimer <[EMAIL PROTECTED]> wrote:

Yes, but the test/jump now happens in the callee, and you need to
maintain an additional stack slot.  I wouldn't be surprised if the


The callee already had to implement the test/jmp in order to decide to
return the error. So this shouldn't introduce another one.


change isn't a win.  Some form of exception handling for truly
exceptional situations would probably be better (and might have helped
to avoid quite a few of the last CVEs 8-).




--
Jon Smirl
[EMAIL PROTECTED]


Re: optimizing calling conventions for function returns

2006-05-23 Thread Jon Smirl

On 5/23/06, Gabriel Paubert <[EMAIL PROTECTED]> wrote:

On Tue, May 23, 2006 at 11:21:46AM -0400, Jon Smirl wrote:
> Has work been done to evaluate a calling convention that takes error
> checks like this into account? Are there size/performance wins? Or am
> I just reinventing a variation on exception handling?

It's fairly close to Fortran alternate return labels, which
were standard in Fortran 77 but have been declared obsolescent
in later revisions of the standard.


I like this method since it can be implemented transparently in C
code. That means the Linux kernel could use it without rewriting
everything.



Regards,
        Gabriel




--
Jon Smirl
[EMAIL PROTECTED]


Re: optimizing calling conventions for function returns

2006-05-24 Thread Jon Smirl

On 5/23/06, Paul Brook <[EMAIL PROTECTED]> wrote:

> Has work been done to evaluate a calling convention that takes error
> checks like this into account? Are there size/performance wins? Or am
> I just reinventing a variation on exception handling?

This introduces an extra stack push and will confuse a call-stack branch
predictor. If both the call stack and the test are normally predicted
correctly I'd guess this would be a performance loss on modern cpus.


I just finished writing a bunch of test cases to explore the idea. My
conclusion is that if the error returns are very infrequent (<<1%)
then this is a win. But if there are a significant number of error
returns this is a major loss.

These two instructions on the error return path are the killer:
addl$4, %esp
ret /* Return to error return */

Apparently the CPU has zero expectation that the address being jumped
to is code. In the calling routine I pushed the error return as data.

pushl   $.L11   /* push return address */

So for the non-error path there is a win by removing the error
test/jmp on the function return. But taking the error path is very
expensive.

I'm experimenting with 50 line assembly programs on a P4. I do wonder
if these micro results would apply in a macro program. My test is
losing because the return destination had been predicted and the
introduction of the addl messed up the prediction. But in a large
program with many levels of calls would the return always be predicted
on the error path?


--
Jon Smirl
[EMAIL PROTECTED]


Re: optimizing calling conventions for function returns

2006-05-25 Thread Jon Smirl

On 5/25/06, Geert Bosch <[EMAIL PROTECTED]> wrote:


On May 23, 2006, at 11:21, Jon Smirl wrote:

> A new calling convention could push two return addresses for functions
> that return their status in EAX. On EAX=0 you take the first return,
> EAX != 0 you take the second.

This seems the same as passing an extra function pointer
argument and calling that instead of doing a regular return.
Tail-call optimization should turn the calll into a jump.

Why do you think a custom ABI is necessary?


The new ABI may not be necessary but adding an extra parameter would
require changing source everywhere. The ABI scheme is source
transparent and lets the compiler locate the places where it would be
a win. The ABI scheme would also let the alternative return be pushed
on the stack once no matter how many calls were made, a parameter has
to be pushed each time.

I ran into another snag that taking the alternative return on a P4 has
really bad performance impacts since it messes up prefetch. This
sequence is the killer.

  addl$4, %esp
  ret /* Return to error return */

I can try coding this as a parameter and see how the compiler
generates code differently.

The sequence of call, test, jne (or slight variations) occurs in
1000's of places, if a better alternative can be found there could be
significant perofrmance gains. I haven't found a good solution yet,
any help would be appreciated.




   -Geert




--
Jon Smirl
[EMAIL PROTECTED]


Re: optimizing calling conventions for function returns

2006-05-25 Thread Jon Smirl

On 5/25/06, Jon Smirl <[EMAIL PROTECTED]> wrote:

I ran into another snag that taking the alternative return on a P4 has
really bad performance impacts since it messes up prefetch. This
sequence is the killer.

   addl$4, %esp
   ret /* Return to error return */

I can try coding this as a parameter and see how the compiler
generates code differently.


   jmp   *4($esp)

This is slightly faster than addl, ret.

But my micro scale benchmarks are extremely influenced by changes in
branch prediction. I still wonder how this would perform in large
programs.

It seems that the sequence

  ret
  test
  jne

is very fast compared to

  jmp   *4($esp)

Even when they both end up at the same place. It looks to me like the
call stack predictor is controlling everything.  The only way to make
this work would be to figure out some way to get the alternative
return address into the call stack predictor.


--
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-05 Thread Jon Smirl
On 12/6/07, Daniel Berlin <[EMAIL PROTECTED]> wrote:
> > While you won't get the git svn metadata if you clone the infradead
> > repo, it can be recreated on the fly by git svn if you want to start
> > commiting directly to gcc svn.
> >
> I will give this a try :)

Back when I was working on the Mozilla repository we were able to
convert the full 4GB CVS repository complete with all history into a
450MB pack file. That work is where the git-fastimport tool came from.
But it took a month of messing with the import tools to achieve this
and Mozilla still chose another VCS (mainly because of poor Windows
support in git).

Like Linus says, this type of command will yield the smallest pack file:
 git repack -a -d --depth=250 --window=250

I do agree that importing multi-gigabyte repositories is not a daily
occurrence nor a turn-key operation. There are significant issues when
translating from one VCS to another. The lack of global branch
tracking in CVS causes extreme problems on import. Hand editing of CVS
files also caused endless trouble.

The key to converting repositories of this size is RAM. 4GB minimum,
more would be better. git-repack is not multi-threaded. There were a
few attempts at making it multi-threaded but none were too successful.
If I remember right, with loads of RAM, a repack on a 450MB repository
was taking about five hours on a 2.8Ghz Core2. But this is something
you only have to do once for the import. Later repacks will reuse the
original deltas.

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>
> On Thu, 6 Dec 2007, Jeff King wrote:
> >
> > What is really disappointing is that we saved only about 20% of the
> > time. I didn't sit around watching the stages, but my guess is that we
> > spent a long time in the single threaded "writing objects" stage with a
> > thrashing delta cache.
>
> I don't think you spent all that much time writing the objects. That part
> isn't very intensive, it's mostly about the IO.
>
> I suspect you may simply be dominated by memory-throughput issues. The
> delta matching doesn't cache all that well, and using two or more cores
> isn't going to help all that much if they are largely waiting for memory
> (and quite possibly also perhaps fighting each other for a shared cache?
> Is this a Core 2 with the shared L2?)

When I lasted looked at the code, the problem was in evenly dividing
the work. I was using a four core machine and most of the time one
core would end up with 3-5x the work of the lightest loaded core.
Setting pack.threads up to 20 fixed the problem. With a high number of
threads I was able to get a 4hr pack to finished in something like
1:15.

A scheme where each core could work a minute without communicating to
the other cores would be best. It would also be more efficient if the
cores could avoid having sync points between them.

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> > When I lasted looked at the code, the problem was in evenly dividing
> > the work. I was using a four core machine and most of the time one
> > core would end up with 3-5x the work of the lightest loaded core.
> > Setting pack.threads up to 20 fixed the problem. With a high number of
> > threads I was able to get a 4hr pack to finished in something like
> > 1:15.
>
> But as far as I know you didn't try my latest incarnation which has been
> available in Git's master branch for a few months already.

I've deleted all my giant packs. Using the kernel pack:
4GB Q6600

Using the current thread pack code I get these results.

The interesting case is the last one. I set it to 15 threads and
monitored with 'top'.
For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
74-100% was 100% CPU. It never used all for cores. The only other
things running were top and my desktop. This is the same load
balancing problem I observed earlier. Much more clock time was spent
in the 2/1 core phases than the 3 core one.

Threaded, threads = 5

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 528994), reused 0 (delta 0)

real1m31.395s
user2m59.239s
sys 0m3.048s
[EMAIL PROTECTED]:/home/linux$

12 seconds counting
53 seconds compressing
38 seconds writing

Without threads,

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f
warning: no threads support, ignoring pack.threads
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 528999), reused 0 (delta 0)

real2m54.849s
user2m51.267s
sys 0m1.412s
[EMAIL PROTECTED]:/home/linux$

Threaded, threads = 5

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f --depth=250 --window=250
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539080), reused 0 (delta 0)

real9m18.032s
user19m7.484s
sys 0m3.880s
[EMAIL PROTECTED]:/home/linux$

[EMAIL PROTECTED]:/home/linux/.git/objects/pack$ ls -l
total 182156
-r--r--r-- 1 jonsmirl jonsmirl  15561848 2007-12-06 16:15
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.idx
-r--r--r-- 1 jonsmirl jonsmirl 170768761 2007-12-06 16:15
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.pack
[EMAIL PROTECTED]:/home/linux/.git/objects/pack$

Non-threaded:

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f --depth=250 --window=250
warning: no threads support, ignoring pack.threads
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539080), reused 0 (delta 0)

real18m51.183s
user18m46.538s
sys 0m1.604s
[EMAIL PROTECTED]:/home/linux$


[EMAIL PROTECTED]:/home/linux/.git/objects/pack$ ls -l
total 182156
-r--r--r-- 1 jonsmirl jonsmirl  15561848 2007-12-06 15:33
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.idx
-r--r--r-- 1 jonsmirl jonsmirl 170768761 2007-12-06 15:33
pack-f1f8637d2c68eb1c964ec7c1877196c0c7513412.pack
[EMAIL PROTECTED]:/home/linux/.git/objects/pack$

Threaded, threads = 15

[EMAIL PROTECTED]:/home/linux$ time git repack -a -d -f --depth=250 --window=250
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539080), reused 0 (delta 0)

real9m18.325s
user    19m14.340s
sys 0m3.996s
[EMAIL PROTECTED]:/home/linux$

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> On Thu, 6 Dec 2007, Jon Smirl wrote:
>
> > On 12/6/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> > > > When I lasted looked at the code, the problem was in evenly dividing
> > > > the work. I was using a four core machine and most of the time one
> > > > core would end up with 3-5x the work of the lightest loaded core.
> > > > Setting pack.threads up to 20 fixed the problem. With a high number of
> > > > threads I was able to get a 4hr pack to finished in something like
> > > > 1:15.
> > >
> > > But as far as I know you didn't try my latest incarnation which has been
> > > available in Git's master branch for a few months already.
> >
> > I've deleted all my giant packs. Using the kernel pack:
> > 4GB Q6600
> >
> > Using the current thread pack code I get these results.
> >
> > The interesting case is the last one. I set it to 15 threads and
> > monitored with 'top'.
> > For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
> > 74-100% was 100% CPU. It never used all for cores. The only other
> > things running were top and my desktop. This is the same load
> > balancing problem I observed earlier.
>
> Well, that's possible with a window 25 times larger than the default.

Why did it never use more than three cores?

>
> The load balancing is solved with a master thread serving relatively
> small object list segments to any work thread that finished with its
> previous segment.  But the size for those segments is currently fixed to
> window * 1000 which is way too large when window == 250.
>
> I have to find a way to auto-tune that segment size somehow.
>
> But with the default window size there should not be any such noticeable
> load balancing problem.
>
> Note that threading only happens in the compression phase.  The count
> and write phase are hardly paralleled.
>
>
> Nicolas
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> On Thu, 6 Dec 2007, Jon Smirl wrote:
>
> > On 12/6/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> > > > When I lasted looked at the code, the problem was in evenly dividing
> > > > the work. I was using a four core machine and most of the time one
> > > > core would end up with 3-5x the work of the lightest loaded core.
> > > > Setting pack.threads up to 20 fixed the problem. With a high number of
> > > > threads I was able to get a 4hr pack to finished in something like
> > > > 1:15.
> > >
> > > But as far as I know you didn't try my latest incarnation which has been
> > > available in Git's master branch for a few months already.
> >
> > I've deleted all my giant packs. Using the kernel pack:
> > 4GB Q6600
> >
> > Using the current thread pack code I get these results.
> >
> > The interesting case is the last one. I set it to 15 threads and
> > monitored with 'top'.
> > For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
> > 74-100% was 100% CPU. It never used all for cores. The only other
> > things running were top and my desktop. This is the same load
> > balancing problem I observed earlier.
>
> Well, that's possible with a window 25 times larger than the default.
>
> The load balancing is solved with a master thread serving relatively
> small object list segments to any work thread that finished with its
> previous segment.  But the size for those segments is currently fixed to
> window * 1000 which is way too large when window == 250.
>
> I have to find a way to auto-tune that segment size somehow.

That would be nice. Threading is most important on the giant
pack/window combinations. The normal case is fast enough that I don't
real notice it. These giant pack/window combos can run 8-10 hours.

>
> But with the default window size there should not be any such noticeable
> load balancing problem.

I only spend 30 seconds in the compression phase without making the
window larger. It's not long enough to really see what is going on.

>
> Note that threading only happens in the compression phase.  The count
> and write phase are hardly paralleled.
>
>
> Nicolas
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/6/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> > > Well, that's possible with a window 25 times larger than the default.
> >
> > Why did it never use more than three cores?
>
> You have 648366 objects total, and only 647457 of them are subject to
> delta compression.
>
> With a window size of 250 and a default thread segment of window * 1000
> that means only 3 segments will be distributed to threads, hence only 3
> threads with work to do.

One little tweak and the clock time drops from 9.5 to 6 minutes. The
tweak makes all four cores work.

[EMAIL PROTECTED]:/home/apps/git$ git diff
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 4f44658..e0dd12e 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1645,7 +1645,7 @@ static void ll_find_deltas(struct object_entry
**list, unsigned list_size,
}

/* this should be auto-tuned somehow */
-   chunk_size = window * 1000;
+   chunk_size = window * 50;

do {
unsigned sublist_size = chunk_size;


[EMAIL PROTECTED]:/home/linux/.git$ time git repack -a -d -f --depth=250
--window=250
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 539043), reused 0 (delta 0)

real6m2.109s
user20m0.491s
sys 0m4.608s
[EMAIL PROTECTED]:/home/linux/.git$



>
>
> Nicolas
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/7/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>
> On Thu, 6 Dec 2007, Jon Smirl wrote:
> > >
> > > time git blame -C gcc/regclass.c > /dev/null
> >
> > [EMAIL PROTECTED]:/video/gcc$ time git blame -C gcc/regclass.c > /dev/null
> >
> > real1m21.967s
> > user1m21.329s
>
> Well, I was also hoping for a "compared to not-so-aggressive packing"
> number on the same machine.. IOW, what I was wondering is whether there is
> a visible performance downside to the deeper delta chains in the 300MB
> pack vs the (less aggressive) 500MB pack.

Same machine with a default pack

[EMAIL PROTECTED]:/video/gcc/.git/objects/pack$ ls -l
total 2145716
-r--r--r-- 1 jonsmirl jonsmirl   23667932 2007-12-07 02:03
pack-bd163555ea9240a7fdd07d2708a293872665f48b.idx
-r--r--r-- 1 jonsmirl jonsmirl 2171385413 2007-12-07 02:03
pack-bd163555ea9240a7fdd07d2708a293872665f48b.pack
[EMAIL PROTECTED]:/video/gcc/.git/objects/pack$

Delta lengths have virtually no impact. The bigger pack file causes
more IO which offsets the increased delta processing time.

One of my rules is smaller is almost always better. Smaller eliminates
IO and helps with the CPU cache. It's like the kernel being optimized
for size instead of speed ending up being  faster.

time git blame -C gcc/regclass.c > /dev/null
real1m19.289s
user1m17.853s
sys 0m0.952s



>
> Linus
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-06 Thread Jon Smirl
On 12/7/07, Jeff King <[EMAIL PROTECTED]> wrote:
> On Thu, Dec 06, 2007 at 07:31:21PM -0800, David Miller wrote:
>
> > > So it is about 5% bigger. What is really disappointing is that we saved
> > > only about 20% of the time. I didn't sit around watching the stages, but
> > > my guess is that we spent a long time in the single threaded "writing
> > > objects" stage with a thrashing delta cache.
> >
> > If someone can give me a good way to run this test case I can
> > have my 64-cpu Niagara-2 box crunch on this and see how fast
> > it goes and how much larger the resulting pack file is.
>
> That would be fun to see. The procedure I am using is this:
>
> # compile recent git master with threaded delta
> cd git
> echo THREADED_DELTA_SEARCH = 1 >>config.mak
> make install
>
> # get the gcc pack
> mkdir gcc && cd gcc
> git --bare init
> git config remote.gcc.url git://git.infradead.org/gcc.git
> git config remote.gcc.fetch \
>   '+refs/remotes/gcc.gnu.org/*:refs/remotes/gcc.gnu.org/*'
> git remote update
>
> # make a copy, so we can run further tests from a known point
> cd ..
> cp -a gcc test
>
> # and test multithreaded large depth/window repacking
> cd test
> git config pack.threads 4

64 threads with 64 CPUs, if they are multicore you want even more.
you need to adjust chunk_size as mentioned in the other mail.


> time git repack -a -d -f --window=250 --depth=250
>
> -Peff
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Git and GCC

2007-12-07 Thread Jon Smirl
On 12/6/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>
> On Thu, 6 Dec 2007, Harvey Harrison wrote:
> >
> > I've updated the public mirror repo with the very-packed version.
>
> Side note: it might be interesting to compare timings for
> history-intensive stuff with and without this kind of very-packed
> situation.
>
> The very density of a smaller pack-file might be enough to overcome the
> downsides (more CPU time to apply longer delta-chains), but regardless,
> real numbers talks, bullshit walks. So wouldn't it be nice to have real
> numbers?
>
> One easy way to get real numbers for history would be to just time some
> reasonably costly operation that uses lots of history. Ie just do a
>
> time git blame -C gcc/regclass.c > /dev/null
>
> and see if the deeper delta chains are very expensive.

[EMAIL PROTECTED]:/video/gcc$ time git blame -C gcc/regclass.c > /dev/null

real1m21.967s
user1m21.329s
sys 0m0.640s

The Mozilla repo is at least 50% larger than the gcc one. It took me
23 minutes to repack the gcc one on my $800 Dell. The trick to this is
lots of RAM and 64b. There is little disk IO during the compression
phase, everything is cached.

I have a 4.8GB git process with 4GB of physical memory. Everything
started slowing down a lot when the process got that big. Does git
really need 4.8GB to repack? I could only keep 3.4GB resident. Luckily
this happen at 95% completion. With 8GB of memory you should be able
to do this repack in under 20 minutes.

[EMAIL PROTECTED]:/video/gcc$ time git repack -a -d -f --depth=250 --window=250
real22m54.380s
user69m18.948s
sys 0m23.773s


> (Yeah, the above is pretty much designed to be the worst possible case for
> this kind of aggressive history packing, but I don't know if that choice
> of file to try to annotate is a good choice or not. I suspect that "git
> blame -C" with a CVS import is just horrid, because CVS commits tend to be
> pretty big and nasty and not as localized as we've tried to make things in
> the kernel, so doing the code copy detection is probably horrendously
> expensive)
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Jon Smirl
[EMAIL PROTECTED]



Re: Something is broken in repack

2007-12-10 Thread Jon Smirl
I added the gcc people to the CC, it's their repository. Maybe they
can help up sort this out.

On 12/11/07, Jon Smirl <[EMAIL PROTECTED]> wrote:
> On 12/10/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> > On Mon, 10 Dec 2007, Jon Smirl wrote:
> >
> > > New run using same configuration. With the addition of the more
> > > efficient load balancing patches and delta cache accounting.
> > >
> > > Seconds are wall clock time. They are lower since the patch made
> > > threading better at using all four cores. I am stuck at 380-390% CPU
> > > utilization for the git process.
> > >
> > > complete seconds RAM
> > > 10%   60900M (includes counting)
> > > 20%   15900M
> > > 30%   15900M
> > > 40%   501.2G
> > > 50%   801.3G
> > > 60%   701.7G
> > > 70%   140  1.8G
> > > 80%   180  2.0G
> > > 90%   280  2.2G
> > > 95%   530  2.8G - 1,420 total to here, previous was 1,983
> > > 100% 1390 2.85G
> > > During the writing phase RAM fell to 1.6G
> > > What is being freed in the writing phase??
> >
> > The cached delta results, but you put a cap of 256MB for them.
> >
> > Could you try again with that cache disabled entirely, with
> > pack.deltacachesize = 1 (don't use 0 as that means unbounded).
> >
> > And then, while still keeping the delta cache disabled, could you try
> > with pack.threads = 2, and pack.threads = 1 ?
> >
> > I'm sorry to ask you to do this but I don't have enough ram to even
> > complete a repack with threads=2 so I'm reattempting single threaded at
> > the moment.  But I really wonder if the threading has such an effect on
> > memory usage.
>
> I already have a threads = 1 running with this config. Binary and
> config were same from threads=4 run.
>
> 10% 28min 950M
> 40% 135min 950M
> 50% 157min 900M
> 60% 160min 830M
> 100% 170min 830M
>
> Something is hurting bad with threads. 170 CPU minutes with one
> thread, versus 195 CPU minutes with four threads.
>
> Is there a different memory allocator that can be used when
> multithreaded on gcc? This whole problem may be coming from the memory
> allocation function. git is hardly interacting at all on the thread
> level so it's likely a problem in the C run-time.
>
> [core]
> repositoryformatversion = 0
> filemode = true
> bare = false
> logallrefupdates = true
> [pack]
> threads = 1
> deltacachesize = 256M
> windowmemory = 256M
> deltacachelimit = 0
> [remote "origin"]
> url = git://git.infradead.org/gcc.git
> fetch = +refs/heads/*:refs/remotes/origin/*
> [branch "trunk"]
> remote = origin
> merge = refs/heads/trunk
>
>
>
>
> >
> >
> >
> > >
> > > I have no explanation for the change in RAM usage. Two guesses come to
> > > mind. Memory fragmentation. Or the change in the way the work was
> > > split up altered RAM usage.
> > >
> > > Total CPU time was 195 minutes in 70 minutes clock time. About 70%
> > > efficient. During the compress phase all four cores were active until
> > > the last 90 seconds. Writing the objects took over 23 minutes CPU
> > > bound on one core.
> > >
> > > New pack file is: 270,594,853
> > > Old one was: 344,543,752
> > > It still has 828,660 objects
> >
> > You mean the pack for the gcc repo is now less than 300MB?  Wow.
> >
> >
> > Nicolas
> >
>
>
> --
> Jon Smirl
> [EMAIL PROTECTED]
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Something is broken in repack

2007-12-10 Thread Jon Smirl
Switching to the Google perftools malloc
http://goog-perftools.sourceforge.net/

10%   30  828M
20%   15  831M
30%   10  834M
40%   50  1014M
50%   80  1086M
60%   80  1500M
70% 200  1.53G
80% 200  1.85G
90% 260  1.87G
95% 520  1.97G
100% 1335 2.24G

Google allocator knocked 600MB off from memory use.
Memory consumption did not fall during the write out phase like it did with gcc.

Since all of this is with the same code except for changing the
threading split, those runs where memory consumption went to 4.5GB
with the gcc allocator must have triggered an extreme problem with
fragmentation.

Total CPU time 196 CPU minutes vs 190 for gcc. Google's claims of
being faster are not true.

So why does our threaded code take 20 CPU minutes longer (12%) to run
than the same code with a single thread? Clock time is obviously
faster. Are the threads working too close to each other in memory and
bouncing cache lines between the cores? Q6600 is just two E6600s in
the same package, the caches are not shared.

Why does the threaded code need 2.24GB (google allocator, 2.85GB gcc)
with 4 threads? But only need 950MB with one thread? Where's the extra
gigabyte going?

Is there another allocator to try? One that combines Google's
efficiency with gcc's speed?


On 12/11/07, Jon Smirl <[EMAIL PROTECTED]> wrote:
> I added the gcc people to the CC, it's their repository. Maybe they
> can help up sort this out.
>
> On 12/11/07, Jon Smirl <[EMAIL PROTECTED]> wrote:
> > On 12/10/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> > > On Mon, 10 Dec 2007, Jon Smirl wrote:
> > >
> > > > New run using same configuration. With the addition of the more
> > > > efficient load balancing patches and delta cache accounting.
> > > >
> > > > Seconds are wall clock time. They are lower since the patch made
> > > > threading better at using all four cores. I am stuck at 380-390% CPU
> > > > utilization for the git process.
> > > >
> > > > complete seconds RAM
> > > > 10%   60900M (includes counting)
> > > > 20%   15900M
> > > > 30%   15900M
> > > > 40%   501.2G
> > > > 50%   801.3G
> > > > 60%   701.7G
> > > > 70%   140  1.8G
> > > > 80%   180  2.0G
> > > > 90%   280  2.2G
> > > > 95%   530  2.8G - 1,420 total to here, previous was 1,983
> > > > 100% 1390 2.85G
> > > > During the writing phase RAM fell to 1.6G
> > > > What is being freed in the writing phase??
> > >
> > > The cached delta results, but you put a cap of 256MB for them.
> > >
> > > Could you try again with that cache disabled entirely, with
> > > pack.deltacachesize = 1 (don't use 0 as that means unbounded).
> > >
> > > And then, while still keeping the delta cache disabled, could you try
> > > with pack.threads = 2, and pack.threads = 1 ?
> > >
> > > I'm sorry to ask you to do this but I don't have enough ram to even
> > > complete a repack with threads=2 so I'm reattempting single threaded at
> > > the moment.  But I really wonder if the threading has such an effect on
> > > memory usage.
> >
> > I already have a threads = 1 running with this config. Binary and
> > config were same from threads=4 run.
> >
> > 10% 28min 950M
> > 40% 135min 950M
> > 50% 157min 900M
> > 60% 160min 830M
> > 100% 170min 830M
> >
> > Something is hurting bad with threads. 170 CPU minutes with one
> > thread, versus 195 CPU minutes with four threads.
> >
> > Is there a different memory allocator that can be used when
> > multithreaded on gcc? This whole problem may be coming from the memory
> > allocation function. git is hardly interacting at all on the thread
> > level so it's likely a problem in the C run-time.
> >
> > [core]
> > repositoryformatversion = 0
> > filemode = true
> > bare = false
> > logallrefupdates = true
> > [pack]
> > threads = 1
> > deltacachesize = 256M
> > windowmemory = 256M
> > deltacachelimit = 0
> > [remote "origin"]
> > url = git://git.infradead.org/gcc.git
> > fetch = +refs/heads/*:refs/remotes/origin/*
> > [branch "trunk"]
> > remote = origin
> > merge = refs/heads/trunk
> >
> >
> >
> >
> > >
> > >
> > >
> > > >
> > > > I have no explanation for the change in RAM usage. Two guesses come to
> > > > mind. Memory fragmentation. Or the change in the way the work was
> > > > split up altered RAM usage.
> > > >
> > > > Total CPU time was 195 minutes in 70 minutes clock time. About 70%
> > > > efficient. During the compress phase all four cores were active until
> > > > the last 90 seconds. Writing the objects took over 23 minutes CPU
> > > > bound on one core.
> > > >
> > > > New pack file is: 270,594,853
> > > > Old one was: 344,543,752
> > > > It still has 828,660 objects
> > >
> > > You mean the pack for the gcc repo is now less than 300MB?  Wow.
> > >
> > >
> > > Nicolas
> > >
> >
> >
> > --
> > Jon Smirl
> > [EMAIL PROTECTED]
> >
>
>
> --
> Jon Smirl
> [EMAIL PROTECTED]
>


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Something is broken in repack

2007-12-11 Thread Jon Smirl
On 12/11/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> On Tue, 11 Dec 2007, Nicolas Pitre wrote:
>
> > And yet, this is still missing the actual issue.  The issue being that
> > the 2.1GB pack as a _source_ doesn't cause as much memory to be
> > allocated even if the _result_ pack ends up being the same.
> >
> > I was able to repack the 2.1GB pack on my machine which has 1GB of ram.
> > Now that it has been repacked, I can't repack it anymore, even when
> > single threaded, as it start crowling into swap fairly quickly.  It is
> > really non intuitive and actually senseless that Git would require twice
> > as much RAM to deal with a pack that is 7 times smaller.
>
> OK, here's something else for you to try:
>
> core.deltabasecachelimit=0
> pack.threads=2
> pack.deltacachesize=1
>
> With that I'm able to repack the small gcc pack on my machine with 1GB
> of ram using:
>
> git repack -a -f -d --window=250 --depth=250
>
> and top reports a ~700m virt and ~500m res without hitting swap at all.
> It is only at 25% so far, but I was unable to get that far before.
>
> Would be curious to know what you get with 4 threads on your machine.

Changing those parameters really slowed down counting the objects. I
used to be able to count in 45 seconds now it took 130 seconds. I am
still have the Google allocator linked in.

4 threads, cumulative clock time
25% 200 seconds, 820/627M
55% 510 seconds, 1240/1000M - little late recording
75% 15 minutes, 1658/1500M
90%  22 minutes, 1974/1800M
it's still running but there is no significant change.

Are two types of allocations being mixed?
1) long term, global objects kept until the end of everything
2) volatile, private objects allocated only while the object is being
compressed and then freed

Separating these would make a big difference to the fragmentation
problem. Single threading probably wouldn't see a fragmentation
problem from mixing the allocation types.

When a thread is created it could allocated a private 20MB (or
whatever) pool. The volatile, private objects would come from that
pool. Long term objects would stay in the global pool. Since they are
long term they will just get laid down sequentially in memory.
Separating these allocation types make things way easier for malloc.

CPU time would be helped by removing some of the locking if possible.

-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Something is broken in repack

2007-12-11 Thread Jon Smirl
On 12/11/07, Nicolas Pitre <[EMAIL PROTECTED]> wrote:
> On Tue, 11 Dec 2007, Nicolas Pitre wrote:
>
> > OK, here's something else for you to try:
> >
> >   core.deltabasecachelimit=0
> >   pack.threads=2
> >   pack.deltacachesize=1
> >
> > With that I'm able to repack the small gcc pack on my machine with 1GB
> > of ram using:
> >
> >   git repack -a -f -d --window=250 --depth=250
> >
> > and top reports a ~700m virt and ~500m res without hitting swap at all.
> > It is only at 25% so far, but I was unable to get that far before.
>
> Well, around 55% memory usage skyrocketed to 1.6GB and the system went
> deep into swap.  So I restarted it with no threads.
>
> Nicolas (even more puzzled)

On the plus side you are seeing what I see, so it proves I am not imagining it.


-- 
Jon Smirl
[EMAIL PROTECTED]


Re: Something is broken in repack

2007-12-11 Thread Jon Smirl
On 12/11/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>
> On Tue, 11 Dec 2007, Jon Smirl wrote:
> >
> > So why does our threaded code take 20 CPU minutes longer (12%) to run
> > than the same code with a single thread?
>
> Threaded code *always* takes more CPU time. The only thing you can hope
> for is a wall-clock reduction. You're seeing probably a combination of
>  (a) more cache misses
>  (b) bigger dataset active at a time
> and a probably fairly miniscule
>  (c) threading itself tends to have some overheads.
>
> > Q6600 is just two E6600s in the same package, the caches are not shared.
>
> Sure they are shared. They're just not *entirely* shared. But they are
> shared between each two cores, so each thread essentially has only half
> the cache they had with the non-threaded version.
>
> Threading is *not* a magic solution to all problems. It gives you
> potentially twice the CPU power, but there are real downsides that you
> should keep in mind.
>
> > Why does the threaded code need 2.24GB (google allocator, 2.85GB gcc)
> > with 4 threads? But only need 950MB with one thread? Where's the extra
> > gigabyte going?
>
> I suspect that it's really simple: you have a few rather big files in the
> gcc history, with deep delta chains. And what happens when you have four
> threads running at the same time is that they all need to keep all those
> objects that they are working on - and their hash state - in memory at the
> same time!
>
> So if you want to use more threads, that _forces_ you to have a bigger
> memory footprint, simply because you have more "live" objects that you
> work on. Normally, that isn't much of a problem, since most source files
> are small, but if you have a few deep delta chains on big files, both the
> delta chain itself is going to use memory (you may have limited the size
> of the cache, but it's still needed for the actual delta generation, so
> it's not like the memory usage went away).

This makes sense. Those runs that blew up to 4.5GB were a combination
of this effect and fragmentation in the gcc allocator. Google
allocator appears to be much better at controlling fragmentation.

Is there a reasonable scheme to force the chains to only be loaded
once and then shared between worker threads? The memory blow up
appears to be directly correlated with chain length.

>
> That said, I suspect there are a few things fighting you:
>
>  - threading is hard. I haven't looked a lot at the changes Nico did to do
>a threaded object packer, but what I've seen does not convince me it is
>correct. The "trg_entry" accesses are *mostly* protected with
>"cache_lock", but nothing else really seems to be, so quite frankly, I
>wouldn't trust the threaded version very much. It's off by default, and
>for a good reason, I think.
>
>For example: the packing code does this:
>
> if (!src->data) {
> read_lock();
> src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
> read_unlock();
> ...
>
>and that's racy. If two threads come in at roughly the same time and
>see a NULL src->data, theÿ́'ll both get the lock, and they'll both
>(serially) try to fill it in. It will all *work*, but one of them will
>have done unnecessary work, and one of them will have their result
>thrown away and leaked.

That may account for the threaded version needing an extra 20 minutes
CPU time.  An extra 12% of CPU seems like too much overhead for
threading. Just letting a couple of those long chain compressions be
done twice

>
>Are you hitting issues like this? I dunno. The object sorting means
>that different threads normally shouldn't look at the same objects (not
>even the sources), so probably not, but basically, I wouldn't trust the
>threading 100%. It needs work, and it needs to stay off by default.
>
>  - you're working on a problem that isn't really even worth optimizing
>that much. The *normal* case is to re-use old deltas, which makes all
>of the issues you are fighting basically go away (because you only have
>a few _incremental_ objects that need deltaing).

I agree, this problem only occurs when people import giant
repositories. But every time someone hits these problems they declare
git to be screwed up and proceed to thrash it in their blogs.

>In other words: the _real_ optimizations have already been done, and
>are done elsewhere, and are much smarter (the best way to optimize X is
>not to make X run fast, but to avoid doing X in the first place!). The
>

Re: Something is broken in repack

2007-12-12 Thread Jon Smirl
On 12/12/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>
>
> On Wed, 12 Dec 2007, Nicolas Pitre wrote:
> >
> > So... my conclusion is that the glibc allocator has fragmentation issues
> > with this work load, given the notable difference with the Google
> > allocator, which itself might not be completely immune to fragmentation
> > issues of its own.
>
> Yes.
>
> Note that delta following involves patterns something like
>
>allocate (small) space for delta
>for i in (1..depth) {
> allocate large space for base
> allocate large space for result
> .. apply delta ..
> free large space for base
> free small space for delta
>}

Is it hard to hack up something that statically allocates a big block
of memory per thread for these two and then just reuses it?
   allocate (small) space for delta
   allocate large space for base

The alternating between long term and short term allocations
definitely aggravates fragmentation.

>
> so if you have some stupid heap algorithm that doesn't try to merge and
> re-use free'd spaces very aggressively (because that takes CPU time!), you
> might have memory usage be horribly inflated by the heap having all those
> holes for all the objects that got free'd in the chain that don't get
> aggressively re-used.
>
> Threaded memory allocators then make this worse by probably using totally
> different heaps for different threads (in order to avoid locking), so they
> will *all* have the fragmentation issue.
>
> And if you *really* want to cause trouble for a memory allocator, what you
> should try to do is to allocate the memory in one thread, and free it in
> another, and then things can really explode (the freeing thread notices
> that the allocation is not in its thread-local heap, so instead of really
> freeing it, it puts it on a separate list of areas to be freed later by
> the original thread when it needs memory - or worse, it adds it to the
> local thread list, and makes it effectively totally impossible to then
> ever merge different free'd allocations ever again because the freed
> things will be on different heap lists!).
>
> I'm not saying that particular case happens in git, I'm just saying that
> it's not unheard of. And with the delta cache and the object lookup, it's
> not at _all_ impossible that we hit the "allocate in one thread, free in
> another" case!
>
> Linus
>


-- 
Jon Smirl
[EMAIL PROTECTED]