Martin Fick <[email protected]> writes:
> Sorry for the long winded rant. I suspect that some variation of all
> my suggestions have already been suggested, but maybe they will
> rekindle some older, now useful thoughts, or inspire some new ones.
> And maybe some of these are better to pursue then more parallelism?
We avoid doing a grand design document without having some prototype
implementation, but I think the limitation of the current protocol
has become apparent enough that we should do something about it, and
we should do it in a way that different implementations of Git can
all implement.
I think "multi-threaded clone" is a wrong title for this discussion,
in that the user does not care if it is done by multi-threading the
current logic or in any other way. The user just wants a faster
clone.
In addition, the current "fetch" protocol has the following problems
that limit us:
- It is not easy to make it resumable, because we recompute every
time. This is especially problematic for the initial fetch aka
"clone" as we will be talking about a large transfer [*1*].
- The protocol extension has a fairly low length limit [*2*].
- Because the protocol exchange starts by the server side
advertising all its refs, even when the fetcher is interested in
a single ref, the initial overhead is nontrivial, especially when
you are doing a small incremental update. The worst case is an
auto-builder that polls every five minutes, even when there is no
new commits to be fetched [*3*].
- Because we recompute every time, taking into account of what the
fetcher has, in addition to what the fetcher obtained earlier
from us in order to reduce the transferred bytes, the payload for
incremental updates become tailor-made for each fetch and cannot
be easily reused [*4*].
I'd like to see a new protocol that lets us overcome the above
limitations (did I miss others? I am sure people can help here)
sometime this year.
[Footnotes]
*1* The "first fetch this bundle from elsewhere and then come back
here for incremental updates" raised earlier in this thread may
be a way to alleviate this, as the large bundle can be served
from a static file.
*2* An earlier "this symbolic ref points at that concrete ref"
attempt failed because of this and we only talk about HEAD.
*3* A new "fetch" protocol must avoid this "one side blindly gives a
large message as the first thing". I have been toying with the
idea of making the fetcher talk first, by declaring "I am
interested in your refs that match refs/heads/* or refs/tags/*,
and I have a superset of objects that are reachable from the
set of refs' values X you gave me earlier", where X is a small
token generated by hashing the output from "git ls-remote $there
refs/heads/* refs/tags/*". In the best case where the server
understands what X is and has a cached pack data, it can then
send:
- differences in the refs that match the wildcards (e.g. "Back
then at X I did not have refs/heads/next but now I do and it
points at this commit. My refs/heads/master is now at that
commit. I no longer have refs/heads/pu. Everything else in
the refs/ hierarchy you are interested in is the same as state
X").
- The new name of the state Y (again, the hashed value of the
output from "git ls-remote $there refs/heads/* refs/tags/*")
to make sure the above differences can be verified at the
receiving end.
- the cached pack data that contains all necessary objects
between X and Y.
Note that the above would work if and only if we accept that it
is OK to send objects between the remote tracking branches the
fetcher has (i.e. the objects it last fetched from the server)
and the current tips of branches the server has, without
optimizing by taking into account that some commits in that set
may have already been obtained by the fetcher from a
third-party.
If the server does not recognize state X (after all it is just a
SHA-1 hash value, so the server cannot recreate the set of refs
and their values from it unless it remembers), the exchange
would have to degenerate to the traditional transfer.
The server would want to recognize the result of hashing an
empty string, though. The fetcher is saying "I have nothing"
in that case.
*4* The scheme in *3* can be extended to bring the fetcher
step-wise. If the server's state was X when the fetcher last
contacted it, and since then the server received multiple pushes
and has two snapshots of states, Y and Z, then the exchange may
go like this:
fetcher: I am interested in refs/heads/* and refs/tags/* and I
have your state X.
server: Here is the incremental difference to the refs and the
end result should hash to Y. Here comes the pack data
to bring you up to date.
fetcher: (after receiving, unpacking and updating the
remote-tracking refs) Thanks. Do you have more?
server: Yes, here is the incremental difference to the refs and the
end result should hash to Z. Here comes the pack data
to bring you up to date.
fetcher: (after receiving, unpacking and updating the
remote-tracking refs) Thanks. Do you have more?
server: No, you are now fully up to date with me. Bye.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html