Now that we seem to be converging to an acceptable Git model, there was only one remaining doubt, and that's how the trigger to update a sequential ID will work. I've been in contact with GitHub folks, and this is in line with their suggestions...
Given the nature of our project's repository structure, triggers in each repository can't just update their own sequential ID (like Gerrit) because we want a sequence in order for the whole project, not just each component. But it's clear to me that we have to do something similar to Gerrit, as this has been proven to work on a larger infrastructure. Adding an incremental "Change-ID" to the commit message should suffice, in the same way we have for SVN revisions now, if we can guarantee that: 1. The ID will be unique across *all* projects 2. Earlier pushes will get lower IDs than later ones Other things are not important: 3. We don't need the ID space to be complete (ie, we can jump from 123 to 125 if some error happens) 4. We don't need an ID for every "commit", but for every push. A multi-commit push is a single feature, and doing so will help buildbots build the whole set as one change. Reverts should also be done in one go. What's left for the near future: 5. We don't yet handle multi-repository patch-sets. A way to implement this is via manual Change-ID manipulation (explained below). Not hard, but not a priority. Design decisions This could be a pre/post-commit trigger on each repository that receives an ID from somewhere (TBD) and updates the commit message. When the umbrella project synchronises, it'll already have the sequential number in. In this case, the umbrella project is not necessary for anything other than bisect, buildbots and releases. I personally believe that having the trigger in the umbrella project will be harder to implement and more error prone. The server has to have some kind of locking mechanism. Web services normally spawn dozens of "listeners", meaning multiple pushes won't fail to get a response, since the lock will be further down, after the web server. Therefore, the lock for the unique increment ID has to be elsewhere. The easiest thing I can think of is a SQL database with auto-increment ID. Example: Initially: sql> create table LLVM_ID ( id int not null primary key auto_increment, repository varchar not null, hash varchar nut null ); sql> alter table LLVM_ID auto_increment = 300000; On every request: sql> insert into LLVM_ID values ("$repo_name", "$hash"); sql> select_last_inset_id(); -> return and then print the "last insert id" back to the user in the body of the page, so the hook can update the Change-id on the commit message. The repo/hash info is more for logging, debugging and conflict resolution purposes. We also must limit the web server to only accept connections from GitHub's servers, to avoid abuse. Other repos in GitHub could still abuse, and we can go further if it becomes a problem, but given point (3) above, we may fix that only if it does happen. This solution doesn't scale to multiple servers, nor helps BPC planning. Given the size of our needs, it not relevant. Problems If the server goes down, given point (3), we may not be able to reproduce locally the same sequence as the server would. Meaning SVN-based bisects and releases would not be possible during down times. But Git bisect and everything else would. Furthermore, even if a local script can't reproduce exactly what the server would do, it still can make it linear for bisect purposes, fixing the local problem. I can't see a situation in which we need the sequence for any other purpose. Upstream and downstream releases can easily wait a day or two in the unlucky situation that the server goes down in the exact time the release will be branched. Migrations and backups also work well, and if we use some cloud server, we can easily take snapshots every week or so, migrate images across the world, etc. We don't need duplication, read-only scaling, multi-master, etc., since only the web service will be writing/reading from it. All in all, a "robust enough" solution for our needs. Bundle commits Just FYI, here's a proposal that appeared in the "commit message format" round of emails a few months ago, and that can work well for bundling commits together, but will need more complicated SQL handling. The current proposal is to have one ID per push. This is easy by using auto_increment. But if we want to have one ID per multiple pushes, on different repositories, we'll need to have the same ID on two or more "repo/hash" pairs. On the commit level, the developer adds a temporary hash, possibly generated by a local script in 'utils'. Example: Commit-ID: 68bd83f69b0609942a0c7dc409fd3428 This ID will have to be the same on both (say) LLVM and Clang commits. The script will then take that hash, generate an ID, and then if it receives two or more pushes with such hashes, it'll return the *same* ID, say 123456, in which case the Git hooks on all projects will update the commit message by replacing the original Commit-ID to: Commit-ID: 123456 To avoid hash clashes in the future, the server script can refuse existing hashes that are a few hours old and return error, in which case the developer generates a new hash, update all commit messages and re-push. If there is no Commit-ID, or if it's empty, we just insert a new empty line, get the auto increment ID and return. Meaning, empty Commit-IDs won't "match" any other. To solve this on the server side, a few ways are possible: A. We stop using primary_key auto_increment, handle the increment in the script and use SQL transactions. This would be feasible, but more complex and error prone. I suggest we go down that route only if keeping the repo/hash information is really important. B. We ditch keeping record of repo/hash and just re-use the ID, but record the original string, so we can match later. This keeps it simple and will work for our purposes, but we'll lose the ability to debug problems if they happen in the future. C. We improve the SQL design to have two tables: LLVM_ID: * ID: int PK auto * Key: varchar null LLVM_PUSH: * LLVM_ID: int FK (LLVM_ID:ID) * Repo: varchar not null * Push: varchar not null Every new push updates both tables, returns the ID. Pushes with the same Key re-use the ID and update only LLVM_PUSH, returns the same ID. This is slightly more complicated, will need to code scripts to gather information (for logging, debug), but give us both benefits (debug+auto_increment) in one package. As a start, I'd recommend we take this route even before the script supports it. But it could be simple enough that we add support for it right from the beginning. I vote for option C. Deployment I recommend we code this, setup a server, let it running for a while on our current mirrors *before* we do the move. A simple plan is to: * Develop the server, hooks and set it running without updating the commit message. * We follow the logs, make sure everything is sane * Change the hook to start updating the commit message * We follow the commit messages, move some buildbots to track GitHub (SVN still master) * When all bots are live tracking GitHub and all developers have moved, we flip. Sounds good? cheers, --renato _______________________________________________ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev