Gate Pull Model
From Genunix
Here is a straw design for running the ON gate using a pull model rather than a push model.
Contents |
Introduction
What is the pull model?
The current approach that ON uses with Mercurial is a "push" model: a developer makes some changes, merges with a clone of the gate, flattens the changeset graph using "hg recommit", then pushes to the gate. If another developer has added a changeset after the first developer merged from the clone, the push fails. The first developer must cycle back to the "merge" step and try again.
An alternative approach would be to use a "pull" model. With this approach, the developer makes some changes and posts them in a public location. The gate pulls the change that the developer posted and does any necessary merging/flattening. If problems are detected, the changes are dropped and the developer is notified.
Essentially, there is a certain amount of overhead work associated with integrating a changegroup and maintaining compliance with gate policies (like well formed changeset comments, and no branching, and no merge changesets). In the push model, this overhead is handled in a distributed fashion, by each individual developer. In the pull model, it is handled in a centralized fashion, by the gatekeeping team.
Why use the pull model?
The main reason to use the pull model is that it scales better.
With a push model, developers may run through the merge/flatten/push cycle multiple times before finally getting a successful push. Because the conflicts often are in unrelated areas of ON, this is not a good use of developer time. Furthermore, because there is no ordering to the pushes, starvation is possible.
With a pull model, the gate automation can handle the common case where the code changes don't actually conflict. This should eliminate, or at least greatly reduce, the need for multiple merge/flatten/push cycles. It may also be possible to introduce queueing heuristics to reduce the likelihood of starvation when a changeset does need to be resubmitted.
There is a chance that two changesets will conflict semantically and that the gate software will fail to detect the conflict. With the push model, the 2nd developer can notice the conflict and update his changes to accomodate the first changeset. But for this to happen, the second developer must carefully scrutinize all changesets that he pulls as part of the final merge/flatten/push cycle. Informal surveys of ON developers suggest that this is not happening, so moving to a pull model isn't likely to introduce additional risk.
Another reason for introducing the pull model is that it's a better enabler for external developerment. That is, we want to incrementally build up infrastructure that lets external developers submit changes directly, without having to go through an internal sponsor. If we stay with the current push model, it will be harder to make incremental progress: the gate will have to move to be outside the SWAN. This means the RTI infrastructure must exist off-SWAN. The RTI infrastructure depends on the bug database, so that will have to exist off-SWAN. That then forces us to deal with problems associated with having multiple bug trackers. These issues will have to be solved eventually, but a pull model makes it easier for us to attack them serially, rather than all at once.
Basic Approach
Consider a setup like the following. The build repo is owned by the gate automation/staff.
----------- --------------
| gate | | build repo |
| (golden | --------------
| source) |
-----------
----------
| queue |
| mgr |
----------
------------
| dev repo |
------------
A simple putback would look something like this:
1. The developer submits an RTI (Request To Integrate).
2. The RTI advocate approves the RTI. As per current practice, this approval can be subject to some minor changes that the developer must make before submitting the changes.
3. The developer enters the changegroup into the queueing system.
4. The gate software pulls from the dev repo to the build repo. If a new head is introduced, the gate software does a hands-free merge and flattens the changeset graph (using the rebase extension (preferred) or Cadmium).
5. The gate software pulls from the build repo into the gate.
6. The gate software removes the changegroup from the queue.
Queueing the changegroup
The user interface for step 3 (queueing the changegroup) is TBD. The proposal on the table is a Mercurial extension. The advantage of this approach is that it allows for various underlying submission mechanisms, without changing the user interface.
The mechanism for step 3 (queueing the changegroup) is TBD. One possible mechanism would be to submit a patch ("hg export --git") or a Mercurial bundle. We still consider this a "pull" approach because the gate pulls changegroups off the queue one at a time. The use of --git for patches is to ensure that renames and copies are handled correctly.
Small changegroups could be submitted in their entirety using email. Large changegroups may need an alternate mechanism (e.g., submit a URL that the queueing system pulls from).
Note, though, that using email as a transport can add latency, which could cause user frustration (depending on how severe the latency is). We could start out using email and see if it's a problem in practice.
Errors and Special Cases
This section discusses complications to the simple flow described above.
Manual Backout
The gatekeeper needs to be able to back out one or more changesets. This could be done by letting the gatekeeper add changegroups to the front of the queue.
Automated Rejection
The changeset will be rejected in the following circumstances:
- changeset merge is necessary and the auto-merge fails
If this happens, the following steps are taken:
- send email notification to the changeset author and RTI submitter (if different)
- remove the changegroup from the queue
- flush build repo (reset to gate)
Email is sent before removing the changegroup in case the system crashes immediately after removing the changegroup.
Locking the Gate
The design could use some fleshing out in this area.
A crude way to lock the gate would be to set a flag that tells the daemon to ignore non-gatekeeper changegroups. More sophisticated approaches are possible, such as having an override mechanism that only allows some changesets through (e.g., based on changeset author or RTI submitter).
The current gate locking script is based on the HGUSER, which is set via the .ssh/authorized_keys file. That mechanism should continue to function for direct integrations (ie from gatekeepers), if we assume that queued integrations are coming through a separate channel.
Ultimately, though, this probably works as "gatekeepers are able to manipulate the integration queue," including privileges to remove, reorder, insert at head, and enable/disable processing.
Overlapping Trains
At the end of a release, there is frequently a window during which stopper fixes are directed to release N and everything else is directed to release N+1. The gatekeeper for release N+1 has the responsibility of merging in the fixes from release N. (Or at least that's how it worked at the start of Nevada development: the Nevada gatekeeper pulled in the stopper fixes that were going into the S10 gate.)
In this situation, the gatekeeper can have a merge workspace, similar to the backout workspace that is used for backouts. The gatekeeper can signal that a changeset is ready using the same mechanism as for backouts (see above).
It's not really clear what this will look like in the future. In the current update release world, the gatekeepers no longer manage the foldback. Now individual developers must prepare integration to both patch and feature gates as needed.
It's also possible that N+1 development could be throttled during the stopper builds of N.
Whatever solution is implemented needs to be sufficiently flexible to handle multiple release trains, but should not be optimized for them.
Cleanup After Gate System Crash
The new infrastructure must be able to automatically recover after a system crash. The necessary operations are
- clear the build repository (reset it to be a copy of the gate)
- check the status of the gate's tip changeset. If it's still in the queue, do the appropriate cleanup processing (remove from queue, invoke post-putback checks, etc.)
Open Issues
Hands-Free Merge
We'll want to think some more about what sorts of conflicts we're willing to mechanically merge. One possible approach is to fail only if the Mercurial auto-merge is unable to resolve the conflict. But that approach has bitten people in the past. We recommend today that people disable it.
A more conservative approach would be to fail if the same file is modified in both branches. This is essentially the TeamWare model, which has worked well for ON in past releases. In the Mercurial world, this would essentially mean that a rebased patch would still apply cleanly at the tip of the repository. Any fuzz in the application would result in changegroup rejection.
A yet more conservative, but much more complicated, approach would be to set up some sort of rule system to describe what constitutes a conflict. For example, you might want the merge to fail if both branches modify ZFS code. (We would probably not want to do this unless experience shows that a simpler system doesn't work.)
Options for the Future
This section describes some possible approaches that we might want to include in the future.
Integration with RTI System
We might want to use OpenRTI for some of the bookkeeping/process steps. This might make it easier for project managers to track the status of changes near the end of a release. The drawback of this approach is that it requires OpenRTI changes, and those changes might only apply to ON.
Incremental Build Prior to Acceptance
The queueing system could do an incremental build before accepting the change. This approach could help avoid breakage in the gate. On the other hand, this sort of breakage is not common today, and the build adds a delay in the main processing path. An obvious problem with this delay is it can reduce overall throughput of the system. It might also make for a less pleasant user experience, by delaying merge-related rejections (waiting for changegroups that are earlier in the queue). This could be especially troublesome near a release cutoff, when there are likely to be a lot of changesets coming in.
An additional complication is that some changes might break an incremental build but work okay for a full (clobber) build.
State Transition Table
Here is a state transition table to show the flow of operations. It assumes that if a changeset is backed out, it remains marked as "put back".
(to be finished)
