On Fri, Aug 17, 2012 at 8:07 AM, Ryosuke Niwa <[email protected]> wrote: > On Thu, Aug 16, 2012 at 6:11 PM, Dirk Pranke <[email protected]> wrote: >> >> On Thu, Aug 16, 2012 at 5:41 PM, Ryosuke Niwa <[email protected]> wrote: >> > On Thu, Aug 16, 2012 at 5:18 PM, Dirk Pranke <[email protected]> >> > wrote: >> >> >> >> On Thu, Aug 16, 2012 at 3:50 PM, Stephen Chenney >> >> <[email protected]> >> >> wrote: >> >> > I agree with the priorities above, at least. I also agree with the >> >> > overall >> >> > goal of making our implementation match our philosophy on testing. >> >> > >> >> > Ryosuke has raised a very valid point: it is not clear what a >> >> > Gardener >> >> > should do with a test that has changed behavior. Personally, the area >> >> > I >> >> > deal >> >> > with is very susceptible to the "single pixel differences matter" >> >> > issue, >> >> > and >> >> > I would prefer gardeners to flag updates in some way so that I can >> >> > look >> >> > at >> >> > it myself (or one of the very few other experts can look at it). >> >> > Maybe >> >> > I'm >> >> > just a control freak. >> >> > >> >> > In the current situation, gardeners are not likely to roll out a >> >> > patch >> >> > short >> >> > of build failures, obvious test or perf regressions, or security >> >> > issues. >> >> > For >> >> > the bulk of cases where a test result has changed, the gardener will >> >> > either >> >> > add it to expectations or go ahead and update the result. >> >> > >> >> > The result is a mixture of incorrect expected results and files >> >> > listed >> >> > in >> >> > TestExpectations that may or may not be correct. The way I deal with >> >> > this >> >> > right now is to maintain a snapshot of TestExpectations and >> >> > occasionally >> >> > go >> >> > through and look for newly added results, and check if they are in >> >> > fact >> >> > correct. And every now and then go look at older entries to see if >> >> > they >> >> > have >> >> > been updated. Or I get lucky and notice that a new expected result >> >> > has >> >> > been >> >> > checked in that is incorrect (I can mostly catch that by following >> >> > checkins >> >> > in the code). >> >> > >> >> > The proposed change does not alter this fundamental dynamic. >> >> > Depending >> >> > on >> >> > what policy gardeners adopt, they might now rename the result as >> >> > failing >> >> > and >> >> > remove the expected, or maybe they'll just update the expected. I'm >> >> > still in >> >> > a situation where I don't know the exact status of any given result, >> >> > and >> >> > where I have no easy way of knowing if a gardener has correctly >> >> > updated >> >> > an >> >> > expected or failing image. >> >> >> >> So, if we had -passing/-failing, then people who knew what the results >> >> were supposed to be in a given directory (call them "owners") could >> >> rename the existing -expected files into one category or another, and >> >> then a gardener could add newly-failing tests as -expected; it would >> >> then be trivial for the owner to look for new -expected files in that >> >> directory. >> >> >> >> Right now, an owner would have to periodically scan the directory for >> >> files changed since the last time they scanned the directory, and >> >> would have no way of knowing whether an updated file was "correct" or >> >> not (perhaps you could filter out by who committed the change, but >> >> it's certainly harder than ls *-expected). >> >> >> >> So, I would contend that my proposal would make it easier for us to >> >> have a process by which gardeners could punt on changes they're unsure >> >> about and for owners to subsequently review them. I think you could >> >> have a similar model if we just checked in *all* new results into >> >> -expected, but it would be harder to implement (perhaps not >> >> impossible, though, I haven't given it much thought yet). >> > >> > >> > That seems like a much better model. >> >> To clarify: what does "That" refer to? My idea for gardeners to check >> in "-expected" and owners to move to "-passing/failing"] or the idea >> that we check in "new results into -expected" and have some other way >> of reviewing new baselines? (or something else)? > > > The former. In particular, the idea that only experts of a given area can > check in -correct/-failing is promising because there are many cases where > pixel tests fail due to completely unrelated bugs and we want to consider it > to be still correct. Butwe can implement this policy without the said > proposal. All we need to do is for gardeners to add test expectations and > wait for experts to come in and decide whether we need rebaseline, bug fix, > or rollout.
A problem with just adding test expectations is that we lose visibility into whether the test is failing the same way or a different way. That is one of the (if not the primary) motivation for my proposal ... > > On the other hand, the pixel test output that's correct to one expert may > not be correct to another expert. For example, I might think that one > editing test's output is correct because it shows the feature we're testing > in the test is working properly. But Stephen might realizes that this > -expected.png contains off-by-one Skia bug. So categorizing -correct.png and > -failure.png may require multiple experts looking at the output, which may > or may not be practical. Perhaps. Obviously (a) there's a limit to what you can do here, and (b) a test that requires multiple experts to verify its correctness is, IMO, a bad test :). > I think we should just check-in whatever result we're > currently seeing as -expected.png because we wouldn't at least have any > ambiguity in the process then. We just check in whatever we're currently > seeing and file a bug if we see a problem with the new result and possibly > rollout the patch after talking with the author/reviewer. This is basically saying we should just follow the "existing non-Chromium" process, right? This would seem to bring us back to step 1: it doesn't address the problem that I identified with the "existing non-Chromium" process, namely that a non-expert can't tell by looking at the checkout what tests are believed to be passing or not. I don't think searching bugzilla (as it is currently used) is a workable alternative. > The new result we check in may not be 100% right but experts — e.g. me for > editing and Stephen for Skia — can come in and look at recent changes to > triage any new failures. > > In fact, it might be worthwhile for us to invest our time in improving tools > to do this. For example, can we have a portal where I can see new > rebaselines that happened in LayoutTests/editing and > LayoutTests/platform/*/editing since the last time I visited the portal? > e.g. it can show chronological timeline of baselines along with a possible > cause (list of changesets maybe?) of the baseline. We could build such a portal, sure. I would be interested to hear from others whether such a thing would be more or less useful than my proposal. Of course, you could just set up a watchlist for new expectations today. Probably not quite as polished as we could get with a portal, but dirt simple ... -- Dirk _______________________________________________ webkit-dev mailing list [email protected] http://lists.webkit.org/mailman/listinfo/webkit-dev

