Deferred constraint checking causing problems during test runs

Jim D. Sat, 09 Jul 2011 21:35:44 -0700

Howdy,

I'd like to share some findings I made recently with regards to deferred 
constraint checks during test runs. There's a bit of background here 
(https://code.djangoproject.com/ticket/3615) and 
here(https://code.djangoproject.com/ticket/11665), but this is the gist of 
what's going on now:


As you all know, when the Django test suite runs it executes each test 
within a DB transaction, which is opened during setUp and rolled back during 
tearDown. This works great for testing, but there is a problem, which Ramiro 
brought up the other day on IRC and I observed as well.

The problem is that by default in most of our backends, a transaction runs 
with constraint checks deferred. This means that foreign keys are *not* 
checked for integrity until a transaction is committed. But here's the rub: 
we never commit a transaction during tests, we roll it back instead. 
Therefore, none of the queries against the DB are being checked for 
integrity. (The exception is MySQL InnoDB, which I'll get to in a sec.)

As part of the work I was doing recently to disable and re-enable constraint 
checks during fixture loading, I realized that we were handling this sort of 
incorrectly across our backends.  Postgresql in fact does provide a facility 
for enabling constraint checks: SET CONSTRAINT CHECKS ALL IMMEDIATE. This 
statement causes Postgresql to behave as it normally would, when it was 
executing queries outside of a transaction.

So I tried this out on Postgresql, and what I got when I executed the test 
suite was:

Ran 3947 tests in 570.399s
FAILED (failures=53, errors=233, skipped=16, expected failures=2)

Interestingly, this looks familiar. Here's the result of the test suite on 
MySQL with Innodb:

Ran 3948 tests in 5171.412s
FAILED (failures=67, errors=213, skipped=21, expected failures=2)

Pretty darn close.

Anyhow, as I started to dig a bit deeper I realized something I considered 
to be quite important: MySQL, which has sort of been dismissed as being 
unable to handle constraint checks like a "real" database has in fact been 
running like a real database all along. The bugs that have accumulated 
during MySQL tests are in fact real bugs. When Postgresql is configured with 
constraint checks set to immediate -- and as a result behaves as it would in 
production -- these very same bugs are revealed.

The good news is there are really just a few minor problems triggering all 
these errors, and most of them are bugs in the test suite rather than 
production code. As a matter of fact, with a bit of hacking yesterday and 
today, I managed to get most of this under control. As of right now, when 
running the test suite in a fixed branch I have, I get:

Postgresql:

Ran 3993 tests in 946.555s
FAILED (failures=3, errors=6, skipped=17, expected failures=3)

MySQL:

Ran 3994 tests in 4857.777s
FAILED (failures=8, errors=7, skipped=25, expected failures=3)

In other words, we're in striking distance from getting all tests to pass on 
all backends, and I uncovered a handful of issues (again, mostly bad test 
code) which were hidden from sight because constraint checks weren't being 
run.

Right now I've got everything on an experimental git branch, which is still 
somewhat a work in progress: 
https://github.com/jsdalton/django/tree/experimental_check_constraints_fixed 
. As I said, I was mostly hacking away to kill the bugs and cut through the 
fog, so at this point I don't think what i have is ready to be proposed as a 
patch. Note that this is a branch off of the work I did on #3615 to fix 
fixture loading, which is sort of an important underlying piece that had to 
come together to be able to flush everything out.

In brief, here are the issues I uncovered, running with Postgresql.

* Content types get flushed from the DB after each test. However, they 
remain cached in the manager. When constraint checks are working, certain 
tests were failing because the a Content type wasn't being created in the DB 
in get_for_model(). Fix was to clear the contenttype cache at the start of 
every test run.

* Some deletion tests were erroring because constraint checks needed to be 
temporarily disabled during batch deletes.

* Big sneaky one that I pulled most of my hair out on until I finally found 
it...this bad boy: 
https://github.com/jsdalton/django/commit/b452248a0d9a025c7aa387091fa6d6b1f51113df
 
 . This is the reason why there are hundreds of errors on MySQL when the 
test suite is run on master right now. Basically the monkey patched routers 
were not being restored properly. As a result, the 'other' database was no 
longer being used. Tests were failing all over the place for every test that 
ran after this one. Fixing this single type caused about 200 tests to start 
passing again in MySQL, and Postgresql as well with constraint checks 
immediate.

The above are the issues I was able to patch up. Other fixes might be 
desired on the above, but again my goal was just to get tests passing again.

There were some additional issues that came up that I was not able to patch 
so easily. Here's what's behind the remaining issues that are not causing 
tests to fail.

* A bit hard to explain but this test: 
regressiontests.multiple_database.tests.RouterTestCase.test_m2m_managers and 
a few others like it don't work right. This is the code that's causing a 
problem:

        pro = Book.objects.using('other').create(pk=1, title="Pro Django",
                                                
 published=datetime.date(2008, 12, 16))
        marty = Person.objects.using('other').create(pk=1, name="Marty 
Alchin")
        pro.authors = [marty]

authors is a join. The problem is that the join database is in the default 
db, but the records created are in 'other'. The join record raises an 
integrity error because the records it's trying to relate to aren't in its 
db. I don't know if this is a bug or shouldn't work in the first place, but 
I didn't have an easy fix, since I'm not familiar with this part of the code 
base. This is a great example of an issue that has been obscured until now 
and probably requires a ticket and a fix of some kind from someone who is 
familiar with it.

* Something funky is going on with a few of the serializer_regress tests. 
That's a big mess of a test structure and I didn't have it in me to try to 
sort through what was going on there. Could be a real bug, probably just a 
test code problem.

* assertNumQueries is responsible for the rest of the failures. This is a 
bit of a chin scratcher for me at the moment. The failures are purely an 
accounting issue: the new constraint check disable/enable code runs two 
extra queries when it is executed, thus there are 2 more queries than 
expected. The problem here is that only a certain backends (right now, 
SQLite is missing them) support this, and as a result there are inconsistent 
query counts based on the backend you are using. I'm sure there's a way 
around it but it's not really my call. Either find a way to exclude certain 
queries from the count or handle it in test code. 

* There were a handful of other MySQL specific errors relating to collation 
mix ups and a few things that at some point I might try to look at.

Anyhow, that's pretty much a detailed list of what i uncovered. Where to go 
from here? These are my recommendations, though obviously a core dev or two 
need to step in here and make the real decisions.

* The fixture loading and constraint checking code might need a bit of 
cleanup and tweaking here and there but is for the most part ready. Getting 
this committed and in place is a good first step.

* A decision needs to be made about whether constraint checks should be set 
to immediate at the start of every test. MySQL InnoDB already does this, and 
Oracle and Postgresql are capable of doing it now by explicitly setting this 
in fixtureSetUp(). SQLite cannot (and will not, as far as I can tell) 
support this. In other words, without a new creative solution SQLite will 
continue to allow bad data to be entered during testing, as it does 
presently. MySQL MyISAM is in the same boat. I am going to come out very 
strongly in favor of this. It's required to make tests be more realistic, as 
Ramiro also discovered and pointed out separately from me. A big concern 
that he brought up over IRC is that it could cause some user tests to break, 
if these tests had previously been inserting bad data during testing.

* Assuming the above goes forward, that means we have test failures in the 
main test suites. Some of them do not have an immediate fixes. The fixes I 
put together in my hacking may not fly or may require independent discussion 
before implementing. The other issues I couldn't fix need work. My 
recommendation -- though I don't know if this is unacceptable -- would be to 
commit the change that enables immediate constraint checks and then file 
blocker-level tickets on the issues that are prevented from passing. It 
seems like people will need that code in place in order to be able to 
exercise the code properly and get to the root of those issues. I guess an 
alternative would be a new branch but I know we're on SVN and that can be 
unpleasant.

Anyhow, that's about it. I think there are probably a number of other 
tickets floating around out there (Ramiro pointed out a few to me) that 
relate to all this mess.

Again the good news is I didn't uncover anything horrific, I think it's 
mostly an issue of cleaning up under the covers a bit and moving forward 
with improved realism in DB testing. Also great news is that we are 
definitely in striking distance of all backends passing the entire test 
suite, for real. As someone whose company uses MySQL in production this 
would be pretty awesome for me, especially since it would help continue to 
address some other MySQL issues that have been hard to tackle because 
running those tests is such a nightmare.

Next week I have a ton of other work I need to attend to (the kind I get 
paid to do that is :) ) so i won't have a lot of time I can spend on this. 
But I hope I've done at least a passing job at uncovering the issue and 
presenting it clearly enough that you guys can all agree on a clear path 
forward. I know Ramiro has been pretty deeply involved in exploring this 
issue as well, so he might have some perspective or opinion of his own he'd 
like to share.

If i flubbed any of the above or anyone requires clarification please do let 
me know. Apologies for the lengthy post but hopefully it was all pertinent 
to the issue at hand.

Cheers,
Jim Dalton

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/kA7LoSB12MIJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Deferred constraint checking causing problems during test runs

Reply via email to