I spent the better part of yesterday mucking around in the dregs of Django's 
cache middleware and related modules, and in doing so I've come to the 
conclusion that, due to an accumulation of hinderances and minor bugs, the 
per-site and per-view caching mechanism are effectively broken for many 
fairly typical usage patterns.

Let me demonstrate by fictional example, with what I would consider to be a 
pretty typical configuration and use case for the per-site cache:

Let's pretend I'm developing a blog powered by Django. I'm using memcached, 
and I would like to cache pages on that blog for anonymous users, who are 
going to make up the vast majority of my site's visitors. Ideally, I will 
serve the exact same cached version of a blog post to every single anonymous 
visitor to my site, which will help keep server load under control, 
particularly when I get slashdotted/reddited/what-have-you.

Like any blog, a typical page view features the content primarily (e.g a 
blog post). It also has some "auth" stuff at the top right, which will say 
"Log in / Register" for non logged in users but show a username and welcome 
message for logged in users. Each blog post also has an empty comment form 
at the bottom of it where users can leave comments on the post. Like 99% of 
the websites out there, I will be using Google Analytics to track my 
visitors etc.

Pretty straightforward, right?

Let me count the ways that Django's cache middleware will muck up my goals 
in the above scenario.

First, I'm going to try use the per site cache. Here's what's going to go 
wrong for me:

* It's going to be virtually impossible for me to avoid my cache varying by 
cookie and thus by visitor. Because in my templates I am checking to see if 
the current user is logged in, I'm touching the session, which is going to 
now set the vary cookie header. That means if there is any difference in the 
cookies users are requesting my pages with, I'm going to be sending each 
user a separate cached page, keyed off of SESSION_COOKIE_NAME, which is 
unique for every visitor.

* Even if I avoid touching the request user somehow, the CSRF middleware 
presents the same issue. Because I have a comment form on every page, I have 
a unique CSRF token for each visitor. Thankfully Django doesn't let me 
completely shoot myself in the foot by caching the page with one user's 
token and serving it to everybody else. At least it helpfully sets a CSRF 
token cookie and varies on it to prevent this. However, that cookie is 
different for every unique user. That triggers the the same problem as 
above. I again cannot avoid caching a unique page for each unique visitor.

* Unfortunately, my troubles are not over, even if I resign myself to having 
a cache that varies per visitor. You see, Google Analytics actually sets a 
handful of other cookies with each page request. And guess what? The values 
for those cookies are unique *for each request*. This mean...I'm actually 
not caching at all. Cookies are unique for each and every page request 
thanks to Google Analytics. My per-site cache configuration is totally and 
completely inoperable, all because I'm using a tracking service that pretty 
much *everybody* uses.

Since that didn't work, I wonder if it'll work if I do per-view caching? It 
shouldn't work at all, should it, since it's not like any of the factors I 
outlined above are different if I'm using the @cache_page decorator to do my 
caching vs the per-site cache.

Well, the sad news is caching does "work" when I use cache_page, and that's 
not a good thing:

* @cache_page caches the direct output of the view/render function. It skips 
over the middleware that might have very good reason to introduce vary 
headers and doesn't introduce any vary headers of it's own. So now, with 
this applied, I *am* serving a cached version of this page even though I 
absolutely should not be. Some poor user's token is now being sent to 
everybody. My only chance of redemption is if I happen to have read the docs 
and discovered that this incantation is required to prevent having 
cache_page improperly cache the page:

   @cache_page(60 * 15)
   @csrf_protect
   def my_view(request):
       # ...
       
Of course, the above just puts me right back where I started at the per-site 
level. There was never any chance of making cache_page work any different 
from the per-site cache, but it certainly proved to be a temptation if I'm a 
hurried developer, frustrated by why my per site cache wasn't working and 
"thankful" for the fact that I could get the cache to start "working" with 
the cache_page decorator.

Hopefully the above example really makes it clear to you guys how all of the 
seemingly minor bugs and imperfections really do add up to a broken 
situation for someone coming to this with a pretty standard set of 
expectations and requirements.

Anyhow, the good news is that a good portion of what I have written about 
already has open tickets which in some cases are close to being ready for 
checkin:

* Google Analytics is a known issue with a proposed patch: 
https://code.djangoproject.com/ticket/9249

* CSRF is known to not play nicely with caching, it's documented at least: 
https://docs.djangoproject.com/en/dev/ref/contrib/csrf/#caching

* The actual underlying cache_page issue is ticketed: 
https://code.djangoproject.com/ticket/15855

Still, I can't help but feel that, to an extent, these are band aids. There 
is still an exceptionally narrow set of circumstances that would allow me to 
serve a single cached page to all anonymous visitors to my site: namely, I 
can't touch request.user and I can't use CSRF. Quite honestly, I'm not even 
sure you should be using a framework like Django if most of your pages don't 
have logic pertaining to a logged in vs. anonymous user, or have some kind 
of form on them which requires CSRF protection. Even if all of the above 
tickets got fixed, it seems like we're still in kind of a bad place.

I don't know that I have good solutions to any of this (though I am very 
much willing to contribute work toward such a solution). I do have a few 
ideas/questions to pose to conclude with here:

* Is it reasonable to set as a goal that Django should attempt to support 
per site caching for the scenario I described above? I mean, am I wrong in 
thinking that in an ideal world, it should be possible to serve the same 
cached page to all anonymous users most of the time, even if there are forms 
or anonymous vs. logged in user logic on it?

* Is an embedded token the only form in which CSRF protection can come from? 
Why can't the token be set as a cookie and the value of that cookie serve as 
the CSRF verification (without varying on it in the cache, obviously)? Or 
perhaps there's a way to dynamically generate a CSRF token via ajax after 
the page load? I'm certain someone much smarter and more knowledgable than I 
will point out why these are dreadfully horrible, unworkable ideas, but the 
embedded token is sort of a deal breaker for effective caching, and these 
days many, many sites have forms on almost every page (e.g. a hidden login 
form that's revealed when you press login, comment form, etc.).

* Why does the cookie have to vary if the request user object is touched on 
the template even though it's not authenticated? If the sessionid isn't even 
in the request cookie (i.e. for a first time visitor), then it doesn't 
require a real "check" of the session. And correct me if I'm wrong, but 
doesn't the session key get cycled when a user logs in anyway? In other 
words, a session key that represents an anonymous user will *always* 
represent an anonymous user. Perhaps there's a way to keep track of those so 
the anonymous session ids so the same anonymous cached view can be served to 
them all. What a waste to generate the entire page dynamically for each 
individual anonymous user all because of one simple key lookup. Again, this 
is probably a hopelessly naive idea with a sensible, obvious rebuttal, but 
perhaps there is some merit in coming up with a creative solution?

I have to guess some of you have already spent some brain cycles thinking 
about the above issues I've raised, in whole or in part, and I apologize if 
I'm re-hashing an old debate or am so totally off-base that I've wasted your 
time if you made it this far. My intent, again, is not to complain, but to 
see if others agree that the current state of the per-site cache is not so 
great, and if so, to elicit some ideas on how to best address it. It also 
seems to me that there is more than just one problem standing in the way of 
things, so "success" might require something of a coordinated effort.

Please do let me know if my concerns make sense, if my goal is a legitimate 
one, if I'm wrong in part or in whole, etc. etc. As I said earlier, if 
there's a path forward on any of the above I am happy to contribute to the 
effort.

Thanks for listening.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/G7iNJsARF4IJ.
To post to this group, send email to django-developers@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Reply via email to