Re: NoSQL Support for the ORM

Waldemar Kornewald Wed, 07 Apr 2010 01:43:36 -0700

Hey Alex,

On Apr 7, 2:11 am, Alex Gaynor <alex.gay...@gmail.com> wrote:
> Non-relational database support for the Django ORM
> ==================================================
>
> Note:  I am withdrawing my proposal on template compilation.  Another
> student
> has expressed some interest in working on it, and in any event I am
> now more
> interested in working on this project.


It's great that you want to work on this project. Since I want to see
this feature in Django, I'm offering mentoring help with the NoSQL
part. You know Django's ORM better than me, so I probably can't really
help you there, but I can help to make sure that your modifications
will work well on NoSQL DBs. Just in case this is necessary, I'll
apply as a GSoC mentor before it's too late (if I remember correctly,
in 2007 we could still allow new mentors even at this late stage)?

> Method
> ~~~~~~
>
> The ORM architecture currently has a ``QuerySet`` which is backend
> agnostic, a
> ``Query`` which is SQL specific, and a ``SQLCompiler`` which is
> backend
> specific (i.e. Oracle vs. MySQL vs. generic).  The plan is to change
> ``Query``
> to be backend agnostic by delaying the creation of structures that are
> SQL
> specific, specifically join/alias data.  Instead of structures like
> ``self.where``, ``self.join_aliases``, or ``self.select`` all working
> in terms
> of joins and table aliases the composition of a query would be stored
> in terms
> of a tree containing the "raw" filters, as passed to the filter calls,
> with
> things like ``Field.get_prep_value`` called appropriately.  The
> ``SQLCompiler``
> will be responsible for computing the joins for all of these data-
> structures.

Could you please elaborate on the data structures? In the end, non-
relational backends shouldn't have to reproduce large parts of the
SQLQuery code just to emulate a JOIN. When we tried to do a similar
refactoring we quickly faced the problem that we needed something
similar to setup_joins() and other SQLQuery features. We'd also have
to create code for grouping filters into individual queries on tables.
The Query class should take care of as much of the common stuff as
possible, so nonrel backends can potentially emulate every single SQL
feature (e.g., via MapReduce or whatever) with the least effort.
Otherwise this refactoring would actually have more disadvantages than
our current SQLCompiler-based approach in Django-nonrel (as ridiculous
as that sounds).

However, it's important that all of the emulated features are handled
not by the backend, but by a reusable code layer which sits on top of
the nonrel backends. It would be wasteful to let every backend
developer write his own JOIN emulation and denormalization and
aggregate code, etc.. The refactored ORM should at least still allow
for writing some kind of "proxy" backend that sits on top of the
actual nonrel backend and takes care of SQL features emulation. I'm
not sure if it's a good idea to integrate the emulation into Django
itself because then progress will be slowed down.

Ideally, we should provide a simplified API for nonrel backends,
similar to the one that we recently published for Django-nonrel, so a
backend could be written in two days instead of two weeks. We can port
our work over to the refactored ORM, so this you don't have to deal
with this (except if it should be officially integrated into Django).

In addition to these changes you'll also need to take care of a few
other things:

Many NoSQL DBs provide a simple "upsert"-like behavior where on save()
they either create a new entity if none exists with that primary key
or update the existing entity if one exists. However, on save() Django
first checks if an entity exists. This would be inefficient and
unnecessary, so the backend should be able to turn that behavior off.

On delete() Django also deletes related objects. This can be a costly
operation, especially if you have a large number of entities. Also,
the queries that collect the related entities can conflict with
transaction support at least on App Engine and it might also be very
inefficient on HBase. IOW, it's not sufficient to let the user handle
the deletion for large datasets. So, non-relational (and maybe also
relatinoal) DBs should be able to defer and split up the deletion
process into background tasks - which would also simplify the
developer's job because he doesn't have to take care of manually
writing background tasks for large datasets, so it's a good addition
in general.

I'm not sure how to handle multi-table inheritance. It could be done
with JOIN emulation, but this would be very inefficient.
Denormalization is IMHO not the answer to this problem, either. Should
Django simply fail to execute such a query on those backends or should
the user make sure that he doesn't use multi-table inheritance
unnecessarily in his code?

Bye,
Waldemar Kornewald

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-develop...@googlegroups.com.
To unsubscribe from this group, send email to 
django-developers+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Re: NoSQL Support for the ORM

Reply via email to