Adding generated common table expressions

2017-03-17 Thread Ashley Waite


Hey all,


I'd like to suggest adding Common Table Expression (CTE) query generation 
as a feature to Django.

I've been working on a project that required manipulation of many records 
at once, and as with many ORMs found that this wasn't an ideal use-case in 
Django. As the rest of our code base and related projects are in Django, 
there was a strong preference to find a way to do it and keep to the same 
model-is-the-truth design.

I first did this by writing some hackish functions using raw querysets and 
generating my own CTE based queries, but it lacked ideal flexibility and 
maintainability. So I've now written some modifications into my Django to 
do this in a more Django-esque way and think that this functionality would 
be beneficial within the project itself, but am unsure exactly where to 
start the conversation about that.


*Why generate CTE based queries from querysets?*

By allowing querysets to be attached to each other, and setting appropriate 
WHERE clauses, arbitrary and nested SQL queries can be generated. Where the 
results of the queries are only necessary for the execution of following 
queries this saves a very substantial amount of time and database work. 
Once these features exist, other functionality can also transparently use 
these to generate more efficient queries (such as large IN clauses).

This allows several powerful use cases I think Django would benefit from:


*Large 'IN' clauses*, can be implemented as CTEs reducing expensive lookups 
to a single CTE INNER JOIN. For sets of thousands to match from tables of 
millions of records this can be a very substantial gain.


*Composite 'IN' conditions,* where multiple fields must match and you're 
matching against a large set of condition rows. In my usage this was "where 
the md5/sha hashes match one of the million md5/sha tuples in my match 
set". This is simply a CTE JOIN with two clauses in the WHERE.


*Nested data creation*, where the parent doesn't yet exist. Django doesn't 
currently do this as the primary keys are needed, and this makes normalised 
data structures unappealing. Using INSERTs as CTEs that supply those keys 
to following statements means that entire nested data structures of new 
information can be recreated in the database at once, efficiently and 
atomically.


*Non-uniform UPDATE*s, such that a modified set of objects can all be 
updated with different data at the same time by utilising a CTE values 
statement JOINed to the UPDATE statement. As there's currently no way to do 
this kind of bulk update the alternative is to update each instance 
individually, and this doesn't scale well.

These could also be used with aggregations and other calculated fields to 
create complex queries that aren't possible at the moment.


*What my PoC looks like*

With another mildly hackish PoC that creates a VALUEs set from a 
dict/namedtuple which can be used to provide large input data, my present 
modified version syntax looks a bit like this (not perfect queries):

class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash 
(base64)")

# Mock QuerySet of values
q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))

# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 
matches and existing record lacks sha2)
q_ex = 
Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160"))
 | Q(sha160=None))

# Create new records that don't exist
q_cr = Hashes.objects.attach(q_mo, 
q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", 
"sha2").as_insert()

Returning the newly created records.

SQL can be generated that looks something like this:

WITH cte_1_0 (md5, sha2) AS (
VALUES ('2d30243bfe9d06673765c432c2bd'::uuid, 
'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('f20a46e4e60338697948a0917423', 
'6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')),
cte_1 AS (
SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" 
FROM "hashes" , "cte_1_0" 
WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = 
(cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) 
SELECT "hashes"."md5" 
FROM "hashes" , "cte_1_0" , "cte_1" 
WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))

That is:

   - A qs.as_insert() and qs.as_update() on queryset to create *lazy* 
   insert and update queries.
   - A qs.attach() that allows querysets to be attached to other querysets, 
   and will generate them as CTE statements.
   - A qs.ref() that returns an expression that when the query is compiled 
   will be a field reference on the CTE that represents that queryset.
   - Additional compilers on the QuerySet subclasses that these return (so 
   no changes to base compilers meaning no fu

Re: SQLAlchemy integration - More details

2017-03-17 Thread stamatis . iliass
Hello David.
I will check it for sure, thank you!

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/39627708-a313-41c6-92a3-880561f1a92e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Add ability to choose a different secret for PasswordResetToken

2017-03-17 Thread jann.haber via Django developers (Contributions to Django itself)
Hi everybody,

I have two sites in my project. One is for internal use to add members of 
our club to our database, one is a self-service page for our members.
On registration of a member using our internal site, I want to send a 
password reset link to the member for easy registration to our self-service 
page.
Since the two sites run on different servers and different SECRET_KEYs, I 
need to specify a different secret for the token generated with our 
internal site to work on the self-service page.
I managed to do so by overwriting the method _make_token_with_timestamp of 
PasswordResetToken, however, I think this is quite an ugly solution.

How would you think about the secret being an attribute of the 
PasswordResetToken class, which is then passed to salted_hmac? The default 
could be settings.SECRET_KEY for backwards compatibility.

I have so far not contributed to django, but I would like to start and I 
believe, this small change might be a good start. If you agree, I could 
open a ticket in trac.

Best Regards,
Jann

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/ca9e1cde-002d-491b-899a-02bfd50fbed6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Introduction for GSoC 2017

2017-03-17 Thread Tim Graham
Welcome! It's a bit late to be starting on a successful proposal, 
especially if you haven't contributed to Django before, but good luck. I'm 
not sure if you were looking for anything specific in reply to your intro?

On Monday, March 13, 2017 at 9:52:34 AM UTC-4, Gunpreet Ahuja wrote:
>
> Greetings,
> I am a Masters in Engineering student at Concordia University, Canada. I 
> wish to participate in GSoC 2017. This is my first time and I want to 
> contribute in this project.
> I have worked on various projects which include three android applications 
> named a Textile app, Dawn-to-Dusk (a weather application) and Styllax (a 
> wallpaper application). As far as python/django is concerned, I have 
> contributed to the project LibreHatti during our undergraduate training. 
> The project is entirely based on django framework for python. Besides 
> this, I have also worked a data mining project named SocioDAvIS that 
> utilizes Facebook data set and Twitter streaming API data to predict 
> different trends using R language. The database systems utilized in these 
> projects being MySQL and SQLite3.
> In addition to this, I have participated in Google Student Ambassador program 
> (being listed among the top 400 selected students) and have been a core 
> team member of Google Student Club, co-convener at Linux User Group, 
> Ludhiana and a member of Android Student Club at my college during my 
> undergraduate course. I have worked on Git, Vim, Latex, Doxygen and have 
> had a hand on Windows, Linux and Mac Operating Systems. During my 
> undergraduate studies only, I won a C++/JAVA event that involved building 
> an algorithm to solve a given problem.
> As for now, I have working on a JAVA project for metrics analysis of source 
> codes. I am currently on the way to polish my python/django skills 
> further.
>
> Regards,
> Gunpreet Ahuja
> gunpreetahuja.wordpress.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/4044c631-d6a2-441c-907e-2e62d1efc232%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Adding generated common table expressions

2017-03-17 Thread Tim Graham
Hi, I don't know anything about CTE, but did you see this third-party 
package? https://github.com/petrounias/django-cte-trees -- It seems to be 
PostgreSQL only.

I was going to write, "Considering that not all databases support CTE 
(MySQL doesn't), a third-party app might be the way to go rather than 
having it built-in to Django." -- however, then I noticed that supported is 
added in MySQL 8 [0], so maybe any database differences could be abstracted 
away with an appropriate API.

I'll leave it to others to comment on the technical details of your 
proposal. Meanwhile, if you have a chance to contribute smaller patches to 
Django, I think it's helpful to start with smaller patches before tackling 
something larger like this.

[0] 
http://mysqlserverteam.com/mysql-8-0-labs-recursive-common-table-expressions-in-mysql-ctes/

On Friday, March 17, 2017 at 7:28:17 AM UTC-4, Ashley Waite wrote:
>
> Hey all,
>
>
> I'd like to suggest adding Common Table Expression (CTE) query generation 
> as a feature to Django.
>
> I've been working on a project that required manipulation of many records 
> at once, and as with many ORMs found that this wasn't an ideal use-case in 
> Django. As the rest of our code base and related projects are in Django, 
> there was a strong preference to find a way to do it and keep to the same 
> model-is-the-truth design.
>
> I first did this by writing some hackish functions using raw querysets and 
> generating my own CTE based queries, but it lacked ideal flexibility and 
> maintainability. So I've now written some modifications into my Django to 
> do this in a more Django-esque way and think that this functionality would 
> be beneficial within the project itself, but am unsure exactly where to 
> start the conversation about that.
>
>
> *Why generate CTE based queries from querysets?*
>
> By allowing querysets to be attached to each other, and setting 
> appropriate WHERE clauses, arbitrary and nested SQL queries can be 
> generated. Where the results of the queries are only necessary for the 
> execution of following queries this saves a very substantial amount of time 
> and database work. Once these features exist, other functionality can also 
> transparently use these to generate more efficient queries (such as large 
> IN clauses).
>
> This allows several powerful use cases I think Django would benefit from:
>
>
> *Large 'IN' clauses*, can be implemented as CTEs reducing expensive 
> lookups to a single CTE INNER JOIN. For sets of thousands to match from 
> tables of millions of records this can be a very substantial gain.
>
>
> *Composite 'IN' conditions,* where multiple fields must match and you're 
> matching against a large set of condition rows. In my usage this was "where 
> the md5/sha hashes match one of the million md5/sha tuples in my match 
> set". This is simply a CTE JOIN with two clauses in the WHERE.
>
>
> *Nested data creation*, where the parent doesn't yet exist. Django 
> doesn't currently do this as the primary keys are needed, and this makes 
> normalised data structures unappealing. Using INSERTs as CTEs that supply 
> those keys to following statements means that entire nested data structures 
> of new information can be recreated in the database at once, efficiently 
> and atomically.
>
>
> *Non-uniform UPDATE*s, such that a modified set of objects can all be 
> updated with different data at the same time by utilising a CTE values 
> statement JOINed to the UPDATE statement. As there's currently no way to do 
> this kind of bulk update the alternative is to update each instance 
> individually, and this doesn't scale well.
>
> These could also be used with aggregations and other calculated fields to 
> create complex queries that aren't possible at the moment.
>
>
> *What my PoC looks like*
>
> With another mildly hackish PoC that creates a VALUEs set from a 
> dict/namedtuple which can be used to provide large input data, my present 
> modified version syntax looks a bit like this (not perfect queries):
>
> class Hashes(models.Model):
> md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
> sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 
> hash (base64)")
>
> # Mock QuerySet of values
> q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
> # A big IN query
> q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))
>
> # Matched existing values with composite 'IN' (where md5 and sha2 match, or 
> md5 matches and existing record lacks sha2)
> q_ex = 
> Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160"))
>  | Q(sha160=None))
>
> # Create new records that don't exist
> q_cr = Hashes.objects.attach(q_mo, 
> q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", 
> "sha2").as_insert()
>
> Returning the newly created records.
>
> SQL can be generated that looks something like this:
>
> WITH cte_1_0 (md5, sha2) AS (
>   VALUES