Re: Fast lookup of bulky "table"

2023-01-16 Thread Dino



Just wanted to take a moment to express my gratitude to everyone who 
responded here. You have all been so incredibly helpful. Thank you


Dino

On 1/14/2023 11:26 PM, Dino wrote:


Hello, I have built a PoC service in Python Flask for my work, and - now 
that the point is made - I need to make it a little more performant (to 
be honest, chances are that someone else will pick up from where I left 
off, and implement the same service from scratch in a different language 
(GoLang? .Net? Java?) but I am digressing).

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Dino

On 1/16/2023 2:53 AM, David wrote:

See here:
   https://docs.python.org/3/reference/expressions.html#assignment-expressions
   https://realpython.com/python-walrus-operator/


Thank you, brother.



--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread rbowman
On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:


>   When none of those reasons matter, one can use dictionaries in Python
>   as well. And then what Chandler Carruth showed us applies:

I am missing something. Where is the data in your dictionary coming from?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Thomas Passin

On 1/16/2023 10:14 AM, Stefan Ram wrote:

However, operating systems and databases also try to cache
   information in main memory that is estimated to be accessed
   often.
Yes, and you can only know by testing, when that's possible. Also, if 
you know that you have the same queries repeated over and over, you can 
intern (basically, cache) their results in Python and get a performance 
boost that way.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Edmondo Giovannozzi
Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto:
> Hello, I have built a PoC service in Python Flask for my work, and - now 
> that the point is made - I need to make it a little more performant (to 
> be honest, chances are that someone else will pick up from where I left 
> off, and implement the same service from scratch in a different language 
> (GoLang? .Net? Java?) but I am digressing). 
> 
> Anyway, my Flask service initializes by loading a big "table" of 100k 
> rows and 40 columns or so (memory footprint: order of 300 Mb) and then 
> accepts queries through a REST endpoint. Columns are strings, enums, and 
> numbers. Once initialized, the table is read only. The endpoint will 
> parse the query and match it against column values (equality, 
> inequality, greater than, etc.) Finally, it will return a (JSON) list of 
> all rows that satisfy all conditions in the query. 
> 
> As you can imagine, this is not very performant in its current form, but 
> performance was not the point of the PoC - at least initially. 
> 
> Before I deliver the PoC to a more experienced software architect who 
> will look at my code, though, I wouldn't mind to look a bit less lame 
> and do something about performance in my own code first, possibly by 
> bringing the average time for queries down from where it is now (order 
> of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on 
> average). 
> 
> To be honest, I was already able to bring the time down to a handful of 
> microseconds thanks to a rudimentary cache that will associate the 
> "signature" of a query to its result, and serve it the next time the 
> same query is received, but this may not be good enough: 1) queries 
> might be many and very different from one another each time, AND 2) I am 
> not sure the server will have a ton of RAM if/when this thing - or 
> whatever is derived from it - is placed into production. 
> 
> How can I make my queries generally more performant, ideally also in 
> case of a new query? 
> 
> Here's what I have been considering: 
> 
> 1. making my cache more "modular", i.e. cache the result of certain 
> (wide) queries. When a complex query comes in, I may be able to restrict 
> my search to a subset of the rows (as determined by a previously cached 
> partial query). This should keep the memory footprint under control. 
> 
> 2. Load my data into a numpy.array and use numpy.array operations to 
> slice and dice my data. 
> 
> 3. load my data into sqlite3 and use SELECT statement to query my table. 
> I have never used sqllite, plus there's some extra complexity as 
> comparing certain colum requires custom logic, but I wonder if this 
> architecture would work well also when dealing with a 300Mb database. 
> 
> 4. Other ideas? 
> 
> Hopefully I made sense. Thank you for your attention 
> 
> Dino

As a comparison with numpy. Given the following lines:

import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

I have just taken an element randomly in a numeric table of 400x10 elements
To find it with numpy:

%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And 
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing... 


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Peter J. Holzer
On 2023-01-16 09:12:30 +1300, dn via Python-list wrote:
> On 16/01/2023 08.36, Weatherby,Gerard wrote:
> > I think any peformance improvements would have to come from a language 
> > change or better indexing of the data.
> Expanding on @Peter's post: databases (relational or not) are best organised
> according to use.

I probably wasn't very clear here. I think one of the main advantages of
relational databases over earlier models (especially hierarchical
databases but also network databases) is that you *don't* have to do
that.

In a hierarchical database you have to decide what's higher or lower in
the hierarchy (e.g., does a department have employees, or do employees
have a department?)  and that has a dramatic effect on performance so
you must structure the database on which queries are expected to be more
common.

In a relational database employees and departments are at the same level
and you have a relationship between them. You don't need to know whether
you'll have to look up all employees of a given department or the
department of a given employee more often. Both will be similarly fast
with appropriate indexes.

(Of course you should still have an idea what the data is used for when
designing your data model. But semantics are much more important than
use cases at this stage and you don't have to redesign your entire
database just because you need a new query.)

This flexibility comes with a cost, though: A relational database is
almost always good enough, but almost never optimal for any given use.


> Postgres and MySQL (for example) enable the establishment of multiple and
> sophisticated indices/indexes, and the aptly-named "views" of data.
> 
> If the queries can be grouped according to the manner in which the data must
> be accessed, a view could be built for each. At which time, even if every
> row must be accessed, the retrieval will be made more efficient and/or the
> response better-organised.

Nitpick: A view will not be more efficient (unless it's a materialized
view which is basically just a table). To retrieve that data, the RDBMS
has to perform exactly the same operations as for the underlying query.
Views make life simpler for the programmer (or analyst) by letting them
*write* simpler queries, but under the surface nothing changes. In fact
the performance may be worse, since the perceived simplicity of the view
may cause the human to write queries which are hard to optimize.

(There is another important use case for views: Access control and
enforcement of business rules. But I fear we are straying far from the
original topic now.)

> Thus, if we have a DB of people. Many retrievals are likely to utilise an
> index on 'name'. However, if at times interest is limited to place or
> suburb, an index and view of such will speed things from O(n). Similarly, if
> a query is only to return people with a dog license.

I don't see where a view would help here - but maybe you are thinking of
a more complex database than you describe.

> Some programmers don't realise that SQL can also be used for calculations,
> eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over
> post-processing in Python.

Agreed. Aggregate and window functions can do a lot of work in the
database.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | [email protected] |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Peter J. Holzer
On 2023-01-15 18:06:36 -0500, Thomas Passin wrote:
> You especially want to avoid letting the database engine do full-table
> scans over and over. And you never want to send a lot of rows to
> Python and do post-filtering on them if you can avoid it.

Another thing to avoid: Lots of small queries. If you have the choice
between a single query which returns 1 rows and 1000 queries which
return on average 5 rows each, the former is almost always much faster.
Each query carries some overhead. This may not be very noticeable with
SQLite (as everything is in the same process), but it becomes more and
more pronounced the farther the database is from the client. For
complicated queries the parsing and planning overhead can also become
significant.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | [email protected] |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Thomas Passin

On 1/16/2023 11:56 AM, rbowman wrote:

On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:



   When none of those reasons matter, one can use dictionaries in Python
   as well. And then what Chandler Carruth showed us applies:


I am missing something. Where is the data in your dictionary coming from?


It would get imported on startup.  This is assuming that the data does 
not get changed during operation, which the OP said is the case.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Albert-Jan Roskam
   On Jan 15, 2023 05:26, Dino  wrote:

 Hello, I have built a PoC service in Python Flask for my work, and - now
 that the point is made - I need to make it a little more performant (to
 be honest, chances are that someone else will pick up from where I left
 off, and implement the same service from scratch in a different language
 (GoLang? .Net? Java?) but I am digressing).

   ===
   Hi,
   * I'd start by measuring where your program spends its
   time: https://docs.python.org/3/library/profile.html
   * It might be useful try DuckDB instead of
   Sqlite. https://duckdb.org/why_duckdb.html
   Best wishes,
   AJ
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread Dino

On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote:


As a comparison with numpy. Given the following lines:

import numpy as np
a = np.random.randn(400,100_000)
ia = np.argsort(a[0,:])
a_elem = a[56, ia[0]]

I have just taken an element randomly in a numeric table of 400x10 elements
To find it with numpy:

%timeit isel = a == a_elem
35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

And
%timeit a[isel]
9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As data are not ordered it is searching it one by one but at C level.
Of course it depends on a lot of thing...


thank you for this. It's probably my lack of experience with Numpy, 
but... can you explain what is going on here in more detail?


Thank you

Dino
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fast lookup of bulky "table"

2023-01-16 Thread rbowman
On Mon, 16 Jan 2023 12:28:37 -0500, Thomas Passin wrote:

> On 1/16/2023 11:56 AM, rbowman wrote:
>> On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote:
>> 
>> 
>>>When none of those reasons matter, one can use dictionaries in
>>>Python as well. And then what Chandler Carruth showed us applies:
>> 
>> I am missing something. Where is the data in your dictionary coming
>> from?
> 
> It would get imported on startup.  This is assuming that the data does
> not get changed during operation, which the OP said is the case.

Okay, so there is no data loss if the program or machine crashes. That 
makes sense. In the cases Where I'm using SQLite the input data consists  
of relatively expensive spatial queries that are processed and written out 
to persist the results for use by other applications in the suite. 
-- 
https://mail.python.org/mailman/listinfo/python-list