Re: Fast lookup of bulky "table"
Just wanted to take a moment to express my gratitude to everyone who responded here. You have all been so incredibly helpful. Thank you Dino On 1/14/2023 11:26 PM, Dino wrote: Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing). -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 1/16/2023 2:53 AM, David wrote: See here: https://docs.python.org/3/reference/expressions.html#assignment-expressions https://realpython.com/python-walrus-operator/ Thank you, brother. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote: > When none of those reasons matter, one can use dictionaries in Python > as well. And then what Chandler Carruth showed us applies: I am missing something. Where is the data in your dictionary coming from? -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 1/16/2023 10:14 AM, Stefan Ram wrote: However, operating systems and databases also try to cache information in main memory that is estimated to be accessed often. Yes, and you can only know by testing, when that's possible. Also, if you know that you have the same queries repeated over and over, you can intern (basically, cache) their results in Python and get a performance boost that way. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
Il giorno domenica 15 gennaio 2023 alle 05:26:50 UTC+1 Dino ha scritto: > Hello, I have built a PoC service in Python Flask for my work, and - now > that the point is made - I need to make it a little more performant (to > be honest, chances are that someone else will pick up from where I left > off, and implement the same service from scratch in a different language > (GoLang? .Net? Java?) but I am digressing). > > Anyway, my Flask service initializes by loading a big "table" of 100k > rows and 40 columns or so (memory footprint: order of 300 Mb) and then > accepts queries through a REST endpoint. Columns are strings, enums, and > numbers. Once initialized, the table is read only. The endpoint will > parse the query and match it against column values (equality, > inequality, greater than, etc.) Finally, it will return a (JSON) list of > all rows that satisfy all conditions in the query. > > As you can imagine, this is not very performant in its current form, but > performance was not the point of the PoC - at least initially. > > Before I deliver the PoC to a more experienced software architect who > will look at my code, though, I wouldn't mind to look a bit less lame > and do something about performance in my own code first, possibly by > bringing the average time for queries down from where it is now (order > of 1 to 4 seconds per query on my laptop) to 1 or 2 milliseconds on > average). > > To be honest, I was already able to bring the time down to a handful of > microseconds thanks to a rudimentary cache that will associate the > "signature" of a query to its result, and serve it the next time the > same query is received, but this may not be good enough: 1) queries > might be many and very different from one another each time, AND 2) I am > not sure the server will have a ton of RAM if/when this thing - or > whatever is derived from it - is placed into production. > > How can I make my queries generally more performant, ideally also in > case of a new query? > > Here's what I have been considering: > > 1. making my cache more "modular", i.e. cache the result of certain > (wide) queries. When a complex query comes in, I may be able to restrict > my search to a subset of the rows (as determined by a previously cached > partial query). This should keep the memory footprint under control. > > 2. Load my data into a numpy.array and use numpy.array operations to > slice and dice my data. > > 3. load my data into sqlite3 and use SELECT statement to query my table. > I have never used sqllite, plus there's some extra complexity as > comparing certain colum requires custom logic, but I wonder if this > architecture would work well also when dealing with a 300Mb database. > > 4. Other ideas? > > Hopefully I made sense. Thank you for your attention > > Dino As a comparison with numpy. Given the following lines: import numpy as np a = np.random.randn(400,100_000) ia = np.argsort(a[0,:]) a_elem = a[56, ia[0]] I have just taken an element randomly in a numeric table of 400x10 elements To find it with numpy: %timeit isel = a == a_elem 35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) And %timeit a[isel] 9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) As data are not ordered it is searching it one by one but at C level. Of course it depends on a lot of thing... -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 2023-01-16 09:12:30 +1300, dn via Python-list wrote: > On 16/01/2023 08.36, Weatherby,Gerard wrote: > > I think any peformance improvements would have to come from a language > > change or better indexing of the data. > Expanding on @Peter's post: databases (relational or not) are best organised > according to use. I probably wasn't very clear here. I think one of the main advantages of relational databases over earlier models (especially hierarchical databases but also network databases) is that you *don't* have to do that. In a hierarchical database you have to decide what's higher or lower in the hierarchy (e.g., does a department have employees, or do employees have a department?) and that has a dramatic effect on performance so you must structure the database on which queries are expected to be more common. In a relational database employees and departments are at the same level and you have a relationship between them. You don't need to know whether you'll have to look up all employees of a given department or the department of a given employee more often. Both will be similarly fast with appropriate indexes. (Of course you should still have an idea what the data is used for when designing your data model. But semantics are much more important than use cases at this stage and you don't have to redesign your entire database just because you need a new query.) This flexibility comes with a cost, though: A relational database is almost always good enough, but almost never optimal for any given use. > Postgres and MySQL (for example) enable the establishment of multiple and > sophisticated indices/indexes, and the aptly-named "views" of data. > > If the queries can be grouped according to the manner in which the data must > be accessed, a view could be built for each. At which time, even if every > row must be accessed, the retrieval will be made more efficient and/or the > response better-organised. Nitpick: A view will not be more efficient (unless it's a materialized view which is basically just a table). To retrieve that data, the RDBMS has to perform exactly the same operations as for the underlying query. Views make life simpler for the programmer (or analyst) by letting them *write* simpler queries, but under the surface nothing changes. In fact the performance may be worse, since the perceived simplicity of the view may cause the human to write queries which are hard to optimize. (There is another important use case for views: Access control and enforcement of business rules. But I fear we are straying far from the original topic now.) > Thus, if we have a DB of people. Many retrievals are likely to utilise an > index on 'name'. However, if at times interest is limited to place or > suburb, an index and view of such will speed things from O(n). Similarly, if > a query is only to return people with a dog license. I don't see where a view would help here - but maybe you are thinking of a more complex database than you describe. > Some programmers don't realise that SQL can also be used for calculations, > eg the eponymous COUNT(), which saves (CPU-time and coding-effort) over > post-processing in Python. Agreed. Aggregate and window functions can do a lot of work in the database. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | [email protected] |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 2023-01-15 18:06:36 -0500, Thomas Passin wrote: > You especially want to avoid letting the database engine do full-table > scans over and over. And you never want to send a lot of rows to > Python and do post-filtering on them if you can avoid it. Another thing to avoid: Lots of small queries. If you have the choice between a single query which returns 1 rows and 1000 queries which return on average 5 rows each, the former is almost always much faster. Each query carries some overhead. This may not be very noticeable with SQLite (as everything is in the same process), but it becomes more and more pronounced the farther the database is from the client. For complicated queries the parsing and planning overhead can also become significant. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | [email protected] |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 1/16/2023 11:56 AM, rbowman wrote: On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote: When none of those reasons matter, one can use dictionaries in Python as well. And then what Chandler Carruth showed us applies: I am missing something. Where is the data in your dictionary coming from? It would get imported on startup. This is assuming that the data does not get changed during operation, which the OP said is the case. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On Jan 15, 2023 05:26, Dino wrote: Hello, I have built a PoC service in Python Flask for my work, and - now that the point is made - I need to make it a little more performant (to be honest, chances are that someone else will pick up from where I left off, and implement the same service from scratch in a different language (GoLang? .Net? Java?) but I am digressing). === Hi, * I'd start by measuring where your program spends its time: https://docs.python.org/3/library/profile.html * It might be useful try DuckDB instead of Sqlite. https://duckdb.org/why_duckdb.html Best wishes, AJ -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On 1/16/2023 1:18 PM, Edmondo Giovannozzi wrote: As a comparison with numpy. Given the following lines: import numpy as np a = np.random.randn(400,100_000) ia = np.argsort(a[0,:]) a_elem = a[56, ia[0]] I have just taken an element randomly in a numeric table of 400x10 elements To find it with numpy: %timeit isel = a == a_elem 35.5 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) And %timeit a[isel] 9.18 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) As data are not ordered it is searching it one by one but at C level. Of course it depends on a lot of thing... thank you for this. It's probably my lack of experience with Numpy, but... can you explain what is going on here in more detail? Thank you Dino -- https://mail.python.org/mailman/listinfo/python-list
Re: Fast lookup of bulky "table"
On Mon, 16 Jan 2023 12:28:37 -0500, Thomas Passin wrote: > On 1/16/2023 11:56 AM, rbowman wrote: >> On 16 Jan 2023 15:14:06 GMT, Stefan Ram wrote: >> >> >>>When none of those reasons matter, one can use dictionaries in >>>Python as well. And then what Chandler Carruth showed us applies: >> >> I am missing something. Where is the data in your dictionary coming >> from? > > It would get imported on startup. This is assuming that the data does > not get changed during operation, which the OP said is the case. Okay, so there is no data loss if the program or machine crashes. That makes sense. In the cases Where I'm using SQLite the input data consists of relatively expensive spatial queries that are processed and written out to persist the results for use by other applications in the suite. -- https://mail.python.org/mailman/listinfo/python-list
