Re: How to perform a long running dry run transaction without blocking

2025-02-07 Thread Robert Leach
>> Anyway, thanks so much for your help.  This discussion has been very useful, 
>> and I think I will proceed at first, exactly how you suggested, by queuing 
>> every validation job (using celery).  Then I will explore whether or not I 
>> can apply the "on timeout" strategy in a small patch.
>> Incidentally, during our Wednesday meeting this week, we actually opened our 
>> public instance to the world for the first time, in preparation for the 
>> upcoming publication.  This discussion is about the data submission 
>> interface, but that interface is actually disabled on the public-facing 
>> instance.  The other part of the codebase that I was primarily responsible 
>> for was the advanced search.  Everything else was primarily by other team 
>> members.  If you would like to check it out, let me know what you think: 
>> http://tracebase.princeton.edu 
> 
> I would have to hit the books again to understand all of what is going on 
> here.

It's a mass spec tracing database.  Animals are infused with radio labeled 
compounds and mass spec is used to see what the animal's biochemistry turns 
those compounds into.  (My undergrad was biochem, so I've been resurrecting my 
biochem knowledge, as needed for this project.  I've been mostly doing RNA and 
DNA sequence analysis since undergrad, and most of that was prokaryotic.

> One quibble with the Download tab, there is no indication of the size of the 
> datasets. I generally like to know what I am getting into before I start a 
> download. Also, is there explicit throttling going on? I am seeing 
> 10.2kb/sec, whereas from here 
> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page I downloaded a 
> 47.65M file at 41.9MB/s

Thank you!  Not knowing the download size is exactly a complaint I had.  That 
download actually uses my advanced search interface (in browse mode).  There is 
the same issue with the download buttons on the advanced search.  With the 
streaming, we're not dealing with temp files, which is nice, at least for the 
advanced search, but we can't know the download size that way.  So I had wanted 
a progress bar to at least show progress (current record per total).  I could 
even estimate the size (an option I explored for a few days).  Eventually, I 
proposed a celery solution for that and I was overruled.

As for the download in the nav bar, we have an issue to change that to a 
listing of actual files broken down by study (3 files per study).  There's not 
much actual utility from a user perspective for downloading everything anyway.  
We've just been focussed on other things.  In fact, we have a request from a 
user for that specific feature, done in a way that's compatible with curl/scp.  
We just have to figure out how to not have to CAS authenticate each command, 
something I don't have experience with.



Re: How to perform a long running dry run transaction without blocking

2025-02-07 Thread Robert Leach
>> I'd run across a stackexchange answer 
>>  that suggested setting a 
>> statement timeout (specific to postgres) in a django view.  If I did that 
>> for the validation view, it wouldn't necessarily 100% correspond to a 
>> blocking issue, but since the overall goal is the prevent a gateway timeout, 
>> I'm fine with that.
> 
> When you do a SET statement_timeout is session/connection specific.
> 
> For what that means in Django see:
> 
> https://docs.djangoproject.com/en/5.1/ref/databases/
> 
> Personally I think this will introduce more complexity then it's worth.
> 
> Further down the page at above link it shows how you can use a psycopg(2) 
> pool, but that is another layer added to an already complex setup.

TBF, adding a celery strategy to the codebase to queue jobs is already 
comparatively very complex to start out.  It's a pretty heavy handed solution 
to what seems to be a relatively rare occurrence (though rather recurrent when 
it does happen).  By that metric, I would think that queuing jobs at all is too 
complex to be worth it.  So I'm not sure that catching a timeout exception to 
queue a job instead of queuing the job at the outset is that much more complex. 
 I can just create a view that inherits from the current view (to which the 
statement timeout is applied) which is *not* subject to the statement timeout 
when I want to queue a longer running validation due to the block.  I think I 
should be able to redirect the request to that derived class view.  Not sure 
exactly how ATM, but it shouldn't be more than 1 line of code/statement, I 
would think.

Regarding statement timeouts in general, I found a pretty great article 

 that suggests setting site-wide statement timeouts as a general rule, which I 
think makes a lot of sense.  In fact, some of the pages on our site (that I 
didn't write) are not going to scale well as the DB grows.  I already feel like 
the samples page is way too slow.  They should probably all gracefully handle 
timeouts so that users don't encounter raw 500 errors.  So whether or not I 
decide to "queue on statement timeout", I think using statement timeouts is 
worthwhile.

I take your point though.  How many more lines of code do you think would make 
the strategy worthwhile?  I'd be willing to bet I could keep it under 20 
compared to straight up queuing every validation.  In fact, I'm leaning toward 
implementing your suggestion of queuing everything as a "phase 1" of this 
endeavor, and then implement my proposed "queue on timeout" idea as phase 2 
(perhaps as a configurable option - at least until I'm satisfied it works well).

>> I don't know if that stack exchange suggestion does what it claims to do or 
>> not.  I haven't tried it.  But they key claim of interest here is that the 
>> timeout is specific to the view.  I.e. any other views or background 
>> processes wouldn't be subject to the timeout.  I'm suspicious as to whether 
>> this is actually true.  My novice understanding of Django's cursors is that 
>> it maintains a pool of connections, and I'm not 
> 
> Actually cursors are attributes of connections.

Yeah, thus my suspicion that it may not apply to only the view in the example.  
I suspected their "proof" that it didn't affect other views was due to randomly 
getting a connection to which that statement timeout had not yet been applied.  
*However*, I just read this article 

 which describes Django's connection management and I realized that Django's 
default behavior (which is what we use) is to create a new connection for every 
request, so I think that, with the caveat that it wouldn't work with persistent 
connections, the stack exchange claim is actually correct.

I would just have to figure out how best to apply the timeout to everything 
except those queued validations.  And I think maybe a context manager might be 
the simplest way to do it.

Anyway, thanks so much for your help.  This discussion has been very useful, 
and I think I will proceed at first, exactly how you suggested, by queuing 
every validation job (using celery).  Then I will explore whether or not I can 
apply the "on timeout" strategy in a small patch.

Incidentally, during our Wednesday meeting this week, we actually opened our 
public instance to the world for the first time, in preparation for the 
upcoming publication.  This discussion is about the data submission interface, 
but that interface is actually disabled on the public-facing instance.  The 
other part of the codebase that I was primarily responsible for was the 
advanced search.  Everything else was primarily by other team members.  If you 
would like to check it out, let me know what you think: 
http://tracebase.princeton.edu 

Cheers,
Rob


Robert Wil

Re: How to perform a long running dry run transaction without blocking

2025-02-06 Thread Robert Leach
> Have you considered a validation app? Have it read the inputs and look in db 
> for conflicts, rather than attempt the insert. Zero transactions necessary


I did consider that about a year or two ago when I first conceived the data 
validation interface.  Doing that now would solve the problem of concurrent 
validations blocking one another, but selects would still get blocked if their 
result is pending the outcome of a load running in the background.  If I 
mitigate that issue by running those loads over night on a schedule, I still 
lose out on the benefits of having the loading code do the validation for me...

I would have to explicitly find and report on problems that the load exceptions 
currently do for me, without any extra code.

So what I'm saying is that the data being validated is inter-dependent.  There 
are about a dozen different loading scripts (one for each sheet of the uploaded 
excel file) whose runs are orchestrated by a master load script that ensures 
they are run in the right order so that the interdependent data can be checked. 
 For example, these are some relative orders of what needs to be loaded so that 
data can be validated:

Study > Animals
Tracers > Infusates > Animals
Treatments > Animals
Tissues > Samples

The Animal load script would fail if the new data in (the tables) Study, 
Tracers, Infusates, and Treatments aren't inserted, because it links to those 
newly created records.  And there's no way to detect problems in those new 
relationships in the unchanged database if they aren't inserted.  That's what 
doing this all in a transaction, and actually doing the inserts (for which I 
use Django `get_or_create` method calls) provides.

In other words, I would have to save and explicitly check the inter-related 
sheet data in data structures independent of the database in order to find the 
equivalent of (for example) `ObjectDoesNotExist` errors that originate from the 
database.  Right now, I get those errors caught "for free".  All I have to do 
is tell the user what sheet/row/column is related to that error.  And it saves 
me the overhead of having to maintain synchronicity between separate validation 
code and loading code when the loading code changes.

Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544





Re: How to perform a long running dry run transaction without blocking

2025-02-06 Thread Robert Leach
>>> The load to the development server does no validation?
>>> 
>>> If so what is the purpose?
>>> 
>>> The background processes are other validation runs?
>> It's the same code that executes in both cases (with or without the 
>> `--validate` flag).  All that that flag does is it (effectively) raises the 
>> dry run exception before it leaves the transaction block, so it always 
>> validates (whether the flag is supplied or not).
> 
> More for my sake then anything else, why do the load to the development 
> server at all if the production load is the only one that counts?

The software is still in a new major version beta.  We're adding features and 
fixing bugs.  It's not unusual to encounter a new bug, fix it on dev to get the 
load to work, then deploy a point release on prod.  And that means repeated 
load attempts that interfere with the validation interface.  Besides, beyond 
this, we're planning on a separate staging database that dev effectively now 
is.  Sometimes, a curator only finds a technical data issue after the initial 
load while browsing the newly loaded data on the dev site.

>> So the load doesn't fail until the end of the run, which is inefficient from 
>> a maintenance perspective.  I've been thinking of adding a `--failfast` 
>> option for use on the back end.  Haven't done it yet.  I started a load 
>> yesterday in fact that ran 2 hours before it buffered an exception related 
>> to a newly introduced bug.  I fixed the bug and ran the load again.  It 
>> finished sometime between COB yesterday and this morning (successfully!).
> 
> Alright I am trying to reconcile this with from below, 'The largest studies 
> take just under a minute'.

The context of the 'The largest studies take just under a minute' statement is 
that it's not loading the hefty/time-consuming raw data.  It's only validating 
the metadata.  That's fast (5-60s).  And that data is a portion of the 
transaction in the back-end load.  There are errors that validation can miss 
that are due to not touching the raw data, and in fact, those errors are 
addressed by curators editing the excel sheets.  That's why it's all in the 
load transaction instead of loaded separately, but those problems are somewhat 
rare (and we currently have a new feature in the design phase that should 
almost completely eliminate those issues).

>>> Seems you are looking for some sort of queuing system.
>>> 
>>> What are the time constraints for getting the validation turned around.
>> I have considered a queuing system, though when I previously floated a proof 
>> of concept using celery, I was informed it was too much.  Though, at the 
>> time, all I was trying to do was a progress bar for a query stats feature.  
>> So proposing celery in this instance may get more traction with the rest of 
>> the team.
>> Most of the small validation processes finish in under a dozen seconds.   
>> The largest studies take just under a minute.  I have plans to optimize the 
>> loading scripts that hopefully could get the largest studies down to a dozen 
>> seconds.  If I could do that, and do the back end loads in off-peak hours, 
>> then I'd be willing to suffer the rare timeouts from concurrent validations. 
>>  The raw data loads will still likely take a much longer time.
> 
> This is where I get confused, probably because I am not exactly sure what 
> constitutes validation. My sense is that involves a load of data into live 
> tables and seeing what fails PK, FK or other constraints.
> 
> If that is the case I am not seeing how the 'for real' data load would be 
> longer?

The validation skips the time-consuming raw data load.  That raw data is 
collectively hundreds of gigs in size and could not be uploaded on the 
validation page anyway.  The feature I alluded to above that would make errors 
associated with the raw data almost completely eliminated is one where the 
researcher can drop the raw data folder into the form and it just walks the 
directory to get all the raw data file names and relative paths.  It's those 
data relationships whose validations are currently skipped.

> At any rate I can't see how loading into a live database multiple sets of 
> data while operations are going on in the database can be made conflict free. 
> To me  it seems the best that be done is:
> 
> 1) Reduce chance for conflict by spreading the actions out.
> 
> 2) Have retry logic that deals with conflicts.

I'm unfamiliar with retry functionality, but those options sound logical to me 
as a good path forward, particularly using celery to spread out validations and 
doing the back end loads at night (or using some sort of fast dump/load).  The 
thing that bothers me about the celery solution is that most of the time, 2 
users validating different data will not block, so I would be making users wait 
for no reason.  Ideally, I could anticipate the block and only at that point, 
separate those validations.

This brings up a question though about a possibility I sus

Re: How to perform a long running dry run transaction without blocking

2025-02-06 Thread Robert Leach
> Please reply to list also.
> Ccing list

👍🏻

>> They enter the data in an excel spreadsheet containing about a dozen 
>> inter-related worksheets, named, for example: Study, Animals, Samples, 
>> Tissues, Treatments, Sequences
> 
> Long term is there a thought to have them enter directly into database where 
> validation can happen in real time

No, not really.  Sample metadata tends to be an afterthought to researchers.  
They have it in their notebooks and getting them to enter it at all is like 
pulling teeth.  The validation interface actually has a bunch of other features 
I haven't mentioned that streamline the process for them.  Before it gets to 
actually validating the data, it tries to lighten the manual burden on the 
researchers (and help with consistent nomenclature) by pulling sample names out 
of the raw files, massaging them, and filling those in along with a mass of 
common data that is used to populate drop-downs in the excel columns to avoid 
researcher typos and value variants.

Having everything work with excel actually made the site more attractive to the 
researchers, because they're comfortable with it and use it already, so it 
lowered the bar for using our software.

Besides, we don't trust the users enough to enter data unsupervised.  There are 
a lot of aspects of the data that cannot be automatically validated and involve 
experimental parameters that are adjacent to the purpose of our site.  We have 
curators that need to look at everything to ensure consistency, and looking at 
all the data in context is necessary before any of it is entered.

That said, back in the aughts, I wrote a perl cgi site for a toxin and 
virulence factor database that used a web interface for data entry and achieved 
the curation goal by saving a form of all inter-related data.  The submit 
button sent that form to a list of curators who could approve the insert/update 
and make it actually happen.  I think I had actually suggested that form of 
data entry when this current project first started, but I was overruled.  
However, in this project, the equivalent procedure would be per-sample, and 
you'd lose out on the overall context.  It's an interesting challenge, but I 
think we're pretty committed now on this file load path.

>>> Where are the background processes loading data to?
>> We first run our loads on a development server with a separate copy of the 
>> database, but then we re-run those same loads on the production server, 
>> where users perform their validations.
> 
> The load to the development server does no validation?
> 
> If so what is the purpose?
> 
> The background processes are other validation runs?

It's the same code that executes in both cases (with or without the 
`--validate` flag).  All that that flag does is it (effectively) raises the dry 
run exception before it leaves the transaction block, so it always validates 
(whether the flag is supplied or not).

So the load doesn't fail until the end of the run, which is inefficient from a 
maintenance perspective.  I've been thinking of adding a `--failfast` option 
for use on the back end.  Haven't done it yet.  I started a load yesterday in 
fact that ran 2 hours before it buffered an exception related to a newly 
introduced bug.  I fixed the bug and ran the load again.  It finished sometime 
between COB yesterday and this morning (successfully!).

>> One of the thoughts I'd had to work around the problem was to somehow dump 
>> the data from the development load and load it onto production in some sort 
>> of scheduled downtime or something.  However, even if we do that, I'm 
>> concerned that multiple users concurrently validating different submissions 
>> would encounter this blocking issue, and since those validations can take 
>> (on the upper end) just under a minute, it's enough for at least 1 user to 
>> encounter a timeout.  I have not yet proven that can happen, but based on my 
>> reading of the postgres documentation, it seems logical.
> 
> Seems you are looking for some sort of queuing system.
> 
> What are the time constraints for getting the validation turned around.

I have considered a queuing system, though when I previously floated a proof of 
concept using celery, I was informed it was too much.  Though, at the time, all 
I was trying to do was a progress bar for a query stats feature.  So proposing 
celery in this instance may get more traction with the rest of the team.

Most of the small validation processes finish in under a dozen seconds.  The 
largest studies take just under a minute.  I have plans to optimize the loading 
scripts that hopefully could get the largest studies down to a dozen seconds.  
If I could do that, and do the back end loads in off-peak hours, then I'd be 
willing to suffer the rare timeouts from concurrent validations.  The raw data 
loads will still likely take a much longer time.


Robert William Leach
Research Software Engineer
133 Carl C. Icahn Lab
Lewis-Sigler Institu

How to perform a long running dry run transaction without blocking

2025-02-06 Thread Robert Leach
I've been trying to solve this problem in Django and I've finally decided after 
over a year going down this path that there's no way to solve it in Django (in 
a database agnostic fashion). So I'm thinking that I need to explore a 
Postgres-specific solution. FYI, I'm not a database expert. I just know enough 
to get myself into trouble like this problem.

Let me try and distill all the back story down to an oversimplified explanation:

I created a scientific data submission validation interface that helps 
researchers compile their data submissions to our database. To do this, I 
decided to add a `--validate` option to the load scripts that essentially 
raises a dry run exception at the end of a load run so that the data is all 
rolled back before leaving the atomic transaction block.

This validation interface skips the raw data load step, which is the heftiest, 
most long running, part and generally finishes in well under a minute.

The interface works fantastically well. It rolls back problematic data in 
smaller transactions and buffers those errors for a final report that the user 
can work to fix in their data files and retry validation until all their 
problems are solved and the data is ready to load.

The problem is that if we have a long running load going on in the background 
(which can take hours) and during that load, a user tries to validate a new 
submission that contains some overlapping common data (e.g. a new tissue type 
or protocol that is currently being loaded) or perhaps they are trying to 
validate data added to an existing submission that is being partially loaded, 
that validation process gets blocked and the validation interface encounters a 
gateway timeout.

I had tried changing the isolation level to both repeatable read and 
serializable, but the hang can still occur (though serializable seems to avoid 
hangs in some cases that repeatable read does not).

My initial interpretation of the isolation level documentation was that 
transactions would not be aware of what is happening in other transactions and 
that if there was a problem, you would only ever get a serialization error at 
the end when the data is committed, but after encountering the hang yesterday, 
I realized that the "snapshot" that the second transaction gets includes locks 
already established by the first transaction.

I had hoped that since my validation interface would never commit, it would 
never have a serialization error or hang.

What options are available that might allow this all to work as I intended? Is 
there some way to tell the validation process transaction to assume that any 
pending changes from another transaction will succeed?

I have lots of ideas on how to mitigate these occurrences, but I don't want to 
bias any potentially helpful responses. Am I screwed from my early decision to 
use the loading code to validate data submissions?