I'm a little late to this party, but here is the approach I'd take: 1. Use a batch size of N (in your case, this is 500, but it is really DB-specific as to what can be handled with an IN clause).
2. Read in N (or whatever is remaining) records from your data file and call your inserter method with these records (unless finished reading the file, of course). 3. In your inserter method, put together a SelectQuery using an IN expression (ExpressionFactory.inExp or ExpressionFactory.inDbExp) based upon your unique constraint values. 4. Perform that query in a new DataContext. If you get back ANY results, those are your duplicates. Handle the duplicates right there (you have a List of them now -- no need to parse exception strings, etc). 5. After handling the duplicates, do your batch insert. 6. Continue at step 2 until finished. I'd expect this approach to be pretty quick as it gives you the advantage of working with larger batches of data and also makes it easy to find the problem records. mrg On Fri, Sep 28, 2018 at 12:40 PM Tony Giaccone <[email protected]> wrote: > This answer is probably way off topic for this group as it's not really > Cayenne related. But I want to give some closure to the solution so that > anyone else who ends up reading this can see the totality of the problem > and solution. > > My original through was, I don't have to test for uniqueness because the > database will tell me if I violate the unique key constraint. The problem > with doing it on the java code side of persistence is that as the number of > inserted records grows so does the time a query takes to test for > uniqueness. If you're inserting a million records that starts to > accumulate. The database has to enforce this constraint so putting in your > java code is just a duplication of effort. > > That process works fine as long as you're inserting one record at a time. > In that case the lack of identifying information isn't a problem, because > you can know that the object your inserting caused the error. The solution > is log the offending row and off you go, to the next record. Simple and > direct. > > Where that ability fails is when you start "batching" your objects and > commit multiple objects at once. At that point relying on the database > becomes problematic in a few ways. First the transaction fails on the first > failure. Which is fine the DB only cares about the transaction and it > failed. What to do about it, from the DB perspective, is someone eles's > problem. From the java app perspective its possible there are more > problems in that batch and the way you solve that problem is going to be > difficult and/or time consuming. The second issue is that short of parsing > this error string there's not an easy way to figure out which insert caused > the problem. All these inserts are to the same table so just finding the ID > that caused the problem would be enough. If I had to do this.. I could. But > change the database and I need a new implementation. I hate that idea. > > In the end this was a case of premature optimization. Running a sample job > of one million lines show that without batching the app only takes about 4 > minutes to insert those rows. Would I like that to be faster, I would, but > is it acceptable, yes. > > Tony Giaccone > > On Fri, Sep 28, 2018 at 11:17 AM Bob Schellink <[email protected]> wrote: > > > Would be nice if Cayenne could pinpoint the problematic record, but I > think > > that will depend on what error info the database provides. > > > > Thinking of a strategy to get speed and correctness, how about an > > optimistic strategy: > > > > Process a batch of 500 records and if commit fails only then process that > > batch records one by one ( or chunk the 500 batch into smaller batches ) > > until you find the guilty object, then continue with the 500 batches > again. > > > > You pay a price through with more complex code. ;-) > > > > Kind regards > > > > Bob > > > > > > On Fri, Sep 28, 2018 at 4:28 PM Tony Giaccone <[email protected]> wrote: > > > > > Yeah, that's pretty much what I ended up doing. Even reading the file > > line > > > by line and doing an insert after each object is created only made the > > run > > > time go to 4 minutes and I can live with that. What I really wanted to > do > > > was find a way to recover from a larger commit. It seems that's not > > really > > > possible. The one feature that would make that failure easier to deal > > with > > > would be some kind of data value in commit error that would identify > the > > > class and key value of the object that caused the commit exception. I > > > recognize that the value is there in the text, but parsing through that > > > text message to find the value is a serious hack. It would be better if > > the > > > framework included in the commit exception, the class type and the key > > > value of the entity that caused the problem. > > > > > > Now maybe in the larger scheme of things, it doesn't make sense to > > identify > > > which item in the set of items being committed caused the problem. It's > > > clear it makes sense in my use case, but in the general use case, maybe > > > not.. > > > > > > > > > Tony > > > > > > On Thu, Sep 27, 2018 at 5:10 PM John Huss <[email protected]> wrote: > > > > > > > Commit the ObjectContext after each object/row and rollback the > > > > ObjectContext on failure. > > > > > > > > On Thu, Sep 27, 2018 at 3:57 PM Tony Giaccone <[email protected]> > > wrote: > > > > > > > > > So the question isn't as much about who to manage the transaction. > > It's > > > > > more about how to recover and eliminate the offending object so > that > > > the > > > > > commit can be made again. > > > > > > > > > > On Thu, Sep 27, 2018 at 3:52 PM John Huss <[email protected]> > > wrote: > > > > > > > > > > > I'd just wrap the whole thing in a database transaction. Then > > commit > > > > your > > > > > > ObjectContexts as often as you want to, but the real DB commit > > won't > > > > > happen > > > > > > until the end. > > > > > > > > > > > > TransactionManager transactionManager = > > > > > CayenneRuntime.*getThreadInjector* > > > > > > ().getInstance(TransactionManager.*class*); > > > > > > > > > > > > transactionManager.performInTransaction(*new* > > > > > > TransactionalOperation<Void>() { > > > > > > > > > > > > @Override > > > > > > > > > > > > *public* Void perform() { > > > > > > > > > > > > *return* *null*; > > > > > > > > > > > > } > > > > > > > > > > > > }); > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 27, 2018 at 2:36 PM Tony Giaccone <[email protected] > > > > > > wrote: > > > > > > > > > > > > > I'm processing a large number of rows, over 600,000 and the key > > > value > > > > > > > should be unique in this file but I'd like to ensure that. I > also > > > > want > > > > > > this > > > > > > > to happen with some rapidity. To speed this process upI'm > going > > to > > > > > read > > > > > > > lines from the file, create objects and commit the changes > after > > > 500 > > > > > have > > > > > > > been created. > > > > > > > > > > > > > > The problem with this is that if I have a duplicate value I > won't > > > > catch > > > > > > it > > > > > > > till I do the commit. > > > > > > > > > > > > > > When I insert a second key value the first exception is a db > > level > > > : > > > > > > > org.postgresql.util.PSQLException > > > > > > > > > > > > > > eventually this gets wrapped by a Cayenne Commit error. > > > > > > > > > > > > > > So I'd like to get a sense of what folks think. Given that I > want > > > to > > > > > > > balance these conflicting goals of speed and accuracy. > > > > > > > > > > > > > > Can I easily figure out what object or objects caused the error > > and > > > > > can I > > > > > > > exclude them from the context and redo the commit? f > > > > > > > > > > > > > > Is this a reasonable path to follow. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Tony Giaccone > > > > > > > > > > > > > > > > > > > > > > > > > > > >
