Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow
I've been using Solr for a while now, indexing 2-4 million records using the DIH to pull data from MySQL, which has been working great. For a new project, I need to index about 20M records (30 fields) and I have been running into issues with MySQL disconnects, right around 15M. I've tried several r

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow
; the data. > > We've run into some troubles for the first 2 attempts, but setting > batchSize="-1" for the dataSource resolved the issues. > > Do you need a lot of complex joins to import the data from mysql? > > > > -robert > > > > > On 4/2

Re: Indexing 20M documents from MySQL with DIH

2011-04-21 Thread Scott Bigelow
Thanks for the e-mail. I probably should have provided more details, but I was more interested in making sure I was approaching the problem correctly (using DIH, with one big SELECT statement for millions of rows) instead of solving this specific problem. Here's a partial stacktrace from this speci

Re: Indexing 20M documents from MySQL with DIH

2011-04-24 Thread Scott Bigelow
erational difference between a newly-rebuilt index > and one that's been optimized. If you don't delete/update, there's not > much reason to optimize either > > I'll leave the DIH to others.. > > Best > Erick > > On Thu, Apr 21, 2011 at 8:09 PM, Scot

DIH: Using MAX(PrimaryKey) as delta identifier instead of dataimporter.last_index_time

2011-04-24 Thread Scott Bigelow
In DataImportHandler, is it possible to use the prior maximum value of the PrimaryKey in the delta query, as opposed to (or in addition to) using "dataimporter.last_index_time"? We already have Created_On and Updated_On fields, but we've only indexed the Updated_On fields. I was hoping for somethin

Re: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'

2011-04-26 Thread Scott Bigelow
I experienced the same issue. With Solr 1.x, I was copying out the 'example' directory to make my solr installation. However, for the Solr 3.x distributions, the DataImportHandler class exists in a directory that is at the same level as example: "dist", not a directory within. You'll either want t

DataImportHandler in Solr 3.1.0: not updating dataimport.properties last_index_time on delta-import?

2011-04-26 Thread Scott Bigelow
Title pretty much says it all; I've configured the DIH in 3.1.0, and it works great, except the delta-imports are always from the last time a full-import happened, not a delta-import. After a delta-import, dataimport.properties is completely untouched. The documentation implies that the delta-impor

Re: Indexing 20M documents from MySQL with DIH

2011-05-05 Thread Scott Bigelow
s > net_write_timeout, so it kills the connection. > {quote} > > I was thinking about some hackish solution to paginate results > >   >   > > Or something along those lines ( you'd need to to calculate offset in > pages query ) > > But unfortunately MySQL does not provi