On 9/1/2014 7:19 AM, Jack Krupansky wrote: > It would be great to have a "standalone DIH" that runs as a separate > server and then sends standard Solr update requests to a Solr cluster.
This has been discussed, and I thought we had an issue in Jira, but I can't find it. A completely standalone DIH app would be REALLY nice. I already know that the JDBC ResultSet is not the bottleneck for indexing, at least for me. I once built a simple single-threaded SolrJ application that pulls data from JDBC and indexes it in Solr. It works in batches, typically 500 or 1000 docs at a time. When I comment out the "solr.add(docs)" line (so input object manipulation, casting, and building of the SolrInputDocument objects is still happening), it can read and manipulate our entire database (99.8 million documents) in about 20 minutes, but if I leave that in, it takes many hours. The bottleneck is that each DIH has only a single thread indexing to Solr. I've theorized that it should be *relatively* easy for me to write an application that pulls records off the JDBC ResultSet with multiple threads (say 10-20), have each thread figure out which shard its document lands on, and send it there with SolrJ. It might even be possible for the threads to collect several documents for each shard before indexing them in the same request. As with most multithreaded apps, the hard part is figuring out all the thread synchronization, making absolutely certain that thread timing is perfect without unnecessary delays. If I can figure out a generic approach (with a few configurable bells and whistles available), it might be something suitable for inclusion in the project, followed with improvements by all the smart people in our community. Thanks, Shawn