On 9/1/2014 7:19 AM, Jack Krupansky wrote:
> It would be great to have a "standalone DIH" that runs as a separate
> server and then sends standard Solr update requests to a Solr cluster.

This has been discussed, and I thought we had an issue in Jira, but I
can't find it.

A completely standalone DIH app would be REALLY nice.  I already know
that the JDBC ResultSet is not the bottleneck for indexing, at least for
me.  I once built a simple single-threaded SolrJ application that pulls
data from JDBC and indexes it in Solr.  It works in batches, typically
500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
line (so input object manipulation, casting, and building of the
SolrInputDocument objects is still happening), it can read and
manipulate our entire database (99.8 million documents) in about 20
minutes, but if I leave that in, it takes many hours.

The bottleneck is that each DIH has only a single thread indexing to
Solr.  I've theorized that it should be *relatively* easy for me to
write an application that pulls records off the JDBC ResultSet with
multiple threads (say 10-20), have each thread figure out which shard
its document lands on, and send it there with SolrJ.  It might even be
possible for the threads to collect several documents for each shard
before indexing them in the same request.

As with most multithreaded apps, the hard part is figuring out all the
thread synchronization, making absolutely certain that thread timing is
perfect without unnecessary delays.  If I can figure out a generic
approach (with a few configurable bells and whistles available), it
might be something suitable for inclusion in the project, followed with
improvements by all the smart people in our community.

Thanks,
Shawn

Reply via email to