Hi, @Jack the final goal is generate index out of Solr Cloud but run DIH externally is not bad
@Shawn it sounds great to build a new application that work with multiple threads and send documents to their shards please let me know the logic how can i decide which document should go to a shard ( i.e. matching rule for document and shard ) Thanks, Chunki. On Sep 2, 2014, at 1:15 AM, Siegfried Goeschl <sgoes...@gmx.at> wrote: > Hi folks, > > we are using Apache Camel but could use Spring Integration with the option to > upgrade to Apache BatchEE or Spring Batch later on - especially Tikka > document extraction can kill you server due to CPU consumption, memory usage > and plain memory leaks > > AFAIK Douf Turnbull also improved the Camel Solr Integration > > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/99739 > > Cheers, > > Siegfried Goeschl > > On 01.09.14 18:05, Jack Krupansky wrote: >> Packaging SolrCell in the same manner, with parallel threads and able to >> talk to multiple SolrCloud servers in parallel would have a lot of the >> same benefits as well. >> >> And maybe there could be some more generic Java framework for indexing >> as well, that "external indexers" in general could use. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Shawn Heisey >> Sent: Monday, September 1, 2014 11:42 AM >> To: solr-user@lucene.apache.org >> Subject: Re: external indexer for Solr Cloud >> >> On 9/1/2014 7:19 AM, Jack Krupansky wrote: >>> It would be great to have a "standalone DIH" that runs as a separate >>> server and then sends standard Solr update requests to a Solr cluster. >> >> This has been discussed, and I thought we had an issue in Jira, but I >> can't find it. >> >> A completely standalone DIH app would be REALLY nice. I already know >> that the JDBC ResultSet is not the bottleneck for indexing, at least for >> me. I once built a simple single-threaded SolrJ application that pulls >> data from JDBC and indexes it in Solr. It works in batches, typically >> 500 or 1000 docs at a time. When I comment out the "solr.add(docs)" >> line (so input object manipulation, casting, and building of the >> SolrInputDocument objects is still happening), it can read and >> manipulate our entire database (99.8 million documents) in about 20 >> minutes, but if I leave that in, it takes many hours. >> >> The bottleneck is that each DIH has only a single thread indexing to >> Solr. I've theorized that it should be *relatively* easy for me to >> write an application that pulls records off the JDBC ResultSet with >> multiple threads (say 10-20), have each thread figure out which shard >> its document lands on, and send it there with SolrJ. It might even be >> possible for the threads to collect several documents for each shard >> before indexing them in the same request. >> >> As with most multithreaded apps, the hard part is figuring out all the >> thread synchronization, making absolutely certain that thread timing is >> perfect without unnecessary delays. If I can figure out a generic >> approach (with a few configurable bells and whistles available), it >> might be something suitable for inclusion in the project, followed with >> improvements by all the smart people in our community. >> >> Thanks, >> Shawn >