If you like Java instead of Python, here’s a skeletal program: https://lucidworks.com/post/indexing-with-solrj/
It’s simple and single-threaded, but could serve as a basis for something along the lines that Walter suggests. And I absolutely agree with Walter that the DB is often where the bottleneck lies. You might be able to use multiple threads and/or processes to query the DB if that’s the case and you can find some kind of partition key. You also might (and it depends on the Solr version) be able, to wrap a jdbc stream in an update decorator. https://lucene.apache.org/solr/guide/8_0/stream-source-reference.html https://lucene.apache.org/solr/guide/8_0/stream-decorator-reference.html Best, Erick > On Nov 29, 2020, at 3:04 AM, Walter Underwood <wun...@wunderwood.org> wrote: > > I recommend building an outboard loader, like I did a dozen years ago for > Solr 1.3 (before DIH) and did again recently. I’m glad to send you my Python > program, though it reads from a JSONL file, not a database. > > Run a loop fetching records from a database. Put each record into a > synchronized > (thread-safe) queue. Run multiple worker threads, each pulling records from > the > queue, batching them up, and sending them to Solr. For maximum indexing speed > (at the expense of query performance), count the number of CPUs per shard > leader > and run two worker threads per CPU. > > Adjust the batch size to be maybe 10k to 50k bytes. That might be 20 to 1000 > documents, depending on the content. > > With this setup, your database will probably be your bottleneck. I’ve had this > index a million (small) documents per minute to a multi-shard cluster, from a > JSONL > file on local disk. > > Also, don’t worry about finding the leaders and sending the right document to > the right shard. I just throw the batches at the load balancer and let Solr > figure > it out. That is super simple and amazingly fast. > > If you are doing big batches, building a dumb ETL system with JSONL files in > Amazon S3 has some real advantages. It allows loading prod data into a test > cluster for load benchmarks, for example. Also good for disaster recovery, > just > load the recent batches from S3. Want to know exactly which documents were > in the index in October? Look at the batches in S3. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Nov 28, 2020, at 6:23 PM, matthew sporleder <msporle...@gmail.com> wrote: >> >> I went through the same stages of grief that you are about to start >> but (luckily?) my core dataset grew some weird cousins and we ended up >> writing our own indexer to join them all together/do partial >> updates/other stuff beyond DIH. It's not difficult to upload docs but >> is definitely slower so far. I think there is a bit of a 'clean core' >> focus going on in solr-land right now and DIH is easy(!) but it's also >> easy to hit its limits (atomic/partial updates? wtf is an "entity?" >> etc) so anyway try to be happy that you are aware of it now. >> >> On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk <dmitri.maz...@gmail.com> >> wrote: >>> >>> On 11/28/2020 5:48 PM, matthew sporleder wrote: >>> >>>> ... The bottom of >>>> that github page isn't hopeful however :) >>> >>> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC >>> JAR" :) >>> >>> It's a more general queston though, what is the path forward for users >>> who with data in two places? Hope that a community-maintained plugin >>> will still be there tomorrow? Dump our tables to CSV (and POST them) and >>> roll our own delta-updates logic? Or are we to choose one datastore and >>> drop the other? >>> >>> Dima >