On 2/15/2018 12:34 AM, LOPEZ-CORTES Mariano-ext wrote: > We've done the following test: From a java program, we read chunks of data > from Oracle and inject to Solr (via Solrj). > > The problem : It is really really slow (1'5 nights). > > Is there one faster method to do that ?
Are you indexing with a single thread? The way to speed up indexing is to index with many threads or processes simultaneously. Using ConcurrentUpdateSolrClient as Michal mentioned is one way to get multi-threading, but if you go this route, your program will never know about any indexing errors. Errors will be logged, but your program won't know about them. This client is good for initial bulk indexing, but when things become more automated, you're probably going to want automatic detection and notification when there's a problem. If you care about error handling, then you're going to have to handle multiple threads or processes in your own program. If you don't care about error handling, then go ahead and use ConcurrentUpdateSolrClient. But if your database is the bottleneck, that will not make things faster. For something later in the thread: One common problem with dataimport and large tables is that almost every JDBC driver will read the entire result of the SELECT statement into memory before providing that information to the program that did the SELECT. For large tables, this information can be larger than the Java heap, and that will cause the program to encounter an OutOfMemoryError. To solve this, you will need to ask Oracle how to disable this behavior with their JDBC driver. For MySQL, the solution is to set the batchSize parameter in DIH to -1, which results in DIH setting a JDBC fetch size of Integer.MIN_VALUE ... which tells the MySQL driver to stream results instead of putting them all into memory. For Microsoft SQL server, you need a URL parameter for the JDBC url, or simply to upgrade the JDBC driver to a newer version that doesn't do this by default. I have not been able to figure out how to get the Oracle driver to do it. Chances are that it will be a JDBC url parameter. https://wiki.apache.org/solr/DataImportHandlerFaq Thanks, Shawn