Hi Everyone,
We are indexing quite a lot of data using update/csv handler. For
reasons I can't get into right now, I can't implement a DIH since I
can only access the DB using Stored Procs and stored proc support in
DIH is not yet available. Indexing takes about 3 hours and I don't
want to tax the server too much during indexing so I came up with a
two server solution. Indexing server to index the file every night and
subsequently copy the index on the search server. Maintaining a full
fledged Tomcat/Jetty for just indexing is too much of a pain, so I
wrote a small utility Java class which starts an Embedded Server,
indexes the CSV and shuts down the server. I would like the
community's input on this solution.
Is this Okay to do?
Is there a better way to do this without running two separate servers?
Is my class safe enough to run everynight in production environment?
Here's my utility calss. This is just a POC and before I productionize
it, I would like some input from Solr Czars here.
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.CoreContainer;
import org.apache.solr.core.CoreDescriptor;
import org.apache.solr.core.SolrConfig;
import org.apache.solr.core.SolrCore;
import java.io.File;
public class StandaloneSolrIndexer {
public static void main(String args[]) throws Exception {
SolrCore core = null;
CoreContainer container = null;
try {
container = new CoreContainer();
SolrConfig config = new SolrConfig("/tmp/solr",
"solrconfig.xml", null);
CoreDescriptor descriptor = new CoreDescriptor(container,
"core1", "/tmp/solr");
core = new SolrCore("core1", "/tmp/solr/data", config,
null, descriptor);
container.register("core1", core, false);
SolrServer server = new EmbeddedSolrServer(container, "core1");
//Start by deleting everything
server.deleteByQuery("*:*");
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/csv");
req.addFile(new File("/tmp/product-5k.tsv"));
req.setParam("commit", "true");
req.setParam("stream.contentType", "text/plain;charset=utf-8");
req.setParam("escape", "\\");
req.setParam("separator", "\t");
req.setParam("fieldnames",
"product_id,account_id,name,category_tags,short_desc,upc,manu_mdl_num,ext_prd_id,brand,long_desc,sku,seller,seller_email,vertical,cat,subcat");
req.setParam("skipLines", "1");
NamedList<Object> result = server.request(req);
System.out.println("Result
====================================================================================:
\n" + result);
} finally {
if (core != null) core.close();
if (container != null) container.shutdown();
}
}
}
Thanks,
Rohit