Hi Everyone,

We are indexing quite a lot of data using update/csv handler. For
reasons I can't get into right now, I can't implement a DIH since I
can only access the DB using Stored Procs and stored proc support in
DIH is not yet available. Indexing takes about 3 hours and I don't
want to tax the server too much during indexing so I came up with a
two server solution. Indexing server to index the file every night and
subsequently copy the index on the search server. Maintaining a full
fledged Tomcat/Jetty for just indexing is too much of a pain, so I
wrote a small utility Java class which starts an Embedded Server,
indexes the CSV and shuts down the server. I would like the
community's input on this solution.

Is this Okay to do?
Is there a better way to do this without running two separate servers?
Is my class safe enough to run everynight in production environment?

Here's my utility calss. This is just a POC and before I productionize
it, I would like some input from Solr Czars here.

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.embedded.EmbeddedSolrServer;
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.CoreContainer;
import org.apache.solr.core.CoreDescriptor;
import org.apache.solr.core.SolrConfig;
import org.apache.solr.core.SolrCore;

import java.io.File;

public class StandaloneSolrIndexer {

    public static void main(String args[]) throws Exception {

        SolrCore core = null;
        CoreContainer container = null;
        try {
            container = new CoreContainer();

            SolrConfig config = new SolrConfig("/tmp/solr",
"solrconfig.xml", null);
            CoreDescriptor descriptor = new CoreDescriptor(container,
"core1", "/tmp/solr");

            core = new SolrCore("core1", "/tmp/solr/data", config,
null, descriptor);
            container.register("core1", core, false);

            SolrServer server = new EmbeddedSolrServer(container, "core1");

            //Start by deleting everything
            server.deleteByQuery("*:*");

            ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/csv");
            req.addFile(new File("/tmp/product-5k.tsv"));

            req.setParam("commit", "true");
            req.setParam("stream.contentType", "text/plain;charset=utf-8");
            req.setParam("escape", "\\");
            req.setParam("separator", "\t");
            req.setParam("fieldnames",
"product_id,account_id,name,category_tags,short_desc,upc,manu_mdl_num,ext_prd_id,brand,long_desc,sku,seller,seller_email,vertical,cat,subcat");
            req.setParam("skipLines", "1");

            NamedList<Object> result = server.request(req);
            System.out.println("Result
====================================================================================:
\n" + result);

        } finally {
            if (core != null) core.close();
            if (container != null) container.shutdown();
        }
    }
}


Thanks,
Rohit

Reply via email to