Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Shawn Heisey Mon, 01 Jul 2013 12:31:53 -0700

On 7/1/2013 12:56 PM, Mike L. wrote:

  Hey Ahmet / Solr User Group,


    I tried using the built in UpdateCSV and it runs A LOT faster than a 
FileDataSource DIH as illustrated below. However, I am a bit confused about the 
numDocs/maxDoc values when doing an import this way. Here's my Get command 
against a Tab delimted file: (I removed server info and additional fields.. 
everything else is the same)

http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields


My response from solr

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">591</int></lst>
</response>

I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to 
see If I can get this to run correctly before running my entire collection of 
data. I initially loaded the first 1000 records to an empty core and that 
seemed to work, however, but when running the above with a csv file that has 10 
records, I would like to see only 10 active records in my core. What I get 
instead, when looking at my stats page:

numDocs 1000
maxDoc 1010

If I run the same url above while appending an 'optimize=true', I get:

numDocs 1000,
maxDoc 1000.

A discrepancy between numDocs and maxDoc indicates that there aredeleted documents in your index. You might already know this, so here'san answer to what I think might be your actual question:

If you want to delete the 1000 existing documents before adding the 10documents, then you have to actually do that deletion. The CSV updatehandler works at a lower level than the DataImport handler, and doesn'thave "clean" or "full-import" options, which defaults to clean=true.The DIH is like a full application embedded inside Solr, one that usesan update handler -- it is not itself an update handler. Whenclean=true or using full-import without a clean option, DIH itself sendsa "delete all documents" update request.

If you didn't already know the bit about the deleted documents, thenread this:

It can be normal for indexing "new" documents to cause deleteddocuments. This happens when you have the same value in your UniqueKeyfield as documents that are already in your index. Solr knows by theconfig you gave it that they are the same document, so it deletes theold one before adding the new one. Solr has no way to know whether thedocument it already had or the document you are adding is more current,so it assumes you know what you are doing and takes care of the deletionfor you.

When you optimize your index, deleted documents are purged, which is whythe numbers match there.


Thanks,
Shawn

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Reply via email to