On Fri, Oct 24, 2008 at 5:14 PM, <[EMAIL PROTECTED]> wrote: > Hello, > > I have some questions about DataImportHandler and Solr statistics... > > > 1.) > I'm using the DataImportHandler for creating my Lucene index from XML files: > > ### > $ cat data-config.xml > <dataConfig> > <dataSource type="FileDataSource" /> > <document> > <entity name="xmlFile" > processor="FileListEntityProcessor" > baseDir="/tmp/files" > fileName="myDoc_.*\.xml" > newerThan="'NOW-30DAYS'" > recursive="false" > rootEntity="false" > dataSource="null"> > <entity name="myDoc" > url="${xmlFile.fileAbsolutePath}" > processor="XPathEntityProcessor" > forEach="/myDoc"> > ... > </dataConfig> > ### > > No problems with this configuration - All works fine for full-imports, but... > > ===> What means 'rootEntity="false"' and 'dataSource="null"'?
It is a menace caused by 'sensible defaults' An entity directly under the <document> is a root entity. That means that for each row emitted by the root entity one document is created in Solr/Lucene . but as in this case we do not wish to make one document per file. we wish to make one document per row emitted by the entity 'myDoc' .Because the entity 'xmlFile' is not has rootEntity=false the entity directly under it becomes a root entity automatically and each row emitted by that becomes a document. In most of the cases there is only one datasource (A JdbcDataSource) and all entities just use them . So it is an overkill to ask them to write the datSource. So we have chosen to implicitly assign the datasource with no name to that entity. But in case of FileListEntityProcessor a datasource is not necessary . But it won't hurt even if you do not put dataSource=null . It just means that we won't create a DataSource instance for that. > > > > 2.) > The documentation from DataImportHandler describes the index update process > for SQL databases only... > > My scenario: > - My application creates, deletes and modifies files from /tmp/files every > night. > - delta-import / DataImportHandler should "mirror" _all_ this changes to my > lucene index (=> create, delete, update documents). The only Entityprocessor which supports delta is SqlEntityProcessor. The XPathEntityProcessor has not implemented it , because we do not know of a consistent way of finding deltas for XML. So , unfortunately,no delta support for XML. But that said you can implement those methods in XPathEntityProcessor . The methods are explained in EntityProcessor.java. if you have questions specific to this I can help.Probably we can contribute it back > > ===> Is this possible with delta-import / DataImportHandler? > ===> If not: Do you have any suggestions on how to do this? > > > > 3.) > My scenario: > - /tmp/files contains 682 'myDoc_.*\.xml' XML files. > - Each XML file contains 12 XML elements (e.g. <title>foo</title>). > - DataImportHandler transfer only 5 from this 12 elements to the lucene index. > > > I don't understand the output from 'solr/dataimport' (=> status): > > ### > <response> > ... > <lst name="statusMessages"> > <str name="Total Requests made to DataSource">0</str> > <str name="Total Rows Fetched">1363</str> > <str name="Total Documents Skipped">0</str> > <str name="Full Dump Started">2008-10-24 13:19:03</str> > <str name=""> > Indexing completed. Added/Updated: 681 documents. Deleted 0 documents. > </str> > <str name="Committed">2008-10-24 13:19:05</str> > <str name="Optimized">2008-10-24 13:19:05</str> > <str name="Time taken ">0:0:2.648</str> > </lst> > ... > </response> > > ===> What is "Total Rows Fetched" rsp. what is a "row" in a XML file? An > element? Why 1363? > ===> Why shows the "Added/Updated" counter 681 and not 682? rows fethed makes a lot of sense with SqlEntityProcessor. It is the no:of rows fetched from DB . It is the cumulative no:of rows given out by all entitiies put together. in your case it will be the total files + total rows emitted from the xml Added updated is the no:of docs . How do you know the number is not accurate? > > > > 4.) > And my last questions about Solr statistics/informations... > > ===> Is it possible to get informations (number of indexed documents, stored > values from documents etc.) from the current lucene index? > ===> The admin webinterface shows 'numDocs' and 'maxDoc' in > 'statistics/core'. Is 'numDocs' = number of indexed documents? What means > 'maxDocs'? > > > Thanks a lot! > gisto > -- > GMX Kostenlose Spiele: Einfach online spielen und Spaß haben mit Pastry > Passion! > http://games.entertainment.gmx.net/de/entertainment/games/free/puzzle/6169196 > -- --Noble Paul