Is there a <uniqueKey> in your schema ? are you returning a value corresponding to that key name?
probably you can paste the whole data-config.xml On Thu, Jul 23, 2009 at 4:59 PM, Chantal Ackermann<chantal.ackerm...@btelligent.de> wrote: > Hi Paul, hi Glen, hi all, > > thank you for your answers. > > I have followed Paul's solution (as I received it earlier). (I'll keep your > suggestion in mind, though, Glen.) > > It looks good, except that it's not creating any documents... ;-) > It is most probably some misunderstanding on my side, and maybe you can help > me correct that? > > So, I have subclassed the SqlEntityProcessor by overwriting basically > nextRow() as Paul suggested: > > public Map<String, Object> nextRow() { > if (rowcache != null) > return getFromRowCache(); > if (rowIterator == null) { > String q = getQuery(); > initQuery(resolver.replaceTokens(q)); > } > Map<String, Object> pivottedRow = new HashMap<String, Object>(); > Map<String, Object> fieldRow = getNext(); > while (fieldRow != null) { > // populate pivottedRow > fieldRow = getNext(); > } > pivottedRow = applyTransformer(pivottedRow); > log.info("Returning: " + pivottedRow); > return pivottedRow; > } > > This seems to work as intended. From the log output, I can see that I get > only the rows that I expect for one iteration in the correct key-value > structure. I can also see, that the returned pivottedRow is what I want it > to be: a map containing columns where each column contains what previously > was input as a row. > > Example (shortened): > INFO: Next fieldRow: {value=2, name=audio, id=1} > INFO: Next fieldRow: {value=773, name=cat, id=23} > INFO: Next fieldRow: {value=642058, name=sid, id=17} > > INFO: Returning: {sid=642058, cat=[773], audio=2} > > The entity declaration in the dih config (db_data_config.xml) looks like > this (shortened): > <entity name="my_value" processor="PivotSqlEntityProcessor" > columnValue="value" columnName="name" > query="select id, name, value from datamart where > parent_id=${id_definition.ID} and id in (1,23,17)"> > <field column="sid" name="sid" /> > <field column="audio" name="audio" /> > <field column="cat" name="cat" /> > </entity> > > id_definition is the root entity. Per parent_id there are several rows in > the datamart table which describe one data set (=>lucene document). > > The object type of "value" is either String, String[] or List. I am not > handling that explicitly, yet. If that'd be the problem it would throw an > exception, wouldn't it? > > But it is not creating any documents at all, although the data seems to be > returned correctly from the processor, so it's pobably something far more > fundamental. > <str name="Total Requests made to DataSource">1069</str> > <str name="Total Rows Fetched">1069</str> > <str name="Total Documents Skipped">0</str> > <str name="Full Dump Started">2009-07-23 12:57:07</str> > − > <str name=""> > Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. > </str> > > Any help / hint on what the root cause is or how to debug it would be > greatly appreciated. > > Thank you! > Chantal > > > Noble Paul നോബിള് नोब्ळ् schrieb: >> >> alternately, you can write your own EntityProcessor and just override >> the nextRow() . I guess you can still use the JdbcDataSource >> >> On Wed, Jul 22, 2009 at 10:05 PM, Chantal >> Ackermann<chantal.ackerm...@btelligent.de> wrote: >>> >>> Hi all, >>> >>> this is my first post, as I am new to SOLR (some Lucene exp). >>> >>> I am trying to load data from an existing datamart into SOLR using the >>> DataImportHandler but in my opinion it is too slow due to the special >>> structure of the datamart I have to use. >>> >>> Root Cause: >>> This datamart uses a row based approach (pivot) to present its data. It >>> was >>> so done to allow adding more attributes to the data set without having to >>> change the table structure. >>> >>> Impact: >>> To use the DataImportHandler, i have to pivot the data to create again >>> one >>> row per data set. Unfortunately, this results in more and less performant >>> queries. Moreover, there are sometimes multiple rows for a single >>> attribute, >>> that require separate queries - or more tricky subselects that probably >>> don't speed things up. >>> >>> Here is an example of the relation between DB requests, row fetches and >>> actual number of documents created: >>> >>> <lst name="statusMessages"> >>> <str name="Total Requests made to DataSource">3737</str> >>> <str name="Total Rows Fetched">5380</str> >>> <str name="Total Documents Skipped">0</str> >>> <str name="Full Dump Started">2009-07-22 18:19:06</str> >>> - >>> <str name=""> >>> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents. >>> </str> >>> <str name="Committed">2009-07-22 18:22:29</str> >>> <str name="Optimized">2009-07-22 18:22:29</str> >>> <str name="Time taken ">0:3:22.484</str> >>> </lst> >>> >>> (Full index creation.) >>> There are about half a million data sets, in total. That would require >>> about >>> 30h for indexing? My feeling is that there are far too many row fetches >>> per >>> data set. >>> >>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using >>> around 680MB RAM, Java6. I haven't changed the Lucene configuration >>> (merge >>> factor 10, ram buffer size 32). >>> >>> Possible solutions? >>> A) Write my own DataImportHandler? >>> B) Write my own "MultiRowTransformer" that accepts several rows as input >>> argument (not sure this is a valid option)? >>> C) Approach the DB developers to add a flat table with one data set per >>> row? >>> D) ...? >>> >>> If someone would like to share their experiences, that would be great! >>> >>> Thanks a lot! >>> Chantal >>> >>> >>> >>> -- >>> Chantal Ackermann >>> >> >> >> >> -- >> ----------------------------------------------------- >> Noble Paul | Principal Engineer| AOL | http://aol.com > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com