Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Noble Paul നോബിള്‍ नोब्ळ् Thu, 23 Jul 2009 04:52:33 -0700

Is there a <uniqueKey> in your schema ? are you returning a value
corresponding to that key name?


probably you can paste the whole data-config.xml



On Thu, Jul 23, 2009 at 4:59 PM, Chantal
Ackermann<chantal.ackerm...@btelligent.de> wrote:
> Hi Paul, hi Glen, hi all,
>
> thank you for your answers.
>
> I have followed Paul's solution (as I received it earlier). (I'll keep your
> suggestion in mind, though, Glen.)
>
> It looks good, except that it's not creating any documents... ;-)
> It is most probably some misunderstanding on my side, and maybe you can help
> me correct that?
>
> So, I have subclassed the SqlEntityProcessor by overwriting basically
> nextRow() as Paul suggested:
>
> public Map<String, Object> nextRow() {
>        if (rowcache != null)
>                return getFromRowCache();
>        if (rowIterator == null) {
>                String q = getQuery();
>                initQuery(resolver.replaceTokens(q));
>        }
>        Map<String, Object> pivottedRow = new HashMap<String, Object>();
>        Map<String, Object> fieldRow = getNext();
>        while (fieldRow != null) {
>                // populate pivottedRow
>                fieldRow = getNext();
>        }
>        pivottedRow = applyTransformer(pivottedRow);
>        log.info("Returning: " + pivottedRow);
>        return pivottedRow;
> }
>
> This seems to work as intended. From the log output, I can see that I get
> only the rows that I expect for one iteration in the correct key-value
> structure. I can also see, that the returned pivottedRow is what I want it
> to be: a map containing columns where each column contains what previously
> was input as a row.
>
> Example (shortened):
> INFO: Next fieldRow: {value=2, name=audio, id=1}
> INFO: Next fieldRow: {value=773, name=cat, id=23}
> INFO: Next fieldRow: {value=642058, name=sid, id=17}
>
> INFO: Returning: {sid=642058, cat=[773], audio=2}
>
> The entity declaration in the dih config (db_data_config.xml) looks like
> this (shortened):
> <entity name="my_value" processor="PivotSqlEntityProcessor"
>        columnValue="value" columnName="name"
>        query="select id, name, value from datamart where
> parent_id=${id_definition.ID} and id in (1,23,17)">
>        <field column="sid" name="sid" />
>        <field column="audio" name="audio" />
>        <field column="cat" name="cat" />
> </entity>
>
> id_definition is the root entity. Per parent_id there are several rows in
> the datamart table which describe one data set (=>lucene document).
>
> The object type of "value" is either String, String[] or List. I am not
> handling that explicitly, yet. If that'd be the problem it would throw an
> exception, wouldn't it?
>
> But it is not creating any documents at all, although the data seems to be
> returned correctly from the processor, so it's pobably something far more
> fundamental.
> <str name="Total Requests made to DataSource">1069</str>
> <str name="Total Rows Fetched">1069</str>
> <str name="Total Documents Skipped">0</str>
> <str name="Full Dump Started">2009-07-23 12:57:07</str>
> −
> <str name="">
> Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
> </str>
>
> Any help / hint on what the root cause is or how to debug it would be
> greatly appreciated.
>
> Thank you!
> Chantal
>
>
> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>
>> alternately, you can write your own EntityProcessor and just override
>> the nextRow() . I guess you can still use the JdbcDataSource
>>
>> On Wed, Jul 22, 2009 at 10:05 PM, Chantal
>> Ackermann<chantal.ackerm...@btelligent.de> wrote:
>>>
>>> Hi all,
>>>
>>> this is my first post, as I am new to SOLR (some Lucene exp).
>>>
>>> I am trying to load data from an existing datamart into SOLR using the
>>> DataImportHandler but in my opinion it is too slow due to the special
>>> structure of the datamart I have to use.
>>>
>>> Root Cause:
>>> This datamart uses a row based approach (pivot) to present its data. It
>>> was
>>> so done to allow adding more attributes to the data set without having to
>>> change the table structure.
>>>
>>> Impact:
>>> To use the DataImportHandler, i have to pivot the data to create again
>>> one
>>> row per data set. Unfortunately, this results in more and less performant
>>> queries. Moreover, there are sometimes multiple rows for a single
>>> attribute,
>>> that require separate queries - or more tricky subselects that probably
>>> don't speed things up.
>>>
>>> Here is an example of the relation between DB requests, row fetches and
>>> actual number of documents created:
>>>
>>> <lst name="statusMessages">
>>> <str name="Total Requests made to DataSource">3737</str>
>>> <str name="Total Rows Fetched">5380</str>
>>> <str name="Total Documents Skipped">0</str>
>>> <str name="Full Dump Started">2009-07-22 18:19:06</str>
>>> -
>>> <str name="">
>>> Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
>>> </str>
>>> <str name="Committed">2009-07-22 18:22:29</str>
>>> <str name="Optimized">2009-07-22 18:22:29</str>
>>> <str name="Time taken ">0:3:22.484</str>
>>> </lst>
>>>
>>> (Full index creation.)
>>> There are about half a million data sets, in total. That would require
>>> about
>>> 30h for indexing? My feeling is that there are far too many row fetches
>>> per
>>> data set.
>>>
>>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
>>> around 680MB RAM, Java6. I haven't changed the Lucene configuration
>>> (merge
>>> factor 10, ram buffer size 32).
>>>
>>> Possible solutions?
>>> A) Write my own DataImportHandler?
>>> B) Write my own "MultiRowTransformer" that accepts several rows as input
>>> argument (not sure this is a valid option)?
>>> C) Approach the DB developers to add a flat table with one data set per
>>> row?
>>> D) ...?
>>>
>>> If someone would like to share their experiences, that would be great!
>>>
>>> Thanks a lot!
>>> Chantal
>>>
>>>
>>>
>>> --
>>> Chantal Ackermann
>>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Reply via email to