Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Chantal Ackermann Thu, 23 Jul 2009 06:08:09 -0700

Hi Paul,

no, I didn't return the unique key, though there is one defined. I addedthat to the nextRow() implementation, and I am now returning it as partof the map.

But it is still not creating any documents, and now that I can see theID I have realized that it is always processing the same - the first -data set. It's like it tries to create the first document but does not,then reiterates over that same data, fails again, and so on. I mean, itdoesn't even create one document. So it cannot be a simple iterationthat updates the same document over and over again (as there is none).

I haven't changed the log level. I see no error message in the output(catalina.log in my case).


The complete entity definition:

<dataConfig>

<dataSource type="JdbcDataSource"driver="oracle.jdbc.driver.OracleDriver" ... />

    <document name="doc">
        <entity name="epg_definition" pk="ID"
                query="select ID from DEFINITION">

<entity name="value" pk="DEF_ID"processor="PivotSqlEntityProcessor"query="select DEF_ID, id, name, value from datamart whereparent_id=${id_definition.ID} and id in (1,23,17)">

                <field column="DEF_ID" name="id" />
                <field column="sid" name="sid" />
                <field column="audio" name="audio" />
                <field column="cat" name="cat" />
            </entity>
        </entity>
    </document>
</dataConfig>

schema:
<field name="id" type="long" indexed="true" stored="true" required="true" />

<field name="sid" type="long" indexed="true" stored="true"required="true" /><field name="audio" type="text_ws" indexed="true" stored="false"omitNorms="true" multiValued="true"/><field name="cat" type="text_ws" indexed="true" stored="true"omitNorms="true" multiValued="true"/>

I am using more fields, but I removed them to make it easier to read. Iam thinking about removing them from my test to be sure they don'tinterfere.


Thanks for your help!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:

Is there a <uniqueKey> in your schema ? are you returning a value
corresponding to that key name?

probably you can paste the whole data-config.xml



On Thu, Jul 23, 2009 at 4:59 PM, Chantal
Ackermann<chantal.ackerm...@btelligent.de> wrote:

Hi Paul, hi Glen, hi all,

thank you for your answers.

I have followed Paul's solution (as I received it earlier). (I'll keep your
suggestion in mind, though, Glen.)

It looks good, except that it's not creating any documents... ;-)
It is most probably some misunderstanding on my side, and maybe you can help
me correct that?

So, I have subclassed the SqlEntityProcessor by overwriting basically
nextRow() as Paul suggested:

public Map<String, Object> nextRow() {
       if (rowcache != null)
               return getFromRowCache();
       if (rowIterator == null) {
               String q = getQuery();
               initQuery(resolver.replaceTokens(q));
       }
       Map<String, Object> pivottedRow = new HashMap<String, Object>();
       Map<String, Object> fieldRow = getNext();
       while (fieldRow != null) {
               // populate pivottedRow
               fieldRow = getNext();
       }
       pivottedRow = applyTransformer(pivottedRow);
       log.info("Returning: " + pivottedRow);
       return pivottedRow;
}

This seems to work as intended. From the log output, I can see that I get
only the rows that I expect for one iteration in the correct key-value
structure. I can also see, that the returned pivottedRow is what I want it
to be: a map containing columns where each column contains what previously
was input as a row.

Example (shortened):
INFO: Next fieldRow: {value=2, name=audio, id=1}
INFO: Next fieldRow: {value=773, name=cat, id=23}
INFO: Next fieldRow: {value=642058, name=sid, id=17}

INFO: Returning: {sid=642058, cat=[773], audio=2}

The entity declaration in the dih config (db_data_config.xml) looks like
this (shortened):
<entity name="my_value" processor="PivotSqlEntityProcessor"
       columnValue="value" columnName="name"
       query="select id, name, value from datamart where
parent_id=${id_definition.ID} and id in (1,23,17)">
       <field column="sid" name="sid" />
       <field column="audio" name="audio" />
       <field column="cat" name="cat" />
</entity>

id_definition is the root entity. Per parent_id there are several rows in
the datamart table which describe one data set (=>lucene document).

The object type of "value" is either String, String[] or List. I am not
handling that explicitly, yet. If that'd be the problem it would throw an
exception, wouldn't it?

But it is not creating any documents at all, although the data seems to be
returned correctly from the processor, so it's pobably something far more
fundamental.
<str name="Total Requests made to DataSource">1069</str>
<str name="Total Rows Fetched">1069</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2009-07-23 12:57:07</str>
−
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>

Any help / hint on what the root cause is or how to debug it would be
greatly appreciated.

Thank you!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:

alternately, you can write your own EntityProcessor and just override
the nextRow() . I guess you can still use the JdbcDataSource

On Wed, Jul 22, 2009 at 10:05 PM, Chantal
Ackermann<chantal.ackerm...@btelligent.de> wrote:

Hi all,

this is my first post, as I am new to SOLR (some Lucene exp).

I am trying to load data from an existing datamart into SOLR using the
DataImportHandler but in my opinion it is too slow due to the special
structure of the datamart I have to use.

Root Cause:
This datamart uses a row based approach (pivot) to present its data. It
was
so done to allow adding more attributes to the data set without having to
change the table structure.

Impact:
To use the DataImportHandler, i have to pivot the data to create again
one
row per data set. Unfortunately, this results in more and less performant
queries. Moreover, there are sometimes multiple rows for a single
attribute,
that require separate queries - or more tricky subselects that probably
don't speed things up.

Here is an example of the relation between DB requests, row fetches and
actual number of documents created:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">3737</str>
<str name="Total Rows Fetched">5380</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2009-07-22 18:19:06</str>
-
<str name="">
Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
</str>
<str name="Committed">2009-07-22 18:22:29</str>
<str name="Optimized">2009-07-22 18:22:29</str>
<str name="Time taken ">0:3:22.484</str>
</lst>

(Full index creation.)
There are about half a million data sets, in total. That would require
about
30h for indexing? My feeling is that there are far too many row fetches
per
data set.

I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
around 680MB RAM, Java6. I haven't changed the Lucene configuration
(merge
factor 10, ram buffer size 32).

Possible solutions?
A) Write my own DataImportHandler?
B) Write my own "MultiRowTransformer" that accepts several rows as input
argument (not sure this is a valid option)?
C) Approach the DB developers to add a flat table with one data set per
row?
D) ...?

If someone would like to share their experiences, that would be great!

Thanks a lot!
Chantal



--
Chantal Ackermann



--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com




--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com


--
Chantal Ackermann
Consultant

mobil    +49 (176) 10 00 09 45
email    chantal.ackerm...@btelligent.de

--------------------------------------------------------------------------------------------------------

b.telligent GmbH & Co. KG
Lichtenbergstraße 8
D-85748 Garching / München

fon       +49 (89) 54 84 25 60
fax        +49 (89) 54 84 25 69
web      www.btelligent.de

Registered in Munich: HRA 84393

Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 representedby Sebastian Amtage and Klaus Blaschek

USt.Id.-Nr. DE814054803



Confidentiality Note

This email is intended only for the use of the individual or entity towhich it is addressed, and may contain information that is privileged,confidential and exempt from disclosure under applicable law. If thereader of this email message is not the intended recipient, or theemployee or agent responsible for delivery of the message to theintended recipient, you are hereby notified that any dissemination,distribution or copying of this communication is prohibited. If you havereceived this email in error, please notify us immediately by telephoneat +49 (0) 89 54 84 25 60. Thank you.

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Reply via email to