On Fri, Jul 24, 2009 at 2:04 PM, Chantal Ackermann<chantal.ackerm...@btelligent.de> wrote: > oh, I didn't know that. That did it! (Jippieh!) > I've pasted the relevant code below in case you're interested. > > Thank you for your help!! > > Another little observation that I made using the debugger interface: > it threw an exception trying to cast Long / String[] etc. to String. I guess > that happened because I filled that Map that is declared as > Map<String,Object> with different object types as values. > Maybe it should be changed to Map<String,String> if you have a cast in > there? Or maybe it's just a bug in the debugger. (Casting to String instead > of using toString()?) > I haven't tried changing it back to non-String objects, yet. I'll play > around with it - and keep you posted, if you like. > > If there is any interest I'll ask my boss whether I can contribute that > class to your code? I'd rework it to be generic as good as I can, of course.
If the usecase is generic enough you can raise an issue and provide a patch for others to use. Just ensure that the patch is for the 1.4 trunk > > Cheers, > Chantal > > ***************************************************************** > Here is the construct (I've changed it to use the rowIterator instead of > getNext() ): > > // return null once the end of that data set is reached > if (!rowIterator.hasNext()) { > rowIterator = null; > return null; > } > // as long as there is data, iterate over the rows and pivot them > // return the pivotted row after the last row of data has been reached > do { > fieldRow = rowIterator.next(); > attName = fieldRow.get(...); > newValue = fieldRow.get(...); > pivottedRow.put(attName, newValue); > } while (rowIterator.hasNext()); > > pivottedRow = applyTransformer(pivottedRow); > log.info("Returning: " + pivottedRow); > return pivottedRow; > ****************************************************************** > > > Noble Paul നോബിള് नोब्ळ् schrieb: >> >> The API of Entityprocessor is as follows >> >> the nextRow() is invoked again and again till it returns null . if >> getNext() returns null you also should return null; >> >> are you doing that? >> >> fieldRow = getNext(); >> if(fieldRow == null) return null; >> >> open the DIH debug page and you should see the debug output >> >> On Thu, Jul 23, 2009 at 9:11 PM, Chantal >> Ackermann<chantal.ackerm...@btelligent.de> wrote: >>> >>> Hi Paul, >>> >>> thanks for your patience. Before, I omitted parts of the code to make it >>> easier to read. Below is the complete code for the Processor. >>> >>> I tried starting the dataimporthandler with parameter debug=true (in the >>> URL >>> call: e.g. >>> >>> http:://localhost:8080/solr/de/dataimport?command=full-import&debug=true), >>> but this doesn't change the output in any way at all. It seems as if the >>> parameter would be ignored - do I use it incorrectly? >>> >>> Changing to FINE in the SOLR admin outputs loads from Tomcat, on top. The >>> "interesting" observation in that case is, that there is no additional >>> log >>> output between the nextRow() calls. The "Returning: " line is output, so >>> the >>> method exits but is called immediately afterwards, again. (Or so it >>> seems?) >>> There is no log output from any code that would try to create a lucene >>> document. >>> >>> Regards, >>> Chantal >>> >>> ****************** C O D E ******************************************* >>> public class PivotSqlEntityProcessor extends SqlEntityProcessor { >>> >>> private static final Logger log = Logger >>> .getLogger(EpgValueEntityProcessor.class.getName()); >>> private static final String ATTR_ID_DEFINITION = "columnIdDefinition"; >>> private static final String ATTR_COLUMN_ATT_NAME = "columnAttName"; >>> private static final String ATTR_COLUMN_VALUE = "columnValue"; >>> private static final String ATTR_COLUMN_SUBVALUE = "columnSubvalue"; >>> // default column names >>> private static final String DEF_ATT_NAME = "ATT_NAME"; >>> private static final String DEF_VALUE = "VALUE"; >>> private static final String DEF_SUBVALUE = "SUBVALUE"; >>> private static final String DEF_ID_DEFINITION = "ID_DEFINITION"; >>> private String defId; >>> private String colIdDef = DEF_ID_DEFINITION; >>> private String colAttName = DEF_ATT_NAME; >>> private String colValue = DEF_VALUE; >>> private String colSubvalue = DEF_SUBVALUE; >>> >>> >>> @SuppressWarnings("unchecked") >>> public void init(Context context) { >>> super.init(context); >>> // load column names from entity element attributes >>> colIdDef = context.getEntityAttribute(ATTR_ID_DEFINITION); >>> epgdefId = colIdEpgDef; >>> colAttName = context.getEntityAttribute(ATTR_COLUMN_ATT_NAME); >>> colValue = context.getEntityAttribute(ATTR_COLUMN_VALUE); >>> colSubvalue = context.getEntityAttribute(ATTR_COLUMN_SUBVALUE); >>> log.info("Using the column names: " + >>> colAttName + ", " + colValue + ", " + colSubvalue); >>> } >>> >>> >>> public Map<String, Object> nextRow() { >>> if (rowcache != null) >>> return getFromRowCache(); >>> if (rowIterator == null) { >>> String q = getQuery(); >>> initQuery(resolver.replaceTokens(q)); >>> } >>> Map<String, Object> pivottedRow = new HashMap<String, Object>(); >>> Map<String, Object> fieldRow = getNext(); >>> String attName, value, subvalue; >>> Object existingValue, newValue; >>> Long id = null; >>> while (fieldRow != null) { >>> id = ((Number)fieldRow.get(colIdDef)).longValue(); >>> assert id != null; >>> if (pivottedRow.containsKey(colIdDef)) { >>> assert id.equals(pivottedRow.get(colIdDef)); >>> } else { >>> pivottedRow.put(colIdDef, id); >>> } >>> log.info("Next fieldRow: " + fieldRow); >>> attName = (String) fieldRow.get(colAttName); >>> if (attName == null) { >>> log.warning("No value returned for attribute name >>> column " + colAttName); >>> } >>> value = (String) fieldRow.get(colValue); >>> subvalue = (String) fieldRow.get(colSubvalue); >>> >>> // create a single object for value and subvalue >>> // if subvalue is not set, use value only, >>> // otherwise create string array >>> if (subvalue == null || subvalue.trim().length() == 0) { >>> newValue = value; >>> } else { >>> newValue = new String[] { value, subvalue }; >>> } >>> >>> // if there is already an entry for that attribute, >>> // extend the existing value >>> if (pivottedRow.containsKey(attName)) { >>> existingValue = pivottedRow.get(attName); >>> if (existingValue instanceof List) { >>> ((List) existingValue).add(newValue); >>> } else { >>> ArrayList v = new ArrayList(); >>> Collections.addAll(v, existingValue, >>> newValue); >>> pivottedRow.put(attName, v); >>> } >>> } else { >>> pivottedRow.put(attName, newValue); >>> } >>> >>> fieldRow = getNext(); >>> } >>> pivottedRow = applyTransformer(pivottedRow); >>> log.info("Returning: " + pivottedRow); >>> return pivottedRow; >>> } >>> } >>> >>> ****************** C O D E E N D ********************************** >>> >>> >>> Noble Paul നോബിള് नोब्ळ् schrieb: >>>> >>>> apparently there is something wrong with the code or it is incomplete >>>> >>>> Map<String, Object> pivottedRow = new HashMap<String, Object>(); >>>> Map<String, Object> fieldRow = getNext(); >>>> while (fieldRow != null) { >>>> // populate pivottedRow >>>> fieldRow = getNext(); >>>> } >>>> pivottedRow = applyTransformer(pivottedRow); >>>> log.info("Returning: " + pivottedRow); >>>> return pivottedRow; >>>> >>>> >>>> here your 'pivottedRow ' is empty and I don't see it getting filled. >>>> fieldRow is contains the data but it is not used at all >>>> >>>> On Thu, Jul 23, 2009 at 6:37 PM, Chantal >>>> Ackermann<chantal.ackerm...@btelligent.de> wrote: >>>>> >>>>> Hi Paul, >>>>> >>>>> no, I didn't return the unique key, though there is one defined. I >>>>> added >>>>> that to the nextRow() implementation, and I am now returning it as part >>>>> of >>>>> the map. >>>>> >>>>> But it is still not creating any documents, and now that I can see the >>>>> ID >>>>> I >>>>> have realized that it is always processing the same - the first - data >>>>> set. >>>>> It's like it tries to create the first document but does not, then >>>>> reiterates over that same data, fails again, and so on. I mean, it >>>>> doesn't >>>>> even create one document. So it cannot be a simple iteration that >>>>> updates >>>>> the same document over and over again (as there is none). >>>>> >>>>> I haven't changed the log level. I see no error message in the output >>>>> (catalina.log in my case). >>>>> >>>>> The complete entity definition: >>>>> >>>>> <dataConfig> >>>>> <dataSource type="JdbcDataSource" >>>>> driver="oracle.jdbc.driver.OracleDriver" ... /> >>>>> <document name="doc"> >>>>> <entity name="epg_definition" pk="ID" >>>>> query="select ID from DEFINITION"> >>>>> <!-- originally I would set the field "id" (unique key) on this level, >>>>> doesn't work neither --> >>>>> <entity name="value" pk="DEF_ID" >>>>> processor="PivotSqlEntityProcessor" >>>>> query="select DEF_ID, id, name, value from datamart where >>>>> parent_id=${id_definition.ID} and id in (1,23,17)"> >>>>> <field column="DEF_ID" name="id" /> >>>>> <field column="sid" name="sid" /> >>>>> <field column="audio" name="audio" /> >>>>> <field column="cat" name="cat" /> >>>>> </entity> >>>>> </entity> >>>>> </document> >>>>> </dataConfig> >>>>> >>>>> schema: >>>>> <field name="id" type="long" indexed="true" stored="true" >>>>> required="true" >>>>> /> >>>>> <field name="sid" type="long" indexed="true" stored="true" >>>>> required="true" >>>>> /> >>>>> <field name="audio" type="text_ws" indexed="true" stored="false" >>>>> omitNorms="true" multiValued="true"/> >>>>> <field name="cat" type="text_ws" indexed="true" stored="true" >>>>> omitNorms="true" multiValued="true"/> >>>>> >>>>> I am using more fields, but I removed them to make it easier to read. I >>>>> am >>>>> thinking about removing them from my test to be sure they don't >>>>> interfere. >>>>> >>>>> Thanks for your help! >>>>> Chantal >>>>> >>>>> >>>>> Noble Paul നോബിള് नोब्ळ् schrieb: >>>>>> >>>>>> Is there a <uniqueKey> in your schema ? are you returning a value >>>>>> corresponding to that key name? >>>>>> >>>>>> probably you can paste the whole data-config.xml >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 23, 2009 at 4:59 PM, Chantal >>>>>> Ackermann<chantal.ackerm...@btelligent.de> wrote: >>>>>>> >>>>>>> Hi Paul, hi Glen, hi all, >>>>>>> >>>>>>> thank you for your answers. >>>>>>> >>>>>>> I have followed Paul's solution (as I received it earlier). (I'll >>>>>>> keep >>>>>>> your >>>>>>> suggestion in mind, though, Glen.) >>>>>>> >>>>>>> It looks good, except that it's not creating any documents... ;-) >>>>>>> It is most probably some misunderstanding on my side, and maybe you >>>>>>> can >>>>>>> help >>>>>>> me correct that? >>>>>>> >>>>>>> So, I have subclassed the SqlEntityProcessor by overwriting basically >>>>>>> nextRow() as Paul suggested: >>>>>>> >>>>>>> public Map<String, Object> nextRow() { >>>>>>> if (rowcache != null) >>>>>>> return getFromRowCache(); >>>>>>> if (rowIterator == null) { >>>>>>> String q = getQuery(); >>>>>>> initQuery(resolver.replaceTokens(q)); >>>>>>> } >>>>>>> Map<String, Object> pivottedRow = new HashMap<String, Object>(); >>>>>>> Map<String, Object> fieldRow = getNext(); >>>>>>> while (fieldRow != null) { >>>>>>> // populate pivottedRow >>>>>>> fieldRow = getNext(); >>>>>>> } >>>>>>> pivottedRow = applyTransformer(pivottedRow); >>>>>>> log.info("Returning: " + pivottedRow); >>>>>>> return pivottedRow; >>>>>>> } >>>>>>> >>>>>>> This seems to work as intended. From the log output, I can see that I >>>>>>> get >>>>>>> only the rows that I expect for one iteration in the correct >>>>>>> key-value >>>>>>> structure. I can also see, that the returned pivottedRow is what I >>>>>>> want >>>>>>> it >>>>>>> to be: a map containing columns where each column contains what >>>>>>> previously >>>>>>> was input as a row. >>>>>>> >>>>>>> Example (shortened): >>>>>>> INFO: Next fieldRow: {value=2, name=audio, id=1} >>>>>>> INFO: Next fieldRow: {value=773, name=cat, id=23} >>>>>>> INFO: Next fieldRow: {value=642058, name=sid, id=17} >>>>>>> >>>>>>> INFO: Returning: {sid=642058, cat=[773], audio=2} >>>>>>> >>>>>>> The entity declaration in the dih config (db_data_config.xml) looks >>>>>>> like >>>>>>> this (shortened): >>>>>>> <entity name="my_value" processor="PivotSqlEntityProcessor" >>>>>>> columnValue="value" columnName="name" >>>>>>> query="select id, name, value from datamart where >>>>>>> parent_id=${id_definition.ID} and id in (1,23,17)"> >>>>>>> <field column="sid" name="sid" /> >>>>>>> <field column="audio" name="audio" /> >>>>>>> <field column="cat" name="cat" /> >>>>>>> </entity> >>>>>>> >>>>>>> id_definition is the root entity. Per parent_id there are several >>>>>>> rows >>>>>>> in >>>>>>> the datamart table which describe one data set (=>lucene document). >>>>>>> >>>>>>> The object type of "value" is either String, String[] or List. I am >>>>>>> not >>>>>>> handling that explicitly, yet. If that'd be the problem it would >>>>>>> throw >>>>>>> an >>>>>>> exception, wouldn't it? >>>>>>> >>>>>>> But it is not creating any documents at all, although the data seems >>>>>>> to >>>>>>> be >>>>>>> returned correctly from the processor, so it's pobably something far >>>>>>> more >>>>>>> fundamental. >>>>>>> <str name="Total Requests made to DataSource">1069</str> >>>>>>> <str name="Total Rows Fetched">1069</str> >>>>>>> <str name="Total Documents Skipped">0</str> >>>>>>> <str name="Full Dump Started">2009-07-23 12:57:07</str> >>>>>>> − >>>>>>> <str name=""> >>>>>>> Indexing completed. Added/Updated: 0 documents. Deleted 0 documents. >>>>>>> </str> >>>>>>> >>>>>>> Any help / hint on what the root cause is or how to debug it would be >>>>>>> greatly appreciated. >>>>>>> >>>>>>> Thank you! >>>>>>> Chantal >>>>>>> >>>>>>> >>>>>>> Noble Paul നോബിള് नोब्ळ् schrieb: >>>>>>>> >>>>>>>> alternately, you can write your own EntityProcessor and just >>>>>>>> override >>>>>>>> the nextRow() . I guess you can still use the JdbcDataSource >>>>>>>> >>>>>>>> On Wed, Jul 22, 2009 at 10:05 PM, Chantal >>>>>>>> Ackermann<chantal.ackerm...@btelligent.de> wrote: >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> this is my first post, as I am new to SOLR (some Lucene exp). >>>>>>>>> >>>>>>>>> I am trying to load data from an existing datamart into SOLR using >>>>>>>>> the >>>>>>>>> DataImportHandler but in my opinion it is too slow due to the >>>>>>>>> special >>>>>>>>> structure of the datamart I have to use. >>>>>>>>> >>>>>>>>> Root Cause: >>>>>>>>> This datamart uses a row based approach (pivot) to present its >>>>>>>>> data. >>>>>>>>> It >>>>>>>>> was >>>>>>>>> so done to allow adding more attributes to the data set without >>>>>>>>> having >>>>>>>>> to >>>>>>>>> change the table structure. >>>>>>>>> >>>>>>>>> Impact: >>>>>>>>> To use the DataImportHandler, i have to pivot the data to create >>>>>>>>> again >>>>>>>>> one >>>>>>>>> row per data set. Unfortunately, this results in more and less >>>>>>>>> performant >>>>>>>>> queries. Moreover, there are sometimes multiple rows for a single >>>>>>>>> attribute, >>>>>>>>> that require separate queries - or more tricky subselects that >>>>>>>>> probably >>>>>>>>> don't speed things up. >>>>>>>>> >>>>>>>>> Here is an example of the relation between DB requests, row fetches >>>>>>>>> and >>>>>>>>> actual number of documents created: >>>>>>>>> >>>>>>>>> <lst name="statusMessages"> >>>>>>>>> <str name="Total Requests made to DataSource">3737</str> >>>>>>>>> <str name="Total Rows Fetched">5380</str> >>>>>>>>> <str name="Total Documents Skipped">0</str> >>>>>>>>> <str name="Full Dump Started">2009-07-22 18:19:06</str> >>>>>>>>> - >>>>>>>>> <str name=""> >>>>>>>>> Indexing completed. Added/Updated: 934 documents. Deleted 0 >>>>>>>>> documents. >>>>>>>>> </str> >>>>>>>>> <str name="Committed">2009-07-22 18:22:29</str> >>>>>>>>> <str name="Optimized">2009-07-22 18:22:29</str> >>>>>>>>> <str name="Time taken ">0:3:22.484</str> >>>>>>>>> </lst> >>>>>>>>> >>>>>>>>> (Full index creation.) >>>>>>>>> There are about half a million data sets, in total. That would >>>>>>>>> require >>>>>>>>> about >>>>>>>>> 30h for indexing? My feeling is that there are far too many row >>>>>>>>> fetches >>>>>>>>> per >>>>>>>>> data set. >>>>>>>>> >>>>>>>>> I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 >>>>>>>>> using >>>>>>>>> around 680MB RAM, Java6. I haven't changed the Lucene configuration >>>>>>>>> (merge >>>>>>>>> factor 10, ram buffer size 32). >>>>>>>>> >>>>>>>>> Possible solutions? >>>>>>>>> A) Write my own DataImportHandler? >>>>>>>>> B) Write my own "MultiRowTransformer" that accepts several rows as >>>>>>>>> input >>>>>>>>> argument (not sure this is a valid option)? >>>>>>>>> C) Approach the DB developers to add a flat table with one data set >>>>>>>>> per >>>>>>>>> row? >>>>>>>>> D) ...? >>>>>>>>> >>>>>>>>> If someone would like to share their experiences, that would be >>>>>>>>> great! >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> Chantal >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chantal Ackermann >>>>>>>>> >>>>>>>> -- >>>>>>>> ----------------------------------------------------- >>>>>>>> Noble Paul | Principal Engineer| AOL | http://aol.com >>>>>> >>>>>> -- >>>>>> ----------------------------------------------------- >>>>>> Noble Paul | Principal Engineer| AOL | http://aol.com >>>>> >>>>> -- >>>>> Chantal Ackermann >>>>> Consultant >>>>> >>>>> mobil +49 (176) 10 00 09 45 >>>>> email chantal.ackerm...@btelligent.de >>>>> >>>>> >>>>> >>>>> -------------------------------------------------------------------------------------------------------- >>>>> >>>>> b.telligent GmbH & Co. KG >>>>> Lichtenbergstraße 8 >>>>> D-85748 Garching / München >>>>> >>>>> fon +49 (89) 54 84 25 60 >>>>> fax +49 (89) 54 84 25 69 >>>>> web www.btelligent.de >>>>> >>>>> Registered in Munich: HRA 84393 >>>>> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented >>>>> by >>>>> Sebastian Amtage and Klaus Blaschek >>>>> USt.Id.-Nr. DE814054803 >>>>> >>>>> >>>>> >>>>> Confidentiality Note >>>>> This email is intended only for the use of the individual or entity to >>>>> which >>>>> it is addressed, and may contain information that is privileged, >>>>> confidential and exempt from disclosure under applicable law. If the >>>>> reader >>>>> of this email message is not the intended recipient, or the employee or >>>>> agent responsible for delivery of the message to the intended >>>>> recipient, >>>>> you >>>>> are hereby notified that any dissemination, distribution or copying of >>>>> this >>>>> communication is prohibited. If you have received this email in error, >>>>> please notify us immediately by telephone at +49 (0) 89 54 84 25 60. >>>>> Thank >>>>> you. >>>>> >>>> >>>> >>>> -- >>>> ----------------------------------------------------- >>>> Noble Paul | Principal Engineer| AOL | http://aol.com >>> >>> -- >>> Chantal Ackermann >>> Consultant >>> >>> mobil +49 (176) 10 00 09 45 >>> email chantal.ackerm...@btelligent.de >>> >>> >>> -------------------------------------------------------------------------------------------------------- >>> >>> b.telligent GmbH & Co. KG >>> Lichtenbergstraße 8 >>> D-85748 Garching / München >>> >>> fon +49 (89) 54 84 25 60 >>> fax +49 (89) 54 84 25 69 >>> web www.btelligent.de >>> >>> Registered in Munich: HRA 84393 >>> Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented >>> by >>> Sebastian Amtage and Klaus Blaschek >>> USt.Id.-Nr. DE814054803 >>> >>> >>> >>> Confidentiality Note >>> This email is intended only for the use of the individual or entity to >>> which >>> it is addressed, and may contain information that is privileged, >>> confidential and exempt from disclosure under applicable law. If the >>> reader >>> of this email message is not the intended recipient, or the employee or >>> agent responsible for delivery of the message to the intended recipient, >>> you >>> are hereby notified that any dissemination, distribution or copying of >>> this >>> communication is prohibited. If you have received this email in error, >>> please notify us immediately by telephone at +49 (0) 89 54 84 25 60. >>> Thank >>> you. >>> >> >> >> >> -- >> ----------------------------------------------------- >> Noble Paul | Principal Engineer| AOL | http://aol.com > > -- > Chantal Ackermann > Consultant > > mobil +49 (176) 10 00 09 45 > email chantal.ackerm...@btelligent.de > > -------------------------------------------------------------------------------------------------------- > > b.telligent GmbH & Co. KG > Lichtenbergstraße 8 > D-85748 Garching / München > > fon +49 (89) 54 84 25 60 > fax +49 (89) 54 84 25 69 > web www.btelligent.de > > Registered in Munich: HRA 84393 > Managing Director: b.telligent Verwaltungs GmbH, HRB 153164 represented by > Sebastian Amtage and Klaus Blaschek > USt.Id.-Nr. DE814054803 > > > > Confidentiality Note > This email is intended only for the use of the individual or entity to which > it is addressed, and may contain information that is privileged, > confidential and exempt from disclosure under applicable law. If the reader > of this email message is not the intended recipient, or the employee or > agent responsible for delivery of the message to the intended recipient, you > are hereby notified that any dissemination, distribution or copying of this > communication is prohibited. If you have received this email in error, > please notify us immediately by telephone at +49 (0) 89 54 84 25 60. Thank > you. > -- ----------------------------------------------------- Noble Paul | Principal Engineer| AOL | http://aol.com