Thanks Alexandre, I solved the problem using the xslt transform and the /update handler.
I attach the xsl that I put in conf/xslt/ (for documentation) Then the command: curl "http://192.168.99.100:8999/solr/solrexchange/update?commit=true&tr=updateXmlSolrExchange.xsl" -H "Content-Type: text/xml" --data-binary @./solr/data/search/dih/data_search.xml It is a shame that DIH can not be used with the schemaless config. I hope this will be possible in the future. Thanks, Pierre
> On 10 Aug 2016, at 19:02, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > > Seem you might be right, according to the source: > https://github.com/apache/lucene-solr/blob/master/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java#L662 > > Sometimes, the magic (and schemaless is rather magical) fails when > combined with older assumptions (and DIH is kind of legacy). > > You can still declare dynamic fields and use preffix/suffix to map to > the types. That would work just fine and avoid guessing. > > Or you could use API to predefine the fields in the schema. > > Or use the POST method with XSLT preprocessor (yes, Solr has that too > somewhere). > > Regards, > Alex. > ---- > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > > > On 10 August 2016 at 18:42, Pierre Caserta <pierre.case...@gmail.com> wrote: >> I am rebuilding a new docker image with each change on the config file so >> solr starts fresh every time. >> >> <requestHandler name="/dataimport" initParams="myInitParams" >> class="solr.DataImportHandler"> >> <lst name="defaults"> >> <str name="update.chain">add-unknown-fields-to-the-schema</str> >> <str name="config">solr-data-config.xml</str> >> </lst> >> </requestHandler> >> >> still having document like such: >> >> "response":{"numFound":8,"start":0,"docs":[ >> { >> "id":"38822", >> "_version_":1542264667720646656}, >> { >> >> If add add the Body field using the Schema section of the Admin UI, This >> field is getting indexed during the dataimport. >> It seems that solr.DataImportHandler does not allow the >> add-unknown-fields-to-the-schema update.chain. >> >> Pierre >> >>> On 10 Aug 2016, at 18:33, Alexandre Rafalovitch <arafa...@gmail.com> wrote: >>> >>> Ok, to reduce the magic, you can just stick "update.chain" parameter >>> inside the defaults of the dataimport handler directly. >>> >>> You can also pass it just as a URL parameter. That's what 'defaults' >>> section mean. >>> >>> And, just to be paranoid, you did reload the core after each of those >>> changes to test it? These are not picked up automatically. >>> >>> Regards, >>> Alex. >>> ---- >>> Newsletter and resources for Solr beginners and intermediates: >>> http://www.solr-start.com/ >>> >>> >>> On 10 August 2016 at 18:28, Pierre Caserta <pierre.case...@gmail.com> wrote: >>>> It did not work, >>>> I tried many things and ended up trying this: >>>> >>>> <requestHandler name="/dataimport" initParams="myInitParams" >>>> class="solr.DataImportHandler"> >>>> <lst name="defaults"> >>>> <str name="config">solr-data-config.xml</str> >>>> </lst> >>>> </requestHandler> >>>> <initParams name="myInitParams" path="/update/**,/dataimport"> >>>> <lst name="defaults"> >>>> <str name="update.chain">add-unknown-fields-to-the-schema</str> >>>> </lst> >>>> </initParams> >>>> >>>> Regards, >>>> Pierre >>>> >>>>> On 10 Aug 2016, at 18:08, Alexandre Rafalovitch <arafa...@gmail.com> >>>>> wrote: >>>>> >>>>> Your initParams section does not apply to /dataimport handler as >>>>> defined. Try modifying it to say: >>>>> path="/update/**,/dataimport" >>>>> >>>>> Hopefully, that's all that takes. >>>>> >>>>> Managed schema is enabled by default, but schemaless mode is the next >>>>> layer on top. With managed schema, you can use the API to add your >>>>> fields (or new Admin UI in the Schema screen). With schemaless mode, >>>>> it tries to guess the field type as it adds it automatically. >>>>> >>>>> >>>>> Regards, >>>>> Alex. >>>>> >>>>> ---- >>>>> Newsletter and resources for Solr beginners and intermediates: >>>>> http://www.solr-start.com/ >>>>> >>>>> >>>>> On 10 August 2016 at 18:04, Pierre Caserta <pierre.case...@gmail.com> >>>>> wrote: >>>>>> Hi Alex, >>>>>> thanks for your answer. >>>>>> >>>>>> Yes my solrconfig.xml contains the add-unknown-fields-to-the-schema. >>>>>> >>>>>> <initParams path="/update/**"> >>>>>> <lst name="defaults"> >>>>>> <str name="update.chain">add-unknown-fields-to-the-schema</str> >>>>>> </lst> >>>>>> </initParams> >>>>>> >>>>>> I created my core using this command: >>>>>> >>>>>> curl >>>>>> http://192.168.99.100:8999/solr/admin/cores?action=CREATE&name=solrexchange&instanceDir=/opt/solr/server/solr/solrexchange&configSet=data_driven_schema_configs_custom >>>>>> >>>>>> I am using the example configset data_driven_schema_configs and I simply >>>>>> added: >>>>>> >>>>>> <lib dir="${solr.install.dir:../../../..}/dist/" >>>>>> regex="solr-dataimporthandler-.*\.jar" /> >>>>>> <requestHandler name="/dataimport" class="solr.DataImportHandler"> >>>>>> <lst name="defaults"> >>>>>> <str name="config">data-config.xml</str> >>>>>> </lst> >>>>>> </requestHandler> >>>>>> >>>>>> I thought the schemaless mode was enable by default but I also tried >>>>>> adding this config but I get the same result. >>>>>> >>>>>> <schemaFactory class="ManagedIndexSchemaFactory"> >>>>>> <bool name="mutable">true</bool> >>>>>> <str name="managedSchemaResourceName">managed-schema</str> >>>>>> </schemaFactory> >>>>>> >>>>>> How can I update my schemaless URP chain and add the parameter to call >>>>>> it to DIH? >>>>>> >>>>>> >>>>>>> On 10 Aug 2016, at 17:43, Alexandre Rafalovitch <arafa...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Do you have the actual fields defined? If not, then I am guessing that >>>>>>> your 'post' test was against a different collection that had >>>>>>> schemaless mode enabled and your DIH one is against one where >>>>>>> schemaless mode is not enabled (look for >>>>>>> 'add-unknown-fields-to-the-schema' in the solrconfig.xml to confirm). >>>>>>> Solr examples for DIH do not have schemaless mode enabled. >>>>>>> >>>>>>> I _believe_ you can copy the schemaless URP chain and add the >>>>>>> parameter to call it to DIH handler and it _should_ work. But I am not >>>>>>> betting on it without testing it, as DIH also has some magic code to >>>>>>> ignore fields not defined in schema because it is designed to work >>>>>>> with only extracting relevant fields from the database even with >>>>>>> 'select *' statement. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Alex. >>>>>>> ---- >>>>>>> Newsletter and resources for Solr beginners and intermediates: >>>>>>> http://www.solr-start.com/ >>>>>>> >>>>>>> >>>>>>> On 10 August 2016 at 17:12, Pierre Caserta <pierre.case...@gmail.com> >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> It seems that using the DataImportHandler with a XPathEntityProcessor >>>>>>>> config >>>>>>>> with a managed-schema setup, only import the id and version field. >>>>>>>> >>>>>>>> data-config.xml >>>>>>>> >>>>>>>> <dataConfig> >>>>>>>> <dataSource type="FileDataSource" encoding="UTF-8" /> >>>>>>>> <document> >>>>>>>> <entity name="post" >>>>>>>> processor="XPathEntityProcessor" >>>>>>>> stream="true" >>>>>>>> forEach="/posts/row/" >>>>>>>> url="${dataimporter.request.dataurl}" >>>>>>>> >>>>>>>> transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer" >>>>>>>>> >>>>>>>> <field column="id" xpath="/posts/row/@Id" /> >>>>>>>> <field column="postTypeId" xpath="/posts/row/@PostTypeId" >>>>>>>> /> >>>>>>>> <field column="acceptedAnswerId" >>>>>>>> xpath="/posts/row/@AcceptedAnswerId" /> >>>>>>>> <field column="creationDate" xpath="/posts/row/@CreationDate" >>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" /> >>>>>>>> <field column="postScore" xpath="/posts/row/@Score" /> >>>>>>>> <field column="viewCount" xpath="/posts/row/@ViewCount" /> >>>>>>>> <field column="body" xpath="/posts/row/@Body" stripHTML="true" >>>>>>>> /> >>>>>>>> <field column="ownerUserId" xpath="/posts/row/@OwnerUserId" /> >>>>>>>> <field column="lastEditorUserId" >>>>>>>> xpath="/posts/row/@LastEditorUserId" /> >>>>>>>> <field column="lastEditorDisplayName" >>>>>>>> xpath="/posts/row/@LastEditorDisplayName" /> >>>>>>>> <field column="lastActivityDate" >>>>>>>> xpath="/posts/row/@LastActivityDate" >>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" /> >>>>>>>> <field column="title" xpath="/posts/row/@Title" /> >>>>>>>> <field column="trimmedTags" xpath="/posts/row/@Tags" >>>>>>>> regex="<(.*)>" /> >>>>>>>> <field column="tags" sourceColName="trimmedTags" >>>>>>>> splitBy="><" /> >>>>>>>> <field column="answerCount" xpath="/posts/row/@AnswerCount" /> >>>>>>>> <field column="commentCount" xpath="/posts/row/@CommentCount" >>>>>>>> /> >>>>>>>> <field column="favoriteCount" >>>>>>>> xpath="/posts/row/@FavoriteCount" >>>>>>>> /> >>>>>>>> <field column="communityOwnedDate" >>>>>>>> xpath="/posts/row/@CommunityOwnedDate" >>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" /> >>>>>>>> </entity> >>>>>>>> </document> >>>>>>>> </dataConfig> >>>>>>>> >>>>>>>> >>>>>>>> http://192.168.99.100:8999/solr/solrexchange/select?indent=on&q=*:*&wt=json >>>>>>>> { >>>>>>>> "responseHeader":{ >>>>>>>> "status":0, >>>>>>>> "QTime":0, >>>>>>>> "params":{ >>>>>>>> "q":"*:*", >>>>>>>> "indent":"on", >>>>>>>> "wt":"json", >>>>>>>> "_":"1470811193595"}}, >>>>>>>> "response":{"numFound":8,"start":0,"docs":[ >>>>>>>> { >>>>>>>> "id":"38822", >>>>>>>> "_version_":1542258196375142400}, >>>>>>>> { >>>>>>>> "id":"38836", >>>>>>>> "_version_":1542258196387725312}, >>>>>>>> { >>>>>>>> "id":"63896", >>>>>>>> "_version_":1542258196388773888}, >>>>>>>> { >>>>>>>> "id":"65406", >>>>>>>> "_version_":1542258196391919616}, >>>>>>>> { >>>>>>>> "id":"1357173", >>>>>>>> "_version_":1542258196391919617}, >>>>>>>> { >>>>>>>> "id":"5339763", >>>>>>>> "_version_":1542258196392968192}, >>>>>>>> { >>>>>>>> "id":"9932722", >>>>>>>> "_version_":1542258196392968193}, >>>>>>>> { >>>>>>>> "id":"9217299", >>>>>>>> "_version_":1542258196392968194}] >>>>>>>> }} >>>>>>>> >>>>>>>> data_search.xml (8 rows) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> the url I am hitting (with custom dataurl parameter) >>>>>>>> >>>>>>>> curl >>>>>>>> 'http://192.168.99.100:8999/solr/solrexchange/dataimport?command=full-import&commit=true&dataurl=/code/solr/data/search/dih/data_search.xml' >>>>>>>> >>>>>>>> I changed my data to use <add> <doc> <field> and use the bin/post tool >>>>>>>> and >>>>>>>> this is working as expected. >>>>>>>> Now I am interested to make it work with the DataImportHandler. >>>>>>>> How can I use the DataImportHandler to import my document ? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Pierre Caserta >>>>>>>> >>>>>>>> >>>>>> >>>> >>