Thanks Alexandre,

I solved the problem using the xslt transform and the /update handler.

I attach the xsl that I put in conf/xslt/ (for documentation)

Then the command:
curl 
"http://192.168.99.100:8999/solr/solrexchange/update?commit=true&tr=updateXmlSolrExchange.xsl";
 -H "Content-Type: text/xml" --data-binary 
@./solr/data/search/dih/data_search.xml

It is a shame that DIH can not be used with the schemaless config. I hope this 
will be possible in the future.

Thanks,
Pierre

> On 10 Aug 2016, at 19:02, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> 
> Seem you might be right, according to the source:
> https://github.com/apache/lucene-solr/blob/master/solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/DocBuilder.java#L662
> 
> Sometimes, the magic (and schemaless is rather magical) fails when
> combined with older assumptions (and DIH is kind of legacy).
> 
> You can still declare dynamic fields and use preffix/suffix to map to
> the types. That would work just fine and avoid guessing.
> 
> Or you could use API to predefine the fields in the schema.
> 
> Or use the POST method with XSLT preprocessor (yes, Solr has that too
> somewhere).
> 
> Regards,
>   Alex.
> ----
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
> 
> 
> On 10 August 2016 at 18:42, Pierre Caserta <pierre.case...@gmail.com> wrote:
>> I am rebuilding a new docker image with each change on the config file so 
>> solr starts fresh every time.
>> 
>>  <requestHandler name="/dataimport" initParams="myInitParams" 
>> class="solr.DataImportHandler">
>>      <lst name="defaults">
>>        <str name="update.chain">add-unknown-fields-to-the-schema</str>
>>        <str name="config">solr-data-config.xml</str>
>>      </lst>
>>  </requestHandler>
>> 
>> still having document like such:
>> 
>> "response":{"numFound":8,"start":0,"docs":[
>>      {
>>        "id":"38822",
>>        "_version_":1542264667720646656},
>>      {
>> 
>> If add add the Body field using the Schema section of the Admin UI, This 
>> field is getting indexed during the dataimport.
>> It seems that solr.DataImportHandler does not allow the 
>> add-unknown-fields-to-the-schema update.chain.
>> 
>> Pierre
>> 
>>> On 10 Aug 2016, at 18:33, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
>>> 
>>> Ok, to reduce the magic, you can just stick "update.chain" parameter
>>> inside the defaults of the dataimport handler directly.
>>> 
>>> You can also pass it just as a URL parameter. That's what 'defaults'
>>> section mean.
>>> 
>>> And, just to be paranoid, you did reload the core after each of those
>>> changes to test it? These are not picked up automatically.
>>> 
>>> Regards,
>>>   Alex.
>>> ----
>>> Newsletter and resources for Solr beginners and intermediates:
>>> http://www.solr-start.com/
>>> 
>>> 
>>> On 10 August 2016 at 18:28, Pierre Caserta <pierre.case...@gmail.com> wrote:
>>>> It did not work,
>>>> I tried many things and ended up trying this:
>>>> 
>>>> <requestHandler name="/dataimport" initParams="myInitParams" 
>>>> class="solr.DataImportHandler">
>>>>     <lst name="defaults">
>>>>       <str name="config">solr-data-config.xml</str>
>>>>     </lst>
>>>> </requestHandler>
>>>> <initParams name="myInitParams" path="/update/**,/dataimport">
>>>>   <lst name="defaults">
>>>>     <str name="update.chain">add-unknown-fields-to-the-schema</str>
>>>>   </lst>
>>>> </initParams>
>>>> 
>>>> Regards,
>>>> Pierre
>>>> 
>>>>> On 10 Aug 2016, at 18:08, Alexandre Rafalovitch <arafa...@gmail.com> 
>>>>> wrote:
>>>>> 
>>>>> Your initParams section does not apply to /dataimport handler as
>>>>> defined. Try modifying it to say:
>>>>> path="/update/**,/dataimport"
>>>>> 
>>>>> Hopefully, that's all that takes.
>>>>> 
>>>>> Managed schema is enabled by default, but schemaless mode is the next
>>>>> layer on top. With managed schema, you can use the API to add your
>>>>> fields (or new Admin UI in the Schema screen). With schemaless mode,
>>>>> it tries to guess the field type as it adds it automatically.
>>>>> 
>>>>> 
>>>>> Regards,
>>>>>  Alex.
>>>>> 
>>>>> ----
>>>>> Newsletter and resources for Solr beginners and intermediates:
>>>>> http://www.solr-start.com/
>>>>> 
>>>>> 
>>>>> On 10 August 2016 at 18:04, Pierre Caserta <pierre.case...@gmail.com> 
>>>>> wrote:
>>>>>> Hi Alex,
>>>>>> thanks for your answer.
>>>>>> 
>>>>>> Yes my solrconfig.xml contains the add-unknown-fields-to-the-schema.
>>>>>> 
>>>>>> <initParams path="/update/**">
>>>>>>  <lst name="defaults">
>>>>>>    <str name="update.chain">add-unknown-fields-to-the-schema</str>
>>>>>>  </lst>
>>>>>> </initParams>
>>>>>> 
>>>>>> I created my core using this command:
>>>>>> 
>>>>>> curl 
>>>>>> http://192.168.99.100:8999/solr/admin/cores?action=CREATE&name=solrexchange&instanceDir=/opt/solr/server/solr/solrexchange&configSet=data_driven_schema_configs_custom
>>>>>> 
>>>>>> I am using the example configset data_driven_schema_configs and I simply 
>>>>>> added:
>>>>>> 
>>>>>> <lib dir="${solr.install.dir:../../../..}/dist/" 
>>>>>> regex="solr-dataimporthandler-.*\.jar" />
>>>>>> <requestHandler name="/dataimport" class="solr.DataImportHandler">
>>>>>>    <lst name="defaults">
>>>>>>      <str name="config">data-config.xml</str>
>>>>>>    </lst>
>>>>>> </requestHandler>
>>>>>> 
>>>>>> I thought the schemaless mode was enable by default but I also tried 
>>>>>> adding this config but I get the same result.
>>>>>> 
>>>>>> <schemaFactory class="ManagedIndexSchemaFactory">
>>>>>>  <bool name="mutable">true</bool>
>>>>>>  <str name="managedSchemaResourceName">managed-schema</str>
>>>>>> </schemaFactory>
>>>>>> 
>>>>>> How can I update my schemaless URP chain and add the parameter to call 
>>>>>> it to DIH?
>>>>>> 
>>>>>> 
>>>>>>> On 10 Aug 2016, at 17:43, Alexandre Rafalovitch <arafa...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Do you have the actual fields defined? If not, then I am guessing that
>>>>>>> your 'post' test was against a different collection that had
>>>>>>> schemaless mode enabled and your DIH one is against one where
>>>>>>> schemaless mode is not enabled (look for
>>>>>>> 'add-unknown-fields-to-the-schema' in the solrconfig.xml to confirm).
>>>>>>> Solr examples for DIH do not have schemaless mode enabled.
>>>>>>> 
>>>>>>> I _believe_ you can copy the schemaless URP chain and add the
>>>>>>> parameter to call it to DIH handler and it _should_ work. But I am not
>>>>>>> betting on it without testing it, as DIH also has some magic code to
>>>>>>> ignore fields not defined in schema because it is designed to work
>>>>>>> with only extracting relevant fields from the database even with
>>>>>>> 'select *' statement.
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Alex.
>>>>>>> ----
>>>>>>> Newsletter and resources for Solr beginners and intermediates:
>>>>>>> http://www.solr-start.com/
>>>>>>> 
>>>>>>> 
>>>>>>> On 10 August 2016 at 17:12, Pierre Caserta <pierre.case...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> It seems that using the DataImportHandler with a XPathEntityProcessor 
>>>>>>>> config
>>>>>>>> with a managed-schema setup, only import the id and version field.
>>>>>>>> 
>>>>>>>> data-config.xml
>>>>>>>> 
>>>>>>>> <dataConfig>
>>>>>>>> <dataSource type="FileDataSource" encoding="UTF-8" />
>>>>>>>> <document>
>>>>>>>>     <entity name="post"
>>>>>>>>         processor="XPathEntityProcessor"
>>>>>>>>         stream="true"
>>>>>>>>         forEach="/posts/row/"
>>>>>>>>         url="${dataimporter.request.dataurl}"
>>>>>>>> 
>>>>>>>> transformer="RegexTransformer,DateFormatTransformer,HTMLStripTransformer"
>>>>>>>>> 
>>>>>>>>         <field column="id"        xpath="/posts/row/@Id" />
>>>>>>>>         <field column="postTypeId"     xpath="/posts/row/@PostTypeId" 
>>>>>>>> />
>>>>>>>>         <field column="acceptedAnswerId"
>>>>>>>> xpath="/posts/row/@AcceptedAnswerId" />
>>>>>>>>         <field column="creationDate" xpath="/posts/row/@CreationDate"
>>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" />
>>>>>>>>         <field column="postScore"  xpath="/posts/row/@Score" />
>>>>>>>>         <field column="viewCount"  xpath="/posts/row/@ViewCount" />
>>>>>>>>         <field column="body"  xpath="/posts/row/@Body" stripHTML="true"
>>>>>>>> />
>>>>>>>>         <field column="ownerUserId"  xpath="/posts/row/@OwnerUserId" />
>>>>>>>>         <field column="lastEditorUserId"
>>>>>>>> xpath="/posts/row/@LastEditorUserId" />
>>>>>>>>         <field column="lastEditorDisplayName"
>>>>>>>> xpath="/posts/row/@LastEditorDisplayName" />
>>>>>>>>         <field column="lastActivityDate"
>>>>>>>> xpath="/posts/row/@LastActivityDate"
>>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" />
>>>>>>>>         <field column="title"  xpath="/posts/row/@Title" />
>>>>>>>>         <field column="trimmedTags" xpath="/posts/row/@Tags"
>>>>>>>> regex="&lt;(.*)&gt;" />
>>>>>>>>         <field column="tags" sourceColName="trimmedTags"
>>>>>>>> splitBy="&gt;&lt;" />
>>>>>>>>         <field column="answerCount"  xpath="/posts/row/@AnswerCount" />
>>>>>>>>         <field column="commentCount"  xpath="/posts/row/@CommentCount"
>>>>>>>> />
>>>>>>>>         <field column="favoriteCount"  
>>>>>>>> xpath="/posts/row/@FavoriteCount"
>>>>>>>> />
>>>>>>>>         <field column="communityOwnedDate"
>>>>>>>> xpath="/posts/row/@CommunityOwnedDate"
>>>>>>>> dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" />
>>>>>>>>     </entity>
>>>>>>>> </document>
>>>>>>>> </dataConfig>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> http://192.168.99.100:8999/solr/solrexchange/select?indent=on&q=*:*&wt=json
>>>>>>>> {
>>>>>>>> "responseHeader":{
>>>>>>>> "status":0,
>>>>>>>> "QTime":0,
>>>>>>>> "params":{
>>>>>>>>   "q":"*:*",
>>>>>>>>   "indent":"on",
>>>>>>>>   "wt":"json",
>>>>>>>>   "_":"1470811193595"}},
>>>>>>>> "response":{"numFound":8,"start":0,"docs":[
>>>>>>>>   {
>>>>>>>>     "id":"38822",
>>>>>>>>     "_version_":1542258196375142400},
>>>>>>>>   {
>>>>>>>>     "id":"38836",
>>>>>>>>     "_version_":1542258196387725312},
>>>>>>>>   {
>>>>>>>>     "id":"63896",
>>>>>>>>     "_version_":1542258196388773888},
>>>>>>>>   {
>>>>>>>>     "id":"65406",
>>>>>>>>     "_version_":1542258196391919616},
>>>>>>>>   {
>>>>>>>>     "id":"1357173",
>>>>>>>>     "_version_":1542258196391919617},
>>>>>>>>   {
>>>>>>>>     "id":"5339763",
>>>>>>>>     "_version_":1542258196392968192},
>>>>>>>>   {
>>>>>>>>     "id":"9932722",
>>>>>>>>     "_version_":1542258196392968193},
>>>>>>>>   {
>>>>>>>>     "id":"9217299",
>>>>>>>>     "_version_":1542258196392968194}]
>>>>>>>> }}
>>>>>>>> 
>>>>>>>> data_search.xml (8 rows)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> the url I am hitting (with custom dataurl parameter)
>>>>>>>> 
>>>>>>>> curl
>>>>>>>> 'http://192.168.99.100:8999/solr/solrexchange/dataimport?command=full-import&commit=true&dataurl=/code/solr/data/search/dih/data_search.xml'
>>>>>>>> 
>>>>>>>> I changed my data to use <add> <doc> <field> and use the bin/post tool 
>>>>>>>> and
>>>>>>>> this is working as expected.
>>>>>>>> Now I am interested to make it work with the DataImportHandler.
>>>>>>>> How can I use the DataImportHandler to import my document ?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Pierre Caserta
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> 

Reply via email to