Re: loading XML docbook files into solr

Sujit Pal Sat, 26 Feb 2011 10:34:07 -0800

Hi Derek,

The XML files you post to Solr needs to be in the correct Solr specific
XML format.


One way to "preserve" the original structure would be to "flatten" the
document into field names indicating the position of the text, for
example:
book_titleabbrev: Advancing Return on Investment Analysis for Government IT:\
 A Public Value Framework
... etc.

But you will still have to parse your docbook XML into the appropriate schema 
that you want to use for Solr. I believe DIH also allows XSLT based 
preprocessors
so you don't have to write parsing code, but I haven't used them.

-sujit

On Sat, 2011-02-26 at 10:40 -0500, Derek Werthmuller wrote:
> I've been working on this for a while an seem to hit a wall.  The error
> messages aren't complete enought to give guidance why importing a sample
> docbook document
> into solr is not working.
> I'm using the curl tool to post the xml file and receive a non error message
> but the document count doesn't increase and the *:* returns no results
> still.
> The docbook document has a attribute id and this is mapped to the uniquekey
> in the schema.xml file.  But it seems this may be the issue still.  Its not
> clear
> how the field names map to the XML.  Do they only map to attributes?  or do
> they map to elements?   How to you differentiate?
> Can field names in the schema.xml file have xpath statements?
> 
> Are there other important sections of the solrconfig that could be keeping
> this from working?
> 
> We want to maintain much of the document structure so we have more control
> over the searching.
> 
> Here is what the docbook XML looks like:  (tried setting the uniquekey to id
> and docid but no go either way)
> 
> <book label="issuebriefs" id="proi">
>       <docid>245</docid>
>     <titleabbrev>Advancing Return on Investment Analysis for Government IT:
> A Pu
> blic Value Framework </titleabbrev>
>     <chapter>
>         <title>Advancing Return on Investment Analysis for Government IT: A
> Publ
> ic Value Framework</title>
>         <para>
>             <mediaobject>
>                 <imageobject>
>                     <imagedata
> fileref="/publications/annualreports/ar2006/image
> s/public-value.jpg" format="jpg" contentdepth="157" contentwidth="216"
> align="le
> ft"/>
>                 </imageobject>
>                 <textobject>
>                     <phrase>Public Value Illustration</phrase>
>                 </textobject>
>             </mediaobject>
> ....
> ..
> 
> Here is the section of the schema.xml  
>         <field name="id" type="string" indexed="true" stored="true"
> multiValued="false" required="true" />
>       <field name="titleabbrev" type="text" indexed="true" stored="true"
> />
>       <field name="title" type="text" indexed="true" stored="true" />
>       
>       <field name="para" type="text" indexed="true" stored="true" />
>       <field name="ulink" type="string" indexed="true" stored="true" />
>       <field name="listitem" type="text" indexed="true" stored="true" />
>       
>       <field name="all_text" type="text" indexed="true" stored="false"
> multiValued="true" />
> 
>        <copyField source="title" dest="all_text" />
>       <copyField source="para" dest="all_text" />
>       <copyField source="listitem" dest="all_text" />
>       <copyField source="titleabbrev" dest="all_text" />
> 
> 
>  </fields>
> 
>  <!-- Field to use to determine and enforce document uniqueness. 
>       Unless this field is marked with required="false", it will be a
> required field
>    -->
>  <uniqueKey>id</uniqueKey>
> 
>  <!-- field for the QueryParser to use when an explicit fieldname is absent
> -->
>  <defaultSearchField>all_text</defaultSearchField>
> 
>  <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
>  <solrQueryParser defaultOperator="OR"/>
> 
> </schema>
> 
> Load command results.
> 
> $ ./postfile.sh 
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">56</int></lst>
> </response>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">15</int></lst>
> </response>
> 
> 
> Thanks
>       Derek

Re: loading XML docbook files into solr

Reply via email to