Re: Newbie Design Questions

Gunaranjan Chandraraju Wed, 21 Jan 2009 16:57:49 -0800

Hi Grant
Thanks for the reply.  My response below.

The data is stored as XMLs. Each record/entity corresponds to anXML. The XML is of the form


<record>
  <CoreInfo a="xyz", b="456"  c="123" ... />
  <AdditionalInfo t="xyz" y="333" ....>
  <addressinfo a="1" b="CA" ac="94087" ..>
  ...

</record>

I have currently put it in a schema.xml and DIH handler as follows

schema.xml
  <field name="id" type="string" ...>
 <field name="rectype" type="long" ..>


data-import.xml

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8" />
  <document>
    <entity name ="f" processor="FileListEntityProcessor"
            baseDir="/Users/guna/Applications/solr-apache/xml"
            fileName=".*xml"
            rootEntity="false"
            dataSource="null" >
       <entity
          name="rec"
           processor="XPathEntityProcessor"
           stream="false"
           forEach="/record"
          url="${f.fileAbsolutePath}">
               <field column="ID" xpath="/record/coreinfo/@a" />
               <field column="type" xpath="/record/coreinfo/@b" />
               <field column="streetname" xpath="/record/address/@c" />

       .. and so on

I don't need all the fields in the XML indexed or stored. I justinclude the ones I need in the schema.xml and data-import.xml

Architecturally these XMLs are created, updated and stored in aseparate system. Currently I am dumping the files in a directory andinvoking the DIH.

Actually we have a publishing channel that publishes each XML wheneverits updated or created. I'd really like to tap into this channel anddirectly post the xml to SOLR instead of saving it to a file and theninvoking DIH. I'd also like to do it leveraging definitions like inthe data-config xml so that every time I can just post the originalXML and the xpath configuration takes care of extracting the relevantfields.

I did take a look at cell in the link below. It seems to be only for1.4 and currently 1.3 is the stable release.



Guna
On Jan 20, 2009, at 7:50 PM, Grant Ingersoll wrote:

On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:
Hi All
We are considering SOLR for a large database of XMLs. I have somenewbie questions - if there is a place I can go read about them dolet me know and I will go read up :)
1. Currently we are able to pull the XMLs from a file systems usingFileDataSource. The DIH is convenient since I can map my XMLfields using the XPathProcessor. This works for an initial load.However after the initial load, we would like to 'post' changedxmls to SOLR whenever the XML is updated in a separate system. Iknow we can post xmls with 'add' however I was not sure how to dothis and maintain the DIH mapping I use in data-config.xml? Idon't want to save the file to the disk and then call the DIH -would prefer to directly post it. Do I need to use solrj for this?
You can likely use SolrJ, but then you probably need to parse theXML an extra time. You may also be able to use Solr Cell, which isthe Tika integration such that you can send the XML straight to Solrand have it deal with it. See http://wiki.apache.org/solr/ExtractingRequestHandlerSolr Cell is a push technology, whereas DIH is a pull technology.
I don't know how compatible this would be w/ DIH. Ideally, in thefuture, they will cooperate as much as possible, but we are notthere yet.
As for your initial load, what if you ran a one time XSLT processorover all the files and transformed them to SolrXML and then justposted them the normal way? Then, going forward, any new filescould just be written out as SolrXML as well.
If you can give some more info about your content, I think it wouldbe helpful.
2. If my solr schema.xml changes then do I HAVE to reindex all theold documents? Suppose in future we have newer XML documents thatcontain a new additional xml field. The old documents that arealready indexed don't have this field and (so) I don't need searchon them with this field. However the new ones need to be search-able on this new field. Can I just add this new field to theSOLR schema, restart the servers just post the new new documents ordo I need to reindex everything?
Yes, you should be able to add new fields w/o problems. Where youcan run into problems is renaming, removing, etc.
3. Can I backup the index directory. So that in case of a diskcrash - I can restore this directory and bring solr up. I realizethat any documents indexed after this backup would be lost - I canhowever keep track of these outside and simply re-index documents'newer' than that backup date. This question is really importantto me in the context of using a Master Server with replicatedindex. I would like to run this backup for the 'Master'.
Yes, just use the master/slave replication approach for doing backups.
4. In general what happens when the solr application is bounced?Is the index affected (anything maintained in memory)?
I would recommend doing a commit before bouncing and letting allindexing operations complete. Worst case, assuming you are usingSolr 1.3 or later, is that you may lose what is in memory.
-Grant

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Newbie Design Questions

Reply via email to