On Jan 20, 2009, at 6:45 PM, Gunaranjan Chandraraju wrote:
Hi All
We are considering SOLR for a large database of XMLs. I have some
newbie questions - if there is a place I can go read about them do
let me know and I will go read up :)
1. Currently we are able to pull the XMLs from a file systems using
FileDataSource. The DIH is convenient since I can map my XML fields
using the XPathProcessor. This works for an initial load. However
after the initial load, we would like to 'post' changed xmls to SOLR
whenever the XML is updated in a separate system. I know we can
post xmls with 'add' however I was not sure how to do this and
maintain the DIH mapping I use in data-config.xml? I don't want to
save the file to the disk and then call the DIH - would prefer to
directly post it. Do I need to use solrj for this?
You can likely use SolrJ, but then you probably need to parse the XML
an extra time. You may also be able to use Solr Cell, which is the
Tika integration such that you can send the XML straight to Solr and
have it deal with it. See http://wiki.apache.org/solr/ExtractingRequestHandler
Solr Cell is a push technology, whereas DIH is a pull technology.
I don't know how compatible this would be w/ DIH. Ideally, in the
future, they will cooperate as much as possible, but we are not there
yet.
As for your initial load, what if you ran a one time XSLT processor
over all the files and transformed them to SolrXML and then just
posted them the normal way? Then, going forward, any new files could
just be written out as SolrXML as well.
If you can give some more info about your content, I think it would be
helpful.
2. If my solr schema.xml changes then do I HAVE to reindex all the
old documents? Suppose in future we have newer XML documents that
contain a new additional xml field. The old documents that are
already indexed don't have this field and (so) I don't need search
on them with this field. However the new ones need to be search-
able on this new field. Can I just add this new field to the SOLR
schema, restart the servers just post the new new documents or do I
need to reindex everything?
Yes, you should be able to add new fields w/o problems. Where you can
run into problems is renaming, removing, etc.
3. Can I backup the index directory. So that in case of a disk
crash - I can restore this directory and bring solr up. I realize
that any documents indexed after this backup would be lost - I can
however keep track of these outside and simply re-index documents
'newer' than that backup date. This question is really important to
me in the context of using a Master Server with replicated index. I
would like to run this backup for the 'Master'.
Yes, just use the master/slave replication approach for doing backups.
4. In general what happens when the solr application is bounced?
Is the index affected (anything maintained in memory)?
I would recommend doing a commit before bouncing and letting all
indexing operations complete. Worst case, assuming you are using Solr
1.3 or later, is that you may lose what is in memory.
-Grant
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ