RE: Questions regarding re-index when using Solr as a data source

Hui Liu Thu, 09 Jun 2016 09:52:59 -0700

Hi Walter,

Thank you for the reply, sorry I need to clarify what I mean by 'migrate 
tables' from Oracle to Solr, we are not literally move existing records from 
Oracle to Solr, instead, we are building a new application directly feed data 
into Solr as document and fields, in parallel of another existing application 
which feeds the same data into Oracle tables/columns, of course, the Solr 
schema will be somewhat different than Oracle; also we only keep those data for 
90 days for user to search on, we hope once we run both system in parallel for 
some time (> 90 days), we will build up enough new data in Solr and we no 
longer need any old data in Oracle, by then we will be able to use Solr as our 
only data store.

It sounds to me that we may need to consider save the data into either file 
system, or another database, in case we need to rebuild the indexes; and the 
reason I mentioned to save data into another Solr system is by reading this 
info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a 
feedback on if there is any update on this approach? And any better way to do 
this to minimize the downtime caused by the schema change and re-index? For 
example, in Oracle, we are able to add a new column or new index online without 
any impact of existing queries as existing indexes are intact.

Alternatives when a traditional reindex isn't possible

Sometimes the option of "do your indexing again" is difficult. Perhaps the 
original data is very slow to access, or it may be difficult to get in the 
first place.

Here's where we go against our own advice that we just gave you. Above we said 
"don't use Solr itself as a datasource" ... but one way to deal with data 
availability problems is to set up a completely separate Solr instance (not 
distributed, which for SolrCloud means numShards=1) whose only job is to store 
the data, then use the SolrEntityProcessor in the DataImportHandler to index 
from that instance to your real Solr install. If you need to reindex, just run 
the import again on your real installation. Your schema for the intermediate 
Solr install would have stored="true" and indexed="false" for all fields, and 
would only use basic types like int, long, and string. It would not have any 
copyFields.

This is the approach used by the Smithsonian for their Solr installation, 
because getting access to the source databases for the individual entities 
within the organization is very difficult. This way they can reindex the online 
Solr at any time without having to get special permission from all those 
entities. When they index new content, it goes into a copy of Solr configured 
for storage only, not in-depth searching. Their main Solr instance uses 
SolrEntityProcessor to import from the intermediate Solr servers, so they can 
always reindex.

Regards,
Hui

-----Original Message-----
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Thursday, June 09, 2016 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

First, using Solr as a repository is pretty risky. I would keep the official 
copy of the data in a database, not in Solr.

Second, you can’t “migrate tables” because Solr doesn’t have tables. You need 
to turn the tables into documents, then index the documents. It can take a lot 
of joins to flatten a relational schema into Solr documents.

Solr does not support schema migration, so yes, you will need to save off all 
the documents, then reload them. I would save them to files. It makes no sense 
to put them in another copy of Solr.

Changing the schema will be difficult and time-consuming, but you’ll probably 
run into much worse problems trying to use Solr as a repository.

wunder
Walter Underwood
wun...@wunderwood.org<mailto:wun...@wunderwood.org>
http://observer.wunderwood.org/  (my blog)

> On Jun 9, 2016, at 8:50 AM, Hui Liu 
> <h...@opentext.com<mailto:h...@opentext.com>> wrote:
>
> Hi,
>
>              We are porting an application currently hosted in Oracle 11g to 
> Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in 
> Solr, index them, and build search tools on top of this; the goal is we won't 
> be using Oracle at all after this has been implemented; every fields in Solr 
> will have 'stored=true' and selectively a subset of searchable fields will 
> have 'indexed=true'; the question is what steps we should follow if we need 
> to re-index a collection after making some schema changes - mostly we only 
> add new fields to store, or make a non-indexed field as indexed, we normally 
> do not delete or rename any existing fields; according to this url: 
> https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 
> 'intermediate' Solr1 to only store the data themselves without any indexing, 
> then have another Solr2 setup to store the indexed data, and in case of 
> re-index, just delete all the documents in Solr2 for the collection and 
> re-import data from Solr1 into Solr2 using SolrEntityProcessor (from 
> dataimport handler)? Is this still the recommended approach? I can see the 
> downside of this approach is if we have tremendous amount of data for a 
> collection (some of our collection could have several billions of documents), 
> re-import it from Solr1 to Solr2 may take a few hours or even days, and 
> during this time, users cannot query the data, is there any better way to do 
> this and avoid this type of down time? Any feedback is appreciated!
>
> Regards,
> Hui Liu
> Opentext, Inc.

RE: Questions regarding re-index when using Solr as a data source

Reply via email to