> > I just tried to answer your many questions, liking youe questions type.. > Answers attached to questions.. > > Thank you Rajini, for your interest :)
> > A) The data for every user is totally unrelated to every other user. This > gives us few advantages: > > 1. we can keep our indexes small in size. > (using cores) > 2. merging/compatcting fragmented index will take less time. > (merging is simple,one query) > 3. if some indexes becomes inaccessible for whatever reason > (corruption?), only those users gets affected. Other users are unaffected > and the service is available for them. > yes it affects only that index others are unaffected > > How many cores can we safely have on a machine ? How much is "too much" in this case ? > B) Each user can have few different types of data. > > So, our index hierarchy will look something like: > /user1/type1/<index files> > /user1/type2/<index files> > /user2/type1/<index files> > /user3/type3/<index files> > > I am not clear with point here.. > Example say you have 2users > user1 > types- Name , Emailaddress, Phone number > user2 > types- Name , Emailaddress, ID > So you want to have user1 -3indexes plus user2-3indexes Total=6 indexes?? > If user1 type "phone number" is only one type in data index-- Then schema > will be having only one data type "number type" > > > I just meant to say, like this : /myself/docs/index_docs /myself/spreadsheets/index_spreads /yourself/docs/index_docs /yourself/spreadsheets/index_spreads You get the idea right ? C) Often, probably with every itereation, we'll add "types" of data that can > be indexed. > So we want to have an efficient/programmatic way to add schemas for > different "types". We would like to avoid having fixed schema for indexing. > > you added a type say DATE > Before you start indexing for this "date" type, u need to update your > schema with this data type to enable indexing .. correct ? > So this wont need a fixed schema defined priorly, we can add this only when > you want to add this data type.. But this requires the service restart.. > This wont effect current index other then adding to it.. > > Today I am adding only docs and spreadsheets, tomorrow I may want to add something else, something from RDBMS for example, then I don't want to sit tinkering with schema.xml and I wouldn't like a service restart either... -- On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M <shrinat...@webyog.com> wrote: > We are building an application which will require us to index data for each > of our users so that we can provide full text search on their data. Here > are > some notable things about the application: > > A) The data for every user is totally unrelated to every other user. This > gives us few advantages: > > 1. we can keep our indexes small in size. > 2. merging/compatcting fragmented index will take less time. > 3. if some indexes becomes inaccessible for whatever reason > (corruption?), only those users gets affected. Other users are unaffected > and the service is available for them. > > B) Each user can have few different types of data. We want to keep each > type > in separate folders, for the same reasons as above. > > So, our index hierarchy will look something like: > /user1/type1/<index files> > /user1/type2/<index files> > /user2/type1/<index files> > /user3/type3/<index files> > > C) Often, probably with every itereation, we'll add "types" of data that > can > be indexed. > So we want to have an efficient/programmatic way to add schemas for > different "types". We would like to avoid having fixed schema for indexing. > I like Lucene's schema-less way of indexing stuff. > > D) The users can fire search queries which will search either: - Within a > specific "type" for that user - Across all types for that user: in this > case > we want to fire a parallel query like Lucene has. > (ParallelMultiSearcher< > http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html > > > ) > > E) We require real time update for the index. *This is a must.* > > F) We are are planning to shard our index across multiple machines. For > this > also, we want: > if a shard becomes inaccessible, only those users whose data are residing > in > that shard gets affected. Other users get uninterrupted service. > > We were considering Lucene, Sphinx and Solr to do this. This is what we > found: > > - Sphinx: No efficient way to do A, B, C, F. Or is there? > - Luecne: Everything looks possible, as it is very low level. But we have > to write wrappers to do F and build a communication layer between the web > server and the search server. > - Solr: Not sure if we can do A, B, C easily. Can we? > > So, my question is what is the best software for the above requirements? I > am inclined more towards Solr and then Lucene if we get all the > requirements. > > -- > Regards > Shrinath.M >