On Mon, Mar 7, 2011 at 9:56 AM, rajini maski <rajinima...@gmail.com> wrote:
> I just tried to answer your many questions, liking youe questions type.. > Answers attached to questions.. > > Thank you Rajini, for your interest :) > > A) The data for every user is totally unrelated to every other user. This > gives us few advantages: > > 1. we can keep our indexes small in size. > (using cores) > 2. merging/compatcting fragmented index will take less time. > (merging is simple,one query) > 3. if some indexes becomes inaccessible for whatever reason > (corruption?), only those users gets affected. Other users are unaffected > and the service is available for them. > yes it affects only that index others are unaffected > > How many cores can we safely have on a machine ? How much is "too much" in this case ? > B) Each user can have few different types of data. > > So, our index hierarchy will look something like: > /user1/type1/<index files> > /user1/type2/<index files> > /user2/type1/<index files> > /user3/type3/<index files> > > I am not clear with point here.. > Example say you have 2users > user1 > types- Name , Emailaddress, Phone number > user2 > types- Name , Emailaddress, ID > So you want to have user1 -3indexes plus user2-3indexes Total=6 indexes?? > If user1 type "phone number" is only one type in data index-- Then schema > will be having only one data type "number type" > > > I just meant to say, like this : /myself/docs/index_docs /myself/spreadsheets/index_spreads /yourself/docs/index_docs /yourself/spreadsheets/index_spreads You get the idea right ? C) Often, probably with every itereation, we'll add "types" of data that can > be indexed. > So we want to have an efficient/programmatic way to add schemas for > different "types". We would like to avoid having fixed schema for indexing. > > you added a type say DATE > Before you start indexing for this "date" type, u need to update your > schema with this data type to enable indexing .. correct ? > So this wont need a fixed schema defined priorly, we can add this only when > you want to add this data type.. But this requires the service restart.. > This wont effect current index other then adding to it.. > > Today I am adding only docs and spreadsheets, tomorrow I may want to add something else, something from RDBMS for example, then I don't want to sit tinkering with schema.xml and I wouldn't like a service restart either... > > D) The users can fire search queries which will search either: - Within a > specific "type" for that user - Across all types for that user: in this > case > we want to fire a parallel query like Lucene has. > (ParallelMultiSearcher< > http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html > > > ) > > > Shradding in solr workd like this : > You have phone number detail in one index and again phone number details > only in other index too.. > You can search across both index firing a query as , Ph:9999 across index1 > and index2 > You cannot fire one search query as : Name:xyz and Ph:9999 across index > one and index2 .. when index one has datatype defined for only name and > index2 has only for phone number.. This can only be done if you define in > schema the datatypes for both..(this will create a prob of having same/fixed > schema) > > > E) We require real time update for the index. *This is a must.* > This can be possible .. Index happening must be enabled every minute , > Check if updates made.. If made, re-index and maintain unique ness with the > userid > > > > We were considering Lucene, Sphinx and Solr to do this. This is what we > found: > > - Sphinx: No efficient way to do A, B, C, F. Or is there? > - Luecne: Everything looks possible, as it is very low level. But we have > to write wrappers to do F and build a communication layer between the web > server and the search server. > - Solr: Not sure if we can do A, B, C easily. Can we? > > So, my question is what is the best software for the above requirements? I > am inclined more towards Solr and then Lucene if we get all the > requirements. > > > Regards, > Rajani Maski > > > > > > > > > On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M <shrinat...@webyog.com> wrote: > >> We are building an application which will require us to index data for >> each >> of our users so that we can provide full text search on their data. Here >> are >> some notable things about the application: >> >> A) The data for every user is totally unrelated to every other user. This >> gives us few advantages: >> >> 1. we can keep our indexes small in size. >> 2. merging/compatcting fragmented index will take less time. >> 3. if some indexes becomes inaccessible for whatever reason >> >> (corruption?), only those users gets affected. Other users are >> unaffected >> and the service is available for them. >> >> B) Each user can have few different types of data. We want to keep each >> type >> in separate folders, for the same reasons as above. >> >> So, our index hierarchy will look something like: >> /user1/type1/<index files> >> /user1/type2/<index files> >> /user2/type1/<index files> >> /user3/type3/<index files> >> >> C) Often, probably with every itereation, we'll add "types" of data that >> can >> be indexed. >> So we want to have an efficient/programmatic way to add schemas for >> different "types". We would like to avoid having fixed schema for >> indexing. >> I like Lucene's schema-less way of indexing stuff. >> >> D) The users can fire search queries which will search either: - Within a >> specific "type" for that user - Across all types for that user: in this >> case >> we want to fire a parallel query like Lucene has. >> (ParallelMultiSearcher< >> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html >> > >> >> ) >> >> E) We require real time update for the index. *This is a must.* >> >> F) We are are planning to shard our index across multiple machines. For >> this >> also, we want: >> if a shard becomes inaccessible, only those users whose data are residing >> in >> that shard gets affected. Other users get uninterrupted service. >> >> We were considering Lucene, Sphinx and Solr to do this. This is what we >> found: >> >> - Sphinx: No efficient way to do A, B, C, F. Or is there? >> - Luecne: Everything looks possible, as it is very low level. But we >> have >> to write wrappers to do F and build a communication layer between the >> web >> server and the search server. >> - Solr: Not sure if we can do A, B, C easily. Can we? >> >> So, my question is what is the best software for the above requirements? I >> am inclined more towards Solr and then Lucene if we get all the >> requirements. >> >> -- >> Regards >> Shrinath.M >> > > -- Regards Shrinath.M