>
> I just tried to answer your many questions, liking youe questions type..
> Answers attached to questions..
>
> Thank you Rajini, for your interest :)

>
> A) The data for every user is totally unrelated to every other user. This
> gives us few advantages:
>
>   1. we can keep our indexes small in size.
>  (using cores)
>   2. merging/compatcting fragmented index will take less time.
> (merging is simple,one query)
>   3. if some indexes becomes inaccessible for whatever reason
>   (corruption?), only those users gets affected. Other users are unaffected
>   and the service is available for them.
> yes it affects only that index others are unaffected
>
>
How many cores can we safely have on a machine ? How much is "too much" in
this case ?


> B) Each user can have few different types of data.
>
> So, our index hierarchy will look something like:
> /user1/type1/<index files>
> /user1/type2/<index files>
> /user2/type1/<index files>
> /user3/type3/<index files>
>
> I am not clear with point here..
> Example say you have 2users
> user1
>  types- Name , Emailaddress, Phone number
> user2
>  types- Name , Emailaddress, ID
> So you want to have user1 -3indexes plus  user2-3indexes  Total=6 indexes??
> If user1 type "phone number" is only one type in data index-- Then schema
> will be having only one data type "number type"
>
>
>
I just meant to say, like this :

/myself/docs/index_docs
/myself/spreadsheets/index_spreads
/yourself/docs/index_docs
/yourself/spreadsheets/index_spreads

You get the idea right ?

C) Often, probably with every itereation, we'll add "types" of data that can
> be indexed.
> So we want to have an efficient/programmatic way to add schemas for
> different "types". We would like to avoid having fixed schema for indexing.
>
> you added a type say DATE
> Before you start indexing for this "date" type, u need to update your
> schema with this data type to enable indexing .. correct ?
> So this wont need a fixed schema defined priorly, we can add this only when
> you want to add this data type..  But this requires the service restart..
> This wont effect current index other then adding to it..
>
>
Today I am adding only docs and spreadsheets, tomorrow I may want to add
something else, something from RDBMS for example, then I don't want
to sit tinkering with schema.xml and I wouldn't like a service restart
either...




-- 

On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M <shrinat...@webyog.com> wrote:

> We are building an application which will require us to index data for each
> of our users so that we can provide full text search on their data. Here
> are
> some notable things about the application:
>
> A) The data for every user is totally unrelated to every other user. This
> gives us few advantages:
>
>   1. we can keep our indexes small in size.
>   2. merging/compatcting fragmented index will take less time.
>   3. if some indexes becomes inaccessible for whatever reason
>   (corruption?), only those users gets affected. Other users are unaffected
>   and the service is available for them.
>
> B) Each user can have few different types of data. We want to keep each
> type
> in separate folders, for the same reasons as above.
>
> So, our index hierarchy will look something like:
> /user1/type1/<index files>
> /user1/type2/<index files>
> /user2/type1/<index files>
> /user3/type3/<index files>
>
> C) Often, probably with every itereation, we'll add "types" of data that
> can
> be indexed.
> So we want to have an efficient/programmatic way to add schemas for
> different "types". We would like to avoid having fixed schema for indexing.
> I like Lucene's schema-less way of indexing stuff.
>
> D) The users can fire search queries which will search either: - Within a
> specific "type" for that user - Across all types for that user: in this
> case
> we want to fire a parallel query like Lucene has.
> (ParallelMultiSearcher<
> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html
> >
> )
>
> E) We require real time update for the index. *This is a must.*
>
> F) We are are planning to shard our index across multiple machines. For
> this
> also, we want:
> if a shard becomes inaccessible, only those users whose data are residing
> in
> that shard gets affected. Other users get uninterrupted service.
>
> We were considering Lucene, Sphinx and Solr to do this. This is what we
> found:
>
>   - Sphinx: No efficient way to do A, B, C, F. Or is there?
>   - Luecne: Everything looks possible, as it is very low level. But we have
>   to write wrappers to do F and build a communication layer between the web
>   server and the search server.
>   - Solr: Not sure if we can do A, B, C easily. Can we?
>
> So, my question is what is the best software for the above requirements? I
> am inclined more towards Solr and then Lucene if we get all the
> requirements.
>
> --
> Regards
> Shrinath.M
>

Reply via email to