Re: Full Text Search with multiple index and complex requirements

Shrinath M Sun, 06 Mar 2011 20:49:33 -0800

On Mon, Mar 7, 2011 at 9:56 AM, rajini maski <rajinima...@gmail.com> wrote:


> I just tried to answer your many questions, liking youe questions type..
> Answers attached to questions..
>
> Thank you Rajini, for your interest :)

>
> A) The data for every user is totally unrelated to every other user. This
> gives us few advantages:
>
>   1. we can keep our indexes small in size.
>  (using cores)
>   2. merging/compatcting fragmented index will take less time.
> (merging is simple,one query)
>   3. if some indexes becomes inaccessible for whatever reason
>   (corruption?), only those users gets affected. Other users are unaffected
>   and the service is available for them.
> yes it affects only that index others are unaffected
>
>
How many cores can we safely have on a machine ? How much is "too much" in
this case ?


> B) Each user can have few different types of data.
>
> So, our index hierarchy will look something like:
> /user1/type1/<index files>
> /user1/type2/<index files>
> /user2/type1/<index files>
> /user3/type3/<index files>
>
> I am not clear with point here..
> Example say you have 2users
> user1
>  types- Name , Emailaddress, Phone number
> user2
>  types- Name , Emailaddress, ID
> So you want to have user1 -3indexes plus  user2-3indexes  Total=6 indexes??
> If user1 type "phone number" is only one type in data index-- Then schema
> will be having only one data type "number type"
>
>
>
I just meant to say, like this :

/myself/docs/index_docs
/myself/spreadsheets/index_spreads
/yourself/docs/index_docs
/yourself/spreadsheets/index_spreads

You get the idea right ?

C) Often, probably with every itereation, we'll add "types" of data that can
> be indexed.
> So we want to have an efficient/programmatic way to add schemas for
> different "types". We would like to avoid having fixed schema for indexing.
>
> you added a type say DATE
> Before you start indexing for this "date" type, u need to update your
> schema with this data type to enable indexing .. correct ?
> So this wont need a fixed schema defined priorly, we can add this only when
> you want to add this data type..  But this requires the service restart..
> This wont effect current index other then adding to it..
>
>
Today I am adding only docs and spreadsheets, tomorrow I may want to add
something else, something from RDBMS for example, then I don't want
to sit tinkering with schema.xml and I wouldn't like a service restart
either...


>
> D) The users can fire search queries which will search either: - Within a
> specific "type" for that user - Across all types for that user: in this
> case
> we want to fire a parallel query like Lucene has.
> (ParallelMultiSearcher<
> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html
> >
> )
>
>
> Shradding in solr workd like this :
> You have phone number detail in one index and again phone number details
> only in other index too..
> You can search across both index firing a query as , Ph:9999 across index1
> and index2
> You cannot fire one search query as :  Name:xyz and Ph:9999 across index
> one and index2 .. when index one has datatype defined for only name and
> index2 has only for phone number.. This can only be done if you define in
> schema the datatypes for both..(this will create a prob of having same/fixed
> schema)
>
>
> E) We require real time update for the index. *This is a must.*
> This can be possible .. Index happening must be enabled every minute ,
> Check if updates made.. If made, re-index and maintain unique ness with the
> userid
>
>
>
> We were considering Lucene, Sphinx and Solr to do this. This is what we
> found:
>
>   - Sphinx: No efficient way to do A, B, C, F. Or is there?
>   - Luecne: Everything looks possible, as it is very low level. But we have
>   to write wrappers to do F and build a communication layer between the web
>   server and the search server.
>   - Solr: Not sure if we can do A, B, C easily. Can we?
>
> So, my question is what is the best software for the above requirements? I
> am inclined more towards Solr and then Lucene if we get all the
> requirements.
>
>
> Regards,
> Rajani Maski
>
>
>
>
>
>
>
>
> On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M <shrinat...@webyog.com> wrote:
>
>> We are building an application which will require us to index data for
>> each
>> of our users so that we can provide full text search on their data. Here
>> are
>> some notable things about the application:
>>
>> A) The data for every user is totally unrelated to every other user. This
>> gives us few advantages:
>>
>>   1. we can keep our indexes small in size.
>>   2. merging/compatcting fragmented index will take less time.
>>   3. if some indexes becomes inaccessible for whatever reason
>>
>>   (corruption?), only those users gets affected. Other users are
>> unaffected
>>   and the service is available for them.
>>
>> B) Each user can have few different types of data. We want to keep each
>> type
>> in separate folders, for the same reasons as above.
>>
>> So, our index hierarchy will look something like:
>> /user1/type1/<index files>
>> /user1/type2/<index files>
>> /user2/type1/<index files>
>> /user3/type3/<index files>
>>
>> C) Often, probably with every itereation, we'll add "types" of data that
>> can
>> be indexed.
>> So we want to have an efficient/programmatic way to add schemas for
>> different "types". We would like to avoid having fixed schema for
>> indexing.
>> I like Lucene's schema-less way of indexing stuff.
>>
>> D) The users can fire search queries which will search either: - Within a
>> specific "type" for that user - Across all types for that user: in this
>> case
>> we want to fire a parallel query like Lucene has.
>> (ParallelMultiSearcher<
>> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html
>> >
>>
>> )
>>
>> E) We require real time update for the index. *This is a must.*
>>
>> F) We are are planning to shard our index across multiple machines. For
>> this
>> also, we want:
>> if a shard becomes inaccessible, only those users whose data are residing
>> in
>> that shard gets affected. Other users get uninterrupted service.
>>
>> We were considering Lucene, Sphinx and Solr to do this. This is what we
>> found:
>>
>>   - Sphinx: No efficient way to do A, B, C, F. Or is there?
>>   - Luecne: Everything looks possible, as it is very low level. But we
>> have
>>   to write wrappers to do F and build a communication layer between the
>> web
>>   server and the search server.
>>   - Solr: Not sure if we can do A, B, C easily. Can we?
>>
>> So, my question is what is the best software for the above requirements? I
>> am inclined more towards Solr and then Lucene if we get all the
>> requirements.
>>
>> --
>> Regards
>> Shrinath.M
>>
>
>


-- 
Regards
Shrinath.M

Re: Full Text Search with multiple index and complex requirements

Reply via email to