estimating memory needed for solr instances...

2008-07-09 Thread Preetam Rao
Hi,

Since we plan to share the same box among multiple solr instances on a 16gb
RAM multi core box, Need to estimate how much memory we need for our
application.

The index size is on disk  2.4G with close to 3 million documents. The plan
is to use dismax query with some fqs.
Since we do not sort the results, the sort will be by score which eliminates
the option "userFiterFprSortedQuerries".
Thus assuming all q's will use query result cache and all fqs will use
filter caches the below is what i am thinking.

I would like to know how to relate the index size on disk to its memory size
?
Would it be safe to assume gven the disk size of 2.4g, that we can have ram
size for whole index plus 1g for any other overhead plus the cache size
which comes to 150MB  (calculation below). Thus making it around 4g.

cache size calculation -

query result cache - size = 50K;
since we paginate the results and each page has 10 items and assuming each
user will at the max see 3 pages, per query
we will set queryResultWindowSize to 30. Assuming this, for 50k querries we
will use up 5* 30 bits = 187K asuming results are stored in bitset.

we use few common fqs, lets say 200. Assuming each returns around 30k
documents, it adds to 200 * 3 bits  = 750K.

If we use document cache of size 20K, assuming each document size is around
5k at the max, it will take up 2 * 5= 100MB.

Thus we can increase the cache more drastically and still it will use up
only 150MB or less.

Is this reasoning on cache's correct ?

Thanks
Preetam


Re: Implementing MoreLikeThis in Ruby

2008-07-09 Thread Neeti Raj
Hi Koji

Thanks for clarifying my understanding about MoreLikeThis.

After your response and reading from Solr Wiki,  I am now successfully using
MoreLikeThis as follows -

   - Using StandardRequestHandler. Removed MoreLikeThisHandler in
   solrconfig.xml.


   -  Modified the query to be -
   http://localhost:8983/solr/select?q=BBT&mlt=true&mlt.fl=*channel_name_t*
   
&mlt.mindf=1&mlt.mintf=1


   - as I am using Acts_as_solr for talking to Solr which generates dynamic
  fields and hence mlt.fl=channel_name_t


Thanks again for guiding
Neeti

On Mon, Jul 7, 2008 at 8:40 PM, Koji Sekiguchi <[EMAIL PROTECTED]> wrote:

> Neeti,
>
> Do you know:
>
> There are two ways to access MoreLikeThis from solr: from the
> MoreLikeThisHandler
> or with the StandardRequestHandler.
> http://wiki.apache.org/solr/MoreLikeThis
>
> You set MoreLikeThisHandler in your solrconfig.xml:
>
> > 
> > 
> > channel_name
> > 1
> > 
> > 
>
> but you were using StandardRequestHandler in your request:
>
> >
> http://localhost:8983/solr/select?q=BBT&mlt=true&mlt.fl=channel_name&mlt.mindf=1&mlt.mintf=1
>
> If you want to use MoreLikeThisHandler you set in your solrconfig.xml,
> specify /mlt
> instead of /select in your request url.
>
> Koji
>
>
>


Re: Solrj Exception

2008-07-09 Thread Akeel
Hi,
we have found that some of the jars (like stax-api-xxx.jar, stax-utils.jar,
stax-xxx-dev.jar, commons-codec-xxx.jar) were missing in our application's
lib directory, by adding these libraries the mentioned exception resolved.


On Tue, May 27, 2008 at 12:56 PM, Akeel <[EMAIL PROTECTED]> wrote:

> yes i have all the jar files (dependencies) in my classpath as there is not
> compile time exception neither there is any NoCalssDefExceptionError but
> there is *NoSuchMethodError* as shown below.
>
>
> On Tue, May 27, 2008 at 11:11 AM, Shalin Shekhar Mangar <
> [EMAIL PROTECTED]> wrote:
>
>> Make sure that you've included all dependencies for SolrJ in your
>> classpath. You can find all the dependencies inside dist/solrj-lib
>> folder in the binary distribution.
>>
>> On Mon, May 26, 2008 at 7:28 PM, Akeel <[EMAIL PROTECTED]> wrote:
>> > Following exception occurred while executing the line from within my
>> > application:
>> > *CommonsHttpSolrServer server = new CommonsHttpSolrServer("
>> > http://192.168.1.34:8983/solr/";);
>> > *(The class CommonsHttpSolrServer is defined in the solrj client)*
>> > *
>> > java.lang.NoSuchMethodError:
>> >
>> org.apache.commons.httpclient.HttpConnectionManager.getParams()Lorg/apache/commons/httpclient/params/HttpConnectionManagerParams;
>> >  at
>> >
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.setDefaultMaxConnectionsPerHost(CommonsHttpSolrServer.java:420)
>> >  at
>> >
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:123)
>> >  at
>> >
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:103)
>> >  at
>> >
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.(CommonsHttpSolrServer.java:79)
>> >
>> > Note: I am using nighlty build of 05-21-08 (because of solrj)
>> > Someone help me
>> >
>> > --
>> > Thanks and Regards,
>> > Akeel ur Rehman Faridee
>> > http://riseofpakistan.blogspot.com
>> > cell: 0321-4714151
>> > 
>> > When there is injustice in society, then everyone will go to politics
>> > Except the two kinds: those who are timid and those who are materialist
>> > (Aristotle)
>> > 
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>
>
> --
> Thanks and Regards,
> Akeel ur Rehman Faridee
> http://riseofpakistan.blogspot.com
> cell: 0321-4714151
> 
> When there is injustice in society, then everyone will go to politics
> Except the two kinds: those who are timid and those who are materialist
> (Aristotle)
> 
>


Re: Automated Index Creation

2008-07-09 Thread Norberto Meijome
On Wed, 9 Jul 2008 08:48:35 +0530
"Shalin Shekhar Mangar" <[EMAIL PROTECTED]> wrote:

> Yes, SOLR-350 added that capability. Look at
> http://wiki.apache.org/solr/MultiCore for details.

ahh loving SOLR more every day :P

thx

_
{Beto|Norberto|Numard} Meijome

I used to hate weddings; all the Grandmas would poke me and
say, "You're next sonny!" They stopped doing that when i
started to do it to them at funerals.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Indexing xml data

2008-07-09 Thread Alexander Ramos Jardim
I need to put big xml files on a string field in one of my projects. Does
Solr accept it automatically or should I put a  on my xml before
putting on the index?
-- 
Alexander Ramos Jardim


Re: Indexing xml data

2008-07-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
You can put it into a 'string' field directly


On Wed, Jul 9, 2008 at 7:41 PM, Alexander Ramos Jardim
<[EMAIL PROTECTED]> wrote:
> I need to put big xml files on a string field in one of my projects. Does
> Solr accept it automatically or should I put a  on my xml before
> putting on the index?
> --
> Alexander Ramos Jardim
>



-- 
--Noble Paul


Re: Indexing xml data

2008-07-09 Thread Norberto Meijome
On Wed, 9 Jul 2008 19:51:45 +0530
"Noble Paul _ __" <[EMAIL PROTECTED]> wrote:

> You can put it into a 'string' field directly

if we refer to the  default string field , you won't be able to search for the 
contents of the XML (unless you search for the whole thing),right? 

_
{Beto|Norberto|Numard} Meijome

Law of Conservation of Perversity: 
  we can't make something simpler without making something else more complex

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Indexing xml data

2008-07-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
yep. you cant search. It is better to extract the data out and index
it if you want to search

On Wed, Jul 9, 2008 at 8:37 PM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
> On Wed, 9 Jul 2008 19:51:45 +0530
> "Noble Paul _ __" <[EMAIL PROTECTED]> 
> wrote:
>
>> You can put it into a 'string' field directly
>
> if we refer to the  default string field , you won't be able to search for 
> the contents of the XML (unless you search for the whole thing),right?
>
> _
> {Beto|Norberto|Numard} Meijome
>
> Law of Conservation of Perversity:
>  we can't make something simpler without making something else more complex
>
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
> Reading disclaimers makes you go blind. Writing them is worse. You have been 
> Warned.
>



-- 
--Noble Paul


Re: Indexing xml data

2008-07-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Jul 9, 2008 at 8:46 PM, Noble Paul നോബിള്‍ नोब्ळ्
<[EMAIL PROTECTED]> wrote:
> yep. you cant search. It is better to extract the data out and index
> it if you want to search
>
> On Wed, Jul 9, 2008 at 8:37 PM, Norberto Meijome <[EMAIL PROTECTED]> wrote:
>> On Wed, 9 Jul 2008 19:51:45 +0530
>> "Noble Paul _ __" <[EMAIL PROTECTED]> 
>> wrote:
>>
>>> You can put it into a 'string' field directly
>>
>> if we refer to the  default string field , you won't be able to search for 
>> the contents of the XML (unless you search for the whole thing),right?
>>
>> _
>> {Beto|Norberto|Numard} Meijome
>>
>> Law of Conservation of Perversity:
>>  we can't make something simpler without making something else more complex
>>
>> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
>> Reading disclaimers makes you go blind. Writing them is worse. You have been 
>> Warned.
>>
>
>
>
> --
> --Noble Paul
>



-- 
--Noble Paul


SOLR Timeout

2008-07-09 Thread McBride, John
Hello All,
 
Prior to SOLR 1.3 and nutch patch integration - what actually is  the effect of 
SOLR (non)-timeout?  Do the threads eventally die?  DOes a new request cause a 
new query thread to open, or is the system locked?
 
What causes a timeout- a complex query?
 
Is SOLR 1.2 open to DoS attacks by submitting complex queries?
 
Thanks,
John
 
 


Re: Indexing xml data

2008-07-09 Thread Alexander Ramos Jardim
Oh thanks.

I don't want to search on that. I will have a name field that contains the
unique identifier of the document.

2008/7/9 Noble Paul നോബിള്‍ नोब्ळ् <[EMAIL PROTECTED]>:

> On Wed, Jul 9, 2008 at 8:46 PM, Noble Paul നോബിള്‍ नोब्ळ्
> <[EMAIL PROTECTED]> wrote:
> > yep. you cant search. It is better to extract the data out and index
> > it if you want to search
> >
> > On Wed, Jul 9, 2008 at 8:37 PM, Norberto Meijome <[EMAIL PROTECTED]>
> wrote:
> >> On Wed, 9 Jul 2008 19:51:45 +0530
> >> "Noble Paul _ __" <
> [EMAIL PROTECTED]> wrote:
> >>
> >>> You can put it into a 'string' field directly
> >>
> >> if we refer to the  default string field , you won't be able to search
> for the contents of the XML (unless you search for the whole thing),right?
> >>
> >> _
> >> {Beto|Norberto|Numard} Meijome
> >>
> >> Law of Conservation of Perversity:
> >>  we can't make something simpler without making something else more
> complex
> >>
> >> I speak for myself, not my employer. Contents may be hot. Slippery when
> wet. Reading disclaimers makes you go blind. Writing them is worse. You have
> been Warned.
> >>
> >
> >
> >
> > --
> > --Noble Paul
> >
>
>
>
> --
> --Noble Paul
>



-- 
Alexander Ramos Jardim


Re: estimating memory needed for solr instances...

2008-07-09 Thread Ian Connor
There was a thread a while ago, that suggested just need to factor in
the index's total size (Mike Klaas I think was the author). It was
suggested having the RAM is enough and the OS will cache the files as
needed to give you the performance boost needed.

If I misread the thread, please chime in - but it seems having enough
RAM is the key to performance.

On Wed, Jul 9, 2008 at 3:00 AM, Preetam Rao <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Since we plan to share the same box among multiple solr instances on a 16gb
> RAM multi core box, Need to estimate how much memory we need for our
> application.
>
> The index size is on disk  2.4G with close to 3 million documents. The plan
> is to use dismax query with some fqs.
> Since we do not sort the results, the sort will be by score which eliminates
> the option "userFiterFprSortedQuerries".
> Thus assuming all q's will use query result cache and all fqs will use
> filter caches the below is what i am thinking.
>
> I would like to know how to relate the index size on disk to its memory size
> ?
> Would it be safe to assume gven the disk size of 2.4g, that we can have ram
> size for whole index plus 1g for any other overhead plus the cache size
> which comes to 150MB  (calculation below). Thus making it around 4g.
>
> cache size calculation -
> 
> query result cache - size = 50K;
> since we paginate the results and each page has 10 items and assuming each
> user will at the max see 3 pages, per query
> we will set queryResultWindowSize to 30. Assuming this, for 50k querries we
> will use up 5* 30 bits = 187K asuming results are stored in bitset.
>
> we use few common fqs, lets say 200. Assuming each returns around 30k
> documents, it adds to 200 * 3 bits  = 750K.
>
> If we use document cache of size 20K, assuming each document size is around
> 5k at the max, it will take up 2 * 5= 100MB.
>
> Thus we can increase the cache more drastically and still it will use up
> only 150MB or less.
>
> Is this reasoning on cache's correct ?
>
> Thanks
> Preetam
>



-- 
Regards,

Ian Connor
82 Fellsway W #2
Somerville, MA 02145
Direct Line: +1 (978) 672
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Mobile Phone: +1 (312) 218 3209
Fax: +1(770) 818 5697
Suisse Phone: +41 (0) 22 548 1664
Skype: ian.connor


schema.xml compatibility

2008-07-09 Thread Teruhiko Kurosaka
I've noticed that schema.xml in the dev version of Solr spells
what used to be fieldtype as fieldType with capital T.

Are there any other compatibility issues between the would-be 
Solr 1.3 and Solr 1.2?

How soon Solr 1.3 will be available, by the way?


Basis Technology Corporation, San Francisco
T. "Kuro" Kurosaka


Re: estimating memory needed for solr instances...

2008-07-09 Thread Jacob Singh
My total guess is that indexing is CPU bound, and searching is RAM bound.

Best,
Jacob
Ian Connor wrote:
> There was a thread a while ago, that suggested just need to factor in
> the index's total size (Mike Klaas I think was the author). It was
> suggested having the RAM is enough and the OS will cache the files as
> needed to give you the performance boost needed.
> 
> If I misread the thread, please chime in - but it seems having enough
> RAM is the key to performance.
> 
> On Wed, Jul 9, 2008 at 3:00 AM, Preetam Rao <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> Since we plan to share the same box among multiple solr instances on a 16gb
>> RAM multi core box, Need to estimate how much memory we need for our
>> application.
>>
>> The index size is on disk  2.4G with close to 3 million documents. The plan
>> is to use dismax query with some fqs.
>> Since we do not sort the results, the sort will be by score which eliminates
>> the option "userFiterFprSortedQuerries".
>> Thus assuming all q's will use query result cache and all fqs will use
>> filter caches the below is what i am thinking.
>>
>> I would like to know how to relate the index size on disk to its memory size
>> ?
>> Would it be safe to assume gven the disk size of 2.4g, that we can have ram
>> size for whole index plus 1g for any other overhead plus the cache size
>> which comes to 150MB  (calculation below). Thus making it around 4g.
>>
>> cache size calculation -
>> 
>> query result cache - size = 50K;
>> since we paginate the results and each page has 10 items and assuming each
>> user will at the max see 3 pages, per query
>> we will set queryResultWindowSize to 30. Assuming this, for 50k querries we
>> will use up 5* 30 bits = 187K asuming results are stored in bitset.
>>
>> we use few common fqs, lets say 200. Assuming each returns around 30k
>> documents, it adds to 200 * 3 bits  = 750K.
>>
>> If we use document cache of size 20K, assuming each document size is around
>> 5k at the max, it will take up 2 * 5= 100MB.
>>
>> Thus we can increase the cache more drastically and still it will use up
>> only 150MB or less.
>>
>> Is this reasoning on cache's correct ?
>>
>> Thanks
>> Preetam
>>
> 
> 
> 



Re: schema.xml compatibility

2008-07-09 Thread Yonik Seeley
On Wed, Jul 9, 2008 at 7:13 PM, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
> I've noticed that schema.xml in the dev version of Solr spells
> what used to be fieldtype as fieldType with capital T.
>
> Are there any other compatibility issues between the would-be
> Solr 1.3 and Solr 1.2?

It shouldn't be a compatibility issue since both will be accepted.
The xpath used to select fieldType nodes is
"/schema/types/fieldtype | /schema/types/fieldType"

> How soon Solr 1.3 will be available, by the way?

Hopefully soon... perhaps the end of the month.

-Yonik


nagios scripts for solr? other monitoring links?

2008-07-09 Thread Ryan McKinley
Is anyone out there using nagios to monitor solr?

I remember some discussion of this in the past around exposing
response handler timing info so it could play nice with nagios... did
anyone get anywhere with this?  want to share :)

Any other pointers to solr monitoring tools would be good too.

thanks
ryan


Re: estimating memory needed for solr instances...

2008-07-09 Thread Ian Connor
I would guess so also to a point. After you run out of RAM, indexing
also takes a hit. I have noticed on a 2Gb machine when the index gets
over 2Gb, my indexing rate when down from 100/s to 40/s. After
reaching 4Gb it was down to 10/s. I am trying now with a 8Gb machine
to see how far I get through my data before slowing down.

On Wed, Jul 9, 2008 at 7:56 PM, Jacob Singh <[EMAIL PROTECTED]> wrote:
> My total guess is that indexing is CPU bound, and searching is RAM bound.
>
> Best,
> Jacob
> Ian Connor wrote:
>> There was a thread a while ago, that suggested just need to factor in
>> the index's total size (Mike Klaas I think was the author). It was
>> suggested having the RAM is enough and the OS will cache the files as
>> needed to give you the performance boost needed.
>>
>> If I misread the thread, please chime in - but it seems having enough
>> RAM is the key to performance.
>>
>> On Wed, Jul 9, 2008 at 3:00 AM, Preetam Rao <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> Since we plan to share the same box among multiple solr instances on a 16gb
>>> RAM multi core box, Need to estimate how much memory we need for our
>>> application.
>>>
>>> The index size is on disk  2.4G with close to 3 million documents. The plan
>>> is to use dismax query with some fqs.
>>> Since we do not sort the results, the sort will be by score which eliminates
>>> the option "userFiterFprSortedQuerries".
>>> Thus assuming all q's will use query result cache and all fqs will use
>>> filter caches the below is what i am thinking.
>>>
>>> I would like to know how to relate the index size on disk to its memory size
>>> ?
>>> Would it be safe to assume gven the disk size of 2.4g, that we can have ram
>>> size for whole index plus 1g for any other overhead plus the cache size
>>> which comes to 150MB  (calculation below). Thus making it around 4g.
>>>
>>> cache size calculation -
>>> 
>>> query result cache - size = 50K;
>>> since we paginate the results and each page has 10 items and assuming each
>>> user will at the max see 3 pages, per query
>>> we will set queryResultWindowSize to 30. Assuming this, for 50k querries we
>>> will use up 5* 30 bits = 187K asuming results are stored in bitset.
>>>
>>> we use few common fqs, lets say 200. Assuming each returns around 30k
>>> documents, it adds to 200 * 3 bits  = 750K.
>>>
>>> If we use document cache of size 20K, assuming each document size is around
>>> 5k at the max, it will take up 2 * 5= 100MB.
>>>
>>> Thus we can increase the cache more drastically and still it will use up
>>> only 150MB or less.
>>>
>>> Is this reasoning on cache's correct ?
>>>
>>> Thanks
>>> Preetam
>>>
>>
>>
>>
>
>


tagging application, best way to architect?

2008-07-09 Thread aris buinevicius
We're trying to implement a large scale domain specific web email
application, and so far solr performance on the search side is really doing
well for us.

There are two limitations that I can't seem to get around however, and was
hoping for some advice.

1. We would like to do bulk tagging on large query result sets (ie, if you
have 1M emails, do a search, and then you wish to apply a tag to the result
set of, say, 250k results).   I've tried many approaches, but the closest
support I could see was the update field functionality in SOLR-139.   Is
there any other way to separate the very dynamic metadata (tags and other
fields) abstracted away from the static documents themselves?   I've
researched joining against a metadata database, but unfortunately the join
logic for large results is just too bulky to be perform well at scale.
Also have even looked at postgres tsearch2, but that also breaks down with a
large number of emails.

2. We're assuming we'll have thousands of users with independent data; any
good way to partition multiple indexes with solr?   With Lucene we could
just save those in independent directories, and cache the index while the
user session is active.   I saw some configurations on tomcat that would
allow multiple instances, but that's probably not practical for lots of
concurrent users.

Thanks for any tips; would love to use Solr (or Lucene), but haven't been
able to get around issue 1 yet for large numbers of emails in a timely
response.   We've really looked at the gamut here, including solr, lucene,
postgres (tsearch2), sphinx, xapian, couchdb(!), and more.

ab


Re: tagging application, best way to architect?

2008-07-09 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Thu, Jul 10, 2008 at 7:53 AM, aris buinevicius <[EMAIL PROTECTED]> wrote:
> We're trying to implement a large scale domain specific web email
> application, and so far solr performance on the search side is really doing
> well for us.
>
> There are two limitations that I can't seem to get around however, and was
> hoping for some advice.
>
> 1. We would like to do bulk tagging on large query result sets (ie, if you
> have 1M emails, do a search, and then you wish to apply a tag to the result
> set of, say, 250k results).   I've tried many approaches, but the closest
> support I could see was the update field functionality in SOLR-139.   Is
> there any other way to separate the very dynamic metadata (tags and other
> fields) abstracted away from the static documents themselves?   I've
> researched joining against a metadata database, but unfortunately the join
> logic for large results is just too bulky to be perform well at scale.
> Also have even looked at postgres tsearch2, but that also breaks down with a
> large number of emails.
Updating large no:of docs in one go is a bit expensive . (SOLR-139) is
trying to achieve that but it is still expensive.If the users do not
tag the docs too often then it may be OK
>
> 2. We're assuming we'll have thousands of users with independent data; any
> good way to partition multiple indexes with solr?   With Lucene we could
> just save those in independent directories, and cache the index while the
> user session is active.   I saw some configurations on tomcat that would
> allow multiple instances, but that's probably not practical for lots of
> concurrent users.
Maintaining multiple indices is not a good idea. Add an extra
attribute 'userid' to each document and search with user id as a 'fq'.
The caches in Solr will automatically take care of the rest.
>
> Thanks for any tips; would love to use Solr (or Lucene), but haven't been
> able to get around issue 1 yet for large numbers of emails in a timely
> response.   We've really looked at the gamut here, including solr, lucene,
> postgres (tsearch2), sphinx, xapian, couchdb(!), and more.
>
> ab
>



-- 
--Noble Paul


Re: schema.xml compatibility

2008-07-09 Thread Chris Hostetter

: > Are there any other compatibility issues between the would-be
: > Solr 1.3 and Solr 1.2?
: 
: It shouldn't be a compatibility issue since both will be accepted.

Note that the example configs tend to represent the latest/greatest syntax 
& features, but existing configs should generally continue to work as is 
when upgrading -- if anything changes are made in Solr that might impact 
people with old configs when upgrading a special note will be found in the 
"Upgrading from Solr 1.XXX" section of the release notes.  (such as the 
comment about the  directive in the "Upgrading from Solr 1.2" 
section, and about the json.nl=map option for JSON in "Upgrading from Solr 
1.1")



-Hoss



Re: tagging application, best way to architect?

2008-07-09 Thread Norberto Meijome
On Thu, 10 Jul 2008 09:36:01 +0530
"Noble Paul _ __" <[EMAIL PROTECTED]> wrote:

> > 2. We're assuming we'll have thousands of users with independent data; any
> > good way to partition multiple indexes with solr?   With Lucene we could
> > just save those in independent directories, and cache the index while the
> > user session is active.   I saw some configurations on tomcat that would
> > allow multiple instances, but that's probably not practical for lots of
> > concurrent users.  
> Maintaining multiple indices is not a good idea. Add an extra
> attribute 'userid' to each document and search with user id as a 'fq'.
> The caches in Solr will automatically take care of the rest.
> >

i have been pondering about something similar to this for some of the stuff i'm
working on.

Intuitively, keeping independent indices doesn't look too good. But if you
split your setup (ie, 2 different clusters if needed be), having one index for
the information that doesn't change often (email body , from, to, date,
headers? ) + message id ( or id = concat(message_id,userid) ), then you can
have a separate index for the metadata of the documents in the first index.

Everytime you have updates to the mail metadata you handle it in the
second index (not sure if this 2nd index would be the definite storage of
metadata for mails, or it's stored in your mail app and you extract and index
into SOLR afterwards). 

there is of course the new issue of scrubbing the 2nd index when emails are
removed from your system, but i don't imagine it being terribly complex.

This way, you can do away with SOLR-139 until it is stable enough + scales as
needed. or altogether , not sure how well -139 will progress.

wrt to the OPs question about 'how to partition the data' wrt thousands of
users, you should be able to use
http://wiki.apache.org/solr/DistributedSearch , or setup different clusters ,
each with distributed searchers setup , using the userid to decide on which
cluster you'll search in ( hash(userid ) would give you an even distribution
across all clusters).

Thoughts? 
B
_
{Beto|Norberto|Numard} Meijome

Q. How do you make God laugh?
A. Tell him your plans.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: estimating memory needed for solr instances...

2008-07-09 Thread Preetam Rao
Thanks for the responses, Ian, Jacob.

While I could not locate the previous thread, this is what I understand..

While we can fine tune the cache parameters and other stuff which we can
directly control, with respect to index files the key is to give enough RAM
and let the the OS do its best with respect to keeping the index file in
memory,

--
Preetam

On Thu, Jul 10, 2008 at 7:12 AM, Ian Connor <[EMAIL PROTECTED]> wrote:

> I would guess so also to a point. After you run out of RAM, indexing
> also takes a hit. I have noticed on a 2Gb machine when the index gets
> over 2Gb, my indexing rate when down from 100/s to 40/s. After
> reaching 4Gb it was down to 10/s. I am trying now with a 8Gb machine
> to see how far I get through my data before slowing down.
>
> On Wed, Jul 9, 2008 at 7:56 PM, Jacob Singh <[EMAIL PROTECTED]> wrote:
> > My total guess is that indexing is CPU bound, and searching is RAM bound.
> >
> > Best,
> > Jacob
> > Ian Connor wrote:
> >> There was a thread a while ago, that suggested just need to factor in
> >> the index's total size (Mike Klaas I think was the author). It was
> >> suggested having the RAM is enough and the OS will cache the files as
> >> needed to give you the performance boost needed.
> >>
> >> If I misread the thread, please chime in - but it seems having enough
> >> RAM is the key to performance.
> >>
> >> On Wed, Jul 9, 2008 at 3:00 AM, Preetam Rao <[EMAIL PROTECTED]>
> wrote:
> >>> Hi,
> >>>
> >>> Since we plan to share the same box among multiple solr instances on a
> 16gb
> >>> RAM multi core box, Need to estimate how much memory we need for our
> >>> application.
> >>>
> >>> The index size is on disk  2.4G with close to 3 million documents. The
> plan
> >>> is to use dismax query with some fqs.
> >>> Since we do not sort the results, the sort will be by score which
> eliminates
> >>> the option "userFiterFprSortedQuerries".
> >>> Thus assuming all q's will use query result cache and all fqs will use
> >>> filter caches the below is what i am thinking.
> >>>
> >>> I would like to know how to relate the index size on disk to its memory
> size
> >>> ?
> >>> Would it be safe to assume gven the disk size of 2.4g, that we can have
> ram
> >>> size for whole index plus 1g for any other overhead plus the cache size
> >>> which comes to 150MB  (calculation below). Thus making it around 4g.
> >>>
> >>> cache size calculation -
> >>> 
> >>> query result cache - size = 50K;
> >>> since we paginate the results and each page has 10 items and assuming
> each
> >>> user will at the max see 3 pages, per query
> >>> we will set queryResultWindowSize to 30. Assuming this, for 50k
> querries we
> >>> will use up 5* 30 bits = 187K asuming results are stored in bitset.
> >>>
> >>> we use few common fqs, lets say 200. Assuming each returns around 30k
> >>> documents, it adds to 200 * 3 bits  = 750K.
> >>>
> >>> If we use document cache of size 20K, assuming each document size is
> around
> >>> 5k at the max, it will take up 2 * 5= 100MB.
> >>>
> >>> Thus we can increase the cache more drastically and still it will use
> up
> >>> only 150MB or less.
> >>>
> >>> Is this reasoning on cache's correct ?
> >>>
> >>> Thanks
> >>> Preetam
> >>>
> >>
> >>
> >>
> >
> >
>


Re: Certain form of autocomplete (like Google Suggest)

2008-07-09 Thread Chris Hostetter

: Now I'd like to know what would be the best way to implement a search
: term autocompletion in the way of Google Suggest
: (http://www.google.com/webhp?complete=1&hl=en).
: 
: Most autocomplete implementations aim to display search result entries
: during input. What Suggest does, and what I'd like to accomplish, is
: an automatic suggestion of relevant index terms. This would help users

you'll find a few discussions about this in the archives...

http://www.nabble.com/forum/Search.jtp?forum=14479&local=y&query=autocomplete

It's not something i've personally built, but as I recal general concensus 
in the past has been to use a custom index where each doc corrisponds to a 
word/phrase you want to suggest.



-Hoss



Re: Certain form of autocomplete (like Google Suggest)

2008-07-09 Thread Yonik Seeley
Would facet.prefix work for you?

-Yonik

On Fri, Jul 4, 2008 at 4:58 AM, Marian Steinbach <[EMAIL PROTECTED]> wrote:
> Hi all!
>
> I just startet evaluating Solr a few days ago and I'm quite happy with
> the way it works. The test project I am using on is a product search
> for a wine shop with 2500 articles and about 20 fields, with faceted
> search.
>
> Now I'd like to know what would be the best way to implement a search
> term autocompletion in the way of Google Suggest
> (http://www.google.com/webhp?complete=1&hl=en).
>
> Most autocomplete implementations aim to display search result entries
> during input. What Suggest does, and what I'd like to accomplish, is
> an automatic suggestion of relevant index terms. This would help users
> to prevent spelling problems, which are a huge issue in the domain of
> wine, where almost every other term is french.
>
> Szenario:
>
> 1) The user types "sa" into a query input field.
> 2) The system searches for the 10 most frequent index terms starting
> with "sa" and displays the result in a menu.
> 3) The user adds a 3rd character => input is now "sau"
> 4) The system searches for the 10 most frequent index terms starting
> with "sa" and displays the result in a menu.
> 5) The user clicks on "sauvignon" in the menu and the term in the
> input field is completed to "sauvignon".
>
> So, what I need technically is the (web service) query that delivers
> all index terms (for specific index fields) starting with a certain
> prefix. The result should be ordered by frequency and limited to a
> certain amount of entries.
>
> Is this functionality already available in the Solr core?
>
> It seems as if "Schema Browser" functionality of the luke webapp (part
> of the nightly build) does something similar, but I can't find out how
> to limit the term lists to match the requirements above.
>
> I have to mention that I'm not an experienced Java developer. :)
>
> Thanks for your help!
>
> Marian
>


Re: Certain form of autocomplete (like Google Suggest)

2008-07-09 Thread Walter Underwood
For capacity planning, our autocomplete gets more than 10X as many
requests as our search. Solr can handle our search just fine, but
I wrote an in-memory prefix match to handle the 25-30M autocomplete
matches each day. I load that by doing Solr queries, so the two
stay in sync.

wunder

On 7/9/08 9:59 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : Now I'd like to know what would be the best way to implement a search
> : term autocompletion in the way of Google Suggest
> : (http://www.google.com/webhp?complete=1&hl=en).
> : 
> : Most autocomplete implementations aim to display search result entries
> : during input. What Suggest does, and what I'd like to accomplish, is
> : an automatic suggestion of relevant index terms. This would help users
> 
> you'll find a few discussions about this in the archives...
> 
> http://www.nabble.com/forum/Search.jtp?forum=14479&local=y&query=autocomplete
> 
> It's not something i've personally built, but as I recal general concensus
> in the past has been to use a custom index where each doc corrisponds to a
> word/phrase you want to suggest.
> 
> -Hoss




How / Does commit work?

2008-07-09 Thread Jacob Singh
Hi,

I'm trying to get replication working, and it's failing because commit
refuses to work (at least as I understand it).

I run commit and point it to the update URL.  I know the URL is correct,
because solr returns something to me:

commit request to Solr at http://solr.solrflare.com:8080/solr/ai5/update
failed:
  04 


Oddly, the code in commit which generates this error is:

echo $rs | grep ' /dev/null 2>&1

Is the code wrong?  should it grep for the int name="status" node being 0?

Also, from my understanding, commit should also generate a snapshot, but
 this doesn't happen.  That is, I update nodes, but I don't get any
snapshots.  If I run snapshooter, it works fine (other than that I can't
install it, because the slave calls commit on the master via SSH.

Thanks for your help as always.  Please let me know if I should write a
patch for the first thing.

Best,
Jacob