Question: index performance

2007-04-13 Thread James liu

i find it will be OutOfMemory when i get more that 10k records.

so now i index 10k records( 5k / record)

if i use for to index more data. it always show OutOfMemory.


i use top to moniter  and find index finish, free memory is 125m,,and
sometime it will be 218m

it show me solr index finish and use sometime free memory?


how can i index more data than 10k records and doesn't stop by OutOfMemory.

tomcat i set memory 512m.


--
regards
jl


Re: Results per user

2007-04-13 Thread Grant Ingersoll
I don't use Filters very much so this might be a dumb question, but I  
could overcome the main drawback by hooking into the filter and  
updating it's bits without affecting the caching, right?


I kind of think I have scaling issues no matter what.  If you do the  
post processing way, then you may have to make repeated fetches to  
Solr in order to get enough results to display.


I think I may have to dig a bit deeper into both approaches

On Apr 12, 2007, at 7:41 PM, Chris Hostetter wrote:



: > results that are filtered on a per user basis, for instance to  
remove
: > results that have already been viewed.  I know I could post  
process

: > the results from Solr to do this, but am wondering if a better
: > solution is to implement my own request handler that takes in  
user id
: > info and manages a cache of Filters that maintains the bit set  
info

: > on the search side.  Is this a good approach?

: One issue with your approach would be scaling... if you have  
multiple

: searchers, how do you communicate this user data between them?

If the filtering logic can be implemented in a Filter class, you might
just want to rely on the built in filterCache (you'd still need a  
custom

request handler that kows about your custom Filter)

the plus side is you'd get all the benefits of Solr's filter cache  
(cached
as long as the same searcher is used, autowarmed when a new  
searcher is

opened)

the down side is you'd get all the benefits of Solr's filter cache  
(cached

as long as the same searcher is used -- so it wouldn't notice if you'd
updated your datastore to remove a bunch of files from their filter)





-Hoss






Re: Sort on multiple fields not working?

2007-04-13 Thread karl wettin


12 apr 2007 kl. 17.06 skrev Yonik Seeley:


Sorting works on indexed tokens, and hence doesn't really work on
analyzed fields that produce more than one token per document.  I
suspect your title field falls into that category.  You could also
index the title field into another field that is indexed as a string
(non-tokenized), but that might take up a lot of memory if you have
long titles.


It just hit me (and I did not consider it any further) that perhaps  
one could store String.valueOf(theTitle.hashcode()) in an alternative  
field and sort by that instead? It will not be 100% accurate, but in  
most cases it will. However, I'm not sure how negative values will be  
handled. If that would be a problem, one could convert the integer to  
alfanum. That should also save a bunch of memory.


--
karl


Re: Question: index performance

2007-04-13 Thread Yonik Seeley

On 4/13/07, James liu <[EMAIL PROTECTED]> wrote:

i find it will be OutOfMemory when i get more that 10k records.

so now i index 10k records( 5k / record)


In one request?  There's really no reason to put more than hundreds of
documents in a single add request.

If you are indexing using multiple requests, and always run into
problems at 10k records, you are probably hitting memory issues with
Lucene merging.  If that's the case, try lowering the mergeFactor so
fewer segments will be merged at the same time.

Some other things to be careful of:
- don't call commit after you add every batch of documents
- don't set maxBufferedDocs too high if you don't have the memory

-Yonik


Re: Sort on multiple fields not working?

2007-04-13 Thread Yonik Seeley

On 4/13/07, karl wettin <[EMAIL PROTECTED]> wrote:

It just hit me (and I did not consider it any further) that perhaps
one could store String.valueOf(theTitle.hashcode()) in an alternative
field and sort by that instead? It will not be 100% accurate, but in
most cases it will.


That would only mostly work for titles around 5 characters long,
right?  It seems like after that, the correlation between hashCode and
sort order breaks down almost immediately since you lose the leftmost
hash bits.

-Yonik


Re: Sort on multiple fields not working?

2007-04-13 Thread karl wettin


13 apr 2007 kl. 15.48 skrev Yonik Seeley:


On 4/13/07, karl wettin <[EMAIL PROTECTED]> wrote:

It just hit me (and I did not consider it any further) that perhaps
one could store String.valueOf(theTitle.hashcode()) in an alternative
field and sort by that instead? It will not be 100% accurate, but in
most cases it will.


That would only mostly work for titles around 5 characters long,
right?  It seems like after that, the correlation between hashCode and
sort order breaks down almost immediately since you lose the leftmost
hash bits.


That might be true, as I said, I didn't really think about it too  
long. But some alternative hashCode could probably be implemented,  
one that use all available bits in a string, rather than the 32 bit  
limitation of an integer.


--
karl


Re: Question: index performance

2007-04-13 Thread galo

Hi there,

I'm building an index to which I'm sending a few hundred thousand 
entries. I pull them off the database in batches of 25k and send them to 
solr, 100 documents at a time. I was doing a commit after each of those 
but after what Yonik says I will remove it and commit only after each 
batch of 25k.


Q1: I've got autocommit set to 1000 now.. in solrconfig.xml, should i 
disable it in this scenario?


Q2: To decide which of those 25k are going to be indexed, we need to do 
a query for each (this is the main reason to optimize before a new DB 
batch is indexed), each of these 25k queries take around 30ms which is 
good enough for us, but i've observed every ~30 queries the time of one 
search goes up to 150ms or even 1200ms. Then it does another ~30, etc. I 
guess there is something happening inside the server regularly that 
causes it. Any clues what it can be and how can i minimize that time?


Q3: The 25k searches are done without any cumulative effect on 
performance (avg/search is ~30ms from start to end). But if inmmediately 
after start posting documents to the index tomcat peaks CPU. But if i 
stop tomcat, and then post the 25k documents without doing those 
searches they're very quick. Is there any reason why the searches would 
affect tomcat to justify this? Just to clarify, searches are NOT done at 
the same time as indexing.


My tomcat is running with -server -Xmx512m -Xms512m

Cheers,

galo

Yonik Seeley wrote:

On 4/13/07, James liu <[EMAIL PROTECTED]> wrote:

i find it will be OutOfMemory when i get more that 10k records.

so now i index 10k records( 5k / record)


In one request?  There's really no reason to put more than hundreds of
documents in a single add request.

If you are indexing using multiple requests, and always run into
problems at 10k records, you are probably hitting memory issues with
Lucene merging.  If that's the case, try lowering the mergeFactor so
fewer segments will be merged at the same time.

Some other things to be careful of:
- don't call commit after you add every batch of documents
- don't set maxBufferedDocs too high if you don't have the memory

-Yonik



Re: Schema validator/debugger

2007-04-13 Thread Andrew Nagy

Yonik Seeley wrote:

Oh wait... Andrew, were you always testing via "ping"?

Check out what the ping query is configured as in solrconfig.xml:

   
qt=dismax&q=solr&start=3&fq=id:[* TO *]&fq=cat:[* 
TO *]

   

Perhaps we should change it to something simple by default???  "q=solr"?

That solves the Jetty failure mystery... so it looks like you either
have a tomcat setup problem, or a Solr bug that only shows under
tomcat.


Yes, this is the problem!  Good catch :)  I have been testing via ping.

However this still does not solve my original problem ... I will dig a 
bit more and see what I can find.


Thanks
Andrew


Embedding Solr vs Lucene, multiple Solr cores?

2007-04-13 Thread Henrib

I'm trying to choose between embedding Lucene versus embedding Solr in one
webapp.

In Solr terms, functional requirements would more or less lead to multiple
schema & conf (need CRUD/generation on those) and deployment constraints
imply one webapp instance. The choice I'm trying to make is thus:
-Embed Lucene and (attempt to) recode a lot of what Solr provides... (the
straw man)
-Embed Solr but refactor 'some' of its core, assuming it is correct to see
one Solr core as the association of one schema & one conf.

There have been a few threads about multiple indexes and/or
multiple/reloading schemas.
>From what I gathered, one solution stems from the 'multiple webapp instances
deployment' and implies 'extracting' the static instance (at least the
SolrCore) & thus host multiple Solr cores in one webapp.

Obviously, the operations (queries/add/delete doc) would need to carry which
core they are targeting (one 'core' being set as the 'default' for
compatibility purpose).
What will be the other big hurdles, the ones that could even preclude the
very idea ? (caches handling, updater threads, HA features...)
Are there any easier routes (class-loaders, 'provisional' schema...) ?

Any advice welcome. Thanks.
Henri


-- 
View this message in context: 
http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9981355
Sent from the Solr - User mailing list archive at Nabble.com.



Deploying Solr with Jetty

2007-04-13 Thread Cody Caughlan

First off, I am aware that the bulk of this question has to do with
Jetty, but please have kindness...

My end goal is to have a handful of Solr instances running under Jetty
all accessible at

/app1
/app2
...etc...

I have taken the .war file in the Solr dist/ directory, unpacked it
and added in a solr/ dir with the bin/, conf/ sub-dirs, etc. I then
zipped this back up, with a .war extension.

I placed this .war in my jetty webapps/

However when I start Jetty it unpacks the war and then tries to start
the app but it bitches about not finding solrconfig.xml and ends up
not starting that app.

So my question is, where I do place the solr/ directory so it will be
found by the app? If I place it in my root jetty dir, that config will
apply to all my Solr instances which is not what I want because they
all need to have different indexes, etc.

The "Multiple Solr instances for Jetty" on the Solr wiki uses outdated
syntax. But it is a step in the right direction because it specifies a
JNDI entry for solr/home for each web-app.

I guess I am just having basic Jetty config issues.

Like I said, I know this question has more to do with Jetty than Solr,
but could someone point me in the right direction (besides saying
"join the Jetty list")?

Thanks
/cody


Re: Results per user

2007-04-13 Thread Chris Hostetter

: I don't use Filters very much so this might be a dumb question, but I
: could overcome the main drawback by hooking into the filter and
: updating it's bits without affecting the caching, right?

Not really ... Solr doesn't use Filter's the same way as
CachingWrapperFilter does ... it builds DocSet's out of them and cachines
those for the life of the IndexSearcher (or until the cache gets full and
it needs to expunge somehting) when a new IndexSearcher is opened, it
auto-warms the new filterCache by executing hte exsiting Filter's against
the new IndexSearcher.

: I kind of think I have scaling issues no matter what.  If you do the
: post processing way, then you may have to make repeated fetches to
: Solr in order to get enough results to display.

anything you can do on the client side you can do in a custom request
handler (assuming you cna do it in java) so that will at least save you
the overhead of HTTP back and forth with the Solr server ... i was jsut
trying to think of ways that existing features available to
SolrRequestHandlers could help you more.




-Hoss



Re: Sort on multiple fields not working?

2007-04-13 Thread Chris Hostetter

: That might be true, as I said, I didn't really think about it too
: long. But some alternative hashCode could probably be implemented,
: one that use all available bits in a string, rather than the 32 bit
: limitation of an integer.

if you're going to use all the bits in the string, and not confine
yourself to an integer, how is that different from sorting on the string
itself?

(either way you still need a single value per doc per sort field -- and
can't use a tokenized field, so you use copyField)


-Hoss



Re: Embedding Solr vs Lucene, multiple Solr cores?

2007-04-13 Thread Ryan McKinley

On 4/13/07, Henrib <[EMAIL PROTECTED]> wrote:


I'm trying to choose between embedding Lucene versus embedding Solr in one
webapp.

In Solr terms, functional requirements would more or less lead to multiple
schema & conf (need CRUD/generation on those) and deployment constraints
imply one webapp instance.


Do you really need multiple schema's?  multiple indexes?  This has
been posted many times (i thought i needed it too!) - it turns out
most cases can easily be taken care of by putting multiple document
types in the same index and including a "type" field.  You could have
a single schema with names for common fields, or one that has a prefix
for each type; either "title" or "typeA_title", "typeB_title" - the
common name approach can be easier because it makes it easier to
search across types.



-Embed Solr but refactor 'some' of its core, assuming it is correct to see
one Solr core as the association of one schema & one conf.


If you absolutely need multiple indexes, It will probably be easier to
fudge the single webapp requirement then to refactor solr to remove
the static singleton SolrCore.getSolrCore().

ryan


Re: Embedding Solr vs Lucene, multiple Solr cores?

2007-04-13 Thread Tom Hill

Hi -

Of the various approaches that you could take, the one I'd work on first is:


deployment constraints imply one webapp instance.


In most environments, it's going to cost a lot less to change this, than to
try to roll your own, or extensively modify solr.

I know I'm sidestepping your stated requirements, but I'd take a long look
at that one.

BTW, We cut over from an embedded Lucene instance to Solr about 4 months
ago, and are very happy that we did.

Tom

On 4/13/07, Henrib <[EMAIL PROTECTED]> wrote:



I'm trying to choose between embedding Lucene versus embedding Solr in one
webapp.

In Solr terms, functional requirements would more or less lead to multiple
schema & conf (need CRUD/generation on those) and deployment constraints
imply one webapp instance. The choice I'm trying to make is thus:
-Embed Lucene and (attempt to) recode a lot of what Solr provides... (the
straw man)
-Embed Solr but refactor 'some' of its core, assuming it is correct to see
one Solr core as the association of one schema & one conf.

There have been a few threads about multiple indexes and/or
multiple/reloading schemas.
From what I gathered, one solution stems from the 'multiple webapp
instances
deployment' and implies 'extracting' the static instance (at least the
SolrCore) & thus host multiple Solr cores in one webapp.

Obviously, the operations (queries/add/delete doc) would need to carry
which
core they are targeting (one 'core' being set as the 'default' for
compatibility purpose).
What will be the other big hurdles, the ones that could even preclude the
very idea ? (caches handling, updater threads, HA features...)
Are there any easier routes (class-loaders, 'provisional' schema...) ?

Any advice welcome. Thanks.
Henri


--
View this message in context:
http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9981355
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Sort on multiple fields not working?

2007-04-13 Thread karl wettin


13 apr 2007 kl. 20.11 skrev Chris Hostetter:



: That might be true, as I said, I didn't really think about it too
: long. But some alternative hashCode could probably be implemented,
: one that use all available bits in a string, rather than the 32 bit
: limitation of an integer.

if you're going to use all the bits in the string, and not confine
yourself to an integer, how is that different from sorting on the  
string

itself?


Smaller string values does not consume as much memory?

I might not understand your question.

--
karl




Re: Embedding Solr vs Lucene, multiple Solr cores?

2007-04-13 Thread Henrib

Thank you both for your quick answers.

The one webapp constraint comes from the main 'embedding' application so I
don't have much leeway there. The direct approach was to map the
main/hosting application document collection & types to one schema/conf.
Since the host collections & types can be dynamically created, this seemed
the natural route (albeit hard).

The longer story is that in our typical customer environments, IT deploys &
monitors webapps (provision space & al, replicate for disaster recovery etc)
but does not want to deal with the application itself, leaving the 'business
users' side administer it. Even if there is a dedicated Tomcat for the main
app, IT will not let the 'business users' install other applications (scope
of responsibility, code versus data, validation procedures, etc). Thus the
'one application' constraint.

Anyway, it seems a 'provisional' schema where most fields would be dynamic
and some notational convention to map them would be the easiest route. And
replace the targeted different indexes by equivalent filters. I gather from
your inputs the potential functionality loss and/or performance hit is not
something I should be afraid of.

For the sake of completeness, instead of embedding Solr in that single
instance, I thought about using several Solr instances running in different
webapp instances & use them as 'coprocessors' for the main application; this
would imply serializing/deserializing/redirecting queries & results between
webapps which is not the most efficient way on a single host/VM env (may be
Tomcat crosscontext could help alleviate that). But this would also require
dynamically deploying webapps for that purpose which is a no-no from IT...

For the sake of argument :-), besides the SolrCore singletons which is easy
to circumvent (a map of cores & at least a pointer from the instantiated
schema to the core handling it, are there others that are hiding
(Config.config, caches...) that would preclude the multiple core track?

Thanks again
Henri


Tom Hill-6 wrote:
> 
> Hi -
> 
> Of the various approaches that you could take, the one I'd work on first
> is:
> 
>> deployment constraints imply one webapp instance.
> 
> In most environments, it's going to cost a lot less to change this, than
> to
> try to roll your own, or extensively modify solr.
> 
> I know I'm sidestepping your stated requirements, but I'd take a long look
> at that one.
> 
> BTW, We cut over from an embedded Lucene instance to Solr about 4 months
> ago, and are very happy that we did.
> 
> Tom
> 
> On 4/13/07, Henrib <[EMAIL PROTECTED]> wrote:
>>
>>
>> I'm trying to choose between embedding Lucene versus embedding Solr in
>> one
>> webapp.
>>
>> In Solr terms, functional requirements would more or less lead to
>> multiple
>> schema & conf (need CRUD/generation on those) and deployment constraints
>> imply one webapp instance. The choice I'm trying to make is thus:
>> -Embed Lucene and (attempt to) recode a lot of what Solr provides... (the
>> straw man)
>> -Embed Solr but refactor 'some' of its core, assuming it is correct to
>> see
>> one Solr core as the association of one schema & one conf.
>>
>> There have been a few threads about multiple indexes and/or
>> multiple/reloading schemas.
>> From what I gathered, one solution stems from the 'multiple webapp
>> instances
>> deployment' and implies 'extracting' the static instance (at least the
>> SolrCore) & thus host multiple Solr cores in one webapp.
>>
>> Obviously, the operations (queries/add/delete doc) would need to carry
>> which
>> core they are targeting (one 'core' being set as the 'default' for
>> compatibility purpose).
>> What will be the other big hurdles, the ones that could even preclude the
>> very idea ? (caches handling, updater threads, HA features...)
>> Are there any easier routes (class-loaders, 'provisional' schema...) ?
>>
>> Any advice welcome. Thanks.
>> Henri
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9981355
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Embedding-Solr-vs-Lucene%2C-multiple-Solr-cores--tf3572324.html#a9986289
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Results per user

2007-04-13 Thread J.J. Larrea
I wrote the following after hurriedly reading Grant Ingersoll's 
question, and I completely missed the "to remove results that have 
already been viewed" bit.  Which leads me to think what I wrote may 
have no bearing on this issue...  but perhaps it may have bearing on 
someone else's issue?


- J.J.

-
Under the assumption that there is an untokenized field, say 
UserAccess, with user names or IDs that for each document indicate 
which users can access them...


If you could trust the requesting client to modify the request based 
on the user name or ID, it could either


 - Add an fq=UserAccess:userName argument to every request

 - Create a RequestHandler configuration for each user, putting such 
a fq (with a hardwired username) in an 'appends' section, along with 
any other needed customization:


  

  UserAccess:tony
...

But if you cannot trust the requesting client and need to do the 
filtering on the SOLR side of the divide, then I think you can simply 
subclass and deploy org.apache.solr.servlet.SolrDispatchFilter, such 
that in the execute() method you take the user (e.g. from 
request.getRemoteUser() or some other means), format a fq argument as 
above, and explicitly add it to the params in the SolrQueryRequest. 
While users can add filters to their queries, they would not be able 
to remove the applet-supplied filter query.


Regardless of how fq is specified, it would create a cached filter 
for each user. Obviously the filter cache size should be greater than 
the number of simultaneously active users plus the filters they use 
in their queries; inactive users' filters will be scrubbed until the 
next time.

-



Re: Embedding Solr vs Lucene, multiple Solr cores?

2007-04-13 Thread Chris Hostetter

: but does not want to deal with the application itself, leaving the 'business
: users' side administer it. Even if there is a dedicated Tomcat for the main
: app, IT will not let the 'business users' install other applications (scope
: of responsibility, code versus data, validation procedures, etc). Thus the
: 'one application' constraint.

There tends to be a lot of devils in the details of policy discussions
like this, but perhaps you could redefine the definition of an
"application" from your ops/biz standpoint to be broader then the
definition from a servlet container standpoint (ie: let the "application"
be the entire Tomcat setup running several webapps)

Alternately, I've heard people mention in past discussions issues
regarding service provider run servlet containers with self serve WAR hot
deployment and the issues involved with only being able to hange your WAR
and not having any control over hte container itself and i've always
wondered: how hard would be to wrap tomcat (or jetty) so that it is a war
that can run inside of another servlet container ... then you can have
multiple wars embeded in that war and control the tomcat configsto your
hearts content -- treating the ISPs servlet container like an OS.

: For the sake of argument :-), besides the SolrCore singletons which is easy
: to circumvent (a map of cores & at least a pointer from the instantiated
: schema to the core handling it, are there others that are hiding
: (Config.config, caches...) that would preclude the multiple core track?

There are lots of places in the code where class instances use static refs
to find the Core/Config/IndexSchema which would have to know know about
your Map and keys ... it would be a lot of non trivial changes and
refactoring i believe.

That said: If anyone is interested in tackling a patch to eliminate all of
the static Singletons i (and many others i suspect) would be
extremely gratefull .. both for how much it would improve hte reusability
of Solr in embedded situatiosn like this, but also for how it would
(hopefully) make hte code eaier to follow for future developers.


-Hoss



Re: Solr Scripts.conf Parsing Error

2007-04-13 Thread realw5

I think you're on to something, here was the output:

# Licensed to the Apache Software Foundation (ASF) under one or more^M$
# contributor license agreements.  See the NOTICE file distributed with^M$
# this work for additional information regarding copyright ownership.^M$
# The ASF licenses this file to You under the Apache License, Version 2.0^M$
# (the "License"); you may not use this file except in compliance with^M$
# the License.  You may obtain a copy of the License at^M$
#^M$
# http://www.apache.org/licenses/LICENSE-2.0^M$
#^M$
# Unless required by applicable law or agreed to in writing, software^M$
# distributed under the License is distributed on an "AS IS" BASIS,^M$
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.^M$
# See the License for the specific language governing permissions and^M$
# limitations under the License.^M$
user=solr^M$
solr_hostname=localhost^M$
solr_port=8080^M$
rsyncd_port=18080^M$
data_dir=^M$
webapp_name=solr^M$
master_host=^M$
master_data_dir=^M$
master_status_dir=^M$

Question now is, what's the best solution to removing those characters?

Dan

Chris Hostetter wrote:
> 
> 
> : all the debug output. Here, is a snip of that. Note the "solr\r", yet in
> my
> : .conf file I have only "user=solr". If I run the script using this
> command
> 
> what does "cat -vet scripts.conf" tell you?
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-Scripts.conf-Parsing-Error-tf3550726.html#a9988416
Sent from the Solr - User mailing list archive at Nabble.com.