Newbie question about memory allocation between solr and OS

2008-08-11 Thread Dallan Quass
Sorry for the newbie question.  When running solr under tomcat I notice that
the amount of memory tomcat uses increases over time until it reaches the
maximum limit set (with the Xms and Xmx switches) for the jvm.

Is it better to allocate give all available physical memory to the jvm, or
to allocate enough so that solr doesn't run out of memory and let the OS use
the rest for disk buffers?  That is, will lucene take good advantage if
given extra memory, or does the extra memory end up being used for data
structures that are no longer in use but haven't been garbage-collected by
the jvm yet?

Thank you,

--dallan



RE: Newbie question about memory allocation between solr and OS

2008-08-11 Thread Dallan Quass
Thanks Yonik!

In case anyone monitoring this list isn't sold already on solr, my use of
solr is pretty non-standard -- I've written nearly a dozen plugins to
customize it for my particular needs.  Yet I've been able to do everything I
need using plugins and without modifying the core code.  It works like a
charm.

--dallan

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf 
> Of Yonik Seeley
> Sent: Monday, August 11, 2008 10:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Newbie question about memory allocation between 
> solr and OS
> 
> On Mon, Aug 11, 2008 at 10:52 AM, Dallan Quass 
> <[EMAIL PROTECTED]> wrote:
> > Sorry for the newbie question.  When running solr under tomcat I 
> > notice that the amount of memory tomcat uses increases over 
> time until 
> > it reaches the maximum limit set (with the Xms and Xmx 
> switches) for the jvm.
> >
> > Is it better to allocate give all available physical memory to the 
> > jvm, or to allocate enough so that solr doesn't run out of 
> memory and 
> > let the OS use the rest for disk buffers?
> 
> The latter... let the OS have as much as you can for disk buffers.
> 
> -Yonik



How to send a parsed Query to shards?

2009-04-03 Thread Dallan Quass
I want to use distributed search with some search components that I would
like to execute only on the main server, not on the shards, because they
reference some large in-memory lookup tables.  After the search components
get done processing the orignal query, the query may contain SpanNearQueries
and DisjunctionMaxQueries.  I'd like to send that query to the shards, not
the original query.  

I've come up with the following idea for doing this.  Would people please
comment on this idea or suggest a better alternative?

* Subclass QueryComponent to base64 encode the serialized form of the query
and send that in place of the original query.

* set the queryParser on the shard servers to a custom class that unencodes
and deserializes the encoded query and returns it.

Thoughts on this approach, or is there a better one?

Thanks,

-dallan



How to store a dense field value efficiently

2010-01-21 Thread Dallan Quass
Hi,

I want to issue queries where queried fields have a specified value or are
"missing".  I know that I can query missing values using a negated
full-range query, but it doesn't seem like that's very efficient (the fields
in question have a lot of possible values).  So I've opted to store special
"missing" value for each field that isn't found in a document, and issue
queries like "+(field1:value field1:missing) +(field2:value
field2:missing)".

The issue is that storing the missing values increases the size of the index
by 30%, because a lot of documents don't have values for all fields.  I'd
like to keep the index as small as possible so it can be cached in memory.

Any ideas on an alternative approach?  Is there a way to convince lucene to
store the doc-id list for the "missing" field value as a bitmap?  What if I
added some boolean fields to my schema; e.g., field1_missing and
field2_missing and stored a true in those fields for documents that were
missing the corresponding fields?  Does lucene store BoolField's as bitmaps?

-dallan



RE: How to store a dense field value efficiently

2010-01-22 Thread Dallan Quass
Sorry - I meant indexed.  I don't store the fields.

--dallan
 

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Friday, January 22, 2010 9:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to store a dense field value efficiently
> 
> Oops, that's a Lucene bit (got confused which list I was on).
> 
> You can still control storing the raw text in SOLR, so my 
> question is still relevant, but the solution may be 
> different. Do you store the fields?
> 
> Erick
> 
> On Fri, Jan 22, 2010 at 10:27 AM, Erick Erickson 
> wrote:
> 
> > I'm surprised by a 30% increase. The approach of adding a special 
> > token for "not present" is one of the standard ones
> >
> > So just to check, when you say "stored", are you really storing the 
> > missing value? As in Field.Store.YES? As opposed to 
> Field.Index.###? 
> > Because theres no need to Store this value.
> >
> > Erick
> >
> > On Thu, Jan 21, 2010 at 11:22 PM, Dallan Quass 
>  wrote:
> >
> >> Hi,
> >>
> >> I want to issue queries where queried fields have a 
> specified value 
> >> or are "missing".  I know that I can query missing values using a 
> >> negated full-range query, but it doesn't seem like that's very 
> >> efficient (the fields in question have a lot of possible 
> values).  So 
> >> I've opted to store special "missing" value for each field 
> that isn't 
> >> found in a document, and issue queries like "+(field1:value 
> >> field1:missing) +(field2:value field2:missing)".
> >>
> >> The issue is that storing the missing values increases the size of 
> >> the index by 30%, because a lot of documents don't have values for 
> >> all fields.  I'd like to keep the index as small as possible so it 
> >> can be cached in memory.
> >>
> >> Any ideas on an alternative approach?  Is there a way to convince 
> >> lucene to store the doc-id list for the "missing" field value as a 
> >> bitmap?  What if I added some boolean fields to my schema; e.g., 
> >> field1_missing and field2_missing and stored a true in 
> those fields 
> >> for documents that were missing the corresponding fields?  Does 
> >> lucene store BoolField's as bitmaps?
> >>
> >> -dallan
> >>
> >>
> >
> 



RE: SOLR Index or database

2010-03-04 Thread Dallan Quass
FWIW, I just implemented a system that stores the index in SOLR but the
records in a partitioned set of MySQL databases.  The only stored field in
SOLR is an ID field, which is the key to a table in the MySQL database.  I
had to modify SOLR a tiny bit and write a "database" search component so
that search results are read from the database instead of the SOLR index
partitions, but it works really well. 

The system indexes around 750M records partitioned across 10 SOLR servers
and 4 MySQL servers.  Storing the records in MySQL kept the indexes small
enough to be cached entirely in memory.  The MySQL databases require one
disk IO for each record displayed in the search results.

--dallan

> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Wednesday, March 03, 2010 1:20 PM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR Index or database
> 
> You need two, maybe three things that Solr doesn't do (or 
> doesn't do well):
> 
> * field updating
> * storing content
> * real time search and/or simple transactions
> 
> I would seriously look at Mark Logic for that. It does all of 
> those, plus full-text search, gracefully, plus it scales. 
> There is also a version for Amazon EC2.  www.marklogic.com
> 
> Note: I work at Mark Logic, but I chose Solr for Netflix when 
> I worked there.
> 
> wunder
> 
> On Mar 3, 2010, at 11:08 AM, caman wrote:
> 
> > 
> > Hello All,
> > 
> > Just struggling with a thought where SOLR or a database 
> would be good 
> > option for me.Here are my requirements.
> > We index about 600+ news/blogs into out system. Only information we 
> > store locally is the title,link and article snippet.We are able to 
> > index all these sources into SOLR index and it works perfectly.
> > This is where is gets tricky: 
> > We need to store certain meta information as well. e.g.
> > 1. Rating/popularity of article
> > 2. Sharing of the articles between users 3. How may times 
> articles is 
> > viewed.
> > 4. Comments on each article.
> > 
> > So far, we are deciding to store meta-information in the 
> database and 
> > link this data with the a document in the index. When user 
> opens the 
> > page, results are combined from index and the database to 
> render the view.
> > 
> > Any reservation on using the above architecture? 
> > Is SOLR right fit in this case? We do need full text search 
> so SOLR is 
> > no-brainer imho but would love to hear community view.
> > 
> > Any feedback appreciated
> > 
> > thanks
> 



EnableLazyFieldLoading?

2008-05-28 Thread Dallan Quass
If I'm loading say 80-90% of the fields 80-90% of the time, and I don't have
any large compressed text fields, is it safe to say that I'm probably better
off to turn off lazy field loading?

Thanks,

--dallan



Issuing queries during analysis?

2008-05-28 Thread Dallan Quass
I have a situation where it would be beneficial to issue queries in a filter
that is called during analysis.  In a nutshell, I have an index of places
that includes possible abbreviations.  And I want to query this index during
analysis to convert user-entered places to "standardized" places.  So if
someone enters "Chicago, IL" into a "place" field, I want to write a filter
that first issues a query on "IL" to find that the standardized name for IL
is Illinois, and then issues a query on places named "Chicago" located in
"Illinois" to find that the standardized name is "Chicago, Cook, Illinois",
and then returns this string in a token.  

I've tried having the filter factory implement SolrCoreAware, but that isn't
allowed for filter factories.  I've considered calling
SolrCore.getSolrCore(), but this function has been deprecated with a comment
that "if you are using multiple cores, this is not a function to use", and
I'd like to use multiple cores someday.  I looked at MultiCore.java but
couldn't find a way to return a specific core.  

Any ideas?  I could issue the queries to standardize the place fields in
each document before indexing it, and then send SOLR documents with
pre-standardized place fields, but it would sure be more convenient (and
probably better-performing) to issue the queries during analysis.

I'd appreciate suggestions!

--dallan



RE: Issuing queries during analysis?

2008-05-30 Thread Dallan Quass
> this may sound a bit too KISS - but another approach could be 
> based on synonyms, i.e. if the number of abbreviation is 
> limited and defined ("All US States"), you can simply define 
> complete state name for each abbreviation, this way a 
> "Chicago, IL" will be "translated" (...) in "Chicago, 
> Illinois" during indexing and/or querying... but this may 
> depend by the Tokenizer you use and how your index is defined 
> (do a search for "Chicago, Illinois" on a field gives you a 
> doc with "Chicago, Cook, Illinois" in some (other/same) field?)

Thanks for the suggestion!  The problem is there are over 1M places (it's a
database of historic places worldwide), most with multiple variations in the
way that they're written.  A complete synonym file would be pretty large.
Issuing queries before indexing the docs would be preferable to a
~100-megabyte synonym file, especially because it's a wiki and people can
add new places anytime so I'd have to re-build the synonym file on a regular
basis.

I sure wish I could figure out how to access the solr core object in my
token filter class though.

-dallan



RE: Issuing queries during analysis?

2008-05-30 Thread Dallan Quass
> Can you describe your indexing process a bit more?  Do you 
> just have one or two tokens that you have "translate" or is 
> it that you are going to query on every token in your text?  
> I just don't see how that will perform at all to look up 
> every token in some index, so maybe if we have some more 
> info, something more obvious will arise.

It's a recursive problem.  Given a user-entered place name like "Salem,
Oregon", you work from right to left, first looking up Oregon, which returns
places like the state of Oregon and also Oregon County, Missouri.  You then
look for places named Salem located in the places returned by the first
query.  This approach finds "Atlanta, Georgia" as well as "Gori, Georgia" (a
city in the Republic of Georgia).  I've already written an efficient lookup
function.  I just don't know how to call it during analysis because I don't
know how to access an instance of SolrCore from within a token filter
object.

-dallan



RE: Issuing queries during analysis?

2008-05-30 Thread Dallan Quass
> Dallas, got money to spend on solving this problem?  I 
> believe this is something that tools like LingPipe can solve 
> through language model training and named entity extraction.

Hi Otis,

Thank-you for your reply.  I'm familiar with tools like LingPipe, but this
problem is actually *much* simpler.  The places have already been entered
into Place fields.  (I don't have to try to identify place names in running
text, which is what tools like LingPipe are for.)  All I have to do is
convert user-entered place names that may omit levels or contain
abbreviations like "Chicago, IL", into complete place names like "Chicago,
Cook, Illinois, United States".  I have an algorithm already written to do
this; I just don't know how to call it from a token filter because I don't
know how to access SolrCore from within a token filter object.

If I cannot (or for some reason should not) access SolrCore from a token
filter, my alternative is before indexing a document, to issue a query to
convert the place fields associated with that document into complete place
names, then pass in the document with the complete place names into SOLR for
indexing.  By issuing this query instead from the token filter if possible,
I was hoping to avoid the extra round-trip query-response between the
indexing process and the SOLR server that this would entail.

Thanks again!

-dallan



RE: Issuing queries during analysis?

2008-05-30 Thread Dallan Quass
Hi Grant,

> Can you describe your indexing process a bit more?  Do you 
> just have one or two tokens that you have "translate" or is 
> it that you are going to query on every token in your text?  
> I just don't see how that will perform at all to look up 
> every token in some index, so maybe if we have some more 
> info, something more obvious will arise.

One more clarification -- I don't need to do this for every token in the
text; just for "place" fields in the document.  Each document has 1-3 place
fields that need to be converted to standard form when the document is
indexed.

There is a special set of (~1M) "Place" documents that contain information
about alternative/abbreviated place names, how places are nested inside each
other, etc.  Either before or during tokenization of the regular documents I
want to query these "Place" documents to determine how to standardize the
place fields in the regular documents.

Thank-you again!

-dallan



RE: Issuing queries during analysis?

2008-06-03 Thread Dallan Quass
> Grant Ingersoll wrote:
> 
> How often does your collection change or get updated?
> 
> You could also have a slight alternative, which is to create 
> a real small and simple Lucene index that contains your 
> translations and then do it pre-indexing.  The code for such 
> a searcher is quite simple, albeit it isn't Solr.
> 
> Otherwise, you'd have to hack the SolrResourceLoader to 
> recognize your Analyzer as being SolrCoreAware, but, geez, I 
> don't know what the full ramifications of that would be, so 
> caveat emptor.


> Mike Klaas wrote:
>
> Perhaps you could separate the problem, putting this info in 
> separate index or solr core.

This sounds like the best approach.  I've written a special searcher that
handles standardization requests for multiple places in one http call and it
was pretty straightforward.  That's what I love about SOLR, it's *so* easy
to write plugins for.

Thank-you for your suggestions!

--dallan