Newbie question about memory allocation between solr and OS
Sorry for the newbie question. When running solr under tomcat I notice that the amount of memory tomcat uses increases over time until it reaches the maximum limit set (with the Xms and Xmx switches) for the jvm. Is it better to allocate give all available physical memory to the jvm, or to allocate enough so that solr doesn't run out of memory and let the OS use the rest for disk buffers? That is, will lucene take good advantage if given extra memory, or does the extra memory end up being used for data structures that are no longer in use but haven't been garbage-collected by the jvm yet? Thank you, --dallan
RE: Newbie question about memory allocation between solr and OS
Thanks Yonik! In case anyone monitoring this list isn't sold already on solr, my use of solr is pretty non-standard -- I've written nearly a dozen plugins to customize it for my particular needs. Yet I've been able to do everything I need using plugins and without modifying the core code. It works like a charm. --dallan > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of Yonik Seeley > Sent: Monday, August 11, 2008 10:15 AM > To: solr-user@lucene.apache.org > Subject: Re: Newbie question about memory allocation between > solr and OS > > On Mon, Aug 11, 2008 at 10:52 AM, Dallan Quass > <[EMAIL PROTECTED]> wrote: > > Sorry for the newbie question. When running solr under tomcat I > > notice that the amount of memory tomcat uses increases over > time until > > it reaches the maximum limit set (with the Xms and Xmx > switches) for the jvm. > > > > Is it better to allocate give all available physical memory to the > > jvm, or to allocate enough so that solr doesn't run out of > memory and > > let the OS use the rest for disk buffers? > > The latter... let the OS have as much as you can for disk buffers. > > -Yonik
How to send a parsed Query to shards?
I want to use distributed search with some search components that I would like to execute only on the main server, not on the shards, because they reference some large in-memory lookup tables. After the search components get done processing the orignal query, the query may contain SpanNearQueries and DisjunctionMaxQueries. I'd like to send that query to the shards, not the original query. I've come up with the following idea for doing this. Would people please comment on this idea or suggest a better alternative? * Subclass QueryComponent to base64 encode the serialized form of the query and send that in place of the original query. * set the queryParser on the shard servers to a custom class that unencodes and deserializes the encoded query and returns it. Thoughts on this approach, or is there a better one? Thanks, -dallan
How to store a dense field value efficiently
Hi, I want to issue queries where queried fields have a specified value or are "missing". I know that I can query missing values using a negated full-range query, but it doesn't seem like that's very efficient (the fields in question have a lot of possible values). So I've opted to store special "missing" value for each field that isn't found in a document, and issue queries like "+(field1:value field1:missing) +(field2:value field2:missing)". The issue is that storing the missing values increases the size of the index by 30%, because a lot of documents don't have values for all fields. I'd like to keep the index as small as possible so it can be cached in memory. Any ideas on an alternative approach? Is there a way to convince lucene to store the doc-id list for the "missing" field value as a bitmap? What if I added some boolean fields to my schema; e.g., field1_missing and field2_missing and stored a true in those fields for documents that were missing the corresponding fields? Does lucene store BoolField's as bitmaps? -dallan
RE: How to store a dense field value efficiently
Sorry - I meant indexed. I don't store the fields. --dallan > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Friday, January 22, 2010 9:30 AM > To: solr-user@lucene.apache.org > Subject: Re: How to store a dense field value efficiently > > Oops, that's a Lucene bit (got confused which list I was on). > > You can still control storing the raw text in SOLR, so my > question is still relevant, but the solution may be > different. Do you store the fields? > > Erick > > On Fri, Jan 22, 2010 at 10:27 AM, Erick Erickson > wrote: > > > I'm surprised by a 30% increase. The approach of adding a special > > token for "not present" is one of the standard ones > > > > So just to check, when you say "stored", are you really storing the > > missing value? As in Field.Store.YES? As opposed to > Field.Index.###? > > Because theres no need to Store this value. > > > > Erick > > > > On Thu, Jan 21, 2010 at 11:22 PM, Dallan Quass > wrote: > > > >> Hi, > >> > >> I want to issue queries where queried fields have a > specified value > >> or are "missing". I know that I can query missing values using a > >> negated full-range query, but it doesn't seem like that's very > >> efficient (the fields in question have a lot of possible > values). So > >> I've opted to store special "missing" value for each field > that isn't > >> found in a document, and issue queries like "+(field1:value > >> field1:missing) +(field2:value field2:missing)". > >> > >> The issue is that storing the missing values increases the size of > >> the index by 30%, because a lot of documents don't have values for > >> all fields. I'd like to keep the index as small as possible so it > >> can be cached in memory. > >> > >> Any ideas on an alternative approach? Is there a way to convince > >> lucene to store the doc-id list for the "missing" field value as a > >> bitmap? What if I added some boolean fields to my schema; e.g., > >> field1_missing and field2_missing and stored a true in > those fields > >> for documents that were missing the corresponding fields? Does > >> lucene store BoolField's as bitmaps? > >> > >> -dallan > >> > >> > > >
RE: SOLR Index or database
FWIW, I just implemented a system that stores the index in SOLR but the records in a partitioned set of MySQL databases. The only stored field in SOLR is an ID field, which is the key to a table in the MySQL database. I had to modify SOLR a tiny bit and write a "database" search component so that search results are read from the database instead of the SOLR index partitions, but it works really well. The system indexes around 750M records partitioned across 10 SOLR servers and 4 MySQL servers. Storing the records in MySQL kept the indexes small enough to be cached entirely in memory. The MySQL databases require one disk IO for each record displayed in the search results. --dallan > -Original Message- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Wednesday, March 03, 2010 1:20 PM > To: solr-user@lucene.apache.org > Subject: Re: SOLR Index or database > > You need two, maybe three things that Solr doesn't do (or > doesn't do well): > > * field updating > * storing content > * real time search and/or simple transactions > > I would seriously look at Mark Logic for that. It does all of > those, plus full-text search, gracefully, plus it scales. > There is also a version for Amazon EC2. www.marklogic.com > > Note: I work at Mark Logic, but I chose Solr for Netflix when > I worked there. > > wunder > > On Mar 3, 2010, at 11:08 AM, caman wrote: > > > > > Hello All, > > > > Just struggling with a thought where SOLR or a database > would be good > > option for me.Here are my requirements. > > We index about 600+ news/blogs into out system. Only information we > > store locally is the title,link and article snippet.We are able to > > index all these sources into SOLR index and it works perfectly. > > This is where is gets tricky: > > We need to store certain meta information as well. e.g. > > 1. Rating/popularity of article > > 2. Sharing of the articles between users 3. How may times > articles is > > viewed. > > 4. Comments on each article. > > > > So far, we are deciding to store meta-information in the > database and > > link this data with the a document in the index. When user > opens the > > page, results are combined from index and the database to > render the view. > > > > Any reservation on using the above architecture? > > Is SOLR right fit in this case? We do need full text search > so SOLR is > > no-brainer imho but would love to hear community view. > > > > Any feedback appreciated > > > > thanks >
EnableLazyFieldLoading?
If I'm loading say 80-90% of the fields 80-90% of the time, and I don't have any large compressed text fields, is it safe to say that I'm probably better off to turn off lazy field loading? Thanks, --dallan
Issuing queries during analysis?
I have a situation where it would be beneficial to issue queries in a filter that is called during analysis. In a nutshell, I have an index of places that includes possible abbreviations. And I want to query this index during analysis to convert user-entered places to "standardized" places. So if someone enters "Chicago, IL" into a "place" field, I want to write a filter that first issues a query on "IL" to find that the standardized name for IL is Illinois, and then issues a query on places named "Chicago" located in "Illinois" to find that the standardized name is "Chicago, Cook, Illinois", and then returns this string in a token. I've tried having the filter factory implement SolrCoreAware, but that isn't allowed for filter factories. I've considered calling SolrCore.getSolrCore(), but this function has been deprecated with a comment that "if you are using multiple cores, this is not a function to use", and I'd like to use multiple cores someday. I looked at MultiCore.java but couldn't find a way to return a specific core. Any ideas? I could issue the queries to standardize the place fields in each document before indexing it, and then send SOLR documents with pre-standardized place fields, but it would sure be more convenient (and probably better-performing) to issue the queries during analysis. I'd appreciate suggestions! --dallan
RE: Issuing queries during analysis?
> this may sound a bit too KISS - but another approach could be > based on synonyms, i.e. if the number of abbreviation is > limited and defined ("All US States"), you can simply define > complete state name for each abbreviation, this way a > "Chicago, IL" will be "translated" (...) in "Chicago, > Illinois" during indexing and/or querying... but this may > depend by the Tokenizer you use and how your index is defined > (do a search for "Chicago, Illinois" on a field gives you a > doc with "Chicago, Cook, Illinois" in some (other/same) field?) Thanks for the suggestion! The problem is there are over 1M places (it's a database of historic places worldwide), most with multiple variations in the way that they're written. A complete synonym file would be pretty large. Issuing queries before indexing the docs would be preferable to a ~100-megabyte synonym file, especially because it's a wiki and people can add new places anytime so I'd have to re-build the synonym file on a regular basis. I sure wish I could figure out how to access the solr core object in my token filter class though. -dallan
RE: Issuing queries during analysis?
> Can you describe your indexing process a bit more? Do you > just have one or two tokens that you have "translate" or is > it that you are going to query on every token in your text? > I just don't see how that will perform at all to look up > every token in some index, so maybe if we have some more > info, something more obvious will arise. It's a recursive problem. Given a user-entered place name like "Salem, Oregon", you work from right to left, first looking up Oregon, which returns places like the state of Oregon and also Oregon County, Missouri. You then look for places named Salem located in the places returned by the first query. This approach finds "Atlanta, Georgia" as well as "Gori, Georgia" (a city in the Republic of Georgia). I've already written an efficient lookup function. I just don't know how to call it during analysis because I don't know how to access an instance of SolrCore from within a token filter object. -dallan
RE: Issuing queries during analysis?
> Dallas, got money to spend on solving this problem? I > believe this is something that tools like LingPipe can solve > through language model training and named entity extraction. Hi Otis, Thank-you for your reply. I'm familiar with tools like LingPipe, but this problem is actually *much* simpler. The places have already been entered into Place fields. (I don't have to try to identify place names in running text, which is what tools like LingPipe are for.) All I have to do is convert user-entered place names that may omit levels or contain abbreviations like "Chicago, IL", into complete place names like "Chicago, Cook, Illinois, United States". I have an algorithm already written to do this; I just don't know how to call it from a token filter because I don't know how to access SolrCore from within a token filter object. If I cannot (or for some reason should not) access SolrCore from a token filter, my alternative is before indexing a document, to issue a query to convert the place fields associated with that document into complete place names, then pass in the document with the complete place names into SOLR for indexing. By issuing this query instead from the token filter if possible, I was hoping to avoid the extra round-trip query-response between the indexing process and the SOLR server that this would entail. Thanks again! -dallan
RE: Issuing queries during analysis?
Hi Grant, > Can you describe your indexing process a bit more? Do you > just have one or two tokens that you have "translate" or is > it that you are going to query on every token in your text? > I just don't see how that will perform at all to look up > every token in some index, so maybe if we have some more > info, something more obvious will arise. One more clarification -- I don't need to do this for every token in the text; just for "place" fields in the document. Each document has 1-3 place fields that need to be converted to standard form when the document is indexed. There is a special set of (~1M) "Place" documents that contain information about alternative/abbreviated place names, how places are nested inside each other, etc. Either before or during tokenization of the regular documents I want to query these "Place" documents to determine how to standardize the place fields in the regular documents. Thank-you again! -dallan
RE: Issuing queries during analysis?
> Grant Ingersoll wrote: > > How often does your collection change or get updated? > > You could also have a slight alternative, which is to create > a real small and simple Lucene index that contains your > translations and then do it pre-indexing. The code for such > a searcher is quite simple, albeit it isn't Solr. > > Otherwise, you'd have to hack the SolrResourceLoader to > recognize your Analyzer as being SolrCoreAware, but, geez, I > don't know what the full ramifications of that would be, so > caveat emptor. > Mike Klaas wrote: > > Perhaps you could separate the problem, putting this info in > separate index or solr core. This sounds like the best approach. I've written a special searcher that handles standardization requests for multiple places in one http call and it was pretty straightforward. That's what I love about SOLR, it's *so* easy to write plugins for. Thank-you for your suggestions! --dallan