Filter queries taking a long time, even with cache disabled
On a Solr 4.1 install I see that queries with use the fq parameter take a long time (upwards of 120 seconds), both on the standard Lucene query parser and also with edismax. I have added the {!cache=false} localparam to the filter query, but this does not speed up the query. Putting all the search terms in the main query returns results in miliseconds. Note that I am not using any wildcard queries, in each case I am specifying the field to search and the terms to search on. Where should I start to debug? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Filter queries taking a long time, even with cache disabled
On Thu, Jun 27, 2013 at 12:14 PM, Upayavira wrote: > can you give an example? > Thank you. This is an example query: select ?q=search_field:iraq &fq={!cache=false}search_field:love%20obama &defType=edismax -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
No date.gap on pivoted facets
Consider the following query: select?q=*:* &facet=true &facet.date=added &facet.date.start=2013-04-01T00:00:00Z &facet.date.end=2013-06-30T00:00:00Z &facet.date.gap=%2b7DAYS &rows=0 &facet.pivot=added,provider In this query, the facet.date.gap is ignored and each individual second in faceted on. The issue remains the same even when reversing the order of the pivot: &facet.pivot=provider,added Is this a Solr bug, or am I pivoting wrong? This is on Solr 4.1.0 running on OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) on Ubuntu Server 12.04. Thank you! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: No date.gap on pivoted facets
On Sun, Jun 30, 2013 at 5:33 PM, Jack Krupansky wrote: > Sorry, but Solr pivot faceting is based solely on "field" facets, not > "range" (or "date") facets. > Thank you. I tried adding that information to the SimpleFacetParameters wiki page, but that page seems to be defined as "Immutable Page". > You can approximate date gaps by making a copy of your raw date field and > then manually "gap" (truncate) the date values so that the their discrete > values correspond to your date gap. > Thank you, this is what I have done. > In the next release of my book I have a script for a > StatelessScriptUpdateProccessor (with examples) that supports truncation of > dates to a desired resolution, copying or modifying the input date as > desired. > Terrific, I anticipate the release. Next release? Did I miss the release? http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957/ -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How to improve the Solr "OR" query performance
On Wed, Jul 3, 2013 at 6:48 AM, huasanyelao wrote: > Nowdays, I've got a urgent task to improve the "OR" query performance with > solr. > I have deployed 9 shards with solr-cloud in two server(each server : 16 > cores, 32G RAM). > The total document count: 60,000,000, total index size : 9G. > According to the requirement, I have to use the "OR" query to get results. > The average number of query terms is about 15. > The response time for "OR" query is around 1-2seconds(the "AND" query is just > about 30ms-40ms ). > Our target : promote 50%, that is, at most 500ms-1s per query. > The document will soar to 80,000,000, however, the performance should keep in > 500ms-1s query. > Any advise or approach is appreciated. Thanks in advance. > What size documents? I've currently got stats like this, only a few more documents but 5s searches on 15 ORs: q=love%20OR%20hate%20OR%20beer%20OR%20sex%20OR%20peace%20OR%20war%20OR%20up%20OR%20down%20OR%20this%20OR%20that%20OR%20left%20OR%20right%20OR%20north%20OR%20south%20OR%20east%20OR%20west 05604love OR hate OR beer OR sex OR peace OR war OR up OR down OR this OR that OR left OR right OR north OR south OR east OR west My index currently has 77461952 documents, most under 1 KiB each but upwards of ten fields. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Find related words
How might one find the top related words for a given word in a Solr index? For instance, given the following single-field documents: 1: I love chocolate 2: I love Solr 3: I eat chocolate cake 4: You will eat chocolate candy Thus, given the word "Chocolate" Solr might find these top words: I (3 times matched) eat (2 times matched) love, cake, you, will, candy (1 time each) Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Find related words
Thank you Jack and Koji. I will take a look at MLT and also at the .zip files from LUCENE-474. Koji, did you have to modify the code for the latest Solr? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
How might one search for dupe IDs other than faceting on the ID field?
To search for duplicate IDs, I am running the following query: select?q=*:*&facet=true&facet.field=id&rows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving OutOfMemoryError errors instead of the desired facet: java.lang.OutOfMemoryError: Java heap spacejava.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at ... Might there be a less resource-intensive way to get this information. This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index has over 100,000,000 small records, for a total of about 95 GiB of disk space, with Solr running on it's own disk. Actually, the 'disk' is an Amazon Web Service EBS volume. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal wrote: > Does adding facet.mincount=2 help? > > In fact, when adding facet.mincount=20 (I know that some dupes are in the hundreds) I got the OutOfMemoryError in seconds instead of minutes. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 9:23 PM, Michael Della Bitta wrote: > Are you talking about the document's ID field? > > If so, you can't have duplicates... the latter document would overwrite the > earlier. > > If not, sorry for asking irrelevant questions. :) > In Solr 4.1 we were using overwrite=false&allowDups=false in order to discard the new document, not overwrite the extant document. We knew at the time that the features were depreciated, and apparently allowDups=false stopped working in 4.3. We are testing new solutions, but we need to identify the dupes to get them out. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 9:24 PM, Shawn Heisey wrote: > Add &facet.method=enum to the query URL. This will cause Solr to enumerate > the facet information on every query rather than load it into the field > cache, which takes a lot of memory. Solr 4.1 was probably very close to > running out of memory as well. > > If you have enough OS disk cache for your index, the enum method should not > cause an enormous slowdown. If you don't have enough OS disk cache, then it > can make the facets run very slowly. > > Thanks, > Shawn > Thanks, the query ran for almost 2 full minutes but it returned results! I'll google for how to increase the disk cache for queries like this. Other than the Qtime, is there no way to judge the amount of memory required for a particular query to run? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 9:43 PM, Michael Della Bitta wrote: > Since this is a one-time problem, Have you thought of just dumping all the > IDs and looking for dupes using sort and awk or something similar to that? > All 100,000,000 of them :) That would take even longer! Also, I fear that this is not a one-time problem, rather, that I should already learn how to deal with tuning Solr for intensive queries as such. I learn by the problems encountered! Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 9:56 PM, Shawn Heisey wrote: > On 7/30/2013 12:49 PM, Dotan Cohen wrote: >> >> Thanks, the query ran for almost 2 full minutes but it returned >> results! I'll google for how to increase the disk cache for queries >> like this. Other than the Qtime, is there no way to judge the amount >> of memory required for a particular query to run? > > > The way you increase disk cache is to add memory to the server. Any memory > that's not being used by programs (OS, Solr, or anything else) is > automatically part of the disk cache. > > Thanks, > Shawn > I see, thanks. I thought that 'disk cache' was something on disk, such as swap space. The server is already maxed out on RAM: $ free -m total used free sharedbuffers cached Mem: 14980 14906 73 0167 5293 -/+ buffers/cache: 9444 5535 Swap:0 0 0 -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 11:00 PM, Mikhail Khludnev wrote: > Dotan, > > Could you please provide more line of the stack trace? Sure, thanks: java.lang.OutOfMemoryError: Java heap spacejava.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.OutOfMemoryError: Java heap space 500 > I have no idea why it made worse at 4.3. I know that 4.3 can use facets > backed on DocValues, which are modest for the heap. But from what I saw, > but can be wrong it's disabled from numeric facets. Hence, I can suggest to > reindex id as string docvalues and hope for them. However, it's doubtful to > reindex everything without strong guaranties. We also had issues with 4.2, though I really don't remember the details. Some simple queries such as 'q=ubuntu' would take tens of seconds whereas on 4.1 it was almost instantaneous. In fact, even in 4.3 I feel that things have slowed down terribly (3000 ms on simple queries whereas 4.1 would do it in tens or maximum a few hundred). Of course, the index is constantly growing so that may be a factor. Note that in both cases the index and configuration was carryover from 4.1 so that may have been an issue. Moving back from 4.2 to 4.1 I bit the bullet and deleted the extant documents. I no longer have that luxury now. > Also, I checked source code of > http://wiki.apache.org/solr/TermsComponentand found that it can be > really memory modest (ie without sort nor limit). > Be aware that df-s returned by that component are unaware of deleted > document, hence expungeDeletes before. > Thank you, I will look into that. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky wrote: > The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... > any particular reason you did not use it? > > See: > http://wiki.apache.org/solr/Deduplication > > and > > https://cwiki.apache.org/confluence/display/solr/De-Duplication > Actually, the guy who made the changes (a coworker) did in fact write an alternative UpdateHandler. I've just noticed that there are a bunch of dupes right now, though. public class DiscoAPIUpdateHandler extends DirectUpdateHandler2 { public DiscoAPIUpdateHandler(SolrCore core) { super(core); } @Override public int addDoc(AddUpdateCommand cmd) throws IOException{ // if overwrite is set to false we'll use the DefaultUpdateHandler2 , this is done for debugging to insert duplicates to solr if (!cmd.overwrite) return super.addDoc(cmd); // when using ref counted objects you have!! to decrement the ref count when your done RefCounted indexSearcher = this.core.getNewestSearcher(false); // the idea is like this we'll make an internal lucene query and check if that id already exists Term updateTerm = null; if (cmd.updateTerm != null){ updateTerm = cmd.updateTerm; } else { updateTerm = new Term("id",cmd.getIndexedId()); } Query query = new TermQuery(updateTerm); TopDocs docs = indexSearcher.get().search(query,2); if (docs.totalHits>0){ // index searcher is no longer needed indexSearcher.decref(); // don't add the new document return 0; } // index searcher is no longer needed indexSearcher.decref(); // if i'm here then it's a new document return super.addDoc(cmd); } } > And I give a bunch of examples in my book. > I anticipate the book with esteem! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Wed, Jul 31, 2013 at 12:48 AM, Jack Krupansky wrote: > You could also try the terms component which provides a very efficient > facet-like feature - counting the terms. And you can set a minimum term > frequency of 2, so only the dups would come back: > > curl "http://localhost:8983/solr/terms?terms.fl=id&terms.mincount=2"; > Thanks, Jack. This returns results with comparable Qtimes to the faceting on enum. Good to know! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How might one search for dupe IDs other than faceting on the ID field?
On Wed, Jul 31, 2013 at 4:56 AM, Bill Bell wrote: > On Jul 30, 2013, at 12:34 PM, Dotan Cohen wrote: >> On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal wrote: >>> Does adding facet.mincount=2 help? >> >> In fact, when adding facet.mincount=20 (I know that some dupes are in >> the hundreds) I got the OutOfMemoryError in seconds instead of >> minutes. >> >> Dotan Cohen > > This seems like a fairly large issue. Can you create a Jira issue ? > > Bill Bell I'll file an issue, but on what? What information should I include? How is this different that what you would expect? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Don't cache filter queries
I need to use the filter query feature to filter my results, but I don't want the results cached as documents are added to the index several times per second and the results will be state immediately. Is there any way to disable filter query caching? This is on Solr 4.1 running in Jetty on Ubuntu Server. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Don't cache filter queries
On Thu, Mar 21, 2013 at 6:22 PM, Chris Hostetter wrote: > > : Just add {!cache=false} to the filter in your query > : (http://wiki.apache.org/solr/SolrCaching#filterCache). > ... > : > I need to use the filter query feature to filter my results, but I > : > don't want the results cached as documents are added to the index > : > several times per second and the results will be state immediately. Is > : > there any way to disable filter query caching? > > Or remove the filterCache config option from your solrconfig.xml if you > really don't want any caching of any filter queries. > > Fnrakly though: that's throwing the baby out with the bath water -- just > because you are updating your index super-fast-like doesn't mean you > aren't getting benefts from the caches, particularly from commonly > reused filters which are applied to many qureies which might get > executed concurrently -- not to entnion that a single filter might be > reused multiple times within a single request to solr. > > disabling cache *warming* can make a lot of sense in NRT cases, but > eliminating caching alltogether rarely does. > Thanks. The problem is that the queries with filter queries are taking much longer to run (~60-80 ms) than the queries without (~1-4 ms). I figured that the problem may have been with the caching. In fact, running a query with a filter query and caching disabled is running in the range of 16-30 ms, which is quite an improvement. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Out of memory on some faceting queries
.handleRequest(BlockingHttpConnection.java:53)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n","code":500}} I notice that this only occurs on queries that run facets. I start Solr with the following command: sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar /opt/solr-4.1.0/example/start.jar & The server seems to have enough memory: $ free -m total used free sharedbuffers cached Mem: 14980 10604 4375 0472 8078 -/+ buffers/cache: 2054 12925 Swap:0 0 0 The server is 64-bit Ubuntu Server 12.04 LTS running Solr 4.1 and the following Java: $ java -version java version "1.6.0_27" OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Tue, Apr 2, 2013 at 12:59 PM, Toke Eskildsen wrote: > How many documents does your index have, how many fields do you facet on > and approximately how many unique values does your facet fields have? > 8971763 documents, growing at a rate of about 500 per minute. We actually expect that to be ~5 per minute once we get out of testing. Most documents are less than a KiB in the 'text' field, and they have a few other fields which store short strings, dates, or ints. You can think of these documents like tweets: short general purpose text messages. >> I notice that this only occurs on queries that run facets. I start >> Solr with the following command: >> sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC >> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled >> -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar >> /opt/solr-4.1.0/example/start.jar & > > You are not specifying any maximum heap size (-Xmx), which you should do > in order to avoid unpleasant surprises. Facets and sorting are often > memory hungry, but your system seems to have 13GB free RAM so the easy > solution attempt would be to increase the heap until Solr serves the > facets without OOM. > Thanks, I will start with "-Xmx8g" and test. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Tue, Apr 2, 2013 at 2:41 PM, Toke Eskildsen wrote: > 9M documents in a heavily updated index with faceting. Maybe you are > committing faster than the faceting can be prepared? > https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F > Thank you Toke, this is exactly on my "list of things to learn about Solr". We do get the error mentioned and we cannot reduce the amount of commits. Also, I do believe that we have the necessary server resources (16 GiB RAM). I have increased maxWarmingSearchers to 4, let's see how this goes. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Tue, Apr 2, 2013 at 5:33 PM, Toke Eskildsen wrote: > On Tue, 2013-04-02 at 15:55 +0200, Dotan Cohen wrote: > > [Tokd: maxWarmingSearchers limit exceeded?] > >> Thank you Toke, this is exactly on my "list of things to learn about >> Solr". We do get the error mentioned and we cannot reduce the amount >> of commits. Also, I do believe that we have the necessary server >> resources (16 GiB RAM). > > Memory does not help you if you commit too frequently. If you commit > each X seconds and warming takes X+Y seconds, then you will run out of > memory at some point. > >> I have increased maxWarmingSearchers to 4, let's see how this goes. > > If you still get the error with 4 concurrent searchers, you will have to > either speed up warmup time or commit less frequently. You should be > able to reduce facet startup time by switching to segment based faceting > (at the cost of worse search-time performance) or maybe by using > DocValues. Some of the current threads on the solr-user list is about > these topics. > > How often do you commit and how many unique values does your facet > fields have? > > Regards, > Toke Eskildsen > -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Tue, Apr 2, 2013 at 5:33 PM, Toke Eskildsen wrote: > Memory does not help you if you commit too frequently. If you commit > each X seconds and warming takes X+Y seconds, then you will run out of > memory at some point. > How might I time the warming? I've been googling warming since your earlier message but there does not seem to be any really good documentation on the subject. If there is anything that you feel I should be reading I would appreciate a link or a keyword to search on. I've read the Solr wiki on caching and performance, but other than that I don't see the issue addressed. >> I have increased maxWarmingSearchers to 4, let's see how this goes. > > If you still get the error with 4 concurrent searchers, you will have to > either speed up warmup time or commit less frequently. You should be > able to reduce facet startup time by switching to segment based faceting > (at the cost of worse search-time performance) or maybe by using > DocValues. Some of the current threads on the solr-user list is about > these topics. > > How often do you commit and how many unique values does your facet > fields have? > Batches of 20-50 results are added to solr a few times a minute, and a commit is done after each batch since I'm calling Solr as such: http://127.0.0.1:8983/solr/core/update/json?commit=true Should I remove commit=true and run a cron job to commit once per minute? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
> How often do you commit and how many unique values does your facet > fields have? > Most of the time I facet on one field that has about twenty unique values. However, once per day I would like to facet on the text field, which is a free-text field usually around 1 KiB (about 100 words), in order to determine what the top keywords / topics are. That query would take up to 200 seconds to run, but it does not have to return the results in real-time (the output goes to another process, not to a waiting user). -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
maxWarmingSearchers in Solr 4.
I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to 4.1, with no customization (bad, bad me!). I'm now looking into customizing it and I see that the Solr 4.1 solrconfig.xml is much simpler and shorter. Is this simply because many of the examples have been removed? In particular, I notice that there is no mention of maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I can simply add it in, are there any other critical config options that are missing that I should be looking into as well? Would I be better off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains so many examples? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Tue, Apr 2, 2013 at 6:26 PM, Andre Bois-Crettez wrote: > warmupTime is available on the admin page for each type of cache (in > milliseconds) : > http://solr-box:8983/solr/#/core1/plugins/cache > > Or if you are only interested in the total : > http://solr-box:8983/solr/core1/admin/mbeans?stats=true&key=searcher > Thanks. >> Batches of 20-50 results are added to solr a few times a minute, and a >> commit is done after each batch since I'm calling Solr as such: >> http://127.0.0.1:8983/solr/core/update/json?commit=true Should I >> remove commit=true and run a cron job to commit once per minute? > > > Even better, it sounds like a job for CommitWithin : > http://wiki.apache.org/solr/CommitWithin > I'll look into that. Thank you! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Wed, Apr 3, 2013 at 10:11 AM, Toke Eskildsen wrote: >> However, once per day I would like to facet on the text field, >> which is a free-text field usually around 1 KiB (about 100 words), in >> order to determine what the top keywords / topics are. That query >> would take up to 200 seconds to run, [...] > > If that query is somehow part of your warming, then I am surprised that > search has worked at all with your commit frequency. That would however > explain your OOM if you have multiple warmups running at the same time. > No, the 'heavy facet' is not part of the warming. I run it at most once per day, at the end of the day. Solr is not shut down daily. > It sounds like TermsComponent would be a better fit for getting top > topics: https://wiki.apache.org/solr/TermsComponent > I had once looked at TermsComponent, but I think that I eliminated it as a possibility because I actually need the top keywords related to a specific keyword. For instance, I need to know which words are most commonly used with the word "coffee". -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: maxWarmingSearchers in Solr 4.
On Wed, Apr 3, 2013 at 7:55 PM, Shawn Heisey wrote: > In situations where I don't want to change the default value, I prefer > to leave config elements out of the solrconfig. It makes the config > smaller, and it also makes it so that I will automatically see benefits > from the default changing in new versions. > Thanks. This makes sense. I take it, then, that you update (or at least review) solrconfig for each new Solr version. As I become more familiar with that file I will begin doing the same. > In the case of maxWarmingSearchers, I would hope that you have your > system set up so that you would never need more than 1 warming searcher > at a time. If you do a commit while a previous commit is still warming, > Solr will try to create a second warming searcher. > How would I set the system up for that? We have very many commits (every few seconds) and each commit contains a few tens of documents (mostly smaller than 1 KiB per document). Right now we get about 200-300 searches per minute. Note that I expect both the commit rate and the search rate to increase 2-3 times in the next month, and ideally I should be able to scale it beyond that. I'm right now looking into sharding as a possible solution. > I went poking in the code, and it seems that maxWarmingSearchers > defaults to Integer.MAX_VALUE. I'm not sure whether this is a bad > default or not. It does mean that a pathological setup without > maxWarmingSearchers in the config will probably blow up with an > OutOfMemory exception, but is that better or worse than commits that > don't make new documents searchable? I can see arguments either way. > This is interesting, what you found is that the value in the stock solrconfig.xml file differs from the Solr default value. I think that this is bad practice: a single default should be decided upon and Solr should use this value when nothing is specified in solrconfig.xml, and that _same_value_ should be specified in the stock solrconfig.xml. Is it not a reasonable assumption that this would be the case? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Understanding the Solr Admin page
I am expanding my Solr skills and would like to understand the Admin page better. I understand that understanding Java memory management and Java memory options will help me, and I am reading and experimenting on that front, but if there are any concise resources that are especially pertinent to Solr I would love to know about them. Everything that I've found is either a "do this" one-liner or expects Java experience which I don't have and don't know what I need to learn. I notice that some of the Args presented are in black text, and others in grey. Why are they presented differently? Where would I have found this information in the fine manual? When I start Solr with nohup, the resulting nohup.out file is _huge_. How might I start Solr such that INFO is not output, but only WARNINGs and SEVEREs are. In particular, I'd rather not log every query, even the invalid queries which also log as SEVERE. I thought that this would be easy to Google for, but it is not! If there is a concise document that examines this issue, I would love to know where on the wild wild web it exists. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Wed, Apr 3, 2013 at 8:47 PM, Shawn Heisey wrote: > On 4/2/2013 3:09 AM, Dotan Cohen wrote: >> I notice that this only occurs on queries that run facets. I start >> Solr with the following command: >> sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC >> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled >> -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar >> /opt/solr-4.1.0/example/start.jar & > > It looks like you've followed some advice that I gave previously on how > to tune java. I have since learned that this advice is bad, it results > in long GC pauses, even with heaps that aren't huge. > I see, thanks. > As others have pointed out, you don't have a max heap setting, which > would mean that you're using whatever Java chooses for its default, > which might not be enough. If you can get Solr to successfully run for > a while with queries and updates happening, the heap should eventually > max out and the admin UI will show you what Java is choosing by default. > > Here is what I would now recommend for a beginning point on your Solr > startup command. You may need to increase the heap beyond 4GB, but be > careful that you still have enough free memory to be able to do > effective caching of your index. > > sudo nohup java -Xms4096M -Xmx4096M -XX:+UseConcMarkSweepGC > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3 > -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled > -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts > -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar > /opt/solr-4.1.0/example/start.jar & > Thank you, I will experiment with that. > If you are running a really old build of java (latest versions on > Oracle's website are 1.6 build 43 and 1.7 build 17), you might want to > leave AggressiveOpts out. Some people would argue that you should never > use that option. > Great, thank for the warning. This is what we're running, I'll see about updating it through my distro's package manager: $ java -version java version "1.6.0_27" OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: maxWarmingSearchers in Solr 4.
On Thu, Apr 4, 2013 at 10:54 PM, Shawn Heisey wrote: > You'll want to ensure that your autowarmCount value on Solr's caches is low > enough that each commit happens quickly. If it takes 5000 milliseconds to > warm the caches when you commit, then you want to be sure that you are > committing less often than that, or you'll quickly reach your > maxWarmingSearchers config value. If the commits are happening VARY > quickly, you may need to set autowarmCount to 0, and possibly disable caches > entirely. > I see. This seems to be the opposite of the approach that I was taking. >>> I went poking in the code, and it seems that maxWarmingSearchers >>> defaults to Integer.MAX_VALUE. I'm not sure whether this is a bad >>> default or not. It does mean that a pathological setup without >>> maxWarmingSearchers in the config will probably blow up with an >>> OutOfMemory exception, but is that better or worse than commits that >>> don't make new documents searchable? I can see arguments either way. >> >> >> This is interesting, what you found is that the value in the stock >> solrconfig.xml file differs from the Solr default value. I think that >> this is bad practice: a single default should be decided upon and Solr >> should use this value when nothing is specified in solrconfig.xml, and >> that _same_value_ should be specified in the stock solrconfig.xml. Is >> it not a reasonable assumption that this would be the case? > > > That was directed more at the other committers. I would argue that either a > low number or a relatively high number should be the default, but not > MAX_VALUE. The example config should have a commented out section for > maxWarmingSearchers that mentions the default. I'm having the same > discussion about maxBooleanClauses on SOLR-4586. > Right. > It's possible that this has already been discussed, and that everyone > prefers that a badly configured setup will eventually have a spectacular > blow up with OutOfMemory, rather than semi-silently ignoring commits. A > searcher object contains caches and uses a lot of memory, so having lots of > them around will eventually use up the entire heap. > Silently dropping data is by far the worse choice, I agree, especially as a default setting. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Why would one not use RemoveDuplicatesTokenFilterFactory?
I am looking through the schema of a Solr installation that I inherited last year. The original dev, who is unavailable for comment, has two types of text fields: one with RemoveDuplicatesTokenFilterFactory and one without. These fields are intended for full-text search. Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a field intended for full-text search? What are the drawbacks to using it? This application is very, very write heavy (hundreds of writes per minute) if that matters. It was running on websolr.com at the time, I've now moved it to Amazon Web Services. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Why would one not use RemoveDuplicatesTokenFilterFactory?
On Fri, May 24, 2013 at 4:04 PM, Jack Krupansky wrote: > The primary purpose of this filter is in conjunction with the > KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not > produce a stem from the original token, so the keyword duplicate is no > longer needed. The goal is to index both the stemmed and unstemmed terms at > the same position. > > Whether your app is using the filter for that purpose remains to be seen. > > Removing duplicates from the raw input token stream would impact the term > frequency. > > -- Jack Krupansky > Thank you Jack. I thought that the filter only removed tokens with both identical position and identical text: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory Are stemmed terms considered the same text as the original word, such that they will show as a dupe fo the RemoveDuplicatesTokenFilterFactory? That seems odd. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Why would one not use RemoveDuplicatesTokenFilterFactory?
On Sun, May 26, 2013 at 8:16 PM, Jack Krupansky wrote: > The only comment I was trying to make here is the relationship between the > RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory. > > No, stemmed terms are not considered the same text as the original word. By > definition, they are a new value for the term text. > > I see, for some reason I did not concentrate on this key quote of yours: "...to remove the tokens that did not produce a stem ..." Now it makes perfect sense. Thank you, Jack! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
What exactly happens to extant documents when the schema changes?
When adding or removing a text field to/from the schema and then restarting Solr, what exactly happens to extant documents? Is the schema only consulted when Solr writes a document, therefore extant documents are unaffected? Considering that Solr supports dynamic fields, my experimentation with removing and adding fields to the schema has shown almost no change in the extant index results returned. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: What exactly happens to extant documents when the schema changes?
On Tue, May 28, 2013 at 2:20 PM, Upayavira wrote: > The schema provides Solr with a description of what it will find in the > Lucene indexes. If you, for example, changed a string field to an > integer in your schema, that'd mess things up bigtime. I recently had to > upgrade a date field from the 1.4.1 date field format to the newer > TrieDateField. Given I had to do it on a live index, I had to add a new > field (just using copyfield) and re-index over the top, as the old field > was still in use. I guess, given my app now uses the new date field > only, I could presumably reindex the old date field with the new > TrieDateField format, but I'd want to try that before I do it for real. > Thank you for the insight. Unfortunately, with 20 million records and growing by hundreds each minute (social media posts) I don't see that I could ever reindex the data in a timely way. > However, if you changed a single valued field to a multi-valued one, > that's not an issue, as a field with a single value is still valid for a > multi-valued field. > > Also, if you add a new field, existing documents will be considered to > have no value in that field. If that is acceptable, then you're fine. > > I guess if you remove a field, then those fields will be ignored by > Solr, and thus not impact anything. But I have to say, I've never tried > that. > > Thus - changing the schema will only impact on future indexing. Whether > your existing index will still be valid depends upon the changes you are > making. > > Upayavira Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: What exactly happens to extant documents when the schema changes?
On Tue, May 28, 2013 at 3:58 PM, Jack Krupansky wrote: > The technical answer: Undefined and not guaranteed. > I was afraid of that! > Sure, you can experiment and see what the effects "happen" to be in any > given release, and maybe they don't tend to change (too much) between most > releases, but there is no guarantee that any given "change schema but keep > existing data without a delete of directory contents and full reindex" will > actually be benign or what you expect. > > As a general proposition, when it comes to changing the schema and not > deleting the directory and doing a full reindex, don't do it! Of course, we > all know not to try to walk on thin ice, but a lot of people will try to do > it anyway - and maybe it happens that most of the time the results are > benign. > In the case of this particular application, reindexing really is overly burdensome as the application is performing hundreds of writes to the index per minute. How might I gauge how much spare I/O Solr could commit to a reindex? All the data that I need is in fact in stored fields. Note that because the social media application that feeds our Solr index is global, there are no 'off hours'. > OTOH, you could file a Jira to propose that the effects of changing the > schema but keeping the existing data should be precisely defined and > documented, but, that could still change from release to release. > Seems like a lot of effort to document, for little benefit. I'm not going to file it. I would like to know, though, is the schema consulted at index time, query time, or both? > From a practical perspective for your original question: If you suddenly add > a field, there is no guarantee what will happen when you try to access that > field for existing documents, or what will happen if you "update" existing > documents. Sure, people can talk about what "happens to be true today", but > there is no guarantee for the future. Similarly for deleting a field from > the schema, there is no guarantee about the status of existing data, even > though people can chatter about "what it seems to do today." > > Generally, you should design your application around contracts and what is > guaranteed to be true, not what happens to be true from experiments or even > experience. Granted, that is the theory and sometimes you do need to rely on > experimentation and folklore and spotty or ambiguous documentation, but to > the extent possible, it is best to avoid explicitly trying to rely on > undocumented, uncontracted behavior. > Thanks. The application does change (added features) and we do not want to loose old data. > One question I asked long ago and never received an answer: what is the best > practice for doing a full reindex - is it sufficient to first do a delete of > "*:*", or does the Solr index directory contents or even the directory > itself need to be explicitly deleted first? I believe it is the latter, but > the former "seems" to work, most of the time. Deleting the directory itself > "seems" to be the best answer, to date - but no guarantees! > I don't have an answer for that, sorry! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Reindexing strategy
I see that I do need to reindex my Solr index. The index consists of 20 million documents with a few hundred new documents added per minute (social media data). The documents are mostly smaller than 1KiB of data, but some may go as large as 10 KiB. All the data is text, and all indexed fields are stored. To reindex, I am considering adding a 'last_indexed' field, and having a Python or Java application pull out N results every T seconds when sorting on "last_indexed asc". How might I determine a good values for N and T? I would like to know when the Solr index is 'overloaded', or whatever happens to Solr when it is being pushed beyond the limits of its hardware. What should I be looking at to know if Solr is over stressed? Is looking at CPU and memory good enough? Is there a way to measure I/O to the disk on which the Solr index is stored? Bear in mind that while the reindex is happening, clients will be performing searches and a few hundred documents will be written per minute. Note that the machine running Solr is an EC2 instance running on Amazon Web Services, and that the 'disk' on which the Solr index is stored in an EBS volume. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Reindexing strategy
On Wed, May 29, 2013 at 2:41 PM, Upayavira wrote: > I presume you are running Solr on a multi-core/CPU server. If you kept a > single process hitting Solr to re-index, you'd be using just one of > those cores. It would take as long as it takes, I can't see how you > would 'overload' it that way. > I mean 'overload' Solr in the sense that it cannot read, process, and write data fast enough because too much data is being handled. I remind you that this system is writing hundreds of documents per minute. Certainly there is a limit to what Solr can handle. I ask how to know how close I am to this limit. > I guess you could have a strategy that pulls 100 documents with an old > last_indexed, and push them for re-indexing. If you get the full 100 > docs, you make a subsequent request immediately. If you get less than > 100 back, you know you're up-to-date and can wait, say, 30s before > making another request. > Actually, I would add a filter query for documents whose last_index value is before the last schema change, and stop when less documents were returned than were requested. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Removing a single value from a multiValue field
I have a Solr application with a multiValue field 'tags'. All fields are indexed in this application. There exists a uniqueKey field 'id' and a '_version_' field. This is running on Solr 4.x. In order to add a tag, the application retrieves the full document, creates a PHP array from the document structure, removes the '_version_' field, and then adds the appropriate tag to the 'tags' array. This is all then sent to Solr's update method via HTTP with 'overwrite=true'. Solr correctly replaces the extant document with the new document, which is identical with the exception of a new value for the '_version_' field and an additional value in the multiValued field 'tags'. This all works correctly. I am now adding a feature where one can remove tags. I am using the same business logic, however instead of adding a value to the 'tags' array I am removing one. I can confirm that the data being sent to Solr does not contain the removed tag. However, it seems that the old value for the multiValue field is persisted, that is the old tag stays. I can see that the '_version_' field has a new value, so I see that the change was properly commited. Is there a known bug that overwriting such a doc...: a b ...with this doc...: a ...has no effect? Can multiValue fields be only added, but not removed? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Reindexing strategy
On Wed, May 29, 2013 at 5:37 PM, Shawn Heisey wrote: > It's impossible for us to give you hard numbers. You'll have to > experiment to know how fast you can reindex without killing your > servers. A basic tenet for such experimentation, and something you > hopefully already know: You'll want to get baseline measurements before > you begin testing for comparison. > Thanks. I wan't looking for hard numbers, but rather am looking for what are the signs of problems. I know to keep my eye on memory and CPU, but I have no idea how to check disk I/O, and I'm not sure how to determine even if that becomes saturated. > One of the most reliable Solr-specific indicators of pushing your > hardware too hard is that the QTime on your queries will start to > increase dramatically. Solr 4.1 and later has more granular query time > statistics in the UI - the median and 95% numbers are much more > important than the average. > Thank you, this will help. At least I now have a hard metric to see when Solr is getting overburdened (QTime). > Outside of that, if your overall IOwait CPU percentage starts getting > near (or above) 30-50%, your server is struggling. If all of your CPU > cores are staying near 100% usage, then it's REALLY struggling. > I see, thanks. > Assuming you have plenty of CPU cores, using fast storage and having > plenty of extra RAM will alleviate much of the I/O bottleneck. The > usual rule of thumb for good query performance is that you need enough > RAM to put 50-100% of your index in the OS disk cache. For blazing > performance during a rebuild, that becomes 100-200%. If you had 150%, > that would probably keep most indexes well-cached even during a rebuild. > > A rebuild will always lower performance, even with lots of RAM. > Considering that the Solr index is the only place that the data is stored, and that users are actively using the system, I was not planning on a rebuild but rather to iteratively reindex the extant documents, even as new documents are being push in. > My earlier reply to your other message has some other ideas that will > hopefully help. > Thank you Shawn! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: What exactly happens to extant documents when the schema changes?
On Wed, May 29, 2013 at 5:09 PM, Shawn Heisey wrote: > I handle this in a very specific way with my sharded index. This won't > work for all designs, and the precise procedure won't work for SolrCloud. > > There is a 'live' and a 'build' core for each of my shards. When I want > to reindex, the program makes a note of my current position for deletes, > reinserts, and new documents. Then I use a DIH full-import from mysql > into the build cores. Once the import is done, I run the update cycle > of deletes, reinserts, and new documents on those build cores, using the > position information noted earlier. Then I swap the cores so the new > index is online. > I do need to examine sharding and multiple cores. I'll look into that, thank you. By the way, don't google for DIH! It took me some time to figure out that it is DataImportHandler, as some people use the acronym for something completely different. > To adapt this for SolrCloud, I would need to use two collections, and > update a collection alias for what is considered live. > > To control the I/O and CPU usage, you might need some kind of throttling > in your update/rebuild application. > > I don't need any throttling in my design. Because I'm using DIH, the > import only uses a single thread for each shard on the server. I've got > RAID10 for storage and half of the CPU cores are still available for > queries, so it doesn't overwhelm the server. > > The rebuild does lower performance, so I have the other copy of the > index handle queries while the rebuild is underway. When the rebuild is > done on one copy, I run it again on the other copy. Right now I'm > half-upgraded -- one copy of my index is version 3.5.0, the other is > 4.2.1. Switching to SolrCloud with sharding and replication would > eliminate this flexibility, unless I maintained two separate clouds. > Thank you. I am not using Solr Cloud but if I ever consider it, then I will keep this in mind. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Removing a single value from a multiValue field
On Thu, May 30, 2013 at 3:42 PM, Jack Krupansky wrote: > First, you cannot do any internal editing of a multi-valued list, other > than: > > 1. Replace the entire list. > 2. Add values on to the end of the list. > Thank you. I meant that I am actually editing the entire document. Reading it, changing the values that I need, and then 'updating' it. I will look into updating only the single multiValued field. > But you can do both of those operations on a single multivalued field with > "atomic update" without reading and writing the entire document. > > Second, there is no "" element in the Solr Update XML format. Only > "". > > To simply replace the full, current value of one multi-valued field: > > > >doc-id >a >b > > > > If you simply want to append a couple of values: > > > >doc-id >a >b > > > > To empty out a multivalued field: > > > >doc-id > > > > Thank you. I will see about translating that into the JSON format that I work with. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Removing a single value from a multiValue field
On Thu, May 30, 2013 at 5:01 PM, Jack Krupansky wrote: > You gave an XML example, so I assumed you were working with XML! > Right, I did give the output as XML. I find XML to be a great document markup language, but a terrible command format! Mostly, due to (mis-)use of the attributes. > In JSON... > > [{"id": "doc-id", "tags": {"add": ["a", "b"]}] > > and > > [{"id": "doc-id", "tags": {"set": null}}] > Thank you! That is quite more intuitive and less ambiguous than the XML, would you not agree? > BTW, this kind of stuff is covered in the book, separate chapters for XML > and JSON, each with dozens of examples like this. > I have not posted on the book postings, but I will definitely order one. My vote is for spiral bound, though I know that the perfect-bound will look more professional on a bookshelf. I don't even care what the book costs, within reason. Any resource that compiles in a single package the wonderful methods that yourself and other contributors mention here and in other places online, will pay for itself in short order. Apache Solr is an amazing product, but it is often obtuse and unintuitive. Other times one does not even know what Solr is capable of, such as the case in this thread, where I was parsing entire documents to change the multiField value. Thank you very much! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Reindexing strategy
On Fri, May 31, 2013 at 3:57 AM, Michael Sokolov gt wrote: > On UNIX platforms, take a look at vmstat for basic I/O measurement, and > iostat for more detailed stats. One coarse measurement is the number of > blocked/waiting processes - usually this is due to I/O contention, and you > will want to look at the paging and swapping numbers - you don't want any > swapping at all. But the best single number to look at is overall disk > activity, which is the I/O percentage utilized number Shaun was mentioning. > > -Mike Great, thanks! I've got some terms to google. For those who follow in my footsteps, on Ubuntu the package 'sysstat' needs to be installed to use iostat. Here are my reference stats before starting to experiment, both for my own use later to compare and also if anybody sees anything amiss here then I would love to know about it. If there is any fine manual that is particularly urgent that I should read, please do mention it. Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Receiving unexpected Faceting results.
Consider the following Solr query: select?q=*:*&fq=tags:dotan-*&facet=true&facet.field=tags&rows=0 The 'tags' field is a multivalue field. I would expect the previous query to return only tags that begin with the string 'dotan-' such as: dotan-home dotan-work ...but not strings which do not begin with (or even contain) the string in question. However, I am getting these results: 14 13 0 0 It _may_ be that the 'beer' and 'beatles' tags were once attached to the same documents as are attached the 'dotan-home' and/or 'dotan-work'. I've done a bit of experimenting on this Solr install, so I cannot be sure. However, considering that they are in fact 0 results for those two, I would not expect them to show up at all, even if they ever were attached to (i.e. once a value in the multiValue field) any of the results that match the filter query. So, the questions are: 1) How can I check if ever the multiValue fields for a particular document (given its uniqueKey id) ever contains a specific value. Alternatively, how can I see all the values that the document ever had for the field. I don't expect this to actually be possible, but I ask if it is, i.e. by examining certain aspects of the Solr index with a text editor. 2) If those spurious results are appearing does that mean necessarily that those values for the multivalued field were in fact once in the multivalued field for documents matching the filter query? Thus, the answer to the previous question would be to simply run a query for the id of the document in question, and facet on the multivalued field with a large limit. 3) How to have Solr return only those faceting values for the field that in fact begin with 'dotan-', even if a document has other tags such as 'beatles'? 4) How to have Solr return only those faceting values which are larger than 0? Thank you! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Receiving unexpected Faceting results.
On Wed, Jun 5, 2013 at 3:38 PM, Raymond Wiker wrote: > 3) Use the parameter facet.prefix, e.g, facet.prefix=dotan-. Note: this > particular case will not work if the field you're facetting on is tokenised > (with "-" being used as a taken separator). > > 4) Use the parameter facet.mincount - looks like you want to set it to 1, > instead of the default which is 0. Perfect, thank you Raymond! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Receiving unexpected Faceting results.
On Wed, Jun 5, 2013 at 3:41 PM, Brendan Grainger wrote: > Hi Dotan, > > I think all you need to do is add: > > facet.mincount=1 > > i.e. > > select?q=*:*&fq=tags:dotan-*&facet=true&facet.field=tags& > rows=0&facet.mincount=1 > > Note that you can do it per field as well: > > select?q=*:*&fq=tags:dotan-*&facet=true&facet.field=tags& > rows=0&f.tags.facet.mincount=1 > > http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount > Thanks, Brendan. I will review the available Facet Parameters, which I really should have thought to do before posting as it is already bookmarked!
Phrase matching with set union as opposed to set intersection on query terms
How would one write a query which should perform set union on the search terms (term1 OR term2 OR term3), and yet also perform phrase matching if both terms are found? I tried a few variants of the following, but in every case I am getting set intersection on the search terms: select?q={!q.op=OR}text:"term1 term2"~10 Thus, if term1 matches 10 documents and term2 matches 20 documents, then SET UNION would include all of the documents that have either term1 and/or term2. That means that between 20-30 results should be returned. Conversely, SET INTERSECTION would return only results with _both_ term1 _and_ term2, which could be between 0-10 documents. Note that in the application, users will be searching for any arbitrary number of terms, in fact they will be entering phrases. I can limit these phrases to 140 characters if needed. Thank you in advance! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Phrase matching with set union as opposed to set intersection on query terms
On Wed, Jun 5, 2013 at 6:10 PM, Shawn Heisey wrote: > On 6/5/2013 9:03 AM, Dotan Cohen wrote: >> How would one write a query which should perform set union on the >> search terms (term1 OR term2 OR term3), and yet also perform phrase >> matching if both terms are found? I tried a few variants of the >> following, but in every case I am getting set intersection on the >> search terms: >> >> select?q={!q.op=OR}text:"term1 term2"~10 > > A phrase search by definition will require all terms to be present. > Even though it is multiple terms, conceptually it is treated as a single > term. > > It sounds like what you are after is what edismax can do. If you define > the pf field in addition to the qf field, Solr will do something pretty > amazing - it will automatically construct a phrase query from a > non-phrase query and search with it against multiple fields. Done > correctly, this means that an exact match will be listed first in the > results. > > http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29 > > Thanks, > Shawn > Thank you Shawn, this pretty much does what I need it to do: select?defType=edismax&q={!q.op=OR}search_field:term1 term2&pf=search_field I'm reviewing the Edismax page now. Is there any other documentation that I should review? I have found the Edismax page at the wonderful lucidworks site, but if there are any other documentation that I should review to squeeze the most out of Edismax thenI would love to know about it. http://docs.lucidworks.com/display/solr/The+Extended+DisMax+Query+Parser Thank you very much! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Phrase matching with set union as opposed to set intersection on query terms
On Wed, Jun 5, 2013 at 6:23 PM, Jack Krupansky wrote: > term1 OR term2 OR "term1 term2"^2 > > term1 OR term2 OR "term1 term2"~10^2 > > The latter would rank documents with the terms nearby higher, and the > adjacent terms highest. > > term1 OR term2 OR "term1 term2"~10^2 OR "term1 term2"^20 OR "term2 term1"^20 > > To further boost adjacent terms. > > But the edismax pf/pf2/pf3 options might be good enough for you. > Thank you Jack. I suppose that I could write a script in PHP to create such a query string from an arbitrary-length phrase, but it wouldn't be pretty! Edismax does in fact meet my need, though. Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Phrase matching with set union as opposed to set intersection on query terms
> select?defType=edismax&q={!q.op=OR}search_field:term1 term2&pf=search_field > Is there any way to perform a fuzzy search with this method? I have tried appending "~1" to every term in the search like so: select?defType=edismax&q={!q.op=OR}search_field:term1~1%20term2~1&pf=search_field However, two issues: 1) It doesn't work! The results are identical to the results given when not appending "~1" to every term (or "~3"). 2) If at all possible, I would rather define the 'fuzzyness' elsewhere. Right now I would have to mangle the user-input in order to add the "~1" to the end of each term. Note that the ExtendedDisMax page does in fact mention that fuzziness is supported: http://wiki.apache.org/solr/ExtendedDisMax#Query_Syntax -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Phrase matching with set union as opposed to set intersection on query terms
On Wed, Jun 5, 2013 at 9:04 PM, Eustache Felenc wrote: > There is also http://wiki.apache.org/solr/SolrRelevancyCookbook with nice > examples. > Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Filtering on results with more than N words.
Is there any way to restrict the search results to only those documents with more than N words / tokens in the searched field? I thought that this would be an easy one to Google for, but I cannot figure it out. or find any references. There are many references to word size in characters, but not to filed size in words. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Best way to retrieve 20 specific documents
On Tue, Nov 20, 2012 at 12:45 AM, Shawn Heisey wrote: > You can also use this query format: > > id:(123 OR 456 OR 789) > > This does get expanded internally by the query parser to the format that has > the field name on every clause, but it is sometimes easier to write code > that produces the above form. > Thank you Shawn, that is much cleaner and will be easier to debug when / if things go wrong. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Error: _version_field must exist in schema
On Thu, Nov 22, 2012 at 9:26 PM, Nick Zadrozny wrote: > Belated reply, but this is probably something you should let us know about > directly at supp...@onemorecloud.com if it happens again. Cheers. > Hi Nick. This particular issue was on a Solr 4 instance on AWS, not on the Websolr account. But I commend you taking notice and taking an interest. Thank you! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: What should focus be on hardware for solr servers?
On Thu, Feb 14, 2013 at 5:54 PM, Michael Della Bitta wrote: > My dual-core, HT-enabled Dell Latitude from last year has this CPU: > model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz > bogomips: 4988.65 > > An m3.xlarge reports: > model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz > bogomips : 4000.14 > > I tried running geekbench and phoronx-test-suite and failed at both... > Anybody have a favorite, free, CLI benchmarking suite? > I'll suggest to the Phoronix team to include some Solr tests in their suite. Solr does seem to be a perfect test for Phoronix, and much more relevant for some readers than Jack-the-Ripper or Quake. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Solr 4 Spatial: NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry
Note that the issue is present in Solr 4.1 as well. I did find this post, which is not very encouraging: http://grokbase.com/t/lucene/solr-user/128sz03jdk/recursiveprefixtreestrategy-class-not-found Might the name of the class be simply a typo that is easily rectified? How might one go about checking which classes are available? Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Solr 4 Spatial: NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry
On Wed, Feb 27, 2013 at 10:24 AM, Smiley, David W. wrote: > Dotan, > > http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Configuration > You need to put the its jar within Solr's WEB-INF/lib; unfortunately you > can't simply reference it via a entry and put it wherever. FWIW you > can find the same question and my response on Stackoverflow. > > ~ David > Thank you David. In fact I do frequent Stack Overflow. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Can't search words in quotes
On Thu, Feb 28, 2013 at 8:14 AM, Alex Cougarman wrote: > Thanks, Oussama. That was very useful information and we have added the > double quotes. One interesting trick: we had to change the way we did it to > wrap the pattern value in single quotes so we could have double quotes inside. > Hi Alex. Would you mind posting the new analyzers? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Returning to Solr 4.0 from 4.1
Solr 4.1 has been giving up much trouble rejecting documents indexed. While I try to work my way through this, I would like to move our application back to Solr 4.0. However, now when I try to start Solr with same index that was created with Solr 4.0 but has been running on 4.1 few a few days I get this error chain: org.apache.solr.common.SolrException: Error opening new searcher Caused by: org.apache.solr.common.SolrException: Error opening new searcher Caused by: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene41' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.The current classpath supports the following names: [Lucene40, Lucene3x] Obviously I'll not be installing Lucene41 in Solr 4.0, but is there any way to work around this? Note that neither solrconf.xml nor schema.xml have changed. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Returning to Solr 4.0 from 4.1
On Fri, Mar 1, 2013 at 11:28 AM, Rafał Kuć wrote: > Hello! > > I suppose the only way to make this work will be reindexing the data. > Solr 4.1 uses Lucene 4.1 as you know, which introduced new default > codec with stored fields compression and this is one of the reasons > you can't read that index with 4.0. > Thank you. My first inclination is to "reindex" the documents, but the only store of these documents is the Solr index itself. I am trying to find solutions to create a new core and to index the data in the old core into the new core. I'm not finding any good ways of going about this. Note that we are talking about ~18,000,000 (yes, 18 million) small documents similar to 'tweets' (mostly under 1 KiB each, very very few over 5 KiB). -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Returning to Solr 4.0 from 4.1
On Fri, Mar 1, 2013 at 11:59 AM, Rafał Kuć wrote: > Hello! > > I assumed that re-indexing can be painful in your case, if it wouldn't > you probably would re-index by now :) I guess (didn't test it myself), > that you can create another collection inside your cluster, use the > old codec for Lucene 4.0 (setting the version in solrconfig.xml should > be enough) and re-indexing, but still re-indexing will have to be > done. Or maybe someone knows a better way ? > Will I have to reindex via an external script bridging, such as a Python script which requests N documents at a time, indexes them into Solr 4.1, then requests another N documents to index? Or is there internal Solr / Lucene facility for this? I've actually looked for such a facility, but as I am unable to find such a thing I ask. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Returning to Solr 4.0 from 4.1
On Fri, Mar 1, 2013 at 12:22 PM, Rafał Kuć wrote: > Hello! > > As far as I know you have to re-index using external tool. > Thank you Rafał. That is what I figured. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Returning to Solr 4.0 from 4.1
On Fri, Mar 1, 2013 at 1:37 PM, Upayavira wrote: > Can you use a checkout from SVN? Does that resolve your issues? That is > what will become 4.2 when it is released soon: > > https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/ > > Upayavira > Thank you. Which feature of 4.2 are you suggesting for this issue? Can Solr 4.2 natively import from a Solr index? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Returning to Solr 4.0 from 4.1
On Sat, Mar 2, 2013 at 9:32 PM, Upayavira wrote: > What I'm questioning is whether the issue you see in 4.1 has been > resolved in Subversion. While I would not expect 4.0 to read a 4.1 > index, the SVN branch/4.2 should be able to do so effortlessly. > > Upayavira > I see, thanks. Actually, running a clean 4.1 with no previous index does not have the issues. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Get results only from the last hour
On Mon, Aug 20, 2012 at 3:00 PM, Markus Jelsma wrote: > Date queries are described here: http://wiki.apache.org/solr/SolrQuerySyntax > Terrific, thank you! > You must first make sure your dates end up in a Date fieldType and are in the > proper format. > Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Faceting Facets
On Mon, Sep 3, 2012 at 5:50 PM, Alexey Serba wrote: > http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting > Thank you, that does seem to be only available on Solr 4.0. Luckily, we're using Websolr so upgrading is rather easy! Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Faceting Facets
On Mon, Sep 3, 2012 at 6:07 PM, Tanguy Moal wrote: > I think it's not possible to combine pivots with facet queries, nor with > facet ranges (or facet dates), please someone correct me if I'm wrong... > > I think only "standard" fields are "pivotable" :) > > That said, if you always use the same ranges for your DateTime field, you > *could* have a "string" version of the time field that only outputs the > hour of the day of the date contained in your time field, and then you'll > be able to use facet.pivots with those two text fields. > > You could still use the original date time field to constrain the results > set to return docs within the last 24 hours... > > Would that make sense to you ? > I think that I understand you! Actually, the DateTime is currently being stored as a UNIX timestamp for compatibility with other software. I had planned on converting it all over to the internal Solr Datetime type, but I now see that I should leave it as a timestamp. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Unexpected results in Solr 4 Pivot Faceting
On Fri, Sep 7, 2012 at 12:23 PM, Erik Hatcher wrote: > Pivot facets currently only work with individual terms, not ranges. > > The response you provided does look odd in that there are duplicate > timestamps listed, but pivots were only implemented for textual (string being > the most common type) fields initially. > I see, thanks. Other than creating an additional rounded-off timestamp field, are there any other solutions? Might ranges work if instead of a timestamp we used a real DateTime field? In any case, in order to pivot on the timestamp, will I have to change its type to string? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Unexpected results in Solr 4 Pivot Faceting
On Fri, Sep 7, 2012 at 4:05 PM, Erik Hatcher wrote: > Ranges won't work at all pivots are purely by individual term currently. > > If you want to pivot by ranges, and you can define those ranges during > indexing, then you could make a field that represented which range each > document is in. > > doc: > id: 1234 > category: History > date_range_buckets: 2004/March->June > > or something like that. Then you could pivot on category and > date_range_buckets. It's a hacky workaround, but might just be sufficient > for some cases. > Thanks. As there are other applications using the index I was hoping to avoid adding a redundant work-around field. But it looks like the best solution. Just to be clear, as I'm not logged onto the dev server at the moment but it was implied in an earlier mail: Any field that is to be pivoted on needs to be a string field? Is that documented, as I cannot find that in the docs. Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Unexpected results in Solr 4 Pivot Faceting
On Fri, Sep 7, 2012 at 4:39 PM, Erik Hatcher wrote: >> Just to be clear, as I'm not logged onto the dev server at the moment >> but it was implied in an earlier mail: Any field that is to be pivoted >> on needs to be a string field? Is that documented, as I cannot find >> that in the docs. > > No, it doesn't need to be a string field but whatever terms come out of > the analysis process are what gets faceted upon. If it was a "text" field, > each word in the field would be a facet value. A "trie" field probably > doesn't work properly, as it indexes multiple terms per value and you'd get > odd values. Pivot faceting was initially implemented only with textual > terms in mind, and string is generally the desired type. > Thanks for the insight. I'll see how much time for experimentation I might afford. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Unexpected results in Solr 4 Pivot Faceting
On Fri, Sep 7, 2012 at 5:04 PM, Yonik Seeley wrote: > On Fri, Sep 7, 2012 at 9:39 AM, Erik Hatcher wrote: >> A "trie" field probably doesn't work properly, as it indexes multiple terms >> per value and you'd get odd values. > > I don't know about pivot faceting, but all of the other types of > faceting take this into account (hence faceting works fine on trie > fields). > Thanks. I am not familiar with the trie field, but I'll look into it. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Return only matched multiValued field
Assuming a multivalued, stored and indexed field with name "comment". When performing a search, I would like to return only the values of "comment" which contain the match. For example: When searching for "gold" instead of getting this result: Theres a lady whos sure all that glitters is gold and shes buying a stairway to heaven I would prefer to get this result: all that glitters is gold (psuedo-XML from memory, may not be accurate but illustrates the point) Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Solr unique key can't be blank
On Wed, Sep 12, 2012 at 5:27 PM, Ahmet Arslan wrote: > Hi Dotan, > > Did you define the following update processor chain in solrconfig.xml ? > And did you reference it in an update handler? > > > > id > > > > Thank you Ahmet! In fact, I did not know that the updateRequestProcessorChain needed to be defined in solrconfig.xml and I had tried to define it in schema.xml. I don't have access to solrconfig.xml (I am using Websolr) but I will contact them about adding it. Thank you. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Cannot insert text into solr.StrField field
On Fri, Sep 14, 2012 at 1:00 AM, Jack Krupansky wrote: > Did you check the log file? > > How are you adding data to Solr? Show us the actual input document or code. > The Solr instances on Websolr. I will put in a feature request for that, though. I am adding the documents with Solr-PHP-Client. In fact, preceding the variable with (int) does in fact resolve the issue I have found. This looks like an issue with PHP being weakly typed. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Return only matched multiValued field
Assuming a multivalued, stored and indexed field with name "comment". When performing a search, I would like to return only the values of "comment" which contain the match. For example: When searching for "gold" instead of getting this result: Theres a lady whos sure all that glitters is gold and shes buying a stairway to heaven I would prefer to get this result: all that glitters is gold (psuedo-XML from memory, may not be accurate but illustrates the point) Is there any way to do this with a Solr 4 index? The client accessing Solr is on a dial-up connection (no provision for DSL or other high speed internet) so I'd like to move as little data over the wire as possible. In reality, the array will have tens of fields so returning only the relevant fields may reduce the data transferred by an order of magnitude. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Return only matched multiValued field
> indexed="true" > multiValued="true" /> > > doctest Note that in anonymizing the information, I introduced a typo. The above "doctest" should be "doctext". In any case, the field names in the production application and in production schema do in fact match! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Return only matched multiValued field
On Mon, Sep 24, 2012 at 2:16 PM, Erick Erickson wrote: > Hmmm, works for me. What is your entire response packet? > > And you've covered the bases with indexed and stored so this > seems like it _should_ work. > I'm sorry, reducing the output to rows=1 helped me notice that the highlighted sections come after the main results. The highlighting feature works as expected. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Return only matched multiValued field
On Mon, Sep 24, 2012 at 9:47 AM, Mikhail Khludnev wrote: > Hi > It seems like highlighting feature. Thank you Mikhail. I actually do need the entire matched single entry, not a snippet of it. Looking at the example in the OP, with highlighting on "gold" I would get glitters is gold Whereas I need: all that glitters is gold Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
How does Solr know which relative paths to use?
I have just installed Solr 4.0 on a test server. I start it like so: $ pwd /some/dir $ java -jar start.jar The Solr Instance now looks like this: CWD /some/dir Instance /some/dir/solr/collection1 Data /some/dir/solr/collection1/data Index /some/dir/solr/collection1/data/index >From where did the additional relative paths 'collection1', 'collection1/data', and 'collection1/data/index' come from? I know that I can change the value of CWD with the -Dsolr.solr.home flag, but what affects the relative paths mentioned? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: How does Solr know which relative paths to use?
On Wed, Oct 17, 2012 at 12:16 AM, P Williams wrote: > Hi Dotan, > > It seems that the examples now use Multiple > Cores<http://wiki.apache.org/solr/CoreAdmin>by default. If your test > server is based on the stock example, you should > see a solr.xml file in your CWD path which is how Solr knows about the > relative paths. There should also be a README.txt file that will tell you > more about how the directory is expected to be organized. > > Cheers, > Tricia > Thanks. I read the top-level README.txt but now I see that the answer is in the solr/README.txt file. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Error: _version_field must exist in schema
On Thu, Oct 18, 2012 at 12:09 AM, Rafał Kuć wrote: > Hello! > > The _version_ field is needed by some of Solr 4.0 functionality like > transaction log or partial documents update. If you want to use them, > just update your schema.xml and put the _version_ field definition > there. > > However if you don't want those, you can remove the transaction log > configuration in your solrconfig.xml. However please remember that > when using SolrCloud you'll need that field. > Thanks. Where is that bit documented? I don't see it on the Solr wiki: http://wiki.apache.org/solr/SchemaXml I do have a Solr 4 Beta index running on Websolr that does not have such a field. It works, but throws many "Service Unavailable" and "Communication Error" errors. Might the lack of the _version_ field be the reason? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Error: _version_field must exist in schema
On Thu, Oct 18, 2012 at 12:25 AM, Rafał Kuć wrote: > Hello! > > You can some find information about requirements of SolrCloud at > http://wiki.apache.org/solr/SolrCloud . I don't know if _version_ is > mentioned elsewhere. > > As for Websolr - I'm afraid I can't say anything about the cause of > those errors without seeing the exception. > I see, thanks. I don't think that I'm using the SolrCloud feature. Is it enable because there exist "solr/collection1" and also "multicore/core0"? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Error: _version_field must exist in schema
On Thu, Oct 18, 2012 at 9:21 AM, Rafał Kuć wrote: > Hello! > > Look at your solrconfig.xml file, you should see something like that: > > > ${solr.data.dir:} > > > Just remove it and Solr shouldn't bother you with the version field > information. However remember that some features won't work (like the > real time get or partial documents update). > Thank you. Is there any place where this is documented? It certainly does not appear in the relevant wiki page: http://wiki.apache.org/solr/SolrConfigXml > You can also add _version_ field to your schema and forget about it. > You don't need to do anything with it as it is used internally by > Solr. > That is exactly my plan, but I would also like to understand more about what is going on. I don't like cut-and-paste programming. Thank you very much! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Error: _version_field must exist in schema
On Thu, Oct 18, 2012 at 1:06 PM, Erick Erickson wrote: > I've updated the schema.xml page, see > http://wiki.apache.org/solr/SchemaXml#Recommended_fields > Great, thanks! > Care to change the schema.xml file to warn about this too and > submit a patch? > If you are referring to the example schema.xml file provided with Solr, then I'd love to. I'm signing up for the dev list now. Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
When Solr is slow, I'm seeing these in the logs: [collection1] Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later. [collection1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 Googling, I found this in the FAQ: "Typically the way to avoid this error is to either reduce the frequency of commits, or reduce the amount of warming a searcher does while it's on deck (by reducing the work in newSearcher listeners, and/or reducing the autowarmCount on your caches)" http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F I happen to know that the script will try to commit once every 60 seconds. How does one "reduce the work in newSearcher listeners"? What effect will this have? What effect will reducing the autowarmCount on caches have? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Mon, Oct 22, 2012 at 5:02 PM, Rafał Kuć wrote: > Hello! > > You can check if the long warming is causing the overlapping > searchers. Check Solr admin panel and look at cache statistics, there > should be warmupTime property. > Thank you, I have gone over the Solr admin panel twice and I cannot find the cache statistics. Where are they? > Lowering the autowarmCount should lower the time needed to warm up, > howere you can also look at your warming queries (if you have such) > and see how long they take. > Thank you, I will look at that! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Mon, Oct 22, 2012 at 5:27 PM, Mark Miller wrote: > Are you using Solr 3X? The occasional long commit should no longer > show up in Solr 4. > Thank you Mark. In fact, this is the production release of Solr 4. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Mon, Oct 22, 2012 at 7:29 PM, Shawn Heisey wrote: > On 10/22/2012 9:58 AM, Dotan Cohen wrote: >> >> Thank you, I have gone over the Solr admin panel twice and I cannot find >> the cache statistics. Where are they? > > > If you are running Solr4, you can see individual cache autowarming times > here, assuming your core is named collection1: > > http://server:port/solr/#/collection1/plugins/cache?entry=queryResultCache > http://server:port/solr/#/collection1/plugins/cache?entry=filterCache > > The warmup time for the entire searcher can be found here: > > http://server:port/solr/#/collection1/plugins/core?entry=searcher > > Thank you Shawn! I can see how I missed that data. I'm reviewing it now. Solr has a low barrier to entry, but quite a learning curve. I'm loving it! I see that the server is using less than 2 GiB of memory, whereas it is a dedicated Solr server with 16 GiB of memory. I understand that I can increase the query and document caches to increase performance, but I worry that this will increase the warm-up time to unacceptable levels. What is a good strategy for increasing the caches yet preserving performance after an optimize operation? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Mon, Oct 22, 2012 at 9:22 PM, Mark Miller wrote: > Perhaps you can grab a snapshot of the stack traces when the 60 second > delay is occurring? > > You can get the stack traces right in the admin ui, or you can use > another tool (jconsole, visualvm, jstack cmd line, etc) > Thanks. I've refactored so that the index is optimized once per hour, instead after each dump of commits. But when I will need to increase the optmize frequency in the future I will go through the stack traces. Thanks! In any case, the server has an extra 14 GiB of memory available, how might I make the best use of that for Solr assuming both heavy reads and writes? Thanks. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Mon, Oct 22, 2012 at 10:01 PM, Walter Underwood wrote: > First, stop optimizing. You do not need to manually force merges. The system > does a great job. Forcing merges (optimize) uses a lot of CPU and disk IO and > might be the cause of your problem. > Thanks. Looking at the index statistics, I see that within minutes after running optimize that the stats say the index needs to be reoptimized. Though, the index still reads and writes fine even in that state. > Second, the OS will use the "extra" memory for file buffers, which really > helps performance, so you might not need to do anything. This will work > better after you stop forcing merges. A forced merge replaces every file, so > the OS needs to reload everything into file buffers. > I don't see that the memory is being used: $ free -g total used free sharedbuffers cached Mem:14 2 12 0 0 1 -/+ buffers/cache: 0 14 Swap: 0 0 0 -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Mon, Oct 22, 2012 at 10:44 PM, Walter Underwood wrote: > Lucene already did that: > > https://issues.apache.org/jira/browse/LUCENE-3454 > > Here is the Solr issue: > > https://issues.apache.org/jira/browse/SOLR-3141 > > People over-use this regardless of the name. In Ultraseek Server, it was > called "force merge" and we had to tell people to stop doing that nearly > every month. > Thank you for those links. I commented on the Solr bug. There are some very insightful comments in there. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Tue, Oct 23, 2012 at 3:52 AM, Shawn Heisey wrote: > As soon as you make any change at all to an index, it's no longer > "optimized." Delete one document, add one document, anything. Most of the > time you will not see a performance increase from optimizing an index that > consists of one large segment and a bunch of very tiny segments or deleted > documents. > I've since realized that by experimentation. I've probably saved quite a few minutes of reading time by investing hours of experiment time! > How big is your index, and did you run this right after a reboot? If you > did, then the cache will be fairly empty, and Solr has only read enough from > the index files to open the searcher.The number is probably too small to > show up on a gigabyte scale. As you issue queries, the cached amount will > get bigger. If your index is small enough to fit in the 14GB of free RAM > that you have, you can manually populate the disk cache by going to your > index directory and doing 'cat * > /dev/null' from the commandline or a > script. The first time you do it, it may go slowly, but if you immediately > do it again, it will complete VERY fast -- the data will all be in RAM. > The cat trick to get the files in RAM is great. I would not have thought that would work for binary files. The index is small, much less than the available RAM, for the time being. Therefore, there was nothing to fill it with I now understand. Both 'free' outputs were after the system had been running for some time. > The 'free -m' command in your first email shows cache usage of 1243MB, which > suggests that maybe your index is considerably smaller than your available > RAM. Having loads of free RAM is a good thing for just about any workload, > but especially for Solr.Try running the free command without the -g so you > can see those numbers in kilobytes. > > I have seen a tendency towards creating huge caches in Solr because people > have lots of memory. It's important to realize that the OS is far better at > the overall job of caching the index files than Solr itself is. Solr caches > are meant to cache result sets from queries and filters, not large sections > of the actual index contents. Make the caches big enough that you see some > benefit, but not big enough to suck up all your RAM. > I see, thanks. > If you are having warm time problems, make the autowarm counts low. I have > run into problems with warming on my filter cache, because we have filters > that are extremely hairy and slow to run. I had to reduce my autowarm count > on the filter cache to FOUR, with a cache size of 512. When it is 8 or > higher, it can take over a minute to autowarm. > I will have to experiment with the warning. Thank you for the tips. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Tue, Oct 23, 2012 at 3:07 PM, Erick Erickson wrote: > Maybe you've been looking at it but one thing that I didn't see on a fast > scan was that maybe the commit bit is the problem. When you commit, > eventually the segments will be merged and a new searcher will be opened > (this is true even if you're NOT optimizing). So you're effectively committing > every 1-2 seconds, creating many segments which get merged, but more > importantly opening new searchers (which you are getting since you pasted > the message: Overlapping onDeckSearchers=2). > > You could pinpoint this by NOT committing explicitly, just set your autocommit > parameters (or specify commitWithin in your indexing program, which is > preferred). Try setting it at a minute or so and see if your problem goes away > perhaps? > > The NRT stuff happens on soft commits, so you have that option to have the > documents immediately available for search. > Thanks, Erick. I'll play around with different configurations. So far just removing the periodic optimize command worked wonders. I'll see how much it helps or hurts to run that daily or more or less frequent. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Wed, Oct 24, 2012 at 4:33 PM, Walter Underwood wrote: > Please consider never running "optimize". That should be called "force merge". > Thanks. I have been letting the system run for about two days already without an optimize. I will let it run a week, then merge to see the effect. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
I spoke too soon! Wereas three days ago when the index was new 500 records could be written to it in <3 seconds, now that operation is taking a minute and a half, sometimes longer. I ran optimize() but that did not help the writes. What can I do to improve the write performance? Even opening the Logging tab of the Solr instance is taking quite a long time. In fact, I just left it for 20 minutes and it still hasn't come back with anything. I do have an SSH window open on the server hosting Solr and it doesn't look overloaded at all: $ date && du -sh data/ && uptime && free -m Fri Oct 26 13:15:59 UTC 2012 578Mdata/ 13:15:59 up 4 days, 17:59, 1 user, load average: 0.06, 0.12, 0.22 total used free sharedbuffers cached Mem: 14980 3237 11743 0284 -/+ buffers/cache:729 14250 Swap: 0 0 0 -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Fri, Oct 26, 2012 at 4:02 PM, Shawn Heisey wrote: > > Taking all the information I've seen so far, my bet is on either cache > warming or heap/GC trouble as the source of your problem. It's now specific > information gathering time. Can you gather all the following information > and put it into a web paste page, such as pastie.org, and reply with the > link? I have gathered the same information from my test server and created > a pastie example. http://pastie.org/5118979 > > On the dashboard of the GUI, it lists all the jvm arguments. Include those. > > Click Java Properties and gather the "java.runtime.version" and > "java.specification.vendor" information. > > After one of the long update times, pause/stop your indexing application. > Click on your core in the GUI, open Plugins/Stats, and paste the following > bits with a header to indicate what each section is: > CACHE->filterCache > CACHE->queryResultCache > CORE->searcher > > Thanks, > Shawn > Thank you Shawn. The information is here: http://pastebin.com/aqEfeYVA -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Occasional Solr performance issues
On Fri, Oct 26, 2012 at 11:04 PM, Shawn Heisey wrote: > Warming doesn't seem to be a problem here -- all your warm times are zero, > so I am going to take a guess that it may be a heap/GC issue. I would > recommend starting with the following additional arguments to your JVM. > Since I have no idea how solr gets started on your server, I don't know > where you would add these: > > -Xmx4096M -Xms4096M -XX:NewRatio=1 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > Thanks. I've added those flags to the Solr line that I use to start Solr. Those are Java flags, not Solr, correct? I'm googling the flags now, but I find it interesting that I cannot find a canonical reference for them. > This allocates 4GB of RAM to java, sets up a larger than normal Eden space > in the heap, and uses garbage collection options that usually fare better in > a server environment than the default.Java memory management options are > like religion to some people ... I may start a flamewar with these > recommendations. ;) The best I can tell you about these choices: They made > a big difference for me. > Thanks. I will experiment with them empirically. The first step is to learn to read the debug info, though. I've been googing for days, but I must be missing something. Where is the information that I pasted in pastebin documented? > I would also recommend switching to a Sun/Oracle jvm. I have heard that > previous versions of Solr were not happy on variants like OpenJDK, I have no > idea whether that might still be the case with 4.0. If you choose to do > this, you probably have package choices in Ubuntu. I know that in Debian, > the package is called sun-java6-jre ... Ubuntu is probably something > similar. Debian has a CLI command 'update-java-alternatives' that will > quickly switch between different java implementations that are installed. > Hopefully Ubuntu also has this. If not, you might need the following > command instead to switch the main java executable: > > update-alternatives --config java > Thanks, I will take a look at the current Oracle JVM. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com