Solr 1.4 Replication index directories
Hi, We're using the new replication and it's working pretty well. There's one detail I'd like to get some more information about. As the replication works, it creates versions of the index in the data directory. Originally we had index/, but now there are dated versions such as index.20100127044500/, which are the replicated versions. Each copy is sized in the vicinity of 65G. With our current hard drive it's fine to have two around, but 3 gets a little dicey. Sometimes we're finding that the replication doesn't always clean up after itself. I would like to understand this better, or to not have this happen. It could be a configuration issue. Some more specific questions: - Is it safe to remove the index/ directory (that doesn't have the date on it)? I think I tried this once and the whole thing broke, however maybe something else was wrong at the time. - Is there a way to know which one is the current one? (I'm looking at the file index.properties, and it seems to be correct, but sometimes there's a newer version in the directory, which later is removed) - Could it be that the index does not finish replicating in the poll interval I give it? What happens if, say there's a poll interval X and replicating the index happens to take longer than X sometimes. (Our current poll interval is 45 minutes, and every time I'm watching it it completes in time.) Thanks in advance Mark
Re: Solr 1.4 Replication index directories
Thanks, Otis. Responses inline. Hi, We're using the new replication and it's working pretty well. There's one detail I'd like to get some more information about. As the replication works, it creates versions of the index in the data directory. Originally we had index/, but now there are dated versions such as index.20100127044500/, which are the replicated versions. Each copy is sized in the vicinity of 65G. With our current hard drive it's fine to have two around, but 3 gets a little dicey. Sometimes we're finding that the replication doesn't always clean up after itself. I would like to understand this better, or to not have this happen. It could be a configuration issue. Some more specific questions: - Is it safe to remove the index/ directory (that doesn't have the date on it)? I think I tried this once and the whole thing broke, however maybe something else was wrong at the time. No, that's the real, live index, you don't want to remove that one. Yeah... I tried it once and remember things breaking. However nothing in this directory has been modified for over a week (since the last replication initialization). And I'm still sitting on 130GB of data for what is only 65GB on the master - Is there a way to know which one is the current one? (I'm looking at the file index.properties, and it seems to be correct, but sometimes there's a newer version in the directory, which later is removed) I think the "index" one is always current, no? If not, I imagine the admin replication page will tell you, or even the Statistics page. e.g. reader : SolrIndexReader{this=46a55e,r=readonlysegmentrea...@46a55e,segments=1} readerDir : org.apache.lucene.store.NIOFSDirectory@/mnt/solrhome/ cores/foo/data/index reader : SolrIndexReader {this=5c3aef1,r=readonlydirectoryrea...@5c3aef1,refCnt=1,segments=9} readerDir : org.apache.lucene.store.NIOFSDirectory@/home/solr/solr_1.4/ solr/data/index.20100127044500 - Could it be that the index does not finish replicating in the poll interval I give it? What happens if, say there's a poll interval X and replicating the index happens to take longer than X sometimes. (Our current poll interval is 45 minutes, and every time I'm watching it it completes in time.) I think only 1 replication will/should be happening at a time. Whew, that's comforting.
Filter by Group
Hey all, Let's say I have an index of one hundred documents, and these documents are grouped into 4 groups A, B, C, and D. The groups do in fact overlap. What would people recommend as the best way to apply a search query and return only the documents that are in group A? Also, how about if we run the same search query but return only those documents in groups A, C and D? I imagine that I could do this by indexing a text field populated with the group names and adding something like "groups:A" to the query but I'm wondering if there's a better solution. Thanks in advance, Mark mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.7 million ratings and counting...
Re: Filter by Group
Thanks, Pieter. I'll go for that then. Mark On Sep 19, 2007, at 10:15 PM, Pieter Berkel wrote: Sounds like you're on the right track, if your groups overap (i.e. a document can be in group A and B), then you should ensure your "groups" field is multivalued. If you are searching for "foo" in documents contained in group "A", then it might be more efficient to use a filter query (fq) like: q=foo&fq=groups:A See the wiki page on common query parameters for more info: http://wiki.apache.org/solr/ CommonQueryParameters#head-6522ef80f22d0e50d2f12ec487758577506d6002 cheers, Piete On 20/09/2007, mark angelillo <[EMAIL PROTECTED]> wrote: Hey all, Let's say I have an index of one hundred documents, and these documents are grouped into 4 groups A, B, C, and D. The groups do in fact overlap. What would people recommend as the best way to apply a search query and return only the documents that are in group A? Also, how about if we run the same search query but return only those documents in groups A, C and D? I imagine that I could do this by indexing a text field populated with the group names and adding something like "groups:A" to the query but I'm wondering if there's a better solution. Thanks in advance, Mark mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.7 million ratings and counting... mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.7 million ratings and counting...
Forced Top Document
Hi all, Is there a way to get a specific document to appear on top of search results even if a sorting parameter would push it further down? Thanks in advance, Mark mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.8 million ratings and counting...
Re: Forced Top Document
Charlie, That's interesting. I did try something like this. Did you try your query with a sorting parameter? What I've read suggests that all the results are returned based on the query specified, but then resorted as specified. Boosting (which modifies the document's score) should not change the order unless the results are sorted by score. Mark On Oct 24, 2007, at 1:05 PM, Charlie Jackson wrote: Do you know which document you want at the top? If so, I believe you could just add an "OR" clause to your query to boost that document very high, such as ?q=foo OR id:bar^1000 Tried this on my installation and it did, indeed push the document specified to the top. -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 10:17 AM To: solr-user@lucene.apache.org Subject: Re: Forced Top Document I'd love to know this, as I just got a development request for this very feature. I'd rather not spend time on it if it already exists. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 +----+ On Oct 23, 2007, at 10:12 PM, mark angelillo wrote: Hi all, Is there a way to get a specific document to appear on top of search results even if a sorting parameter would push it further down? Thanks in advance, Mark mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.8 million ratings and counting... mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.8 million ratings and counting...
Re: Forced Top Document
That's the ticket exactly, Kyle. What I have is the ID of my document, so I indexed a dynamic field with name id_*. Then I just set that field for each document with the proper ID. So for example, to pop one document to the top of the index, i just run: "&q=field: value; id_700390+desc, date+desc" Works like a charm, even with multiple documents. "&q=field: value; id_700390+desc, id_604030+desc, date+desc" Mark On Oct 24, 2007, at 4:15 PM, Kyle Banerjee wrote: The typical use case, though, is for the featured document to be on top only for certain queries. Like in an intranet where someone queries 401K or retirement or similar, you want to feature a document about benefits that would otherwise rank really low for that query. I have not be able to make sorting strategies work very well. Depending on how many of these certain queries you have, it seems like you could still use some variation of the strategy based on a bogus tag sort. If you place a dynamic field for each query term (e.g. foo_s, bar_s, etc) relevant to a document and then detect when one of the special query terms is detected, you can still sort on the appropriate dynamic field before applying the rest of the sort. kyle mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.8 million ratings and counting...
Re: Forced Top Document
Thanks for your thoughts, Chris. I agree with you about the user's experience. Snooth doesn't serve any ads/sponsored results -- the goal here is to make sure that the most recent document the user has acted on shows up top in searches for recent activity. My aim is to forcibly preserve the sort order until the document can be reindexed/ updated. Since the dynamic field is too memory intensive, I'll try boosting on the date field -- and boosting more on the date field for the document that needs to be up top. If that doesn't end up working I'll just perform two queries and be done with it. Mark On Oct 25, 2007, at 3:11 AM, Chris Hostetter wrote: : The typical use case, though, is for the featured document to be on top only : for certain queries. Like in an intranet where someone queries 401K or : retirement or similar, you want to feature a document about benefits that : would otherwise rank really low for that query. I have not be able to make : sorting strategies work very well. this type of question typically falls into two use cases: 1) "targeted ads" 2) "sponsored results" in the targeted ads case, the "special" matches aren't part of the normal flow of results, and don't fit into pagination -- they always appera at the top, or to the right, on every page, no matter what the sort this kind of usage doesn't really need any special logic, it can be solved as easily by a second Solr hit as it can by custom request handler logic. in the "sponsored results" use case, the "special" matches should appear in the normal flow of results as the #1 (2, 3, etc) matches, so that they don't appear on page#2 ... but that also means that it's extremely disconcerting for users if those matches are still at the top when the userse resort. if a user is looking at product listings, sorted by "relevancy" and the top 3 results all say they are "sponsered" that's fine ... but if the user sort by "price" and those 3 results are still at teh top of the list, even though they clearly aren't the chepest, that's just going to piss the user off. in my profesional opinion: don't fuck with your users. default to whatever order you want, but if the user specificly requests to sort the results by some option, do it. assuming you follow my professional opinion, then "boosting" docs to have an artifically high score will work fine. if you absolutely *MUST* have certain docs "sorting" before others, regardless of which sort option the user picks, then it is still possible do ... i'm hesitant to even say how, but if people insist on knowing... allways sort by score first, then by whatever field the user wants to sort by ... but when the user wants to sort on a specific field, move the users main query input into an "fq" (so it doesn't influence the score) ... and use an extremely low boost matchalldocs query along with your "special doc matching query" as the main (scoring) query param. the key being that even though your primary sort is on score, every doc except your special matches have identical scores. (this may not be possible with dismax because it's not trivial to move the query into an fq, it might work if you can use "0" as the boost on fields in the qf so it still dictates the matches but doesn't influence the score enough to throw off the sort) -Hoss mark angelillo snooth inc. o: 646.723.4328 c: 484.437.9915 [EMAIL PROTECTED] snooth -- 1.8 million ratings and counting...
dynamicField Scaling
Hello, I've got a Solr index running and I want to use a dynamicField to store n different sorting fields. The field that is used to actually sort the results will be determined by the application that is querying the index. I'm wondering if anyone has done something similar to this, or if anyone has an idea of how Solr will perform as the number n of sorting fields grows larger. Is there a way to make sure this doesn't start to slow the index down? Is there any information out there about the number of dynamicFields that can be declared in this way before the entire index suffers? Is there such a limit? (I'm assuming the number of documents in the index will eventually be around 500k -- perhaps more in the future.) TIA, Mark Angelillo
Re: dynamicField Scaling
On Mar 7, 2007, at 2:17 PM, Mike Klaas wrote: On 3/7/07, mark angelillo <[EMAIL PROTECTED]> wrote: Hello, I've got a Solr index running and I want to use a dynamicField to store n different sorting fields. The field that is used to actually sort the results will be determined by the application that is querying the index. I'm wondering if anyone has done something similar to this, or if anyone has an idea of how Solr will perform as the number n of sorting fields grows larger. Is there a way to make sure this doesn't start to slow the index down? Is there any information out there about the number of dynamicFields that can be declared in this way before the entire index suffers? Is there such a limit? It's not realy about the number of dynamic fields. The key variable is the number of sort fields. To sort efficiently, solr needs to maintain a cache of field values. This consumes memory per-field on the order of D x S + U where D is the document count, S is the the size of the data type (eg. 4bytes for ints, 8 bytes for doubles, 4/8 bytes for anything else [pointers]), and U is the cumulative size of the unique field values (if sorting on a non-primitive type, like Strings). If you have sufficient memory to store this data for each field you are sorting on, you shouldn't have any problems. best, -Mike Okay, makes sense. Thanks, Mark
Error loading custom similarity class
Hiya, I'm currently trying to compile and load my own similarity class in Solr, and I'm having a bit of a problem. Here's what I've done so far: 1) Create the .java for the class using SweetSpotSimilarity as a model. I'm using the code below to make sure I can get this working -- my real class will do something a bit different. . package org.apache.lucene.misc; import org.apache.lucene.search.Similarity; import org.apache.lucene.search.DefaultSimilarity; public class CustomSimilarity extends DefaultSimilarity { public CustomSimilarity() { super(); } public float lengthNorm(String fieldName, int numTerms) { return (float)1.0; } public float tf(int freq) { return (float)1.0; } } . 2) Create the .jar file. (Maybe I'm doing this wrong?) > javac classpath lucene-core-nightly.jar CustomSimilarity.java > jar -cvf CustomSimilarity.jar CustomSimilarity.class 3) Put the .jar file in my solr home /lib directory. (/var/solr/lib for me) 4) Edit schema.xml with this line: 5) I'm using Jetty, and read that I may need to ensure the .jar is in the classpath, so I added this to start.config (I've tried with and without this): # solr specific jars /var/solr/lib/CustomSimilarity.jar always Then, when I fire up Jetty, I get the following error: 10:59:01.885 WARN!! [main] org.mortbay.jetty.Server.main(Server.java: 465) >08> EXCEPTION org.mortbay.util.MultiException[org.apache.solr.core.SolrException: Error loading class 'org.apache.lucene.misc.CustomSimilarity'] at org.mortbay.http.HttpServer.doStart(HttpServer.java:686) at org.mortbay.util.Container.start(Container.java:72) at org.mortbay.jetty.Server.main(Server.java:460) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.mortbay.start.Main.invokeMain(Main.java:151) at org.mortbay.start.Main.start(Main.java:476) at org.mortbay.start.Main.main(Main.java:94) org.apache.solr.core.SolrException: Error loading class 'org.apache.lucene.misc.CustomtSimilarity' at org.apache.solr.core.Config.findClass(Config.java:208) at org.apache.solr.core.Config.newInstance(Config.java:213) at org.apache.solr.schema.IndexSchema.readConfig (IndexSchema.java:363) at org.apache.solr.schema.IndexSchema. (IndexSchema.java:69) at org.apache.solr.core.SolrCore.(SolrCore.java:191) at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:172) at org.apache.solr.servlet.SolrServlet.init(SolrServlet.java: 72) at javax.servlet.GenericServlet.init(GenericServlet.java:211) at org.mortbay.jetty.servlet.ServletHolder.initServlet (ServletHolder.java:383) at org.mortbay.jetty.servlet.ServletHolder.start (ServletHolder.java:243) at org.mortbay.jetty.servlet.ServletHandler.initializeServlets (ServletHandler.java:446) at org.mortbay.jetty.servlet.WebApplicationHandler.initializeServlets (WebApplicationHandler.java:321) at org.mortbay.jetty.servlet.WebApplicationContext.doStart (WebApplicationContext.java:509) at org.mortbay.util.Container.start(Container.java:72) at org.mortbay.http.HttpServer.doStart(HttpServer.java:708) at org.mortbay.util.Container.start(Container.java:72) at org.mortbay.jetty.Server.main(Server.java:460) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.mortbay.start.Main.invokeMain(Main.java:151) at org.mortbay.start.Main.start(Main.java:476) at org.mortbay.start.Main.main(Main.java:94) Caused by: java.lang.ClassNotFoundException: org.apache.lucene.misc.CustomSimilarity at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.net.FactoryURLClassLoader.loadClass (URLClassLoader.java:580) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java: 319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:242) at org.apache.solr.core.Config.findClass(Config.java:192) ... 23 more [0]=org.apache.solr.core.Solr
Re: Error loading custom similarity class
Thanks, Yonik. I was definitely missing that. On Apr 9, 2007, at 2:08 PM, Yonik Seeley wrote: On 4/9/07, mark angelillo <[EMAIL PROTECTED]> wrote: package org.apache.lucene.misc; [...] 2) Create the .jar file. (Maybe I'm doing this wrong?) > javac classpath lucene-core-nightly.jar CustomSimilarity.java > jar -cvf CustomSimilarity.jar CustomSimilarity.class This may be the problem. The path in the jar file needs to reflect the package. So the CustomSimilarity.class file needs to be in the org/apache/ lucene/misc/ directory. -Yonik