Interpretation of Solr log messages
Hello all, I'm not sure if this is a Solr question, but any pointers would be helpful. I am seeing this kind of thing in the stdout logs (I believe it to be entirely normal): [10:58:26.507] /select qt=relatedinstructables&q=music%0awall%0amount%0aguitar%0adiy%0astand%0amusicianhome+NOT+E7Z1HY8HQ5ES9J4QIQ&version=2.2&rows=8&wt=json 0 341 I don't know how to interpret the last items on the line, though (0 341). It appears that Solr uses the Java Logging API (as opposed to log4j or another library) and I've looked through the docs for that but have not found anyplace that provides output interpretation. The initial timestamp is placed there by the servlet container, Resin, as configured in its conf file; that's the only item of logging configuration there, though. If this were log4j output I could tell from its configuration file what those numbers were, but I don't see any kind of log configuration file with our Solr release. My guess is that they are times, possibly the first one is the request time in seconds and the second one is request times in millis or microseconds, but I haven't been able to verify this. Any thoughts much appreciated. Thanks, Rachel
Re: Interpretation of Solr log messages
Thanks Hoss, so then it's actually the same values as returned in the query response header, e.g. (JSON format): {"responseHeader":{"status":0,"QTime":209},"response":{"numFound":2574, ... (omitted) thx, Rachel On 1/28/08, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : [10:58:26.507] /select > : > qt=relatedinstructables&q=music%0awall%0amount%0aguitar%0adiy%0astand%0amusicianhome+NOT+E7Z1HY8HQ5ES9J4QIQ&version=2.2&rows=8&wt=json > : 0 341 > : > : I don't know how to interpret the last items on the line, though (0 > : 341). It appears that Solr uses the Java Logging API (as opposed to > : log4j or another library) and I've looked through the docs for that > : but have not found anyplace that provides output interpretation. The > > those numbers are actually part of that particular log message -- not part > of the format (ie: they aren't on a seperate line, they are at the end of > the line) .. casically whenever Solr logs the processing of a request, it > includes two numbers ... the first was orriginally an indication of > success/failure, but since Solr started using the HTTP Status codes i'm > not sure if that number is still used or if "0" is a constant now ... the > second number is the amount of time (in ms) spend processing the logic of > the "solr request" (ie: the request handlers handleRequest method) > independend of response writing. > > the servlet containers request log can give you a the total time for > handling the "http request" including writing the response out over the > network. > > > -Hoss > >
duplicate entries being returned, possible caching issue?
We have just started seeing an intermittent problem in our production Solr instances, where the same document is returned twice in one request. Most of the content of the response consists of duplicates. It's not consistent; maybe 1/3 of the time this is happening and the rest of the time, one return document is sent per actual Solr document. We recently made some changes to our caching strategy, basically to increase the values across the board. This is the only change to our Solr instance for quite some time. Our production system consists of the following: * 'write', a Solr server used as the master index, optimized for writes. all 3 application servers use this * 'read1' & 'read2', Solr servers optimized for reads, which synch from the master every 20 minutes. these two are behind a pound load balancer. Two application servers use these for searching. * 'read3', a Solr server identical to read1 & read2, but which is not load balanced, and used by only one application server. Has anyone any ideas how to start debugging this? What information should I be looking for that could shed some light on this? Thanks for any advice, Rachel
Re: duplicate entries being returned, possible caching issue?
We are using Solr's replication scripts. They are set to run every 20 minutes, via a cron job on the slave servers. Any further useful info I can give regarding them? R On 2/3/08, Yonik Seeley <[EMAIL PROTECTED]> wrote: > I would guess you are seeing a view of the index after adding some > documents but before the duplicates have been removed. Are you using > Solr's replication scripts? > > -Yonik > > On Feb 1, 2008 6:01 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote: > > We have just started seeing an intermittent problem in our production > > Solr instances, where the same document is returned twice in one > > request. Most of the content of the response consists of duplicates. > > It's not consistent; maybe 1/3 of the time this is happening and the > > rest of the time, one return document is sent per actual Solr > > document. > > > > We recently made some changes to our caching strategy, basically to > > increase the values across the board. This is the only change to our > > Solr instance for quite some time. > > > > Our production system consists of the following: > > > > * 'write', a Solr server used as the master index, optimized for > > writes. all 3 application servers use this > > * 'read1' & 'read2', Solr servers optimized for reads, which synch > > from the master every 20 minutes. these two are behind a pound load > > balancer. Two application servers use these for searching. > > * 'read3', a Solr server identical to read1 & read2, but which is not > > load balanced, and used by only one application server. > > > > Has anyone any ideas how to start debugging this? What information > > should I be looking for that could shed some light on this? > > > > Thanks for any advice, > > Rachel > > >
Re: duplicate entries being returned, possible caching issue?
On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Feb 4, 2008 1:48 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote: > > On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > On Feb 4, 2008 1:15 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote: > > > > We are using Solr's replication scripts. They are set to run every 20 > > > > minutes, via a cron job on the slave servers. Any further useful info > > > > I can give regarding them? > > > > > > Are you using the postCommit hook in solrconfig.xml to call snapshooter? > > > > No, just the crontab. We have only one master server on which commits > > are made, and the servers on which requests are made run the > > snapshooter periodically. > > If you are running snapshooter asynchronously, this would be the cause. > It's designed to be run from solr (via a postCommit or postOptimize > hook) at specific points where a consistent view of the index is > available. So our cron job might be running DURING an update, for example, and get duplicate values that way? I'd have thought that in that case, the dupe values would stick around until the next update, 20 minutes later, and we have not observed that to happen. Or do you mean something else? thanks, Rachel
Re: duplicate entries being returned, possible caching issue?
On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Feb 4, 2008 1:15 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote: > > We are using Solr's replication scripts. They are set to run every 20 > > minutes, via a cron job on the slave servers. Any further useful info > > I can give regarding them? > > Are you using the postCommit hook in solrconfig.xml to call snapshooter? No, just the crontab. We have only one master server on which commits are made, and the servers on which requests are made run the snapshooter periodically. No data changes are made on the read servers, so postCommit would never be called anyway (I believe). > The other possibility is a JVM crash happening before Solr removes > deleted documents. This would crash the appserver, which isn't happening. Also the duplicates don't seem to be returned often; we see a case of duplicate results, but within a minute or less it goes away and the correct set of results is returned again. This seems to point to a problem with the cache, to me. But I don't have a good sense of how to debug it... We tried changing the autowarmer settings to not pull anything from the cache. I'll write again if this seems to fix the problem - by which I mean, if we don't see it at all for a day or two. thanks, Rachel
Re: duplicate entries being returned, possible caching issue?
On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Feb 4, 2008 2:20 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote: > > > If you are running snapshooter asynchronously, this would be the cause. > > > It's designed to be run from solr (via a postCommit or postOptimize > > > hook) at specific points where a consistent view of the index is > > > available. > > > > So our cron job might be running DURING an update, for example, and > > get duplicate values that way? > > Right. Duplicates are removed on a commit(), so if a snapshot is > being taken at any other time than right after a commit, those deletes > will not have been performed. I've reviewed the wiki pages about snappuller (http://wiki.apache.org/solr/SolrCollectionDistributionScripts) and solrconfig.xml (http://wiki.apache.org/solr/SolrConfigXml) and it seems that the snappuller is intended to be used on the slave server. In our case, the slave servers do no updating and never commit; the master is the only one that commits. Is there a standard way for the just-committed, consistent index to be pushed from the master server out to the slaves? In fact I don't see how this is supposed to work in any environment where the master and slave Solr servers are on different physical machines. The postCommit handler should run after a commit, which only happens on the master server; yet it runs snappuller which should run on a slave. I am probably missing something here, is there any more documentation you can point me to? Rachel
Re: negation
We do something similar in a different context. I don't know if our way is necessarily better, but it would work like this: 1. add a field to campaign called something like enteredUsers 2. once a user adds a campaign, update the campaign, adding a value unique to that user to enteredUsers 3. the negation can now be done by excluding the user's unique id from the enteredUsers field, instead of excluding all the user's campaigns The downside is it will increase the number of your commits, which may or may not be OK. Rachel On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote: > Hi all > > Say that I have a solr index with 5000 documents, each representing a > campaign that users of my site can join. The user can search and find > these campaigns in various ways, which is not a problem, but once a > user has found a campaign and joined it, I don't want that campaign to > ever show up again for that particular user. > > After a while, a user can have built up a list of say 200 campaigns > that he has joined, and hence should never see in any search results > again. > > I know this functionality could be achieved by simply building a > longer and longer negation query negating all the campaigns that a > user already has joined. I would assume that this would become slow > and ineffective eventually. > > My question is: is there a better way to do this? > > Thanks > Alec >
Re: negation
We've been using this in production for at least six months. I have never stress-tested this particular feature, but we usually do over 100k unique hits a day. Of those, most hit Solr for one thing or another, but a much smaller percentage use this specific bit. It isn't the fastest query but as we use it there are some additional complexities so YMMV. We aren't at risk for data loss from Solr, as we maintain all data in our database backend; Solr is essentially a slave to that. So we have a db field, enteredUsers, which has the usual JDBC failure checking and any error is handled gracefully. And the Solr index is then updated from the db periodically (we're optimized for faster search results, over up-to-date-ness). R On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote: > Have you done any stress tests on this setup? Is it working well for > you? > It sounds like something that could work quite well for me too, but I > would be a little worried that a commit could time out, and a unique > value could be lost for that user. > > Thank you > Alec > > On Feb 13, 2008, at 1:10 PM, Rachel McConnell wrote: > > > We do something similar in a different context. I don't know if our > > way is necessarily better, but it would work like this: > > > > 1. add a field to campaign called something like enteredUsers > > 2. once a user adds a campaign, update the campaign, adding a value > > unique to that user to enteredUsers > > 3. the negation can now be done by excluding the user's unique id from > > the enteredUsers field, instead of excluding all the user's campaigns > > > > The downside is it will increase the number of your commits, which may > > or may not be OK. > > > > Rachel > > > > On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote: > >> Hi all > >> > >> Say that I have a solr index with 5000 documents, each representing a > >> campaign that users of my site can join. The user can search and find > >> these campaigns in various ways, which is not a problem, but once a > >> user has found a campaign and joined it, I don't want that campaign > >> to > >> ever show up again for that particular user. > >> > >> After a while, a user can have built up a list of say 200 campaigns > >> that he has joined, and hence should never see in any search results > >> again. > >> > >> I know this functionality could be achieved by simply building a > >> longer and longer negation query negating all the campaigns that a > >> user already has joined. I would assume that this would become slow > >> and ineffective eventually. > >> > >> My question is: is there a better way to do this? > >> > >> Thanks > >> Alec > >> > >
Re: Shared index base
We tried this architecture for our initial rollout of Solr/Lucene to our production application. We ran into a problem with it, which may or may not apply to you. Our production software servers all are monitored for uptime by a daemon which pings them periodically and restarts them if a response is not received within a configurable period of time. We found that under some orderings of restarts, the Lucene appservers would not come up correctly. I don't recall the exact details, and I don't think it ever corrupted the index. As I recall, we had to restart in a particular order to avoid freezes on the read-only servers, and of course the automated monitor, separate for each server, could not do that. YMMV of course, but this would be something to test thoroughly in a shared index situation. We moved a while ago to each server (even on the same machine) having its own index files, and using the snapshot puller/shooter processes for replication. Rachel On 2/26/08, Matthew Runo <[EMAIL PROTECTED]> wrote: > We're about to do the same thing here, but have not tried yet. We > currently run Solr with replication across several servers. So long as > only one server is doing updates to the index, I think it should work > fine. > > > Thanks! > > > Matthew Runo > Software Developer > Zappos.com > 702.943.7833 > > > On Feb 26, 2008, at 7:51 AM, Evgeniy Strokin wrote: > > > I know there was such discussions about the subject, but I want to > > ask again if somebody could share more information. > > We are planning to have several separate servers for our search > > engine. One of them will be index/search server, and all others are > > search only. > > We want to use SAN (BTW: should we consider something else?) and > > give access to it from all servers. So all servers will use the same > > index base, without any replication, same files. > > Is this a good practice? Did somebody do the same? Any problems > > noticed? Or any suggestions, even about different configurations are > > highly appreciated. > > > > Thanks, > > Gene > >
Re: schema help
Our Solr use consists of several rather different data types, some of which have one-to-many relationships with other types. We don't need to do any searching of quite the kind you describe, but I have an idea about it, depending on what you need to do with the book data. It is rather hacky, but maybe you can improve it. If you only need to present a list of books, possibly with links to fuller data, you could do this: * store only Authors in solr * create a field, stored but not indexed (I may be using slightly wrong terms here) which contains the short text representation of all their books * search on authors however you want and make sure you return this field, and just display it as is For example, if Jane Doe has written 2 books, How To Garden, and Fields Of Maine, your special field might contain this: Fields of Maine published on DATE. A brief overvew of Maine's woods and fields with special attention to wildflowers If your 'authors' 'write' 'books' with great frequency, you'd need to update a lot... Another possibility is to do two searches, with this kind of structure, which sort of mimics an RDBMS: * everything in Solr has a field, type (book, author, library, etc). these can be filtered on a search by search basis * books have a field, authorId, uniquely referencing the author * your first search will restricted to just authors, from which you will extract the IDs. * your second search will be restricted to just books, whose authorId field is exactly one of the IDs from the first search As you have noticed, Lucene is not an RDBMS. Searching through all the text of all the books is more the use it was designed around; of course the analogy might not be THAT strong with your need! Rachel On 3/11/08, Geoffrey Young <[EMAIL PROTECTED]> wrote: > > > Otis Gospodnetic wrote: > > Geoff, > > > > I'm not sure if I understood your problem correctly, but it sounds > > like you want your search to be restricted to authors, but then you > > want to list all of his/her books when displaying results. > > > that's about right. add that I may also want to search on libraries and > show all the books (and authors) stored there. > > in real life, it's not books or authors, of course, but the parallels > are close enough :) in fact, the library example is a good one for > me... or at least a network of public libraries linked together. > > > > The > > easiest thing to do would be to create an index where each > > "row"/Document has the author name, the book title, etc. For each > > author-matching Document you'd pull his/her books out of the result > > set. Yes, this means the author name would be denormalized in > > RDBMS-speak. > > > I think I can live with the denormalization - it seems lucene is flat > and very different conceptually than a database :) > > the trouble I'm having is one of dimension. an author has many, many > attributes (name, birthdate, biography in $language, etc). as does each > book (title in $language, summary in $language, genre, etc). as does > each library (name, address, directions in $language, etc). so an > author with N books doesn't seem to scale very well in the flat > representations I'm finding in all the lucene/solr docs and examples... > at least not in some way I can wrap my head around. > > part of what seemed really appealing about lucene in general was that > you could stuff all this (unindexed) information into a document and > retrieve it all based on some search criteria. but it's seeming very > difficult for me to wrap my head around the data I need to represent. > > > > Another option is not to index/store book titles, but > > rather have only an author index to search against. The book data > > (mapped to author identities) would then be pulled from an external > > source (e.g. RDBMS: select title from books where author_id in > > (1,2,3)) at search results display time. > > > eew :) seriously, though, that's what we have now - all rdbms driven. > if solr could only conceptually handle the initial lookup there wouldn't > be much point. > > maybe I'm thinking about this all wrong (as is to be expected :), but I > just can't believe that nobody is using solr to represent data a bit > more complex than the examples out there. > > thanks for the feedback. > > --Geoff > > > > > > Otis > > > > -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message From: Geoffrey Young > > <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: > > Tuesday, March 11, 2008 12:17:32 PM Subject: schema help > > > > hi :) > > > > I'm trying to work out a schema for our widgets. more than "just > > coming up with something" I'd like something idiomatic in solr terms. > > any help is much appreciated. here's a similar problem space to what > > I'm working with... > > > > lets say we're talking books. books are written by authors and held > > in libraries. a sister