Interpretation of Solr log messages

2008-01-28 Thread Rachel McConnell
Hello all,

I'm not sure if this is a Solr question, but any pointers would be
helpful.  I am seeing this kind of thing in the stdout logs (I believe
it to be entirely normal):

[10:58:26.507] /select
qt=relatedinstructables&q=music%0awall%0amount%0aguitar%0adiy%0astand%0amusicianhome+NOT+E7Z1HY8HQ5ES9J4QIQ&version=2.2&rows=8&wt=json
0 341

I don't know how to interpret the last items on the line, though (0
341).  It appears that Solr uses the Java Logging API (as opposed to
log4j or another library) and I've looked through the docs for that
but have not found anyplace that provides output interpretation.  The
initial timestamp is placed there by the servlet container, Resin, as
configured in its conf file; that's the only item of logging
configuration there, though.

If this were log4j output I could tell from its configuration file
what those numbers were, but I don't see any kind of log configuration
file with our Solr release.  My guess is that they are times, possibly
the first one is the request time in seconds and the second one is
request times in millis or microseconds, but I haven't been able to
verify this.

Any thoughts much appreciated.

Thanks,
Rachel


Re: Interpretation of Solr log messages

2008-01-28 Thread Rachel McConnell
Thanks Hoss, so then it's actually the same values as returned in the
query response header, e.g. (JSON format):

{"responseHeader":{"status":0,"QTime":209},"response":{"numFound":2574,
... (omitted)

thx,
Rachel

On 1/28/08, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : [10:58:26.507] /select
> : 
> qt=relatedinstructables&q=music%0awall%0amount%0aguitar%0adiy%0astand%0amusicianhome+NOT+E7Z1HY8HQ5ES9J4QIQ&version=2.2&rows=8&wt=json
> : 0 341
> :
> : I don't know how to interpret the last items on the line, though (0
> : 341).  It appears that Solr uses the Java Logging API (as opposed to
> : log4j or another library) and I've looked through the docs for that
> : but have not found anyplace that provides output interpretation.  The
>
> those numbers are actually part of that particular log message -- not part
> of the format (ie: they aren't on a seperate line, they are at the end of
> the line) .. casically whenever Solr logs the processing of a request, it
> includes two numbers ... the first was orriginally an indication of
> success/failure, but since Solr started using the HTTP Status codes i'm
> not sure if that number is still used or if "0" is a constant now ...  the
> second number is the amount of time (in ms) spend processing the logic of
> the "solr request" (ie: the request handlers handleRequest method)
> independend of response writing.
>
> the servlet containers request log can give you a the total time for
> handling the "http request" including writing the response out over the
> network.
>
>
> -Hoss
>
>


duplicate entries being returned, possible caching issue?

2008-02-01 Thread Rachel McConnell
We have just started seeing an intermittent problem in our production
Solr instances, where the same document is returned twice in one
request.  Most of the content of the response consists of duplicates.
It's not consistent; maybe 1/3 of the time this is happening and the
rest of the time, one return document is sent per actual Solr
document.

We recently made some changes to our caching strategy, basically to
increase the values across the board.  This is the only change to our
Solr instance for quite some time.

Our production system consists of the following:

* 'write', a Solr server used as the master index, optimized for
writes.  all 3 application servers use this
* 'read1' & 'read2', Solr servers optimized for reads, which synch
from the master every 20 minutes.  these two are behind a pound load
balancer.  Two application servers use these for searching.
* 'read3', a Solr server identical to read1 & read2, but which is not
load balanced, and used by only one application server.

Has anyone any ideas how to start debugging this?  What information
should I be looking for that could shed some light on this?

Thanks for any advice,
Rachel


Re: duplicate entries being returned, possible caching issue?

2008-02-04 Thread Rachel McConnell
We are using Solr's replication scripts.  They are set to run every 20
minutes, via a cron job on the slave servers.  Any further useful info
I can give regarding them?

R

On 2/3/08, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> I would guess you are seeing a view of the index after adding some
> documents but before the duplicates have been removed.  Are you using
> Solr's replication scripts?
>
> -Yonik
>
> On Feb 1, 2008 6:01 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote:
> > We have just started seeing an intermittent problem in our production
> > Solr instances, where the same document is returned twice in one
> > request.  Most of the content of the response consists of duplicates.
> > It's not consistent; maybe 1/3 of the time this is happening and the
> > rest of the time, one return document is sent per actual Solr
> > document.
> >
> > We recently made some changes to our caching strategy, basically to
> > increase the values across the board.  This is the only change to our
> > Solr instance for quite some time.
> >
> > Our production system consists of the following:
> >
> > * 'write', a Solr server used as the master index, optimized for
> > writes.  all 3 application servers use this
> > * 'read1' & 'read2', Solr servers optimized for reads, which synch
> > from the master every 20 minutes.  these two are behind a pound load
> > balancer.  Two application servers use these for searching.
> > * 'read3', a Solr server identical to read1 & read2, but which is not
> > load balanced, and used by only one application server.
> >
> > Has anyone any ideas how to start debugging this?  What information
> > should I be looking for that could shed some light on this?
> >
> > Thanks for any advice,
> > Rachel
> >
>


Re: duplicate entries being returned, possible caching issue?

2008-02-04 Thread Rachel McConnell
On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Feb 4, 2008 1:48 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote:
> > On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> > > On Feb 4, 2008 1:15 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote:
> > > > We are using Solr's replication scripts.  They are set to run every 20
> > > > minutes, via a cron job on the slave servers.  Any further useful info
> > > > I can give regarding them?
> > >
> > > Are you using the postCommit hook in solrconfig.xml to call snapshooter?
> >
> > No, just the crontab.  We have only one master server on which commits
> > are made, and the servers on which requests are made run the
> > snapshooter periodically.
>
> If you are running snapshooter asynchronously, this would be the cause.
> It's designed to be run from solr (via a postCommit or postOptimize
> hook) at specific points where a consistent view of the index is
> available.

So our cron job might be running DURING an update, for example, and
get duplicate values that way?  I'd have thought that in that case,
the dupe values would stick around until the next update, 20 minutes
later, and we have not observed that to happen.  Or do you mean
something else?

thanks,
Rachel


Re: duplicate entries being returned, possible caching issue?

2008-02-04 Thread Rachel McConnell
On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Feb 4, 2008 1:15 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote:
> > We are using Solr's replication scripts.  They are set to run every 20
> > minutes, via a cron job on the slave servers.  Any further useful info
> > I can give regarding them?
>
> Are you using the postCommit hook in solrconfig.xml to call snapshooter?

No, just the crontab.  We have only one master server on which commits
are made, and the servers on which requests are made run the
snapshooter periodically.  No data changes are made on the read
servers, so postCommit would never be called anyway (I believe).

> The other possibility is a JVM crash happening before Solr removes
> deleted documents.

This would crash the appserver, which isn't happening.  Also the
duplicates don't seem to be returned often; we see a case of duplicate
results, but within a minute or less it goes away and the correct set
of results is returned again.  This seems to point to a problem with
the cache, to me.  But I don't have a good sense of how to debug it...

We tried changing the autowarmer settings to not pull anything from
the cache.  I'll write again if this seems to fix the problem - by
which I mean, if we don't see it at all for a day or two.

thanks,
Rachel


Re: duplicate entries being returned, possible caching issue?

2008-02-04 Thread Rachel McConnell
On 2/4/08, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Feb 4, 2008 2:20 PM, Rachel McConnell <[EMAIL PROTECTED]> wrote:
> > > If you are running snapshooter asynchronously, this would be the cause.
> > > It's designed to be run from solr (via a postCommit or postOptimize
> > > hook) at specific points where a consistent view of the index is
> > > available.
> >
> > So our cron job might be running DURING an update, for example, and
> > get duplicate values that way?
>
> Right.  Duplicates are removed on a commit(), so if a snapshot is
> being taken at any other time than right after a commit, those deletes
> will not have been performed.

I've reviewed the wiki pages about snappuller
(http://wiki.apache.org/solr/SolrCollectionDistributionScripts) and
solrconfig.xml (http://wiki.apache.org/solr/SolrConfigXml) and it
seems that the snappuller is intended to be used on the slave server.
In our case, the slave servers do no updating and never commit; the
master is the only one that commits.  Is there a standard way for the
just-committed, consistent index to be pushed from the master server
out to the slaves?

In fact I don't see how this is supposed to work in any environment
where the master and slave Solr servers are on different physical
machines.  The postCommit handler should run after a commit, which
only happens on the master server; yet it runs snappuller which should
run on a slave.  I am probably missing something here, is there any
more documentation you can point me to?

Rachel


Re: negation

2008-02-13 Thread Rachel McConnell
We do something similar in a different context.  I don't know if our
way is necessarily better, but it would work like this:

1. add a field to campaign called something like enteredUsers
2. once a user adds a campaign, update the campaign, adding a value
unique to that user to enteredUsers
3. the negation can now be done by excluding the user's unique id from
the enteredUsers field, instead of excluding all the user's campaigns

The downside is it will increase the number of your commits, which may
or may not be OK.

Rachel

On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote:
> Hi all
>
> Say that I have a solr index with 5000 documents, each representing a
> campaign that users of my site can join. The user can search and find
> these campaigns in various ways, which is not a problem, but once a
> user has found a campaign and joined it, I don't want that campaign to
> ever show up again for that particular user.
>
> After a while, a user can have built up a list of say 200 campaigns
> that he has joined, and hence should never see in any search results
> again.
>
> I know this functionality could be achieved by simply building a
> longer and longer negation query negating all the campaigns that a
> user already has joined. I would assume that this would become slow
> and ineffective eventually.
>
> My question is: is there a better way to do this?
>
> Thanks
> Alec
>


Re: negation

2008-02-13 Thread Rachel McConnell
We've been using this in production for at least six months.  I have
never stress-tested this particular feature, but we usually do over
100k unique hits a day.  Of those, most hit Solr for one thing or
another, but a much smaller percentage use this specific bit.  It
isn't the fastest query but as we use it there are some additional
complexities so YMMV.

We aren't at risk for data loss from Solr, as we maintain all data in
our database backend; Solr is essentially a slave to that.  So we have
a db field, enteredUsers, which has the usual JDBC failure checking
and any error is handled gracefully.  And the Solr index is then
updated from the db periodically (we're optimized for faster search
results, over up-to-date-ness).

R

On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote:
> Have you done any stress tests on this setup? Is it working well for
> you?
> It sounds like something that could work quite well for me too, but I
> would be a little worried that a commit could time out, and a unique
> value could be lost for that user.
>
> Thank you
> Alec
>
> On Feb 13, 2008, at 1:10 PM, Rachel McConnell wrote:
>
> > We do something similar in a different context.  I don't know if our
> > way is necessarily better, but it would work like this:
> >
> > 1. add a field to campaign called something like enteredUsers
> > 2. once a user adds a campaign, update the campaign, adding a value
> > unique to that user to enteredUsers
> > 3. the negation can now be done by excluding the user's unique id from
> > the enteredUsers field, instead of excluding all the user's campaigns
> >
> > The downside is it will increase the number of your commits, which may
> > or may not be OK.
> >
> > Rachel
> >
> > On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote:
> >> Hi all
> >>
> >> Say that I have a solr index with 5000 documents, each representing a
> >> campaign that users of my site can join. The user can search and find
> >> these campaigns in various ways, which is not a problem, but once a
> >> user has found a campaign and joined it, I don't want that campaign
> >> to
> >> ever show up again for that particular user.
> >>
> >> After a while, a user can have built up a list of say 200 campaigns
> >> that he has joined, and hence should never see in any search results
> >> again.
> >>
> >> I know this functionality could be achieved by simply building a
> >> longer and longer negation query negating all the campaigns that a
> >> user already has joined. I would assume that this would become slow
> >> and ineffective eventually.
> >>
> >> My question is: is there a better way to do this?
> >>
> >> Thanks
> >> Alec
> >>
>
>


Re: Shared index base

2008-02-26 Thread Rachel McConnell
We tried this architecture for our initial rollout of Solr/Lucene to
our production application.  We ran into a problem with it, which may
or may not apply to you.  Our production software servers all are
monitored for uptime by a daemon which pings them periodically and
restarts them if a response is not received within a configurable
period of time.

We found that under some orderings of restarts, the Lucene appservers
would not come up correctly.  I don't recall the exact details, and I
don't think it ever corrupted the index.  As I recall, we had to
restart in a particular order to avoid freezes on the read-only
servers, and of course the automated monitor, separate for each
server, could not do that.

YMMV of course, but this would be something to test thoroughly in a
shared index situation.  We moved a while ago to each server (even on
the same machine) having its own index files, and using the snapshot
puller/shooter processes for replication.

Rachel

On 2/26/08, Matthew Runo <[EMAIL PROTECTED]> wrote:
> We're about to do the same thing here, but have not tried yet. We
>  currently run Solr with replication across several servers. So long as
>  only one server is doing updates to the index, I think it should work
>  fine.
>
>
>  Thanks!
>
>
>  Matthew Runo
>  Software Developer
>  Zappos.com
>  702.943.7833
>
>
>  On Feb 26, 2008, at 7:51 AM, Evgeniy Strokin wrote:
>
>  > I know there was such discussions about the subject, but I want to
>  > ask again if somebody could share more information.
>  > We are planning to have several separate servers for our search
>  > engine. One of them will be index/search server, and all others are
>  > search only.
>  > We want to use SAN (BTW: should we consider something else?) and
>  > give access to it from all servers. So all servers will use the same
>  > index base, without any replication, same files.
>  > Is this a good practice? Did somebody do the same? Any problems
>  > noticed? Or any suggestions, even about different configurations are
>  > highly appreciated.
>  >
>  > Thanks,
>  > Gene
>
>


Re: schema help

2008-03-11 Thread Rachel McConnell
Our Solr use consists of several rather different data types, some of
which have one-to-many relationships with other types.  We don't need
to do any searching of quite the kind you describe, but I have an idea
about it, depending on what you need to do with the book data.  It is
rather hacky, but maybe you can improve it.

If you only need to present a list of books, possibly with links to
fuller data, you could do this:
* store only Authors in solr
* create a field, stored but not indexed (I may be using slightly
wrong terms here) which contains the short text representation of all
their books
* search on authors however you want and make sure you return this
field, and just display it as is

For example, if Jane Doe has written 2 books, How To Garden, and
Fields Of Maine, your special field might contain this:

Fields of Maine published on
DATE.  A brief overvew of Maine's woods and fields with special
attention to wildflowers

If your 'authors' 'write' 'books' with great frequency, you'd need to
update a lot...


Another possibility is to do two searches, with this kind of
structure, which sort of mimics an RDBMS:
* everything in Solr has a field, type (book, author, library, etc).
these can be filtered on a search by search basis
* books have a field, authorId, uniquely referencing the author
* your first search will restricted to just authors, from which you
will extract the IDs.
* your second search will be restricted to just books, whose authorId
field is exactly one of the IDs from the first search


As you have noticed, Lucene is not an RDBMS.  Searching through all
the text of all the books is more the use it was designed around; of
course the analogy might not be THAT strong with your need!

Rachel

On 3/11/08, Geoffrey Young <[EMAIL PROTECTED]> wrote:
>
>
>  Otis Gospodnetic wrote:
>  > Geoff,
>  >
>  > I'm not sure if I understood your problem correctly, but it sounds
>  > like you want your search to be restricted to authors, but then you
>  > want to list all of his/her books when displaying results.
>
>
> that's about right.  add that I may also want to search on libraries and
>  show all the books (and authors) stored there.
>
>  in real life, it's not books or authors, of course, but the parallels
>  are close enough :)  in fact, the library example is a good one for
>  me... or at least a network of public libraries linked together.
>
>
>  > The
>  > easiest thing to do would be to create an index where each
>  > "row"/Document has the author name, the book title, etc.  For each
>  > author-matching Document you'd pull his/her books out of the result
>  > set.  Yes, this means the author name would be denormalized in
>  > RDBMS-speak.
>
>
> I think I can live with the denormalization - it seems lucene is flat
>  and very different conceptually than a database :)
>
>  the trouble I'm having is one of dimension.  an author has many, many
>  attributes (name, birthdate, biography in $language, etc).  as does each
>  book (title in $language, summary in $language, genre, etc).  as does
>  each library (name, address, directions in $language, etc).  so an
>  author with N books doesn't seem to scale very well in the flat
>  representations I'm finding in all the lucene/solr docs and examples...
>  at least not in some way I can wrap my head around.
>
>  part of what seemed really appealing about lucene in general was that
>  you could stuff all this (unindexed) information into a document and
>  retrieve it all based on some search criteria.  but it's seeming very
>  difficult for me to wrap my head around the data I need to represent.
>
>
>  > Another option is not to index/store book titles, but
>  > rather have only an author index to search against.  The book data
>  > (mapped to author identities) would then be pulled from an external
>  > source (e.g. RDBMS: select title from books where author_id in
>  > (1,2,3)) at search results display time.
>
>
> eew :)  seriously, though, that's what we have now - all rdbms driven.
>  if solr could only conceptually handle the initial lookup there wouldn't
>  be much point.
>
>  maybe I'm thinking about this all wrong (as is to be expected :), but I
>  just can't believe that nobody is using solr to represent data a bit
>  more complex than the examples out there.
>
>  thanks for the feedback.
>
>  --Geoff
>
>
>  >
>  > Otis
>  >
>  > -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>  >
>  > - Original Message  From: Geoffrey Young
>  > <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent:
>  > Tuesday, March 11, 2008 12:17:32 PM Subject: schema help
>  >
>  > hi :)
>  >
>  > I'm trying to work out a schema for our widgets.  more than "just
>  > coming up with something" I'd like something idiomatic in solr terms.
>  > any help is much appreciated.  here's a similar problem space to what
>  > I'm working with...
>  >
>  > lets say we're talking books.  books are written by authors and held
>  > in libraries.  a sister