combining negative queries and OR

2008-11-03 Thread Joe Pollard
I am trying to decide if this is a solr or a lucene problem, using solr
1.3:

take this example --  

(-productName:"whatever") OR (anotherField:"Johnny")

I would expect to get back records that have anotherField=Johnny, but
also, any records that don't have 'whatever' as the productName.

However, it seems that the productName statement is not being used,
because I get only records back that match anotherField:"Johnny".




Not sure if this is a bug or expected behavior, and I am able to work
around it, but I'd certainly expect the above query to work.

Thanks!
-Joe



Best way to unit test solr integration

2009-03-27 Thread Joe Pollard
Hello,

On our project, we have quite a bit of code used to generate Solr queries, and 
I need to create some unit tests to ensure that these continue to work.  In 
addition, I need to generate some unit tests that will test indexing and 
retrieval of certain documents, based on our current schema and the application 
logic that generates the indexable documents as well as generates the Solr 
queries.

My question is - what's the best way for me to unit test our Solr integration?

I'd like to be able to spin up an embedded/in-memory solr, or that failing just 
start one up as part of my test case setup, fill it with interesting documents, 
and do some queries, comparing the results to expected results.

Are there wiki pages or other documented examples of doing this?  It seems 
rather straight-forward, but who knows, it may be dead simple with some unknown 
feature.

Thanks!
-Joe


RE: Best way to unit test solr integration

2009-03-27 Thread Joe Pollard
Thanks for the tips, I like the suggestion of testing the document and query 
generation without having solr involved.  That seems like a more bite-sized 
unit; I think I'll do that.

However, here's the test case that I'm considering where I'd like to have a 
live solr instance:

During an exercise of optimizing our schema, I'm going to be making wholesale 
changes that I'd like to ensure don't break some portion of our app.  It seems 
like a good method for this would be to write a test with the following steps: 
(arguably not a unit test, but a very valuable test indeed in our application)
* take some defined model object generated at test time, store it in db
* run it through our document creation code
* submit it into solr
* generate a query using our custom criteria-based generation code
* ensure that the query returns the results as expected
* flesh out the new model objects from the db using only the id fields returned 
from Solr
* In the end, it would be expected to have model objects retrieved from the db 
that match model objects at the beginning of the test.

These building blocks could be stacked in numerous ways to test almost all the 
different scenarios in which we use Solr.

Also, when/if we start making solr config changes, I can ensure that they 
change nothing from my app's functional point of view (with the exception of 
ridding us of dreaded OOMs).

Thanks,
-Joe

-Original Message-
From: Eric Pugh [mailto:ep...@opensourceconnections.com]
Sent: Friday, March 27, 2009 11:27 AM
To: solr-user@lucene.apache.org
Subject: Re: Best way to unit test solr integration

So my first thought is that "unit test + solr integration" is an
oxymoron.  In the sense that unit test implies the smallest functional
unit, and solr integration implies multiple units working together.

It sounds like you have two different tasks.  the code that generate
queies, you can test that without Solr.  If you need to parse some
sort of solr document to generate a query based on it, then mock up
the query.   A lot of folks will just use Solr to build a result set,
and then save that on the filesystem.  "my_big_result1.xml" and read
it in and feed it to your code.

On the other hand, for you code testing indexing and retrieval, again,
if you can use the same approach to decouple what solr does from your
code.  Unless you've patched Solr, you shouldn't need to unit test
Solr, Solr has very nice unit testing built in.

On the other hand, if you are doing integration testing, where you
want a more end to end view of your application, then you probably
already have a "test" solr setup in your environment somewhere that
you can rely on to use.

Spinning up and shutting down Solr for tests can be done, and I can
think of use cases for why you might want to do it, but it does incur
a penalty of being more work.  And you still need to validate that
your "embedded/unit test" solr works the same as your integration/test
environment Solr.

Eric



On Mar 27, 2009, at 11:59 AM, Joe Pollard wrote:

> Hello,
>
> On our project, we have quite a bit of code used to generate Solr
> queries, and I need to create some unit tests to ensure that these
> continue to work.  In addition, I need to generate some unit tests
> that will test indexing and retrieval of certain documents, based on
> our current schema and the application logic that generates the
> indexable documents as well as generates the Solr queries.
>
> My question is - what's the best way for me to unit test our Solr
> integration?
>
> I'd like to be able to spin up an embedded/in-memory solr, or that
> failing just start one up as part of my test case setup, fill it
> with interesting documents, and do some queries, comparing the
> results to expected results.
>
> Are there wiki pages or other documented examples of doing this?  It
> seems rather straight-forward, but who knows, it may be dead simple
> with some unknown feature.
>
> Thanks!
> -Joe

-
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal






Coming up with a model of memory usage

2009-04-06 Thread Joe Pollard
To combat our frequent OutOfMemory Exceptions, I'm attempting to come up
with a model so that we can determine how much memory to give Solr based
on how much data we have (as we expand to more data types eligible to be
supported this becomes more important).

Are there any published guidelines on how much memory a particular
document takes up in memory, based on the data types, etc?

I have several stored fields, numerous other non-stored fields, a
largish copyTo field, and I am doing some sorting on indexed, non-stored
fields.

Any pointers would be appreciated!

Thanks,
-Joe



Re: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
It doesn't seem to matter whether fields are stored or not, but I've
found a rather striking difference in the memory requirements during
sorting.  Sorting on a string field representing datetime like
'2008-08-12T12:18:26.510' is about twice as memory intense as sorting
first by '2008-08-12' and then by '121826'.

Any other tips/guidance like this would be great!

Thanks,
-Joe

On Mon, 2009-04-06 at 15:43 -0500, Joe Pollard wrote:
> To combat our frequent OutOfMemory Exceptions, I'm attempting to come up
> with a model so that we can determine how much memory to give Solr based
> on how much data we have (as we expand to more data types eligible to be
> supported this becomes more important).
> 
> Are there any published guidelines on how much memory a particular
> document takes up in memory, based on the data types, etc?
> 
> I have several stored fields, numerous other non-stored fields, a
> largish copyTo field, and I am doing some sorting on indexed, non-stored
> fields.
> 
> Any pointers would be appreciated!
> 
> Thanks,
> -Joe
> 



RE: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
Cool, great resource, thanks.

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: Tuesday, April 07, 2009 10:13 AM
To: solr-user@lucene.apache.org
Subject: Re: Coming up with a model of memory usage

On Tue, Apr 7, 2009 at 8:25 PM, Joe Pollard wrote:

> It doesn't seem to matter whether fields are stored or not, but I've
> found a rather striking difference in the memory requirements during
> sorting.  Sorting on a string field representing datetime like
> '2008-08-12T12:18:26.510' is about twice as memory intense as sorting
> first by '2008-08-12' and then by '121826'.
>

> Any other tips/guidance like this would be great!
>

There are a lot of threads on memory usage on the mailing list. Searching on
the mailing list will give you a lot of information.

http://lucene.markmail.org/search/solr+sorting+memory

--
Regards,
Shalin Shekhar Mangar.


RE: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
Good info to have.  Thanks Erick.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, April 07, 2009 10:51 AM
To: solr-user@lucene.apache.org
Subject: Re: Coming up with a model of memory usage

Your observations about date sorting are probably correct. The
issue is that the sort caches in Lucene look at the unique terms.
There are many more unique terms (nearly every one) in
2008-08-12T12:18:26.510

then when the field is split. You can reduce memory consumption
when sorting even more by splitting into more fields, but that's up
to you to decide whether or not it's worth the effort

Best
Erick

On Tue, Apr 7, 2009 at 10:55 AM, Joe Pollard wrote:

> It doesn't seem to matter whether fields are stored or not, but I've
> found a rather striking difference in the memory requirements during
> sorting.  Sorting on a string field representing datetime like
> '2008-08-12T12:18:26.510' is about twice as memory intense as sorting
> first by '2008-08-12' and then by '121826'.
>
> Any other tips/guidance like this would be great!
>
> Thanks,
> -Joe
>
> On Mon, 2009-04-06 at 15:43 -0500, Joe Pollard wrote:
> > To combat our frequent OutOfMemory Exceptions, I'm attempting to come up
> > with a model so that we can determine how much memory to give Solr based
> > on how much data we have (as we expand to more data types eligible to be
> > supported this becomes more important).
> >
> > Are there any published guidelines on how much memory a particular
> > document takes up in memory, based on the data types, etc?
> >
> > I have several stored fields, numerous other non-stored fields, a
> > largish copyTo field, and I am doing some sorting on indexed, non-stored
> > fields.
> >
> > Any pointers would be appreciated!
> >
> > Thanks,
> > -Joe
> >
>
>


RE: Coming up with a model of memory usage

2009-04-07 Thread Joe Pollard
It does end up in the right order (sorted), but it's very expensive.  Sorting 
by a couple fields that each have fewer unique index values seems to limit the 
memory consumption greatly.

-Original Message-
From: Walter Underwood [mailto:wunderw...@netflix.com]
Sent: Tuesday, April 07, 2009 11:12 AM
To: solr-user@lucene.apache.org
Subject: Re: Coming up with a model of memory usage

Why tokenize the date? It sorts just fine as a string. --wunder

On 4/7/09 8:50 AM, "Erick Erickson"  wrote:

> Your observations about date sorting are probably correct. The
> issue is that the sort caches in Lucene look at the unique terms.
> There are many more unique terms (nearly every one) in
> 2008-08-12T12:18:26.510
>
> then when the field is split. You can reduce memory consumption
> when sorting even more by splitting into more fields, but that's up
> to you to decide whether or not it's worth the effort
>
> Best
> Erick
>
> On Tue, Apr 7, 2009 at 10:55 AM, Joe Pollard
> wrote:
>
>> It doesn't seem to matter whether fields are stored or not, but I've
>> found a rather striking difference in the memory requirements during
>> sorting.  Sorting on a string field representing datetime like
>> '2008-08-12T12:18:26.510' is about twice as memory intense as sorting
>> first by '2008-08-12' and then by '121826'.
>>
>> Any other tips/guidance like this would be great!
>>
>> Thanks,
>> -Joe
>>
>> On Mon, 2009-04-06 at 15:43 -0500, Joe Pollard wrote:
>>> To combat our frequent OutOfMemory Exceptions, I'm attempting to come up
>>> with a model so that we can determine how much memory to give Solr based
>>> on how much data we have (as we expand to more data types eligible to be
>>> supported this becomes more important).
>>>
>>> Are there any published guidelines on how much memory a particular
>>> document takes up in memory, based on the data types, etc?
>>>
>>> I have several stored fields, numerous other non-stored fields, a
>>> largish copyTo field, and I am doing some sorting on indexed, non-stored
>>> fields.
>>>
>>> Any pointers would be appreciated!
>>>
>>> Thanks,
>>> -Joe
>>>
>>
>>



_val:ord(field) (from wiki LargeIndexes)

2009-04-07 Thread Joe Pollard
I see this interesting line in the wiki page LargeIndexes  
http://wiki.apache.org/solr/LargeIndexes (sorting section towards the bottom)

Using _val:ord(field) as a search term will sort the results without incurring 
the memory cost.

I'd like to know what this means, but I'm having a bit of trouble parsing 
it  What is _val:ord(field) exactly?  Does this just mean that I should 
pass in the ordinal of the field instead of the fieldname in the query?  Which 
portion of the memory cost is being avoided doing this?


Distributed Search - only get ids

2009-04-29 Thread Joe Pollard
Solr 1.3: If I am only getting back the document ids from a distributed search 
(e.g., uniqueid is 'id' and the fl parameter only contains 'id'), there seems 
to be some room for optimization in the current code path:


1)  On each shard, grab top N sorted document ids & sort fields)

2)  Merge these into one list of N sorted id fields.

3)  Query each shard for the details of these documents (by id), getting 
back a field list of id only.
It seems to me that step 3 is overhead that can be skipped.

Any thoughts on this/known patches?

Thanks,
-Joe