combining negative queries and OR
I am trying to decide if this is a solr or a lucene problem, using solr 1.3: take this example -- (-productName:"whatever") OR (anotherField:"Johnny") I would expect to get back records that have anotherField=Johnny, but also, any records that don't have 'whatever' as the productName. However, it seems that the productName statement is not being used, because I get only records back that match anotherField:"Johnny". Not sure if this is a bug or expected behavior, and I am able to work around it, but I'd certainly expect the above query to work. Thanks! -Joe
Best way to unit test solr integration
Hello, On our project, we have quite a bit of code used to generate Solr queries, and I need to create some unit tests to ensure that these continue to work. In addition, I need to generate some unit tests that will test indexing and retrieval of certain documents, based on our current schema and the application logic that generates the indexable documents as well as generates the Solr queries. My question is - what's the best way for me to unit test our Solr integration? I'd like to be able to spin up an embedded/in-memory solr, or that failing just start one up as part of my test case setup, fill it with interesting documents, and do some queries, comparing the results to expected results. Are there wiki pages or other documented examples of doing this? It seems rather straight-forward, but who knows, it may be dead simple with some unknown feature. Thanks! -Joe
RE: Best way to unit test solr integration
Thanks for the tips, I like the suggestion of testing the document and query generation without having solr involved. That seems like a more bite-sized unit; I think I'll do that. However, here's the test case that I'm considering where I'd like to have a live solr instance: During an exercise of optimizing our schema, I'm going to be making wholesale changes that I'd like to ensure don't break some portion of our app. It seems like a good method for this would be to write a test with the following steps: (arguably not a unit test, but a very valuable test indeed in our application) * take some defined model object generated at test time, store it in db * run it through our document creation code * submit it into solr * generate a query using our custom criteria-based generation code * ensure that the query returns the results as expected * flesh out the new model objects from the db using only the id fields returned from Solr * In the end, it would be expected to have model objects retrieved from the db that match model objects at the beginning of the test. These building blocks could be stacked in numerous ways to test almost all the different scenarios in which we use Solr. Also, when/if we start making solr config changes, I can ensure that they change nothing from my app's functional point of view (with the exception of ridding us of dreaded OOMs). Thanks, -Joe -Original Message- From: Eric Pugh [mailto:ep...@opensourceconnections.com] Sent: Friday, March 27, 2009 11:27 AM To: solr-user@lucene.apache.org Subject: Re: Best way to unit test solr integration So my first thought is that "unit test + solr integration" is an oxymoron. In the sense that unit test implies the smallest functional unit, and solr integration implies multiple units working together. It sounds like you have two different tasks. the code that generate queies, you can test that without Solr. If you need to parse some sort of solr document to generate a query based on it, then mock up the query. A lot of folks will just use Solr to build a result set, and then save that on the filesystem. "my_big_result1.xml" and read it in and feed it to your code. On the other hand, for you code testing indexing and retrieval, again, if you can use the same approach to decouple what solr does from your code. Unless you've patched Solr, you shouldn't need to unit test Solr, Solr has very nice unit testing built in. On the other hand, if you are doing integration testing, where you want a more end to end view of your application, then you probably already have a "test" solr setup in your environment somewhere that you can rely on to use. Spinning up and shutting down Solr for tests can be done, and I can think of use cases for why you might want to do it, but it does incur a penalty of being more work. And you still need to validate that your "embedded/unit test" solr works the same as your integration/test environment Solr. Eric On Mar 27, 2009, at 11:59 AM, Joe Pollard wrote: > Hello, > > On our project, we have quite a bit of code used to generate Solr > queries, and I need to create some unit tests to ensure that these > continue to work. In addition, I need to generate some unit tests > that will test indexing and retrieval of certain documents, based on > our current schema and the application logic that generates the > indexable documents as well as generates the Solr queries. > > My question is - what's the best way for me to unit test our Solr > integration? > > I'd like to be able to spin up an embedded/in-memory solr, or that > failing just start one up as part of my test case setup, fill it > with interesting documents, and do some queries, comparing the > results to expected results. > > Are there wiki pages or other documented examples of doing this? It > seems rather straight-forward, but who knows, it may be dead simple > with some unknown feature. > > Thanks! > -Joe - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Free/Busy: http://tinyurl.com/eric-cal
Coming up with a model of memory usage
To combat our frequent OutOfMemory Exceptions, I'm attempting to come up with a model so that we can determine how much memory to give Solr based on how much data we have (as we expand to more data types eligible to be supported this becomes more important). Are there any published guidelines on how much memory a particular document takes up in memory, based on the data types, etc? I have several stored fields, numerous other non-stored fields, a largish copyTo field, and I am doing some sorting on indexed, non-stored fields. Any pointers would be appreciated! Thanks, -Joe
Re: Coming up with a model of memory usage
It doesn't seem to matter whether fields are stored or not, but I've found a rather striking difference in the memory requirements during sorting. Sorting on a string field representing datetime like '2008-08-12T12:18:26.510' is about twice as memory intense as sorting first by '2008-08-12' and then by '121826'. Any other tips/guidance like this would be great! Thanks, -Joe On Mon, 2009-04-06 at 15:43 -0500, Joe Pollard wrote: > To combat our frequent OutOfMemory Exceptions, I'm attempting to come up > with a model so that we can determine how much memory to give Solr based > on how much data we have (as we expand to more data types eligible to be > supported this becomes more important). > > Are there any published guidelines on how much memory a particular > document takes up in memory, based on the data types, etc? > > I have several stored fields, numerous other non-stored fields, a > largish copyTo field, and I am doing some sorting on indexed, non-stored > fields. > > Any pointers would be appreciated! > > Thanks, > -Joe >
RE: Coming up with a model of memory usage
Cool, great resource, thanks. -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Tuesday, April 07, 2009 10:13 AM To: solr-user@lucene.apache.org Subject: Re: Coming up with a model of memory usage On Tue, Apr 7, 2009 at 8:25 PM, Joe Pollard wrote: > It doesn't seem to matter whether fields are stored or not, but I've > found a rather striking difference in the memory requirements during > sorting. Sorting on a string field representing datetime like > '2008-08-12T12:18:26.510' is about twice as memory intense as sorting > first by '2008-08-12' and then by '121826'. > > Any other tips/guidance like this would be great! > There are a lot of threads on memory usage on the mailing list. Searching on the mailing list will give you a lot of information. http://lucene.markmail.org/search/solr+sorting+memory -- Regards, Shalin Shekhar Mangar.
RE: Coming up with a model of memory usage
Good info to have. Thanks Erick. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, April 07, 2009 10:51 AM To: solr-user@lucene.apache.org Subject: Re: Coming up with a model of memory usage Your observations about date sorting are probably correct. The issue is that the sort caches in Lucene look at the unique terms. There are many more unique terms (nearly every one) in 2008-08-12T12:18:26.510 then when the field is split. You can reduce memory consumption when sorting even more by splitting into more fields, but that's up to you to decide whether or not it's worth the effort Best Erick On Tue, Apr 7, 2009 at 10:55 AM, Joe Pollard wrote: > It doesn't seem to matter whether fields are stored or not, but I've > found a rather striking difference in the memory requirements during > sorting. Sorting on a string field representing datetime like > '2008-08-12T12:18:26.510' is about twice as memory intense as sorting > first by '2008-08-12' and then by '121826'. > > Any other tips/guidance like this would be great! > > Thanks, > -Joe > > On Mon, 2009-04-06 at 15:43 -0500, Joe Pollard wrote: > > To combat our frequent OutOfMemory Exceptions, I'm attempting to come up > > with a model so that we can determine how much memory to give Solr based > > on how much data we have (as we expand to more data types eligible to be > > supported this becomes more important). > > > > Are there any published guidelines on how much memory a particular > > document takes up in memory, based on the data types, etc? > > > > I have several stored fields, numerous other non-stored fields, a > > largish copyTo field, and I am doing some sorting on indexed, non-stored > > fields. > > > > Any pointers would be appreciated! > > > > Thanks, > > -Joe > > > >
RE: Coming up with a model of memory usage
It does end up in the right order (sorted), but it's very expensive. Sorting by a couple fields that each have fewer unique index values seems to limit the memory consumption greatly. -Original Message- From: Walter Underwood [mailto:wunderw...@netflix.com] Sent: Tuesday, April 07, 2009 11:12 AM To: solr-user@lucene.apache.org Subject: Re: Coming up with a model of memory usage Why tokenize the date? It sorts just fine as a string. --wunder On 4/7/09 8:50 AM, "Erick Erickson" wrote: > Your observations about date sorting are probably correct. The > issue is that the sort caches in Lucene look at the unique terms. > There are many more unique terms (nearly every one) in > 2008-08-12T12:18:26.510 > > then when the field is split. You can reduce memory consumption > when sorting even more by splitting into more fields, but that's up > to you to decide whether or not it's worth the effort > > Best > Erick > > On Tue, Apr 7, 2009 at 10:55 AM, Joe Pollard > wrote: > >> It doesn't seem to matter whether fields are stored or not, but I've >> found a rather striking difference in the memory requirements during >> sorting. Sorting on a string field representing datetime like >> '2008-08-12T12:18:26.510' is about twice as memory intense as sorting >> first by '2008-08-12' and then by '121826'. >> >> Any other tips/guidance like this would be great! >> >> Thanks, >> -Joe >> >> On Mon, 2009-04-06 at 15:43 -0500, Joe Pollard wrote: >>> To combat our frequent OutOfMemory Exceptions, I'm attempting to come up >>> with a model so that we can determine how much memory to give Solr based >>> on how much data we have (as we expand to more data types eligible to be >>> supported this becomes more important). >>> >>> Are there any published guidelines on how much memory a particular >>> document takes up in memory, based on the data types, etc? >>> >>> I have several stored fields, numerous other non-stored fields, a >>> largish copyTo field, and I am doing some sorting on indexed, non-stored >>> fields. >>> >>> Any pointers would be appreciated! >>> >>> Thanks, >>> -Joe >>> >> >>
_val:ord(field) (from wiki LargeIndexes)
I see this interesting line in the wiki page LargeIndexes http://wiki.apache.org/solr/LargeIndexes (sorting section towards the bottom) Using _val:ord(field) as a search term will sort the results without incurring the memory cost. I'd like to know what this means, but I'm having a bit of trouble parsing it What is _val:ord(field) exactly? Does this just mean that I should pass in the ordinal of the field instead of the fieldname in the query? Which portion of the memory cost is being avoided doing this?
Distributed Search - only get ids
Solr 1.3: If I am only getting back the document ids from a distributed search (e.g., uniqueid is 'id' and the fl parameter only contains 'id'), there seems to be some room for optimization in the current code path: 1) On each shard, grab top N sorted document ids & sort fields) 2) Merge these into one list of N sorted id fields. 3) Query each shard for the details of these documents (by id), getting back a field list of id only. It seems to me that step 3 is overhead that can be skipped. Any thoughts on this/known patches? Thanks, -Joe