substituting a db
Hi List I have a pretty big app in the works, and in short it will need to index a lot of items, with with some core attributes, and hundreds of optional attributes for each item. The app then needs to be able to make queries like 'find all items with attributes attribute_1=yes, attribute_5=10, attribute_8 > 50, attribute_12 != green' etc. I need exact counts, no approximations. I could do this using a regular database like mysql, but I know it will become rather slow at about 4-500k items with a 100 or so attributes each. My question is, would Solr be able to handle this better? I was thinking perhaps I could use the facetted searches for this? Thanks Alec
Re: substituting a db
On Dec 21, 2007, at 8:16 AM, Erik Hatcher wrote: On Dec 21, 2007, at 1:33 AM, alexander lind wrote: I have a pretty big app in the works, and in short it will need to index a lot of items, with with some core attributes, and hundreds of optional attributes for each item. The app then needs to be able to make queries like 'find all items with attributes attribute_1=yes, attribute_5=10, attribute_8 > 50, attribute_12 != green' etc. I need exact counts, no approximations. I could do this using a regular database like mysql, but I know it will become rather slow at about 4-500k items with a 100 or so attributes each. My question is, would Solr be able to handle this better? I was thinking perhaps I could use the facetted searches for this? I think Solr would handle this better personally, and you'd get full- text search as an added bonus! :) But, of course, it is advisable to give it a try and see. Solr is quite easy to get rolling with, so it'd be well worth a try. The one part I am most worried about is that solr would start doing approximations when the amount of data grows big, like in the 500k range, or even once I reach the 1M range. Do you know if there is a way to make sure that all facet counts are exact when dealing with this many records, and is it feasible to do this, ie not making it too slow by forcing the exact counts? Thanks Alec
Re: substituting a db
On Dec 21, 2007, at 9:56 AM, Ryan McKinley wrote: solr does not do approximations. Faceting with large indexes (500K is not particularly large) just requires RAM for reasonable performance. Give it a try, and see what you think. Excellent, happy to hear that. Give it a try I will, pretty sure I will like it :) Alec
negation
Hi all Say that I have a solr index with 5000 documents, each representing a campaign that users of my site can join. The user can search and find these campaigns in various ways, which is not a problem, but once a user has found a campaign and joined it, I don't want that campaign to ever show up again for that particular user. After a while, a user can have built up a list of say 200 campaigns that he has joined, and hence should never see in any search results again. I know this functionality could be achieved by simply building a longer and longer negation query negating all the campaigns that a user already has joined. I would assume that this would become slow and ineffective eventually. My question is: is there a better way to do this? Thanks Alec
Re: negation
Have you done any stress tests on this setup? Is it working well for you? It sounds like something that could work quite well for me too, but I would be a little worried that a commit could time out, and a unique value could be lost for that user. Thank you Alec On Feb 13, 2008, at 1:10 PM, Rachel McConnell wrote: We do something similar in a different context. I don't know if our way is necessarily better, but it would work like this: 1. add a field to campaign called something like enteredUsers 2. once a user adds a campaign, update the campaign, adding a value unique to that user to enteredUsers 3. the negation can now be done by excluding the user's unique id from the enteredUsers field, instead of excluding all the user's campaigns The downside is it will increase the number of your commits, which may or may not be OK. Rachel On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote: Hi all Say that I have a solr index with 5000 documents, each representing a campaign that users of my site can join. The user can search and find these campaigns in various ways, which is not a problem, but once a user has found a campaign and joined it, I don't want that campaign to ever show up again for that particular user. After a while, a user can have built up a list of say 200 campaigns that he has joined, and hence should never see in any search results again. I know this functionality could be achieved by simply building a longer and longer negation query negating all the campaigns that a user already has joined. I would assume that this would become slow and ineffective eventually. My question is: is there a better way to do this? Thanks Alec
Re: negation
I think I will try a hybrid version. One that uses my simple negation for newly joined campaigns, and uses your method to filter out campaigns joined longer ago. A cron:ed script will run every night and add all new user_id:s to the appropriate campaigns. That way I don't have to re-index on the fly at daytime when the server is going to be the busiest, and there should be less commits to the solr instance too, one per campaign max, instead of one per every join. Thanks for your input on this Rachel! Alec On Feb 13, 2008, at 2:01 PM, Rachel McConnell wrote: We've been using this in production for at least six months. I have never stress-tested this particular feature, but we usually do over 100k unique hits a day. Of those, most hit Solr for one thing or another, but a much smaller percentage use this specific bit. It isn't the fastest query but as we use it there are some additional complexities so YMMV. We aren't at risk for data loss from Solr, as we maintain all data in our database backend; Solr is essentially a slave to that. So we have a db field, enteredUsers, which has the usual JDBC failure checking and any error is handled gracefully. And the Solr index is then updated from the db periodically (we're optimized for faster search results, over up-to-date-ness). R On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote: Have you done any stress tests on this setup? Is it working well for you? It sounds like something that could work quite well for me too, but I would be a little worried that a commit could time out, and a unique value could be lost for that user. Thank you Alec On Feb 13, 2008, at 1:10 PM, Rachel McConnell wrote: We do something similar in a different context. I don't know if our way is necessarily better, but it would work like this: 1. add a field to campaign called something like enteredUsers 2. once a user adds a campaign, update the campaign, adding a value unique to that user to enteredUsers 3. the negation can now be done by excluding the user's unique id from the enteredUsers field, instead of excluding all the user's campaigns The downside is it will increase the number of your commits, which may or may not be OK. Rachel On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote: Hi all Say that I have a solr index with 5000 documents, each representing a campaign that users of my site can join. The user can search and find these campaigns in various ways, which is not a problem, but once a user has found a campaign and joined it, I don't want that campaign to ever show up again for that particular user. After a while, a user can have built up a list of say 200 campaigns that he has joined, and hence should never see in any search results again. I know this functionality could be achieved by simply building a longer and longer negation query negating all the campaigns that a user already has joined. I would assume that this would become slow and ineffective eventually. My question is: is there a better way to do this? Thanks Alec
Re: YAML update request handler
On Feb 20, 2008, at 9:31 AM, Doug Steigerwald wrote: A few months back I wrote a YAML update request handler to see if we could post documents faster than with XMl. We did see some small speed improvements (didn't write down the numbers), but the hacked together code was probably making it slower as well. Not sure if there are faster YAML libraries out there either. We're not actually using it, since it was just a small proof of concept type of project, but is this anything people might be interested in? Out of simple preference I would love to see a YAML request handler just because I like the YAML format. If its also faster than XML, then all the better. Cheers Alec