substituting a db

2007-12-20 Thread alexander lind

Hi List

I have a pretty big app in the works, and in short it will need to  
index a lot of items, with with some core attributes, and hundreds of  
optional attributes for each item.


The app then needs to be able to make queries like
'find all items with attributes  attribute_1=yes, attribute_5=10,  
attribute_8 > 50, attribute_12 != green'

etc.

I need exact counts, no approximations.

I could do this using a regular database like mysql, but I know it  
will become rather slow at about 4-500k items with a 100 or so  
attributes each.


My question is, would Solr be able to handle this better?   I was  
thinking perhaps I could use the facetted searches for this?


Thanks
Alec


Re: substituting a db

2007-12-21 Thread alexander lind


On Dec 21, 2007, at 8:16 AM, Erik Hatcher wrote:



On Dec 21, 2007, at 1:33 AM, alexander lind wrote:
I have a pretty big app in the works, and in short it will need to  
index a lot of items, with with some core attributes, and hundreds  
of optional attributes for each item.


The app then needs to be able to make queries like
'find all items with attributes  attribute_1=yes, attribute_5=10,  
attribute_8 > 50, attribute_12 != green'

etc.

I need exact counts, no approximations.

I could do this using a regular database like mysql, but I know it  
will become rather slow at about 4-500k items with a 100 or so  
attributes each.


My question is, would Solr be able to handle this better?   I was  
thinking perhaps I could use the facetted searches for this?


I think Solr would handle this better personally, and you'd get full- 
text search as an added bonus!  :)   But, of course, it is advisable  
to give it a try and see.   Solr is quite easy to get rolling with,  
so it'd be well worth a try.


The one part I am most worried about is that solr would start doing  
approximations when the amount of data grows big, like in the 500k  
range, or even once I reach the 1M range.
Do you know if there is a way to make sure that all facet counts are  
exact when dealing with this many records, and is it feasible to do  
this, ie not making it too slow by forcing the exact counts?


Thanks
Alec


Re: substituting a db

2007-12-21 Thread alexander lind

On Dec 21, 2007, at 9:56 AM, Ryan McKinley wrote:

solr does not do approximations.  Faceting with large indexes (500K  
is not particularly large) just requires RAM for reasonable  
performance.


Give it a try, and see what you think.


Excellent, happy to hear that. Give it a try I will, pretty sure I  
will like it :)


Alec



negation

2008-02-13 Thread alexander lind

Hi all

Say that I have a solr index with 5000 documents, each representing a  
campaign that users of my site can join. The user can search and find  
these campaigns in various ways, which is not a problem, but once a  
user has found a campaign and joined it, I don't want that campaign to  
ever show up again for that particular user.


After a while, a user can have built up a list of say 200 campaigns  
that he has joined, and hence should never see in any search results  
again.


I know this functionality could be achieved by simply building a  
longer and longer negation query negating all the campaigns that a  
user already has joined. I would assume that this would become slow  
and ineffective eventually.


My question is: is there a better way to do this?

Thanks
Alec


Re: negation

2008-02-13 Thread alexander lind
Have you done any stress tests on this setup? Is it working well for  
you?
It sounds like something that could work quite well for me too, but I  
would be a little worried that a commit could time out, and a unique  
value could be lost for that user.


Thank you
Alec

On Feb 13, 2008, at 1:10 PM, Rachel McConnell wrote:


We do something similar in a different context.  I don't know if our
way is necessarily better, but it would work like this:

1. add a field to campaign called something like enteredUsers
2. once a user adds a campaign, update the campaign, adding a value
unique to that user to enteredUsers
3. the negation can now be done by excluding the user's unique id from
the enteredUsers field, instead of excluding all the user's campaigns

The downside is it will increase the number of your commits, which may
or may not be OK.

Rachel

On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote:

Hi all

Say that I have a solr index with 5000 documents, each representing a
campaign that users of my site can join. The user can search and find
these campaigns in various ways, which is not a problem, but once a
user has found a campaign and joined it, I don't want that campaign  
to

ever show up again for that particular user.

After a while, a user can have built up a list of say 200 campaigns
that he has joined, and hence should never see in any search results
again.

I know this functionality could be achieved by simply building a
longer and longer negation query negating all the campaigns that a
user already has joined. I would assume that this would become slow
and ineffective eventually.

My question is: is there a better way to do this?

Thanks
Alec





Re: negation

2008-02-13 Thread alexander lind
I think I will try a hybrid version. One that uses my simple negation  
for newly joined campaigns, and uses your method to filter out  
campaigns joined longer ago. A cron:ed script will run every night and  
add all new user_id:s to the appropriate campaigns. That way I don't  
have to re-index on the fly at daytime when the server is going to be  
the busiest, and there should be less commits to the solr instance  
too, one per campaign max, instead of one per every join.


Thanks for your input on this Rachel!

Alec

On Feb 13, 2008, at 2:01 PM, Rachel McConnell wrote:


We've been using this in production for at least six months.  I have
never stress-tested this particular feature, but we usually do over
100k unique hits a day.  Of those, most hit Solr for one thing or
another, but a much smaller percentage use this specific bit.  It
isn't the fastest query but as we use it there are some additional
complexities so YMMV.

We aren't at risk for data loss from Solr, as we maintain all data in
our database backend; Solr is essentially a slave to that.  So we have
a db field, enteredUsers, which has the usual JDBC failure checking
and any error is handled gracefully.  And the Solr index is then
updated from the db periodically (we're optimized for faster search
results, over up-to-date-ness).

R

On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote:

Have you done any stress tests on this setup? Is it working well for
you?
It sounds like something that could work quite well for me too, but I
would be a little worried that a commit could time out, and a unique
value could be lost for that user.

Thank you
Alec

On Feb 13, 2008, at 1:10 PM, Rachel McConnell wrote:


We do something similar in a different context.  I don't know if our
way is necessarily better, but it would work like this:

1. add a field to campaign called something like enteredUsers
2. once a user adds a campaign, update the campaign, adding a value
unique to that user to enteredUsers
3. the negation can now be done by excluding the user's unique id  
from
the enteredUsers field, instead of excluding all the user's  
campaigns


The downside is it will increase the number of your commits, which  
may

or may not be OK.

Rachel

On 2/13/08, alexander lind <[EMAIL PROTECTED]> wrote:

Hi all

Say that I have a solr index with 5000 documents, each  
representing a
campaign that users of my site can join. The user can search and  
find

these campaigns in various ways, which is not a problem, but once a
user has found a campaign and joined it, I don't want that campaign
to
ever show up again for that particular user.

After a while, a user can have built up a list of say 200 campaigns
that he has joined, and hence should never see in any search  
results

again.

I know this functionality could be achieved by simply building a
longer and longer negation query negating all the campaigns that a
user already has joined. I would assume that this would become slow
and ineffective eventually.

My question is: is there a better way to do this?

Thanks
Alec








Re: YAML update request handler

2008-02-20 Thread alexander lind

On Feb 20, 2008, at 9:31 AM, Doug Steigerwald wrote:

A few months back I wrote a YAML update request handler to see if we  
could post documents faster than with XMl.  We did see some small  
speed improvements (didn't write down the numbers), but the hacked  
together code was probably making it slower as well.  Not sure if  
there are faster YAML libraries out there either.


We're not actually using it, since it was just a small proof of  
concept type of project, but is this anything people might be  
interested in?




Out of simple preference I would love to see a YAML request handler  
just because I like the YAML format. If its also faster than XML, then  
all the better.


Cheers
Alec