Can Solr solve this simple problem?
Hi everyone :) Our company is very interesting in Solr engine for searching people. I have 3 questions below about extended capabilities of Solr, but first I'd like to present you the problem Let's say we have ~100 mln users with many characteristics - some of them described below. We want to search users by any set of these characteristics (of course we should use index clustering, replication, query distribution) - country - text (alpha-iso-3 country code) - language - text (alpha-iso-3 country code) - has_photo - boolean - has_video - boolean - lastvisit - date - gender - int - age - int - latitude - float - longitude - float - height - int - updated - date - 100+ other boolean fields to store and search by it - profile has some property or don't * Prototype SQL query looks like this:* SELECT user_id FROM users WHERE AND country = 'USA' AND language = 'SPA' AND gender = 1 AND age BETWEEN 30 AND 40 AND latitude BETWEEN 39.0 AND 41.0 AND longitude BETWEEN 73.0 AND 75.0 AND height BETWEEN 170 AND 180 AND has_photo = 1 AND has_video = 0 AND (bool_field1 = 1 OR bool_field2 = 1) AND (bool_fieldN = 0 OR bool_fieldM = 1 OR bool_fieldK = 0) ... ORDER BY IF(has_photo = 1, 100, 0) + IF(language = 'FRA', 50, 0) + IF(has_description = 1, 150, 0) + IF(has_video = 1, 50, 0) + IF(lastvisit > NOW() - interval 1 month, 200, 0) DESC, IF(age > 35, 20, 0) + IF(gender = 2, 30, 0) + IF(bool_field1 = 1, 50, 0) + IF(bool_fieldN = 0, 100, 0) ASC LIMIT 200; So, these are my 3 questions: 1. Does Solr provide searching among different count fields with different types like in WHERE condition? 2. Does Solr provide such sorting, that depends on other fields (like sums in ORDER BY), other words - does it provide any kind of function, which is used to sort results from q1? 3. Does Solr provide realtime index updating or updating every N minutes? What advices can you give to provide this scheme searching with Solr? Best regards.
Re: Can Solr solve this simple problem?
Thanks for your reply :) I have some new questions now: 1. How stable is trunk version? Has anyone used it on any kind of highload project in production? 2. Does version 3.6 support near real time index update? 3. What is scheme of Solr index storing? Is it all in memory for each shard or in disk with caching for frequently asked queries in memory? 4. The best practice for index updating is - to do delta imports each 5 minutes for example, and once a day - full rebuild index, does it take long time for ~100 mln users? Am I right? 5. Does sharding and replications have native support in Solr, so everyting I need to care about is config file for nodes? Are there any limitations of usage such sorting if we use sharding? The reason why we want to move from our DB search scheme (data is sharded into small tables at several servers and managed in code) is that: 1. response time of our search isn't what we need (3-5 s now in production, we want <1 s) 2. growing amount of data 3. we want automatically clustering any amount of data and search by it, without need to care about how data stores and does it has durability or not That's why we also looking other solutions with autosharding of huge amount of data with ability to make such types of query and sorting (thinking about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If anyone can give advice for such technology, I'll be glad to hear it. 2012/4/17 Jan Høydahl > > Hi everyone :) > > Hi :) > > > So, these are my 3 questions: > > 1. Does Solr provide searching among different count fields with > different > > types like in WHERE condition? > > Yes. As long as these are not full-text you should use filter queries for > these, e.g. > &q=*:* > &fq=country:USA > &fq=language:SPA > &fq=age:[30 TO 40] > &fq=(bool_field1:1 OR bool_field2:1) > > The reason why I put multiple "fq" instead of one long is to optimize for > caching of filters > > > 2. Does Solr provide such sorting, that depends on other fields (like > sums > > in ORDER BY), other words - does it provide any kind of function, which > is > > used to sort results from q1? > > Yes. In trunk version you can sort by function which can do sums and all > crezy things > &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0)) > asc&agequery=age:[53 TO *] > See http://wiki.apache.org/solr/FunctionQuery for more functions > > But you could also to much of this through boost queries > &sort=score desc > &bq=language:FRA^50 > %bq=age:[53 TO *]^20 > > > 3. Does Solr provide realtime index updating or updating every N minutes? > > Sure, there is Near Real-time indexing in TRUNK (coming 4.0) > > Jan
Re: Can Solr solve this simple problem?
Thanks for your replies, you're good expert :) I've read documentation on Solr basicaly, I'm familiar with it around 2 days. The documentation is very huge at first sight :). Me and my company is being deciding to use Solr or other solution. Maybe you're right about re-implementing our sorting functions to something new. 1. If index is stored at disk, what way good performance is achieved (if index changes frequently, ~50,000 - 100,000 records are updating each 10 minutes, so maybe caching won't be effective)? 2. What can you say about semantic search Solr capabilities? Are there any examples of it in production? 3. Can you please give some examples projects/sites with Solr 4.0 usage in production? 2012/4/17 Jan Høydahl > Hi, > > You have many basic questions about search. Can I recommend one of the > books? http://lucene.apache.org/solr/books.html > Also, you'll find a lot of answers on the Solr WIKI: > http://wiki.apache.org/solr/ if you're not aware of it. > > I think Solr may solve your performance problems well. > Whether it's the right tool for the job depends on several factors. > Also, sometimes it is useful to step back and think fresh. Perhaps the > reason why you implemented things like you did was technical reasons driven > by your DB capabilities. > When re-implementing on top of Solr, perhaps there are better ways to do > what you REALLY wanted instead of limiting yourself to the ORDER BY syntax > etc. > One of Solr's strengths is relevancy and FunctionQueries and it can do > amazing things :) > > Further answers below.. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 17. apr. 2012, at 07:20, Alexandr Bocharov wrote: > > > Thanks for your reply :) > > I have some new questions now: > > 1. How stable is trunk version? Has anyone used it on any kind of > highload > > project in production? > It's stable. Used in production many places. Soon expected in alpha or > beta release > > 2. Does version 3.6 support near real time index update? > No > > 3. What is scheme of Solr index storing? Is it all in memory for each > shard > > or in disk with caching for frequently asked queries in memory? > On disk but with many caching optimizations > > 4. The best practice for index updating is - to do delta imports each 5 > > minutes for example, and once a day - full rebuild index, does it take > long > > time for ~100 mln users? Am I right? > You can do deltas only, as often as you choose. Solr will handle the > backend details > > 5. Does sharding and replications have native support in Solr, so > everyting > > I need to care about is config file for nodes? Are there any limitations > of > > usage such sorting if we use sharding? > Yes, sharding and replication is natively supported. See the Wiki > > The reason why we want to move from our DB search scheme (data is sharded > > into small tables at several servers and managed in code) is that: > > 1. response time of our search isn't what we need (3-5 s now in > production, > > we want <1 s) > > 2. growing amount of data > > 3. we want automatically clustering any amount of data and search by it, > > without need to care about how data stores and does it has durability or > not > > > > That's why we also looking other solutions with autosharding of huge > amount > > of data with ability to make such types of query and sorting (thinking > > about Mysql Cluster, but it's not stable yet, or Oracle Cluster). If > anyone > > can give advice for such technology, I'll be glad to hear it. > What do you expect from "Autosharding"? > > > > 2012/4/17 Jan Høydahl > > > >>> Hi everyone :) > >> > >> Hi :) > >> > >>> So, these are my 3 questions: > >>> 1. Does Solr provide searching among different count fields with > >> different > >>> types like in WHERE condition? > >> > >> Yes. As long as these are not full-text you should use filter queries > for > >> these, e.g. > >> &q=*:* > >> &fq=country:USA > >> &fq=language:SPA > >> &fq=age:[30 TO 40] > >> &fq=(bool_field1:1 OR bool_field2:1) > >> > >> The reason why I put multiple "fq" instead of one long is to optimize > for > >> caching of filters > >> > >>> 2. Does Solr provide such sorting, that depends on other fields (like > >> sums > >>> in ORDER BY), other words - does it provide any kind of function, which > >> is > >>> used to sort results from q1? > >> > >> Yes. In trunk version you can sort by function which can do sums and all > >> crezy things > >> &sort=sum(product(has_photo,10),if(exists(query($agequery)),50,0)) > >> asc&agequery=age:[53 TO *] > >> See http://wiki.apache.org/solr/FunctionQuery for more functions > >> > >> But you could also to much of this through boost queries > >> &sort=score desc > >> &bq=language:FRA^50 > >> %bq=age:[53 TO *]^20 > >> > >>> 3. Does Solr provide realtime index updating or updating every N > minutes? > >> > >> Sure, there is Near Real-time indexing in TRUNK (coming 4.0) > >> > >> Jan > >
Re: Solr PHP highload search
Thank you for help :) I'm giving 2048M the JVM for each node. CPU load is jumping 70-90%. Memory usage is increasing to max during testing (probably cache is filling). I/O I didn't monitor. I'd like to see answers on my other questions. 2012/6/13 Erick Erickson > How much memory are you giving the JVM? Have you put a performance > monitor on the running process to see what resources have been > exhausted (i.e. are you I/O bound? CPU bound?) > > Best > Erick > > On Tue, Jun 12, 2012 at 3:40 AM, Alexandr Bocharov > wrote: > > Hi, all. > > > > I need advice for configuring Solr search to use at highload production. > > > > I've wrote user's search engine (PHP class), that uses over 70 parameters > > for searching users. > > User's database is over 30 millions records. > > Index total size is 6.4G when I use 1 node and 3.2G when 2 nodes. > > Previous search engine can handle 700,000 queries per day for searching > > users - it is ~8 queries/sec (4 mysql servers with manual sharding via > > Gearman) > > > > Example of queries are: > > > > [responseHeader] => SolrObject Object > >( > >[status] => 0 > >[QTime] => 517 > >[params] => SolrObject Object > >( > >[bq] => Array > >( > >[0] => bool_field1:1^30 > >[1] => str_field1:str_value1^15 > >[2] => tint_field1:tint_field1^5 > >[3] => bool_field2:1^6 > >[4] => date_field1:[NOW-14DAYS TO NOW]^20 > >[5] => date_field2:[NOW-14DAYS TO NOW]^5 > >) > > > >[indent] => on > >[start] => 0 > >[q.alt] => *:* > >[wt] => xml > >[fq] => Array > >( > >[0] => tint_field2:[tint_value2 TO > tint_value22] > >[1] => str_field1:str_value1 > >[2] => str_field2:str_value2 > >[3] => tint_field3:(tint_value3 OR > tint_value32 > > OR tint_value33 OR tint_value34 OR tint_value5) > >[4] => tint_field4:tint_value4 > >[5] => -bool_field1:[* TO *] > >) > > > >[version] => 2.2 > >[defType] => dismax > >[rows] => 10 > >) > > > >) > > > > > > I test my PHP search API and found that concurrent random queries, for > > example 10 queries at one time increases QTime from avg 500 ms to 3000 ms > > at 2 nodes. > > > > 1. How can I tweak my queries or parameters or Solr's config to decrease > > QTime? > > 2. What if I put my index data to emulated RAM directory, can it increase > > greatly performance? > > 3. Sorting by boost queries has a great influence on QTime, how can I > > optimize boost queries? > > 4. If I split my 2 nodes on 2 machines into 6 nodes on 2 machines, 3 > nodes > > per machine, will it increase performance? > > 5. What is "multi-core query", how can I configure it, and will it > increase > > performance? > > > > Thank you! >