Re: Ideas

2015-09-21 Thread Walter Underwood
I have put a limit in the front end at a couple of sites. Nobody gets more than 50 pages of results. Show page 50 if they request beyond that. First got hit by this at Netflix, years ago. Solr 4 is much better about deep paging, but here at Chegg we got deep paging plus a stupid, long query. Th

Re: Ideas

2015-09-21 Thread Doug Turnbull
The nginx reverse proxy we use blocks ridicilous start and rows values https://github.com/o19s/solr_nginx Another silly thing I've noticed is you can pass sleep() as a function query. It's not documented, but I think a big hole. I wonder if I could DoS your Solr by sleeping and hogging all the av

Re: Ideas

2015-09-21 Thread DVT
Hi Bill, the classical way would be to have a reverse proxy in front of the application that catches such cases. A decent reverse proxy or even application firewall router will allow you to define limits on bandwidth and sessions per time unit. Some even recognize specific denial-of-service patte

Re: Ideas

2015-09-21 Thread Paul Libbrecht
Writing a query component would be pretty easy or? It would throw an exception if crazy numbers are requested... I can provide a simple example of a maven project for a query component. Paul William Bell wrote: > We have some Denial of service attacks on our web site. SOLR threads are > going c

Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Erick Erickson
Ian: Thanks much for the writeup! It's always good to have real-world documentation! Best, Erick On Fri, Nov 7, 2014 at 8:26 AM, Shawn Heisey wrote: > On 11/7/2014 7:17 AM, Ian Rose wrote: >> *tl;dr: *Routing updates to a random Solr node (and then letting it forward >> the docs to where they n

Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Shawn Heisey
On 11/7/2014 7:17 AM, Ian Rose wrote: > *tl;dr: *Routing updates to a random Solr node (and then letting it forward > the docs to where they need to go) is very expensive, more than I > expected. Using a "smart" router that uses the cluster config to route > documents directly to their shard resul

Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Ian Rose
Hi again, all - Since several people were kind enough to jump in to offer advice on this thread, I wanted to follow up in case anyone finds this useful in the future. *tl;dr: *Routing updates to a random Solr node (and then letting it forward the docs to where they need to go) is very expensive,

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Erick Erickson
bq: but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Not really. You've stated that you're not driving Solr very hard in your tests. Therefore you're waiting on I/O. Therefore your tests just aren't going to scale linearly with the number of shard

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Shawn Heisey
On 11/1/2014 9:52 AM, Ian Rose wrote: > Just to make sure I am thinking about this right: batching will certainly > make a big difference in performance, but it should be more or less a > constant factor no matter how many Solr nodes you are using, right? Right > now in my load tests, I'm not actu

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Ian Rose
Erick, Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan
Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Erick Erickson
Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan wrote: > Regarding batch indexing: > When I send batches of 1000 docs to a standalone S

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Peter Keegan
Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually "(12 adds)". Why do the batches appea

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Erick Erickson
NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1> you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of

Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Ian Rose
Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said "WPS" (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Erick Erickson
I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mea

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Thanks for the suggestions so for, all. 1) We are not using SolrJ on the client (not using Java at all) but I am working on writing a "smart" router so that we can always send to the correct node. I am certainly curious to see how that changes things. Nonetheless even with the overhead of extra r

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Erick Erickson
Your indexing client, if written in SolrJ, should use CloudSolrServer which is, in Matt's terms "leader aware". It divides up the documents to be indexed into packets that where each doc in the packet belongs on the same shard, and then sends the packet to the shard leader. This avoids a lot of re-

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Shawn Heisey
On 10/30/2014 2:56 PM, Ian Rose wrote: > I think this is true only for actual queries, right? I am not issuing > any queries, only writes (document inserts). In the case of writes, > increasing the number of shards should increase my throughput (in > ops/sec) more or less linearly, right? No, that

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Matt Hilt
If you are issuing writes to shard non-leaders, then there is a large overhead for the eventual redirect to the leader. I noticed a 3-5 times performance increase by making my write client leader aware. On Oct 30, 2014, at 2:56 PM, Ian Rose wrote: >> >> If you want to increase QPS, you shoul

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
> > If you want to increase QPS, you should not be increasing numShards. > You need to increase replicationFactor. When your numShards matches the > number of servers, every single server will be doing part of the work > for every query. I think this is true only for actual queries, right? I a

Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Shawn Heisey
On 10/30/2014 2:23 PM, Ian Rose wrote: > My methodology is as follows. > 1. Start up a K solr servers. > 2. Remove all existing collections. > 3. Create N collections, with numShards=K for each. > 4. Start load testing. Every minute, print the number of successful > updates and the number of faile

RE: ideas for indexing large amount of pdf docs

2011-08-16 Thread Rode González
ae...@dot.wi.gov] > Enviado el: lunes, 15 de agosto de 2011 14:54 > Para: solr-user@lucene.apache.org > Asunto: RE: ideas for indexing large amount of pdf docs > > Note on i: Solr replication provides pretty good clustering support > out-of-the-box, including replication of m

RE: ideas for indexing large amount of pdf docs

2011-08-15 Thread Jaeger, Jay - DOT
Note on i: Solr replication provides pretty good clustering support out-of-the-box, including replication of multiple cores. Read the Wiki on replication (Google +solr +replication if you don't know where it is). In my experience, the problem with indexing PDFs is it takes a lot of CPU on t

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
t, 13 Aug 2011 15:34:19 -0400 Subject: Re: ideas for indexing large amount of pdf docs Ahhh, ok, my reply was irrelevant ... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr [http://www.lucidimagination.com/content/scaling-lucen

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
tering in production time. > > Best, > > Rode. > > > -Original Message- > > From: Erick Erickson > > To: solr-user@lucene.apache.org > > Date: Sat, 13 Aug 2011 12:13:27 -0400 > > Subject: Re: ideas for indexing large amount of pdf docs > >

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell
You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson wrote: > Yeah, parsing PDF files

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)
dea to minimize this time all as possible when we entering in production time. Best, Rode. -Original Message- From: Erick Erickson To: solr-user@lucene.apache.org Date: Sat, 13 Aug 2011 12:13:27 -0400 Subject: Re: ideas for indexing large amount of pdf docs Yeah, parsing PDF

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson
Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this s

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov
I think a 30% increase is acceptable. Yes, I think we'll try it. Although our case is more like # groups ~ # documents / N, where N is a smallish number (~1-5?). We are planning for a variety of different index sizes, but aiming for a sweet spot around a few M docs. -Mike On 08/01/2011 11:

Re: ideas for versioning query?

2011-08-01 Thread Martijn v Groningen
Hi Mike, how many docs and groups do you have in your index? I think the group.sort option fits your requirements. If I remember correctly group.ngroup=true adds something like 30% extra time on top of the search request with grouping, but that was on my local test dataset (~30M docs, ~8000 groups

Re: ideas for versioning query?

2011-08-01 Thread Mike Sokolov
Thanks, Tomas. Yes we are planning to keep a "current" flag in the most current document. But there are cases where, for a given user, the most current document is not that one, because they only have access to some older documents. I took a look at http://wiki.apache.org/solr/FieldCollapsin

Re: ideas for versioning query?

2011-08-01 Thread Tomás Fernández Löbbe
Hi Michael, I guess this could be solved using grouping as you said. Documents inside a group can be sorted on a field (in your case, the version field, see parameter group.sort), and you can show only the first one. It will be more complex to show facets (post grouping faceting is work in progress

Re: Ideas on how to implement "sponsored results"

2008-06-04 Thread Alexander Ramos Jardim
Cuong, I think you will need some manipulation beyond solr queries. You should separate the results by your site criteria after retrieving them. After that, you could cache the results on your application and randomize the lists every time you render the a page. I don't know if solr has collapsin

Re: Ideas on how to implement "sponsored results"

2008-06-03 Thread climbingrose
Hi Alexander, Thanks for your suggestion. I think my problem is a bit different from yours. We don't have any sponsored words but we have to retrieve sponsored results directly from the index. This is because a site can have 60,000 products which is hard to insert/update keywords. I can live with

Re: Ideas on how to implement "sponsored results"

2008-06-03 Thread Alexander Ramos Jardim
Cuong, I have implemented sponsored words for a client. I don't know if my working can help you but I will expose it and let you decide. I have an index containing products entries that I created a field called sponsored words. What I do is to boost this field , so when these words are matched in

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-30 Thread Yonik Seeley
On 5/30/07, Daniel Einspanjer <[EMAIL PROTECTED]> wrote: What I quickly found I could do without though was the HTTP overhead. I implemented the EmbeddedSolr class found on the Solr wiki that let me interact with the Solr engine directly. This is important since I'm doing thousands of queries in

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-30 Thread Daniel Einspanjer
On 4/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Not really. The explain scores aren't normalized and I also couldn't : find a way to get the explain data as anything other than a whitespace : formatted text blob from Solr. Keep in mind that they need confidence the defualt way Solr du

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-09 Thread Sean Timm
Yes, for good (hopefully) or bad. -Sean Shridhar Venkatraman wrote on 5/7/2007, 12:37 AM: Interesting.. Surrogates can also bring the searcher's subjectivity (opinion and context) into it by the learning process ? shridhar Sean Timm wrote: It may not be easy or even possible withou

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-06 Thread Shridhar Venkatraman
Interesting.. Surrogates can also bring the searcher's subjectivity (opinion and context) into it by the learning process ? shridhar Sean Timm wrote: It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-05 Thread Sean Timm
It may not be easy or even possible without major changes, but having global collection statistics would allow scores to be compared across searchers.  To do this, the master indexes would need to be able to communicate with each other. An other approach to merging across searchers is describe

Re: Ideas for a relevance score that could be considered stable across multiple searches with the same query structure?

2007-05-05 Thread Daniel Einspanjer
On 4/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: A custom Similaity class with simplified tf, idf, and queryNorm functions might also help you get scores from the Explain method that are more easily manageable since you'll have predictible query structures hard coded into your application