Thank you Joel - I'm using a ModifiableSolrParams object to build the parameters for Solr (hope this is what you want)

toString() returns:

expr=classify(model(models,id%3D"MODEL1014",cacheMillis%3D5000),search(COL,df%3D"FULL_DOCUMENT",q%3D"Collection:(COLLECT2000)+AND+DocTimestamp:[2017-08-14T04:00:00Z+TO+2017-08-16T03:59:00Z]",fl%3D"id,score",sort%3D"id+asc"),field%3D"ClusterText")&qt=/stream&explain=true&fl=id&sort=id+asc&rows=100

This collection has 100 shards with 3 replicas each, so I would expect 100*20 = 2000 results? Although I'm classifying on ClusterText, for the results, I only need an ID. At present, I can build a model and classify a single, or set of documents as they come into the system. However, if I want to use a model as a search, then I'm asking Solr to classify a lot of docs, but I actually only want to return docs that have a probability of n or higher.

-Joe


On 8/14/2017 10:46 PM, Joel Bernstein wrote:
My math was off again ... If you have 20 results from 50 shards that would
produce the 1000 results.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 14, 2017 at 10:17 PM, Joel Bernstein <joels...@gmail.com> wrote:

Actually my math was off. You would need 200 shards to get to 1000 result.
How many shards do you have?

The expression you provided also didn't include the ClusterText field in
field list of the search. So perhaps it's missing other parameters.

If you include all the parameters I may be able to spot the issue.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 14, 2017 at 10:10 PM, Joel Bernstein <joels...@gmail.com>
wrote:

It looks like you just need to set the rows parameter in the search
expression. If you don't set rows the default will be 20 I believe, which
will pull to top 20 docs from each shard. If you have 5 shards than the
1000 results would make sense.

You can parallelize the whole expression by wrapping it in a parallel
expression. You'll need to set the partitionKeys in the search expression
to do this.

If you have a large number of records to process I would recommend batch
processing. This blog explains the parallel batch framework:

http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-para
llel-etl-and.html






Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 14, 2017 at 7:53 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

Hi All - I'm using the classify stream expression and the results
returned are always limited to 1,000.  Where do I specify the number to
return?  The stream expression that I'm using looks like:

classify(model(models,id="MODEL1014",cacheMillis=5000),searc
h(COL,df="FULL_DOCUMENT",q="Collection:(COLLECT2000) AND
DocTimestamp:[2017-08-14T04:00:00Z TO 
2017-08-15T03:59:00Z]",fl="id,score",sort="id
asc"),field="ClusterText")

When I read this (code snipet):

              stream.open();
             while (true) {
                 Tuple tuple = stream.read();
                 if (tuple.EOF) {
                     break;
                 }
                 Double probabilty = (Double)
tuple.fields.get("probability_d");
                 String docID = (String) tuple.fields.get("id");

I get back 1,000 results.  Another question is if there is a way to
parallelize the classify call to other worker nodes?  Thank you!

-Joe



---
This email has been checked for viruses by AVG.
http://www.avg.com


Reply via email to