fetch streaming expression multiple collections problem

2020-09-24 Thread uyilmaz


Hello all,

When I try to use the "select" streaming expression with multiple collections 
it works without any problems, like:

search(
"collection1,collection2",
q="*:*",
fl="field1,field2",
qt="/export",
sort="field1 desc"
)

but when I try to use the "fetch" expression similarly:

fetch(
"collection1,collection2"


It gives me an error saying: "EXCEPTION": "java.io.IOException: Slices not 
found for \"collection1,collection2\""

when I use it without quotes problem is resolved but another problem arises:

fetch(
collection1,collection2


which fetches fields only from collection1.. and returns empty for documents 
residing in collection2.

I took a look at the source code of fetch and select expressions, they both get 
collection parameter exactly the same way, using:

String collectionName = factory.getValueOperand(expression, 0)

I'm lost. When I use an alias in place of multiple collections it works as 
desired, but we have many collections and queries are generated dynamically so 
we would need many combination of aliases.

Need help.
Regards

-- 
uyilmaz 


Worker node / collection creation, parallelized streams

2020-09-28 Thread uyilmaz


Hi all,

Today I was fiddling with a streaming expression that takes too long to finish 
and times out. First of all, is it normal for it to time out, rather than just 
taking too long?

Then I read about the parallelized streaming expressions, which takes a worker 
number as parameter. We have 10 nodes in our cluster.

First question is, if I want to run it in 10 worker nodes, should I provide a 
partition key that takes exactly 10 different values, or Solr itself figures 10 
different values from it? "mod" function query with modulus 10 came into my 
mind, but I got various errors when using it as a partition key.

Second question is, how do I correctly create a worker collection? Should it be 
an empty collection with 10 shards with 1 replica each, or 1 shard with 10 
replicas? When I used the latter, I got array IndexOutOfBounds errors with 
workers parameter set to greater than 1.

~Regards


-- 
uyilmaz 


Solr Web UI

2020-09-29 Thread uyilmaz


Hello all,

Our Solr web ui (/solr/#/) doesn't show query results if it takes longer than, 
say 3-4 seconds. When I look at the browser console, I see the request is 
getting cancelled. I went through the javascript code but didn't see a part 
that cancels the request after a couple of seconds. Do you see this behavior 
too? Is it intentional?

I usually use Postman for querying so this is not a problem most of the time, 
but I just wanted to see streaming expression explanation diagrams.

Have a nice day~~

-- 
uyilmaz 


Strange fetch streaming expression doesn't fetch fields sometimes?

2020-10-13 Thread uyilmaz


Hi all,

I have a streaming expression looking like:

fetch(
  myAlias,
  top(
n=3,
  various expressions here
sort="count(*) desc"
  ),
  fl="username", on="userid=userid", batchSize=3
)

which fails to fetch username field for the 1st result:

{
 "result-set":{
  "docs":[{
"userid":"123123",
"count(*)":58}
   ,{
"userid":"123123123",
"count(*)":32,
"username":"Ayha"}
   ,{
"userid":"12432423321323",
"count(*)":30,
"username":"MEHM"}
   ,{
"EOF":true,
"RESPONSE_TIME":34889}]}}

But strangely, when I change n and batchSize both to 2 and touch nothing else, 
fetch fetches the first username correctly:

fetch(
  myAlias,
  top(
n=2,
  various expressions here
sort="count(*) desc"
  ),
  fl="username", on="userid=userid", batchSize=2
)

Result is:

{
 "result-set":{
  "docs":[{
"userid":"123123",
"count(*)":58,
"username":"mura"}
   ,{
"userid":"123123123",
"count(*)":32,
"username":"Ayha"}
   ,{
"EOF":true,
"RESPONSE_TIME":34889}]}}

What can be the problem?

Regards

~~ufuk

-- 
uyilmaz 


Re: Strange fetch streaming expression doesn't fetch fields sometimes?

2020-10-13 Thread uyilmaz
I think I found the reason right after asking (facepalm), but it took me days 
to realize this.

I think fetch performs a naive "in" query, something like:

q="userid:(123123 123123123 12432423321323)&rows={batchSize}"

When userid to document relation is one-to-many, it is possible that above 
query will result in documents consisting entirely of last two userid's 
documents, so the first one is left out, resulting in empty username. Docs 
state that one to many is not supported with fetch, but I didn't stumble onto 
this issue until recently so I just assumed it would work.

Sorry to take your time, I hope this helps somebody later.

Have a nice day.

On Wed, 14 Oct 2020 00:38:05 +0300
uyilmaz  wrote:

> 
> Hi all,
> 
> I have a streaming expression looking like:
> 
> fetch(
>   myAlias,
>   top(
>   n=3,
>   various expressions here
> sort="count(*) desc"
>   ),
>   fl="username", on="userid=userid", batchSize=3
> )
> 
> which fails to fetch username field for the 1st result:
> 
> {
>  "result-set":{
>   "docs":[{
> "userid":"123123",
> "count(*)":58}
>,{
> "userid":"123123123",
> "count(*)":32,
> "username":"Ayha"}
>,{
> "userid":"12432423321323",
> "count(*)":30,
> "username":"MEHM"}
>,{
> "EOF":true,
> "RESPONSE_TIME":34889}]}}
>   
> But strangely, when I change n and batchSize both to 2 and touch nothing 
> else, fetch fetches the first username correctly:
> 
> fetch(
>   myAlias,
>   top(
>   n=2,
>   various expressions here
> sort="count(*) desc"
>   ),
>   fl="username", on="userid=userid", batchSize=2
> )
> 
> Result is:
> 
> {
>  "result-set":{
>   "docs":[{
> "userid":"123123",
> "count(*)":58,
> "username":"mura"}
>,{
> "userid":"123123123",
> "count(*)":32,
> "username":"Ayha"}
>,{
> "EOF":true,
> "RESPONSE_TIME":34889}]}}
>   
> What can be the problem?
> 
> Regards
> 
> ~~ufuk
> 
> -- 
> uyilmaz 


-- 
uyilmaz 


just testing if my emails are reaching the mailing list

2020-10-14 Thread uyilmaz
Hello all,

I have never got an answer to my questions in this mailing list yet, and my 
mail client shows INVALID next to my mail address, so I thought I should check 
if my emails are reaching to you.

Can anyone reply?

Regards

-- 
uyilmaz 


Re: just testing if my emails are reaching the mailing list

2020-10-14 Thread uyilmaz
Thank you!

On Wed, 14 Oct 2020 09:41:16 +0200
Szűcs Roland  wrote:

> Hi,
> I got it from the solr user list.
> 
> 
> Roland
> 
> uyilmaz  ezt írta (időpont: 2020. okt. 14.,
> Sze, 9:39):
> 
> > Hello all,
> >
> > I have never got an answer to my questions in this mailing list yet, and
> > my mail client shows INVALID next to my mail address, so I thought I should
> > check if my emails are reaching to you.
> >
> > Can anyone reply?
> >
> > Regards
> >
> > --
> > uyilmaz 
> >


-- 
uyilmaz 


Re: Strange fetch streaming expression doesn't fetch fields sometimes?

2020-10-15 Thread uyilmaz
Is it possible to duplicate its functionality using existing expressions?

In SQL, while grouping you can just say first(column) to get some one-to-many 
value if you don't care which one you get. Solr usually only has min,max,avg.. 
aggregation functions. If it had a "first" function I could just get userid and 
first(username) in an expression, I sometimes use min(username) as a trick 
while faceting to get extra fields alongside faceted results, but max,min only 
accepts numbers in streaming expressions.

On Wed, 14 Oct 2020 20:47:28 -0400
Joel Bernstein  wrote:

> Yes, the docs mention one-to-one and many-to-one fetches, but one-to-many
> is not supported currently. I've never really been happy with fetch. It
> really needs to be replaced with a standard nested loop join that handles
> all scenarios.
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> 
> On Tue, Oct 13, 2020 at 6:30 PM uyilmaz  wrote:
> 
> > I think I found the reason right after asking (facepalm), but it took me
> > days to realize this.
> >
> > I think fetch performs a naive "in" query, something like:
> >
> > q="userid:(123123 123123123 12432423321323)&rows={batchSize}"
> >
> > When userid to document relation is one-to-many, it is possible that above
> > query will result in documents consisting entirely of last two userid's
> > documents, so the first one is left out, resulting in empty username. Docs
> > state that one to many is not supported with fetch, but I didn't stumble
> > onto this issue until recently so I just assumed it would work.
> >
> > Sorry to take your time, I hope this helps somebody later.
> >
> > Have a nice day.
> >
> > On Wed, 14 Oct 2020 00:38:05 +0300
> > uyilmaz  wrote:
> >
> > >
> > > Hi all,
> > >
> > > I have a streaming expression looking like:
> > >
> > > fetch(
> > >   myAlias,
> > >   top(
> > >   n=3,
> > >   various expressions here
> > > sort="count(*) desc"
> > >   ),
> > >   fl="username", on="userid=userid", batchSize=3
> > > )
> > >
> > > which fails to fetch username field for the 1st result:
> > >
> > > {
> > >  "result-set":{
> > >   "docs":[{
> > > "userid":"123123",
> > > "count(*)":58}
> > >,{
> > > "userid":"123123123",
> > > "count(*)":32,
> > > "username":"Ayha"}
> > >,{
> > > "userid":"12432423321323",
> > > "count(*)":30,
> > > "username":"MEHM"}
> > >,{
> > > "EOF":true,
> > > "RESPONSE_TIME":34889}]}}
> > >
> > > But strangely, when I change n and batchSize both to 2 and touch nothing
> > else, fetch fetches the first username correctly:
> > >
> > > fetch(
> > >   myAlias,
> > >   top(
> > >   n=2,
> > >   various expressions here
> > > sort="count(*) desc"
> > >   ),
> > >   fl="username", on="userid=userid", batchSize=2
> > > )
> > >
> > > Result is:
> > >
> > > {
> > >  "result-set":{
> > >   "docs":[{
> > > "userid":"123123",
> > > "count(*)":58,
> > > "username":"mura"}
> > >,{
> > > "userid":"123123123",
> > > "count(*)":32,
> > > "username":"Ayha"}
> > >,{
> > > "EOF":true,
> > > "RESPONSE_TIME":34889}]}}
> > >
> > > What can be the problem?
> > >
> > > Regards
> > >
> > > ~~ufuk
> > >
> > > --
> > > uyilmaz 
> >
> >
> > --
> > uyilmaz 
> >


-- 
uyilmaz 


Very high disk read rate with an idle solr

2020-10-16 Thread uyilmaz


What can cause a very high (1G/s, which is the max our disks can provide) disk 
read rate that goes on for hours, with a Solr instance not being indexed or 
queried?

Last days our SolrCloud cluster stops responding to queries, today we tried 
stopping indexing and querying it, to find out what is going on. 2 collections 
seem to be in recovery, can recovery cause this behavior?

Regards and have a nice day

-- 
uyilmaz 


Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread uyilmaz


Hey all,

>From my little experiments, I see that (if I didn't make a stupid mistake) we 
>can facet on fields marked as both indexed and stored being false:



I'm suprised by this, I thought I would need to index it. Can you confirm this?

Regards

-- 
uyilmaz 


Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread uyilmaz
Thanks! This also contributed to my confusion:

https://lucene.apache.org/solr/guide/8_4/faceting.html#field-value-faceting-parameters

"If you want Solr to perform both analysis (for searching) and faceting on the 
full literal strings, use the copyField directive in your Schema to create two 
versions of the field: one Text and one String. Make sure both are 
indexed="true"."

On Mon, 19 Oct 2020 13:08:00 -0400
Alexandre Rafalovitch  wrote:

> I think this is all explained quite well in the Ref Guide:
> https://lucene.apache.org/solr/guide/8_6/docvalues.html
> 
> DocValues is a different way to index/store values. Faceting is a
> primary use case where docValues are better than what 'indexed=true'
> gives you.
> 
> Regards,
>Alex.
> 
> On Mon, 19 Oct 2020 at 12:51, uyilmaz  wrote:
> >
> >
> > Hey all,
> >
> > From my little experiments, I see that (if I didn't make a stupid mistake) 
> > we can facet on fields marked as both indexed and stored being false:
> >
> >  > stored="false" docValues="true"/>
> >
> > I'm suprised by this, I thought I would need to index it. Can you confirm 
> > this?
> >
> > Regards
> >
> > --
> > uyilmaz 


-- 
uyilmaz 


Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread uyilmaz
Thanks for taking time to write a detailed answer.

We use Solr to both store our data and to perform aggregations, using faceting 
or streaming expressions. When required analysis is too complex to do in Solr, 
we export large query results from Solr to a more capable analysis tool.

So I guess all fields need to be docValues="true", because export handler and 
streaming both require fields to have docValues, and even if I won't use a 
field in queries or facets, it should be in available to read in result set. 
Fields that won't be searched or faceted can be (indexed=false stored=false 
docValues=true) right?

--uyilmaz


On Mon, 19 Oct 2020 14:14:27 -0400
Michael Gibney  wrote:

> As you've observed, it is indeed possible to facet on fields with
> docValues=true, indexed=false; but in almost all cases you should
> probably set indexed=true. 1. for distributed facet count refinement,
> the "indexed" approach is used to look up counts by value; 2. assuming
> you're wanting to do something usual, e.g. allow users to apply
> filters based on facet counts, the filter application would use the
> "indexed" approach as well. Where indexed=false, if either filtering
> or distributed refinement is attempted, I'm not 100% sure what
> happens. It might fail, or lead to inconsistent results, or attempt to
> look up results via the equivalent of a "table scan" over docValues (I
> think the last of these is what actually happens, fwiw) ... but none
> of these options is likely desirable.
> 
> Michael
> 
> On Mon, Oct 19, 2020 at 1:42 PM uyilmaz  wrote:
> >
> > Thanks! This also contributed to my confusion:
> >
> > https://lucene.apache.org/solr/guide/8_4/faceting.html#field-value-faceting-parameters
> >
> > "If you want Solr to perform both analysis (for searching) and faceting on 
> > the full literal strings, use the copyField directive in your Schema to 
> > create two versions of the field: one Text and one String. Make sure both 
> > are indexed="true"."
> >
> > On Mon, 19 Oct 2020 13:08:00 -0400
> > Alexandre Rafalovitch  wrote:
> >
> > > I think this is all explained quite well in the Ref Guide:
> > > https://lucene.apache.org/solr/guide/8_6/docvalues.html
> > >
> > > DocValues is a different way to index/store values. Faceting is a
> > > primary use case where docValues are better than what 'indexed=true'
> > > gives you.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On Mon, 19 Oct 2020 at 12:51, uyilmaz  wrote:
> > > >
> > > >
> > > > Hey all,
> > > >
> > > > From my little experiments, I see that (if I didn't make a stupid 
> > > > mistake) we can facet on fields marked as both indexed and stored being 
> > > > false:
> > > >
> > > >  > > > indexed="false" stored="false" docValues="true"/>
> > > >
> > > > I'm suprised by this, I thought I would need to index it. Can you 
> > > > confirm this?
> > > >
> > > > Regards
> > > >
> > > > --
> > > > uyilmaz 
> >
> >
> > --
> > uyilmaz 


-- 
uyilmaz 


Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread uyilmaz
Sorry, correction, taking "the" time

On Mon, 19 Oct 2020 22:18:30 +0300
uyilmaz  wrote:

> Thanks for taking time to write a detailed answer.
> 
> We use Solr to both store our data and to perform aggregations, using 
> faceting or streaming expressions. When required analysis is too complex to 
> do in Solr, we export large query results from Solr to a more capable 
> analysis tool.
> 
> So I guess all fields need to be docValues="true", because export handler and 
> streaming both require fields to have docValues, and even if I won't use a 
> field in queries or facets, it should be in available to read in result set. 
> Fields that won't be searched or faceted can be (indexed=false stored=false 
> docValues=true) right?
> 
> --uyilmaz
> 
> 
> On Mon, 19 Oct 2020 14:14:27 -0400
> Michael Gibney  wrote:
> 
> > As you've observed, it is indeed possible to facet on fields with
> > docValues=true, indexed=false; but in almost all cases you should
> > probably set indexed=true. 1. for distributed facet count refinement,
> > the "indexed" approach is used to look up counts by value; 2. assuming
> > you're wanting to do something usual, e.g. allow users to apply
> > filters based on facet counts, the filter application would use the
> > "indexed" approach as well. Where indexed=false, if either filtering
> > or distributed refinement is attempted, I'm not 100% sure what
> > happens. It might fail, or lead to inconsistent results, or attempt to
> > look up results via the equivalent of a "table scan" over docValues (I
> > think the last of these is what actually happens, fwiw) ... but none
> > of these options is likely desirable.
> > 
> > Michael
> > 
> > On Mon, Oct 19, 2020 at 1:42 PM uyilmaz  wrote:
> > >
> > > Thanks! This also contributed to my confusion:
> > >
> > > https://lucene.apache.org/solr/guide/8_4/faceting.html#field-value-faceting-parameters
> > >
> > > "If you want Solr to perform both analysis (for searching) and faceting 
> > > on the full literal strings, use the copyField directive in your Schema 
> > > to create two versions of the field: one Text and one String. Make sure 
> > > both are indexed="true"."
> > >
> > > On Mon, 19 Oct 2020 13:08:00 -0400
> > > Alexandre Rafalovitch  wrote:
> > >
> > > > I think this is all explained quite well in the Ref Guide:
> > > > https://lucene.apache.org/solr/guide/8_6/docvalues.html
> > > >
> > > > DocValues is a different way to index/store values. Faceting is a
> > > > primary use case where docValues are better than what 'indexed=true'
> > > > gives you.
> > > >
> > > > Regards,
> > > >Alex.
> > > >
> > > > On Mon, 19 Oct 2020 at 12:51, uyilmaz  
> > > > wrote:
> > > > >
> > > > >
> > > > > Hey all,
> > > > >
> > > > > From my little experiments, I see that (if I didn't make a stupid 
> > > > > mistake) we can facet on fields marked as both indexed and stored 
> > > > > being false:
> > > > >
> > > > >  > > > > indexed="false" stored="false" docValues="true"/>
> > > > >
> > > > > I'm suprised by this, I thought I would need to index it. Can you 
> > > > > confirm this?
> > > > >
> > > > > Regards
> > > > >
> > > > > --
> > > > > uyilmaz 
> > >
> > >
> > > --
> > > uyilmaz 
> 
> 
> -- 
> uyilmaz 


-- 
uyilmaz 


Solr tag cloud - words and counts

2020-11-03 Thread uyilmaz


I have been trying to find a way to do this in Solr for a while. Perform a 
query, and for a text_general field in the result set, find each term's # of 
occurences.

- I tried the Terms Component, it doesn't have the ability to restrict the 
result set with a query.

- Tried faceting on the field, since it's a text_general field it doesn't have 
docValues, plus cardinality is very high (millions of documents * tens of words 
in each field), so it works but it's very slow and sometimes times out.

- Tried significantTerms streaming expression, but it's logically not the same 
with what I'm looking for. It gives the words occuring frequently in the result 
set, but not occuring as frequently outside it. So it's better to find out 
frequency anomalies rather than simply the counts.

Do you have any suggestions?

Regards

-- 
uyilmaz 


Re: docValues usage

2020-11-04 Thread uyilmaz
Hi,

I'm by no means expert on this so if anyone sees a mistake please correct me.

I think you need to index this field, since boost functions are added to the 
query as optional clauses 
(https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter).
 It's like boosting a regular field by putting ^2 next to it in a query. 
Storing or enabling docValues will unnecesarily consume space/memory.

On Tue, 3 Nov 2020 16:10:50 -0800
Wei  wrote:

> Hi,
> 
> I have a couple of primitive single value numeric type fields,  their
> values are used in boosting functions, but not used in sort/facet. or in
> returned response.   Should I use docValues for them in the schema?  I can
> think of the following options:
> 
>  1)   indexed=true,  stored=true, docValues=false
>  2)   indexed=true, stored=false, docValues=true
>  3)   indexed=false,  stored=false,  docValues=true
> 
> What would be the performance implications for these options?
> 
> Best,
> Wei


-- 
uyilmaz 


when to use stored over docValues and useDocValuesAsStored

2020-11-04 Thread uyilmaz
Hi,

I heavily use streaming expressions and facets, or export large amounts of data 
from Solr to Spark to make analyses.

Please correct me if I know wrong:

+ requesting a non-docValues field in a response causes whole document to be 
decompressed and read from disk
+ streaming expressions and export handler requires every field read to have 
docValues

- docValues increases index size, therefore memory requirement, stored only 
uses disk space
- stored preserves order of multivalued fields

It seems stored is only useful when I have a multivalued field that I care 
about the index-time order of things, and since I will be using the export 
handler, it will use docValues anyways and lose the order.

So is there any case that I need stored=true?

Best,
ufuk

-- 
uyilmaz