Ranking issue when combining sorting and re-ranking on SolrCloud (multiple shards)

2020-05-11 Thread Spyros Kapnissis
HI all,

On our current master/slave setup (no cloud), we use a a custom sorting
function to get the first pass results (using the sort param), and then we
use LTR for re-ranking. This works fine, i.e. re-ranking is applied on the
topN, after sorting has completed and the order is correct.

However, as we are migrating on SolrCloud (version 7.3.1) with multiple
shards, this does not seem to work as expected. To my understanding, Solr
collects the reranked results from the shards back on a single node to
merge them, and then tries to re-apply sorting.

We would expect the results to at least follow the sorting formula, even if
this is not what we want. But this still not even the case, as the
combination of the two (sorting + reranking) results in erratic ordering.

Example result, where $sort_score is the sorting formula output, and score
is the LTR re-ranked output:

{"id": "152",
"$sort_score": 17.38543,
"score": 0.22140852
},
{"id": "2016",
"$sort_score": 14.612957,
"score": 0.19214153
},
{ "id": "1523",
"$sort_score": 14.4093275,
"score": 0.26738763
},
{ "id": "6704",
"$sort_score": 13.956842,
"score": 0.17357588
},
{ "id": "6512",
"$sort_score": 14.43907,
"score": 0.11575622
},

We also tried with other simple re-rank queries apart from LTR, and the
issue persisted.

Could someone please help troubleshoot? Ideally, we would want to have the
re-rank results merged on the single node, and not re-apply sorting.

Thank you!


Re: solr core metrics & prometheus exporter - indexreader is closed

2020-05-11 Thread Richard Goodman
Hey Dwane,

Thanks for your email, gah I should have mentioned that I had applied the
patches from 8.x branches onto the exporter already *(such as the fixed
thread pooling that you mentioned). *I still haven't gotten to the bottom
of the IndexReader is closed issue, I found that if that was present on an
instance, even calling just http://ip.address:port/solr/admin/metrics would
return that and 0 metrics. If I added the following parameter to the
call; ®ex=^(?!SEARCHER).*
It was all fine. I'm trying to wrap my head around the relationship between
a solr core, and an index searcher / reader in the code, but it's quite
complicated, similarly, trying to understand how I could replicate this for
testing purposes. So if you have any guidance/advice on that area, would be
greatly appreciated.

Cheers,

On Wed, 6 May 2020 at 21:36, Dwane Hall  wrote:

> Hey Richard,
>
> I noticed this issue with the exporter in the 7.x branch. If you look
> through the release notes for Solr since then there have been quite a few
> improvements to the exporter particularly around thread safety and
> concurrency (and the number of nodes it can monitor).  The version of the
> exporter can run independently to your Solr version so my advice would be
> to download the most recent Solr version, check and modify the exporter
> start script for its library dependencies, extract these files to a
> separate location, and run this version against your 7.x instance. If you
> have the capacity to upgrade your Solr version this will save you having to
> maintain the exporter separately. Since making this change the exporter has
> not missed a beat and we monitor around 100 Solr nodes.
>
> Good luck,
>
> Dwane
> --
> *From:* Richard Goodman 
> *Sent:* Tuesday, 5 May 2020 10:22 PM
> *To:* solr-user@lucene.apache.org 
> *Subject:* solr core metrics & prometheus exporter - indexreader is closed
>
> Hi there,
>
> I've been playing with the prometheus exporter for solr, and have created
> my config and have deployed it, so far, all groups were running fine (node,
> jetty, jvm), however, I'm repeatedly getting an issue with the core group;
>
> WARN  - 2020-05-05 12:01:24.812; org.apache.solr.prometheus.scraper.Async;
> Error occurred during metrics collection
> java.util.concurrent.ExecutionException:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:8083/solr: Server Error
>
> request:
> http://127.0.0.1:8083/solr/admin/metrics?group=core&wt=json&version=2.2
> at
>
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> ~[?:1.8.0_141]
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> ~[?:1.8.0_141]
> at
> org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
> at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> ~[?:1.8.0_141]
> at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[?:1.8.0_141]
> at
>
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> ~[?:1.8.0_141]
> at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[?:1.8.0_141]
> at
>
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[?:1.8.0_141]
> at
>
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> ~[?:1.8.0_141]
> at
>
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> ~[?:1.8.0_141]
> at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[?:1.8.0_141]
> at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> ~[?:1.8.0_141]
> at
>
> org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
> at
>
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ~[?:1.8.0_141]
> at
>
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> ~[?:1.8.0_141]
> at
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> ~[?:1.8.0_141]
> at
>
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> ~[?:1.8.0_141]
> at
>
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> ~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
> at
>
> java.util.concurrent.ThreadPoolExecutor.runW

Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Jan Høydahl
Sounds like an anti pattern. Can you explain what search problem you are trying 
to solve with this many unique fields?

Jan Høydahl

> 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
> 
> Hi
> Is it good idea to create 10 dynamic fields of time pint in solr?
> I have that many fields to search on actually which come upon based on
> users.
> 
> Thanks in advance!
> And I'm using Solr Cloud in real-time.
> 
> Regards,
> Sai Vignan M


Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Vignan Malyala
I have around 1M products used by my clients.
Client need a filter of these 1M products by their cost filters.

Just like:
User1 has 5 products (A,B,C,D,E)
User2 has 3 products (D,E,F)
User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)

...every customer has different sets.

Now they want to search users by filter of product costs:
Product_A_cost :  50 TO 100
Product_D_cost :  0 TO 40

it should return all the users who use products in this filter range.

As I have 1M products, do I need to create dynamic fields for all users
with filed names as Product_A_cost and product_B_cost. etc to make a
search by them? If I should, then I haveto create 1M dynamic fields
Or is there any other way?

Hope I'm clear here!


On Mon, May 11, 2020 at 1:47 PM Jan Høydahl  wrote:

> Sounds like an anti pattern. Can you explain what search problem you are
> trying to solve with this many unique fields?
>
> Jan Høydahl
>
> > 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
> >
> > Hi
> > Is it good idea to create 10 dynamic fields of time pint in solr?
> > I have that many fields to search on actually which come upon based on
> > users.
> >
> > Thanks in advance!
> > And I'm using Solr Cloud in real-time.
> >
> > Regards,
> > Sai Vignan M
>


Re: Response Time Diff between Collection with low deletes

2020-05-11 Thread Ganesh Sethuraman
As detailed below. The collection where we have issues have 16 shards with
2 replica each.

On Sun, May 10, 2020, 9:10 PM matthew sporleder 
wrote:

> Why so many shards?
>
> > On May 10, 2020, at 9:09 PM, Ganesh Sethuraman 
> wrote:
> >
> > We are using dedicated host, Cent OS in EC2  r5.12xlarge (48  CPU,
> ~360GB
> > RAM), 2 nodes. Swapiness set to 1. With General purpose 2T EBS SSD
> volume.
> > JVM size of 18gb, with G1 GC enabled. About 92 collection with average
> of 8
> > shards and 2 replica each. Most of updates over daily batch updates.
> >
> > While we have Solr disk utilization of about ~800gb. Most of the
> collection
> > space are for real time GET, /get call. The issue we are having is for
> few
> > collection where we having query use case /need. This has 32 replica (16
> > shards 2 replica each). During performance test, issue is few calls where
> > we have high response time, it is noticeable when test duration is small,
> > the response time improve when the test is for longer duration.
> >
> > Hope this information helps.
> >
> > Regards
> > Ganesh
> >
> > Regards
> > Ganesh
> >
> >
> >> On Sun, May 10, 2020, 8:14 PM Shawn Heisey  wrote:
> >>
> >>> On 5/10/2020 4:48 PM, Ganesh Sethuraman wrote:
> >>> The additional info is that when we execute the test for longer
> (20mins)
> >> we
> >>> are seeing better response time, however for a short test (5mins) and
> >> rerun
> >>> the test after an hour or so we are seeing slow response times again.
> >> Note
> >>> that we don't update the collection during the test or in between the
> >> test.
> >>> Does this help to identify the issue?
> >>
> >> Assuming Solr is the only software that is running, most operating
> >> systems would not remove Solr data from the disk cache, so unless you
> >> have other software running on the machine, it's a little weird that
> >> performance drops back down after waiting an hour.  Windows is an
> >> example of an OS that *does* proactively change data in the disk cache,
> >> and on that OS, I would not be surprised by such behavior.  You haven't
> >> mentioned which OS you're running on.
> >>
> >>> 3. We have designed our test to mimick reality where filter cache is
> not
> >>> hit at all. From solr, we are seeing that there is ZERO Filter cache
> hit.
> >>> There is about 4% query and document cache hit in prod and we are
> seeing
> >> no
> >>> filter cache hit in both QA and PROD
> >>
> >> If you're getting zero cache hits, you should disable the cache that is
> >> getting zero hits.  There is no reason to waste the memory that the
> >> cache uses, because there is no benefit.
> >>
> >>> Give that, could this be some warming up related issue to keep the
> Solr /
> >>> Lucene memory-mapped file in RAM? Is there any way to measure which
> >>> collection is using memory? we do have 350GB RAM, but we see it full
> with
> >>> buffer cache, not really sure what is really using this memory.
> >>
> >> You would have to ask the OS which files are contained by the OS disk
> >> cache, and it's possible that even if the information is available, that
> >> it is very difficult to get.  There is no way Solr can report this.
> >>
> >> Thanks,
> >> Shawn
> >>
>


Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Jan Høydahl
Sounds like you are looking for parent/child docs here, see 
https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html

{
"type": "user",
"name": "user1",
"products": [
{ "id": "prod_A", "cost": 50},
{ "id": "prod_B", "cost": 200},
{ "id": "prod_D", "cost": 25}
]
}

This will index 4 documents - one user document and three product-cost child 
documents.

You can then search the child docs and return matching parents with e.g. 
q=*:*&fq={!parent which="type:user"}((id:prod_A AND cost:[50 TO 100]) OR 
(id:prod_D AND cost:[0 TO 40]))&fl=[child]

Hope this helps.

Jan

> 11. mai 2020 kl. 11:35 skrev Vignan Malyala :
> 
> I have around 1M products used by my clients.
> Client need a filter of these 1M products by their cost filters.
> 
> Just like:
> User1 has 5 products (A,B,C,D,E)
> User2 has 3 products (D,E,F)
> User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)
> 
> ...every customer has different sets.
> 
> Now they want to search users by filter of product costs:
> Product_A_cost :  50 TO 100
> Product_D_cost :  0 TO 40
> 
> it should return all the users who use products in this filter range.
> 
> As I have 1M products, do I need to create dynamic fields for all users
> with filed names as Product_A_cost and product_B_cost. etc to make a
> search by them? If I should, then I haveto create 1M dynamic fields
> Or is there any other way?
> 
> Hope I'm clear here!
> 
> 
> On Mon, May 11, 2020 at 1:47 PM Jan Høydahl  wrote:
> 
>> Sounds like an anti pattern. Can you explain what search problem you are
>> trying to solve with this many unique fields?
>> 
>> Jan Høydahl
>> 
>>> 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
>>> 
>>> Hi
>>> Is it good idea to create 10 dynamic fields of time pint in solr?
>>> I have that many fields to search on actually which come upon based on
>>> users.
>>> 
>>> Thanks in advance!
>>> And I'm using Solr Cloud in real-time.
>>> 
>>> Regards,
>>> Sai Vignan M
>> 



Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Erick Erickson
Creating that many dynamic fields is a bad idea, Solr isn’t
built to handle that many fields. It works, but performance
will decline and I’d guess that this app is sensitive
to response time.

So try Jan’s approach or find another would be my advice.

Best,
Erick

> On May 11, 2020, at 7:37 AM, Jan Høydahl  wrote:
> 
> Sounds like you are looking for parent/child docs here, see 
> https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html
> 
> {
>   "type": "user",
>   "name": "user1",
>   "products": [
>   { "id": "prod_A", "cost": 50},
>   { "id": "prod_B", "cost": 200},
>   { "id": "prod_D", "cost": 25}
>   ]
> }
> 
> This will index 4 documents - one user document and three product-cost child 
> documents.
> 
> You can then search the child docs and return matching parents with e.g. 
> q=*:*&fq={!parent which="type:user"}((id:prod_A AND cost:[50 TO 100]) OR 
> (id:prod_D AND cost:[0 TO 40]))&fl=[child]
> 
> Hope this helps.
> 
> Jan
> 
>> 11. mai 2020 kl. 11:35 skrev Vignan Malyala :
>> 
>> I have around 1M products used by my clients.
>> Client need a filter of these 1M products by their cost filters.
>> 
>> Just like:
>> User1 has 5 products (A,B,C,D,E)
>> User2 has 3 products (D,E,F)
>> User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)
>> 
>> ...every customer has different sets.
>> 
>> Now they want to search users by filter of product costs:
>> Product_A_cost :  50 TO 100
>> Product_D_cost :  0 TO 40
>> 
>> it should return all the users who use products in this filter range.
>> 
>> As I have 1M products, do I need to create dynamic fields for all users
>> with filed names as Product_A_cost and product_B_cost. etc to make a
>> search by them? If I should, then I haveto create 1M dynamic fields
>> Or is there any other way?
>> 
>> Hope I'm clear here!
>> 
>> 
>> On Mon, May 11, 2020 at 1:47 PM Jan Høydahl  wrote:
>> 
>>> Sounds like an anti pattern. Can you explain what search problem you are
>>> trying to solve with this many unique fields?
>>> 
>>> Jan Høydahl
>>> 
 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
 
 Hi
 Is it good idea to create 10 dynamic fields of time pint in solr?
 I have that many fields to search on actually which come upon based on
 users.
 
 Thanks in advance!
 And I'm using Solr Cloud in real-time.
 
 Regards,
 Sai Vignan M
>>> 
> 



Re: Max docs and num docs are not matching after optimization

2020-05-11 Thread Erick Erickson
That’s odd, are you absolutely sure that there’s no indexing going on while the 
optimize is running?

Optimizing only works on the closed segments that exist when the process 
_starts_,
any updates that come in while the optimize is running will result in new 
segments that
are not optimized and if any of the new updates are of docs that exist when the 
optimize
starts, you’ll have deleted docs at the end.

BTW, optimizing is generally unnecessary, be sure you need it. Although 7.7 is 
much
better about some of the consequences than previous versions, see: 
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

If you are not adding docs to the index and still see deleted docs after 
optimizing,
it would be helpful to have the exact steps you use to try to track down

Best,
Erick

> On May 11, 2020, at 2:21 AM, Rajdeep Sahoo  wrote:
> 
> Hi all,
> We are using solr 7.7.2 . After optimization the deleted docs count is
> still showing as part of max docs.
>  As per my knowledge after optimization max docs and num docs count should
> match. It is not happening here.. Is there any way to troubleshoot this.



Re: Max docs and num docs are not matching after optimization

2020-05-11 Thread Rajdeep Sahoo
Please help

On Mon, 11 May, 2020, 11:51 AM Rajdeep Sahoo, 
wrote:

> Hi all,
> We are using solr 7.7.2 . After optimization the deleted docs count is
> still showing as part of max docs.
>   As per my knowledge after optimization max docs and num docs count
> should match. It is not happening here.. Is there any way to troubleshoot
> this.
>


Problems when Upgrading from Solr 7.7.1 to 8.5.0

2020-05-11 Thread Ludger Steens
Hi all,

we recently upgraded our SolrCloud cluster from version 7.7.1 to version
8.5.0 and ran into multiple problems.
In the end we had to revert the upgrade and went back to Solr 7.7.1.

In our company we are using Solr since Version 4 and so far, upgrading
Solr to a newer version was possible without any problems.
We are curious if others are experiencing the same kind of problems and if
these are some known issues. Or maybe we did something wrong and missed
something when upgrading?


1. Network issues when indexing documents
===

Our collection contains roughly 150 million documents.  When we re-created
the collection and re-indexed all documents, we regularly experienced
network problems that causes our loader application to fail.
The Solr log always contains an IOException Exception:

ERROR
(updateExecutor-5-thread-1338-processing-x:PSMG_CI_2020_04_15_10_07_04_sha
rd6_replica_n22 r:core_node25 null n:solr2:8983_solr
c:PSMG_CI_2020_04_15_10_07_04 s:shard6) [c:PSMG_CI_2020_04_15_10_07_04
s:shard6 r:core_node25 x:PSMG_CI_2020_04_15_10_07_04_shard6_replica_n22]
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode:
http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ to
http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ =>
java.io.IOException: java.io.IOException: cancel_stream_error
 at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197)
 java.io.IOException: java.io.IOException: cancel_stream_error
 at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
 at
org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
ream.flush(OutputStreamContentProvider.java:151)
~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
 at
org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
ream.write(OutputStreamContentProvider.java:145)
~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
 at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:2
16) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42
- romseygeek - 2020-03-1309:38:26]
 at
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.
java:209) ~[solr-solrj-8.5.0.jar:8.5.0
7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 202003-13
09:38:26]
 at
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:172)
~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 -
romseygeek - 2020-03-13 09:3826]

After the Exception the collection usually was in a degraded state for
some time and shards try to recover and sync with the leader.

In the Solr changelog we saw that one major change from 7.x to 8.x was
that Solr now uses HTTP/2 instead of HTTP/1.1. So we tried to disable
HTTP/2 by setting the system property solr.http1=true.
That did make the indexing process a LOT more stable but we still saw a
IOExceptions from time to time. Disabling HTTP/2 did not completely fix
the problem.

ERROR
(updateExecutor-5-thread-9310-processing-x:PSMG_BOM_2020_04_28_05_00_11_sh
ard7_replica_n24 r:core_node27 null n:solr3:8983_solr
c:PSMG_BOM_2020_04_28_05_00_11 s:shard7) [c:PSMG_BOM_2020_04_28_05_00_11
s:shard7 r:core_node27 x:PSMG_BOM_2020_04_28_05_00_11_shard7_replica_n24]
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req: cmd=add{,id=5141653a-e33a-4b60-856d-7aa2ce73dee7};
node=ForwardNode:
http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ to
http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ =>
java.io.IOException: java.io.EOFException:
HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0.
0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/60}{io=0/0,ki
o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <->
r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan
ge=HttpExchange@6ffd260f req=PENDING/null@null
res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID
LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve
rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]]
at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197)
java.io.IOException: java.io.EOFException:
HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0.
0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/60}{io=0/0,ki
o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <->
r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan
ge=HttpExchange@6ffd260f req=PENDING/null@null
res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID

Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Vincenzo D'Amore
But keep in mind that "With the exception of in-place updates, the whole
block must be updated or deleted together, not separately. For some
applications this may result in tons of extra indexing and thus may be a
deal-breaker."

On Mon, May 11, 2020 at 1:37 PM Jan Høydahl  wrote:

> Sounds like you are looking for parent/child docs here, see
> https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html
>
> {
> "type": "user",
> "name": "user1",
> "products": [
> { "id": "prod_A", "cost": 50},
> { "id": "prod_B", "cost": 200},
> { "id": "prod_D", "cost": 25}
> ]
> }
>
> This will index 4 documents - one user document and three product-cost
> child documents.
>
> You can then search the child docs and return matching parents with e.g.
> q=*:*&fq={!parent which="type:user"}((id:prod_A AND cost:[50 TO 100]) OR
> (id:prod_D AND cost:[0 TO 40]))&fl=[child]
>
> Hope this helps.
>
> Jan
>
> > 11. mai 2020 kl. 11:35 skrev Vignan Malyala :
> >
> > I have around 1M products used by my clients.
> > Client need a filter of these 1M products by their cost filters.
> >
> > Just like:
> > User1 has 5 products (A,B,C,D,E)
> > User2 has 3 products (D,E,F)
> > User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)
> >
> > ...every customer has different sets.
> >
> > Now they want to search users by filter of product costs:
> > Product_A_cost :  50 TO 100
> > Product_D_cost :  0 TO 40
> >
> > it should return all the users who use products in this filter range.
> >
> > As I have 1M products, do I need to create dynamic fields for all users
> > with filed names as Product_A_cost and product_B_cost. etc to make a
> > search by them? If I should, then I haveto create 1M dynamic fields
> > Or is there any other way?
> >
> > Hope I'm clear here!
> >
> >
> > On Mon, May 11, 2020 at 1:47 PM Jan Høydahl 
> wrote:
> >
> >> Sounds like an anti pattern. Can you explain what search problem you are
> >> trying to solve with this many unique fields?
> >>
> >> Jan Høydahl
> >>
> >>> 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
> >>>
> >>> Hi
> >>> Is it good idea to create 10 dynamic fields of time pint in solr?
> >>> I have that many fields to search on actually which come upon based on
> >>> users.
> >>>
> >>> Thanks in advance!
> >>> And I'm using Solr Cloud in real-time.
> >>>
> >>> Regards,
> >>> Sai Vignan M
> >>
>
>

-- 
Vincenzo D'Amore


Re: solr payloads performance

2020-05-11 Thread Emir Arnautović
Hi Wei,
In order to use payload you have to use functions and that’s not cheap. In 
order to make it work fast, you could use it as post filter and filter on some 
summary field like minPrice/maxPrice/defaultPrice.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 9 May 2020, at 01:26, Wei  wrote:
> 
> Hi everyone,
> 
> Have a question regarding typical  e-commerce scenario: each item may have
> different price in different store. suppose there are 10 million items and
> 1000 stores.
> 
> Option 1:  use solr payloads, each document have
> store_prices_payload:store1|price1 store2|price2  .
> store1000|price1000
> 
> Option 2: use dynamic fields and have 1000 fields in each document, i.e.
>   field1:  store1_price:  price1
>   field2:  store2_price:  price2
>   ...
>   field1000:  store1000_price: price1000
> 
> Option 2 doesn't look elegant,  but is there any performance benchmark on
> solr payloads? In terms of filtering, sorting or faceting, how would query
> performance compare between the two?
> 
> Thanks,
> Wei



Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Vincenzo D'Amore
For in-place updates you should read this:
https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html

On Mon, May 11, 2020 at 2:49 PM Vincenzo D'Amore  wrote:

> But keep in mind that "With the exception of in-place updates, the whole
> block must be updated or deleted together, not separately. For some
> applications this may result in tons of extra indexing and thus may be a
> deal-breaker."
>
> On Mon, May 11, 2020 at 1:37 PM Jan Høydahl  wrote:
>
>> Sounds like you are looking for parent/child docs here, see
>> https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html
>>
>> {
>> "type": "user",
>> "name": "user1",
>> "products": [
>> { "id": "prod_A", "cost": 50},
>> { "id": "prod_B", "cost": 200},
>> { "id": "prod_D", "cost": 25}
>> ]
>> }
>>
>> This will index 4 documents - one user document and three product-cost
>> child documents.
>>
>> You can then search the child docs and return matching parents with e.g.
>> q=*:*&fq={!parent which="type:user"}((id:prod_A AND cost:[50 TO 100]) OR
>> (id:prod_D AND cost:[0 TO 40]))&fl=[child]
>>
>> Hope this helps.
>>
>> Jan
>>
>> > 11. mai 2020 kl. 11:35 skrev Vignan Malyala :
>> >
>> > I have around 1M products used by my clients.
>> > Client need a filter of these 1M products by their cost filters.
>> >
>> > Just like:
>> > User1 has 5 products (A,B,C,D,E)
>> > User2 has 3 products (D,E,F)
>> > User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)
>> >
>> > ...every customer has different sets.
>> >
>> > Now they want to search users by filter of product costs:
>> > Product_A_cost :  50 TO 100
>> > Product_D_cost :  0 TO 40
>> >
>> > it should return all the users who use products in this filter range.
>> >
>> > As I have 1M products, do I need to create dynamic fields for all users
>> > with filed names as Product_A_cost and product_B_cost. etc to make a
>> > search by them? If I should, then I haveto create 1M dynamic fields
>> > Or is there any other way?
>> >
>> > Hope I'm clear here!
>> >
>> >
>> > On Mon, May 11, 2020 at 1:47 PM Jan Høydahl 
>> wrote:
>> >
>> >> Sounds like an anti pattern. Can you explain what search problem you
>> are
>> >> trying to solve with this many unique fields?
>> >>
>> >> Jan Høydahl
>> >>
>> >>> 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
>> >>>
>> >>> Hi
>> >>> Is it good idea to create 10 dynamic fields of time pint in solr?
>> >>> I have that many fields to search on actually which come upon based on
>> >>> users.
>> >>>
>> >>> Thanks in advance!
>> >>> And I'm using Solr Cloud in real-time.
>> >>>
>> >>> Regards,
>> >>> Sai Vignan M
>> >>
>>
>>
>
> --
> Vincenzo D'Amore
>
>

-- 
Vincenzo D'Amore


Unified highlighter- unable to get results - can get results with original and termvector highlighters

2020-05-11 Thread Warren, David [USA]
I am running Solr 8.4 and am attempting to use its highlighting feature. It 
appears to work well when I use the original highlighter or the term vector 
highlighter, but when I try to use the unified highlighter, I get no results 
returned.  My Google searches so far have not revealed anybody having this same 
problem (perhaps user error on my part), hence why I’m asking a question to the 
Solr mailing list.

I am running a query which searches the “title_text” field for a term and 
highlights it.
The configuration for title_text is this:


The query looks like this:
https://solr-server/index/c1/select?hl.fl=title_text&hl.method=unified&hl=true&q=
 title_text%3Azelda

If hl.method=original or hl.method=termvector, I get back results in the 
highlighting section with “Zelda” surrounded by  tags.
If hl.method=unified, all results in the highlighting section are blank.

I’ve attached a remote debugger to my Solr server and verified that the unified 
highlighter class (org/apache/solr/highlight/UnifiedSolrHighlighter.java) is 
being invoked when I set hl.method=unified.  And I do not see any errors in the 
Solr logs.

Any idea what I’m doing wrong? In looking at the Solr highlighting 
documentation, I didn’t see any additional configuration which needs to be done 
to get the unified highlighter to work.

I realize I have not provided a bunch of information here, but obviously can 
provide more if needed.

Thank you,
David Warren
Booz | Allen | Hamilton
703-625-0311 mobile



Re: solr payloads performance

2020-05-11 Thread Erik Hatcher
Wei -

Here's some details on the various payload capabilities and short-comings: 
https://lucidworks.com/post/solr-payloads/

SOLR-10541 is the main functional constraint (range faceting over functions).

Erik

> On May 8, 2020, at 7:26 PM, Wei  wrote:
> 
> Hi everyone,
> 
> Have a question regarding typical  e-commerce scenario: each item may have
> different price in different store. suppose there are 10 million items and
> 1000 stores.
> 
> Option 1:  use solr payloads, each document have
> store_prices_payload:store1|price1 store2|price2  .
> store1000|price1000
> 
> Option 2: use dynamic fields and have 1000 fields in each document, i.e.
>   field1:  store1_price:  price1
>   field2:  store2_price:  price2
>   ...
>   field1000:  store1000_price: price1000
> 
> Option 2 doesn't look elegant,  but is there any performance benchmark on
> solr payloads? In terms of filtering, sorting or faceting, how would query
> performance compare between the two?
> 
> Thanks,
> Wei



Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Joe Obernberger

Could you use a multi-valued field for user in each of your products?

So productA and a field User that is a list of all the users that have 
productA.  Then you could do a search like:


user:User1 AND Product_A_cost:[5 TO 10]
user:(User1 User5...) AND Product_B_cost[0 TO 40]

-Joe

On 5/11/2020 5:35 AM, Vignan Malyala wrote:

I have around 1M products used by my clients.
Client need a filter of these 1M products by their cost filters.

Just like:
User1 has 5 products (A,B,C,D,E)
User2 has 3 products (D,E,F)
User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)

...every customer has different sets.

Now they want to search users by filter of product costs:
Product_A_cost :  50 TO 100
Product_D_cost :  0 TO 40

it should return all the users who use products in this filter range.

As I have 1M products, do I need to create dynamic fields for all users
with filed names as Product_A_cost and product_B_cost. etc to make a
search by them? If I should, then I haveto create 1M dynamic fields
Or is there any other way?

Hope I'm clear here!


On Mon, May 11, 2020 at 1:47 PM Jan Høydahl  wrote:


Sounds like an anti pattern. Can you explain what search problem you are
trying to solve with this many unique fields?

Jan Høydahl


11. mai 2020 kl. 07:51 skrev Vignan Malyala :

Hi
Is it good idea to create 10 dynamic fields of time pint in solr?
I have that many fields to search on actually which come upon based on
users.

Thanks in advance!
And I'm using Solr Cloud in real-time.

Regards,
Sai Vignan M




8.5.1 LogReplayer extremely slow

2020-05-11 Thread Markus Jelsma
Hello,

Our main Solr text search collection broke down last night (search was still 
working fine), every indexing action timed out with the Solr master spending 
most of its time in Java regex. One shard has only one replica left for queries 
and it stays like that. I have copied both shard's leader to local to see what 
is going on.

One shard is fine but the other has a replica with has about 600MB of data to 
replay and it is extremely slow. Using the VisualVM sampler i find that the 
replayer is also spending almost all time in dealing with Java regex (stack 
trace below). Is this to be expected? And what is it actually doing? Where do 
the TokenFilters come from?

I had a old but clean collection on the same cluster and started indexing to it 
to see what is going on but it too timed out due to Java regex. This is weird, 
because locally i have no problem indexing a million records in a 8.5.1 
collection, and the broken down cluster has been running fine for over a month.

A note, this index uses PreAnalyzedField, so i would expect no analysis or 
whatsoever, certainly no regex.

Thanks,
Markus

"replayUpdatesExecutor-3-thread-1-processing-n:127.0.1.1:8983_solr 
x:sitesearch_shard2_replica_t2 c:sitesearch s:shard2 r:core_node4" #222 prio=5 
os_prio=0 cpu=239207,44ms elapsed=239,50s tid=0x7ffde0057000 nid=0x24f5 
runnable  [0x7ffeedd0f000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.regex.Pattern$GroupTail.match(java.base@11.0.7/Pattern.java:4863)
at 
java.util.regex.Pattern$CharPropertyGreedy.match(java.base@11.0.7/Pattern.java:4306)
at 
java.util.regex.Pattern$GroupHead.match(java.base@11.0.7/Pattern.java:4804)
at 
java.util.regex.Pattern$CharPropertyGreedy.match(java.base@11.0.7/Pattern.java:4306)
at 
java.util.regex.Pattern$Start.match(java.base@11.0.7/Pattern.java:3619)
at java.util.regex.Matcher.search(java.base@11.0.7/Matcher.java:1729)
at java.util.regex.Matcher.find(java.base@11.0.7/Matcher.java:746)
at 
org.apache.lucene.analysis.pattern.PatternReplaceFilter.incrementToken(PatternReplaceFilter.java:71)
at 
org.apache.lucene.analysis.miscellaneous.TrimFilter.incrementToken(TrimFilter.java:42)
at 
org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:812)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
at 
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:979)
at 
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor$$Lambda$631/0x000840670c40.apply(Unknown
 Source)
at 
org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
- locked <0xa7df5620> (a org.apache.solr.update.VersionBucket)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
at 
org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at 
org.apache.solr.update.UpdateLog$LogReplayer.lambda$execute$1(UpdateLog.java:2025)
at 
org.apache.solr.update.UpdateLog$LogReplayer$$Lambda$629/0x000840672c40.run(Unknown
 S

Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
Hi Wei,

In considering this problem, I'm stumbling a bit on terminology
(particularly, where you mention "nodes", I think you're referring to
"replicas"?). Could you confirm that you have 10 TLOG replicas per
shard, for each of 6 shards? How many *nodes* (i.e., running solr
server instances) do you have, and what is the replica placement like
across those nodes? What, if any, non-TLOG replicas do you have per
shard (not that it's necessarily relevant, but just to get a complete
picture of the situation)?

If you're able without too much trouble, can you determine what the
behavior is like on Solr 8.3? (there were different changes introduced
to potentially relevant code in 8.3 and 8.4, and knowing whether the
behavior you're observing manifests on 8.3 would help narrow down
where to look for an explanation).

Michael

On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
>
> Update:  after I remove the shards.preference parameter from
> solrconfig.xml,  issue is gone and internal shard requests are now
> balanced. The same parameter works fine with solr 7.6.  Still not sure of
> the root cause, but I observed a strange coincidence: the nodes that are
> most frequently picked for shard requests are the first node in each shard
> returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> equally compared nodes when shards.preference is set.  Will report back if
> I find more.
>
> On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
>
> > Hi Eric,
> >
> > I am measuring the number of shard requests, and it's for query only, no
> > indexing requests.  I have an external load balancer and see each node
> > received about the equal number of external queries. However for the
> > internal shard queries,  the distribution is uneven:6 nodes (one in
> > each shard,  some of them are leaders and some are non-leaders ) gets about
> > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > requests.   I checked a few other parameters set:
> >
> > -Dsolr.disable.shardsWhitelist=true
> > shards.preference=replica.location:local,replica.type:TLOG
> >
> > Nothing seems to cause the strange behavior.  Any suggestions how to
> > debug this?
> >
> > -Wei
> >
> >
> > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> > wrote:
> >
> >> Wei:
> >>
> >> How are you measuring utilization here? The number of incoming requests
> >> or CPU?
> >>
> >> The leader for each shard are certainly handling all of the indexing
> >> requests since they’re TLOG replicas, so that’s one thing that might
> >> skewing your measurements.
> >>
> >> Best,
> >> Erick
> >>
> >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> >> >
> >> > Hi everyone,
> >> >
> >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
> >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> >> one
> >> > of the replicas in each shard is handling most of the distributed shard
> >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> >> There
> >> > is no change in shard handler configuration:
> >> >
> >> >  >> > "HttpShardHandlerFactory">
> >> >
> >> >3
> >> >
> >> >3
> >> >
> >> >500
> >> >
> >> > 
> >> >
> >> >
> >> > What could cause the unbalanced internal distributed request?
> >> >
> >> >
> >> > Thanks in advance.
> >> >
> >> >
> >> >
> >> > Wei
> >>
> >>


Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
Wei, probably no need to answer my earlier questions; I think I see
the problem here, and believe it is indeed a bug, introduced in 8.3.
Will file an issue and submit a patch shortly.
Michael

On Mon, May 11, 2020 at 12:49 PM Michael Gibney
 wrote:
>
> Hi Wei,
>
> In considering this problem, I'm stumbling a bit on terminology
> (particularly, where you mention "nodes", I think you're referring to
> "replicas"?). Could you confirm that you have 10 TLOG replicas per
> shard, for each of 6 shards? How many *nodes* (i.e., running solr
> server instances) do you have, and what is the replica placement like
> across those nodes? What, if any, non-TLOG replicas do you have per
> shard (not that it's necessarily relevant, but just to get a complete
> picture of the situation)?
>
> If you're able without too much trouble, can you determine what the
> behavior is like on Solr 8.3? (there were different changes introduced
> to potentially relevant code in 8.3 and 8.4, and knowing whether the
> behavior you're observing manifests on 8.3 would help narrow down
> where to look for an explanation).
>
> Michael
>
> On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> >
> > Update:  after I remove the shards.preference parameter from
> > solrconfig.xml,  issue is gone and internal shard requests are now
> > balanced. The same parameter works fine with solr 7.6.  Still not sure of
> > the root cause, but I observed a strange coincidence: the nodes that are
> > most frequently picked for shard requests are the first node in each shard
> > returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> > equally compared nodes when shards.preference is set.  Will report back if
> > I find more.
> >
> > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> >
> > > Hi Eric,
> > >
> > > I am measuring the number of shard requests, and it's for query only, no
> > > indexing requests.  I have an external load balancer and see each node
> > > received about the equal number of external queries. However for the
> > > internal shard queries,  the distribution is uneven:6 nodes (one in
> > > each shard,  some of them are leaders and some are non-leaders ) gets 
> > > about
> > > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > > requests.   I checked a few other parameters set:
> > >
> > > -Dsolr.disable.shardsWhitelist=true
> > > shards.preference=replica.location:local,replica.type:TLOG
> > >
> > > Nothing seems to cause the strange behavior.  Any suggestions how to
> > > debug this?
> > >
> > > -Wei
> > >
> > >
> > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> > > wrote:
> > >
> > >> Wei:
> > >>
> > >> How are you measuring utilization here? The number of incoming requests
> > >> or CPU?
> > >>
> > >> The leader for each shard are certainly handling all of the indexing
> > >> requests since they’re TLOG replicas, so that’s one thing that might
> > >> skewing your measurements.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> > >> >
> > >> > Hi everyone,
> > >> >
> > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 
> > >> > 6
> > >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> > >> one
> > >> > of the replicas in each shard is handling most of the distributed shard
> > >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> > >> There
> > >> > is no change in shard handler configuration:
> > >> >
> > >> >  > >> > "HttpShardHandlerFactory">
> > >> >
> > >> >3
> > >> >
> > >> >3
> > >> >
> > >> >500
> > >> >
> > >> > 
> > >> >
> > >> >
> > >> > What could cause the unbalanced internal distributed request?
> > >> >
> > >> >
> > >> > Thanks in advance.
> > >> >
> > >> >
> > >> >
> > >> > Wei
> > >>
> > >>


What is the logical order of applying sorts in SOLR?

2020-05-11 Thread Stephen Lewis Bianamara
Hi SOLR Community,

What is the order of operations which SOLR applies to sorting? I've
observed many times and across SOLR versions that a restrictive filter with
a sort takes an extremely long time to return, suggesting to me that the
SORT is applied before the filter.

An example situation is querying for fq:Foo=Bar vs querying for fq:Foo=Bar
sort by Id desc. I've observed over many SOLR versions and collections that
the former is orders of magnitude cheaper and quicker to respond, even when
the result set is tiny (10-100).

Does anyone in this forum know whether this is the default behavior and
whether there is any way through the API or SOLR configuration to apply
sorts after filters?

Thanks,
Stephen


Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
FYI: https://issues.apache.org/jira/browse/SOLR-14471
Wei, assuming you have only TLOG replicas, your "last place" matches
(to which the random fallback ordering would not be applied -- see
above issue) would be the same as the "first place" matches selected
for executing distributed requests.


On Mon, May 11, 2020 at 1:49 PM Michael Gibney
 wrote:
>
> Wei, probably no need to answer my earlier questions; I think I see
> the problem here, and believe it is indeed a bug, introduced in 8.3.
> Will file an issue and submit a patch shortly.
> Michael
>
> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
>  wrote:
> >
> > Hi Wei,
> >
> > In considering this problem, I'm stumbling a bit on terminology
> > (particularly, where you mention "nodes", I think you're referring to
> > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > server instances) do you have, and what is the replica placement like
> > across those nodes? What, if any, non-TLOG replicas do you have per
> > shard (not that it's necessarily relevant, but just to get a complete
> > picture of the situation)?
> >
> > If you're able without too much trouble, can you determine what the
> > behavior is like on Solr 8.3? (there were different changes introduced
> > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > behavior you're observing manifests on 8.3 would help narrow down
> > where to look for an explanation).
> >
> > Michael
> >
> > On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> > >
> > > Update:  after I remove the shards.preference parameter from
> > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > balanced. The same parameter works fine with solr 7.6.  Still not sure of
> > > the root cause, but I observed a strange coincidence: the nodes that are
> > > most frequently picked for shard requests are the first node in each shard
> > > returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> > > equally compared nodes when shards.preference is set.  Will report back if
> > > I find more.
> > >
> > > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> > >
> > > > Hi Eric,
> > > >
> > > > I am measuring the number of shard requests, and it's for query only, no
> > > > indexing requests.  I have an external load balancer and see each node
> > > > received about the equal number of external queries. However for the
> > > > internal shard queries,  the distribution is uneven:6 nodes (one in
> > > > each shard,  some of them are leaders and some are non-leaders ) gets 
> > > > about
> > > > 80% of the shard requests, the other 54 nodes gets about 20% of the 
> > > > shard
> > > > requests.   I checked a few other parameters set:
> > > >
> > > > -Dsolr.disable.shardsWhitelist=true
> > > > shards.preference=replica.location:local,replica.type:TLOG
> > > >
> > > > Nothing seems to cause the strange behavior.  Any suggestions how to
> > > > debug this?
> > > >
> > > > -Wei
> > > >
> > > >
> > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> > > > wrote:
> > > >
> > > >> Wei:
> > > >>
> > > >> How are you measuring utilization here? The number of incoming requests
> > > >> or CPU?
> > > >>
> > > >> The leader for each shard are certainly handling all of the indexing
> > > >> requests since they’re TLOG replicas, so that’s one thing that might
> > > >> skewing your measurements.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> > > >> >
> > > >> > Hi everyone,
> > > >> >
> > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud 
> > > >> > has 6
> > > >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed 
> > > >> > that
> > > >> one
> > > >> > of the replicas in each shard is handling most of the distributed 
> > > >> > shard
> > > >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> > > >> There
> > > >> > is no change in shard handler configuration:
> > > >> >
> > > >> >  > > >> > "HttpShardHandlerFactory">
> > > >> >
> > > >> >3
> > > >> >
> > > >> >3
> > > >> >
> > > >> >500
> > > >> >
> > > >> > 
> > > >> >
> > > >> >
> > > >> > What could cause the unbalanced internal distributed request?
> > > >> >
> > > >> >
> > > >> > Thanks in advance.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Wei
> > > >>
> > > >>


Limiting random results set with facets.

2020-05-11 Thread David Lukowski
I'm looking for a way if possible to run a query with random results, where
I limit the number of results I want back, yet still have the facets
accurately reflect the results I'm searching.

When I run a search I use a filter query to randomize the results based on
a modulo of a random seed. This returns a results set with the associated
facets for each documentType.

"response":{"numFound":377895,"start":0,"docs":[]
  },
  "facet_counts":{
"facet_queries":{},
"facet_fields":{
  "documentType":[
"78",374015,
"3",3021,
"2",736,
"1",41,
"34",41,
"35",32,
"72",8,
"7",1]},

How do I limit the number of results returned to N and have the facets
accurately reflect the number of messages?  I cannot simply say rows=N
because the facets will always reflect the total numFound and not the
limited results set I'm looking for.


Solr 8.1.5 Postlogs - Basic Authentication Error

2020-05-11 Thread Waheed, Imran
Is there a way to use bin/postllogs with basic authentication on? I am getting 
error if do not give username/password

bin/postlogs http://localhost:8983/solr/logs 
server/logs/ server/logs

Exception in thread "main" 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:8983/solr/logs: Expected mime type 
application/octet-stream but got text/html. 


Error 401 require authentication

HTTP ERROR 401 require authentication

URI:/solr/logs/update
STATUS:401
MESSAGE:require authentication
SERVLET:default


I get a different error if I try
bin/postlogs -u user:@password http://localhost:8983/solr/logs server/logs/


SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.solr.util.SolrLogPostTool.gatherFiles(SolrLogPostTool.java:127)
at 
org.apache.solr.util.SolrLogPostTool.main(SolrLogPostTool.java:65)

thank you,
Imran


The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.


Re: Limiting random results set with facets.

2020-05-11 Thread Srijan
If you can tag your filter query, you can exclude it when faceting. Your
results will honor the filter query and you will get the N results back,
and since faceting will exclude the filter, it will still give you facet
count for the base query.

https://lucene.apache.org/solr/guide/8_5/faceting.html#tagging-and-excluding-filters


On Mon, May 11, 2020 at 3:36 PM David Lukowski 
wrote:

> I'm looking for a way if possible to run a query with random results, where
> I limit the number of results I want back, yet still have the facets
> accurately reflect the results I'm searching.
>
> When I run a search I use a filter query to randomize the results based on
> a modulo of a random seed. This returns a results set with the associated
> facets for each documentType.
>
> "response":{"numFound":377895,"start":0,"docs":[]
>   },
>   "facet_counts":{
> "facet_queries":{},
> "facet_fields":{
>   "documentType":[
> "78",374015,
> "3",3021,
> "2",736,
> "1",41,
> "34",41,
> "35",32,
> "72",8,
> "7",1]},
>
> How do I limit the number of results returned to N and have the facets
> accurately reflect the number of messages?  I cannot simply say rows=N
> because the facets will always reflect the total numFound and not the
> limited results set I'm looking for.
>


Re: Unbalanced shard requests

2020-05-11 Thread Wei
Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other type
of replicas, and each Tlog replica is an individual solr instance on its
own physical machine.  In the jira you mentioned 'when "last place matches"
== "first place matches" – e.g. when shards.preference specified matches
*all* available replicas'.   My setting is
shards.preference=replica.location:local,replica.type:TLOG,
I also tried just shards.preference=replica.location:local and it still has
the issue. Can you explain a bit more?

On Mon, May 11, 2020 at 12:26 PM Michael Gibney 
wrote:

> FYI: https://issues.apache.org/jira/browse/SOLR-14471
> Wei, assuming you have only TLOG replicas, your "last place" matches
> (to which the random fallback ordering would not be applied -- see
> above issue) would be the same as the "first place" matches selected
> for executing distributed requests.
>
>
> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
>  wrote:
> >
> > Wei, probably no need to answer my earlier questions; I think I see
> > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > Will file an issue and submit a patch shortly.
> > Michael
> >
> > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> >  wrote:
> > >
> > > Hi Wei,
> > >
> > > In considering this problem, I'm stumbling a bit on terminology
> > > (particularly, where you mention "nodes", I think you're referring to
> > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > server instances) do you have, and what is the replica placement like
> > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > shard (not that it's necessarily relevant, but just to get a complete
> > > picture of the situation)?
> > >
> > > If you're able without too much trouble, can you determine what the
> > > behavior is like on Solr 8.3? (there were different changes introduced
> > > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > > behavior you're observing manifests on 8.3 would help narrow down
> > > where to look for an explanation).
> > >
> > > Michael
> > >
> > > On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> > > >
> > > > Update:  after I remove the shards.preference parameter from
> > > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > > balanced. The same parameter works fine with solr 7.6.  Still not
> sure of
> > > > the root cause, but I observed a strange coincidence: the nodes that
> are
> > > > most frequently picked for shard requests are the first node in each
> shard
> > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> shuffling
> > > > equally compared nodes when shards.preference is set.  Will report
> back if
> > > > I find more.
> > > >
> > > > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> > > >
> > > > > Hi Eric,
> > > > >
> > > > > I am measuring the number of shard requests, and it's for query
> only, no
> > > > > indexing requests.  I have an external load balancer and see each
> node
> > > > > received about the equal number of external queries. However for
> the
> > > > > internal shard queries,  the distribution is uneven:6 nodes
> (one in
> > > > > each shard,  some of them are leaders and some are non-leaders )
> gets about
> > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> the shard
> > > > > requests.   I checked a few other parameters set:
> > > > >
> > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > >
> > > > > Nothing seems to cause the strange behavior.  Any suggestions how
> to
> > > > > debug this?
> > > > >
> > > > > -Wei
> > > > >
> > > > >
> > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> erickerick...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Wei:
> > > > >>
> > > > >> How are you measuring utilization here? The number of incoming
> requests
> > > > >> or CPU?
> > > > >>
> > > > >> The leader for each shard are certainly handling all of the
> indexing
> > > > >> requests since they’re TLOG replicas, so that’s one thing that
> might
> > > > >> skewing your measurements.
> > > > >>
> > > > >> Best,
> > > > >> Erick
> > > > >>
> > > > >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> > > > >> >
> > > > >> > Hi everyone,
> > > > >> >
> > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
> cloud has 6
> > > > >> > shards with 10 TLOG replicas each shard.  After upgrade I
> noticed that
> > > > >> one
> > > > >> > of the replicas in each shard is handling most of the
> distributed shard
> > > > >> > requests, so 6 nodes are heavily loaded while other nodes are
> idle.
> > > > >> There
> > > > >> > is no change in shard handler configuration:
> > > > >> >
> > > > >> >  > > > >> > "HttpShardHandlerFactory">
> > > > >> >
> > > > >> >3
> > > > >> >
> > > > >> >3
> > > > >> >
> > > > >> >500
> > > > >> >
> > > 

Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Vignan Malyala
Thank you Jan, Vincezo and Joe.
This helps us a lot.

On Mon, May 11, 2020 at 10:03 PM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Could you use a multi-valued field for user in each of your products?
>
> So productA and a field User that is a list of all the users that have
> productA.  Then you could do a search like:
>
> user:User1 AND Product_A_cost:[5 TO 10]
> user:(User1 User5...) AND Product_B_cost[0 TO 40]
>
> -Joe
>
> On 5/11/2020 5:35 AM, Vignan Malyala wrote:
> > I have around 1M products used by my clients.
> > Client need a filter of these 1M products by their cost filters.
> >
> > Just like:
> > User1 has 5 products (A,B,C,D,E)
> > User2 has 3 products (D,E,F)
> > User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)
> >
> > ...every customer has different sets.
> >
> > Now they want to search users by filter of product costs:
> > Product_A_cost :  50 TO 100
> > Product_D_cost :  0 TO 40
> >
> > it should return all the users who use products in this filter range.
> >
> > As I have 1M products, do I need to create dynamic fields for all users
> > with filed names as Product_A_cost and product_B_cost. etc to make a
> > search by them? If I should, then I haveto create 1M dynamic fields
> > Or is there any other way?
> >
> > Hope I'm clear here!
> >
> >
> > On Mon, May 11, 2020 at 1:47 PM Jan Høydahl 
> wrote:
> >
> >> Sounds like an anti pattern. Can you explain what search problem you are
> >> trying to solve with this many unique fields?
> >>
> >> Jan Høydahl
> >>
> >>> 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
> >>>
> >>> Hi
> >>> Is it good idea to create 10 dynamic fields of time pint in solr?
> >>> I have that many fields to search on actually which come upon based on
> >>> users.
> >>>
> >>> Thanks in advance!
> >>> And I'm using Solr Cloud in real-time.
> >>>
> >>> Regards,
> >>> Sai Vignan M
> >
>


Re: Creating 100000 dynamic fields in solr

2020-05-11 Thread Vignan Malyala
Thanks Jan! This helps a lot!

Sai Vignan Malyala

On Mon, May 11, 2020 at 5:07 PM Jan Høydahl  wrote:

> Sounds like you are looking for parent/child docs here, see
> https://lucene.apache.org/solr/guide/8_5/indexing-nested-documents.html
>
> {
> "type": "user",
> "name": "user1",
> "products": [
> { "id": "prod_A", "cost": 50},
> { "id": "prod_B", "cost": 200},
> { "id": "prod_D", "cost": 25}
> ]
> }
>
> This will index 4 documents - one user document and three product-cost
> child documents.
>
> You can then search the child docs and return matching parents with e.g.
> q=*:*&fq={!parent which="type:user"}((id:prod_A AND cost:[50 TO 100]) OR
> (id:prod_D AND cost:[0 TO 40]))&fl=[child]
>
> Hope this helps.
>
> Jan
>
> > 11. mai 2020 kl. 11:35 skrev Vignan Malyala :
> >
> > I have around 1M products used by my clients.
> > Client need a filter of these 1M products by their cost filters.
> >
> > Just like:
> > User1 has 5 products (A,B,C,D,E)
> > User2 has 3 products (D,E,F)
> > User3 has 10 products (A,B,C,H,I,J,K,L,M,N,O)
> >
> > ...every customer has different sets.
> >
> > Now they want to search users by filter of product costs:
> > Product_A_cost :  50 TO 100
> > Product_D_cost :  0 TO 40
> >
> > it should return all the users who use products in this filter range.
> >
> > As I have 1M products, do I need to create dynamic fields for all users
> > with filed names as Product_A_cost and product_B_cost. etc to make a
> > search by them? If I should, then I haveto create 1M dynamic fields
> > Or is there any other way?
> >
> > Hope I'm clear here!
> >
> >
> > On Mon, May 11, 2020 at 1:47 PM Jan Høydahl 
> wrote:
> >
> >> Sounds like an anti pattern. Can you explain what search problem you are
> >> trying to solve with this many unique fields?
> >>
> >> Jan Høydahl
> >>
> >>> 11. mai 2020 kl. 07:51 skrev Vignan Malyala :
> >>>
> >>> Hi
> >>> Is it good idea to create 10 dynamic fields of time pint in solr?
> >>> I have that many fields to search on actually which come upon based on
> >>> users.
> >>>
> >>> Thanks in advance!
> >>> And I'm using Solr Cloud in real-time.
> >>>
> >>> Regards,
> >>> Sai Vignan M
> >>
>
>