Re: DataImportHandler SolrEntityProcessor configuration for local copy

2020-02-06 Thread Mikhail Khludnev
Hello, Karl.
Please check these:
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#constraints-when-using-cursors

https://lucene.apache.org/solr/guide/6_6/uploading-structured-data-store-data-with-the-data-import-handler.html#solrentityprocessor
 cursorMark="true"
Good luck.


On Wed, Feb 5, 2020 at 10:06 PM Karl Stoney
 wrote:

> Hey All,
> I'm trying to implement a simplistic reindex strategy to copy all of the
> data out of one collection, into another, on a single node (no distributed
> queries).
>
> It's approx 4 million documents, with an index size of 26gig.  Based on
> your experience, I'm wondering what people feel sensible values for the
> SolrEntityProcessor are (to give me a sensible starting point, to save me
> iterating over loads of them).
>
> This is where I'm at right now.  I know `rows` would increase memory
> pressure but speed up the copy, I can't really find anywhere online where
> people have benchmarked different values for rows and the default (50)
> seems quite low.
>
> 
> 
>  query="*:*"
>  rows="100"
>  fl="*,old_version:_version_"
>  wt="javabin"
>  url="http://127.0.0.1/solr/at-uk";>
>
> 
> 
>
> Any suggestions are welcome.
> Thanks
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office:
> 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England
> No. 9439967). This email and any files transmitted with it are confidential
> and may be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> email in error please notify the sender. This email message has been swept
> for the presence of computer viruses.
>


-- 
Sincerely yours
Mikhail Khludnev


Re: DataImportHandler SolrEntityProcessor configuration for local copy

2020-02-06 Thread Karl Stoney
I cannot believe how much of a difference that cursorMark and sort order made.
Previously it died about 800k docs, now we're at 1.2m without any slowdown.

Thank you so much

On 06/02/2020, 08:14, "Mikhail Khludnev"  wrote:

Hello, Karl.
Please check these:

https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fpagination-of-results.html%23constraints-when-using-cursors&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=pNw8x6YUBTtXst60oMAe8UqWvUtakYvoJ9%2FKn7R8ETo%3D&reserved=0


https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fuploading-structured-data-store-data-with-the-data-import-handler.html%23solrentityprocessor&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=572w%2Br7QtZ8eHORG5UVrE3yE3SZaUXsuqFpRuwE80sw%3D&reserved=0
 cursorMark="true"
Good luck.


On Wed, Feb 5, 2020 at 10:06 PM Karl Stoney
 wrote:

> Hey All,
> I'm trying to implement a simplistic reindex strategy to copy all of the
> data out of one collection, into another, on a single node (no distributed
> queries).
>
> It's approx 4 million documents, with an index size of 26gig.  Based on
> your experience, I'm wondering what people feel sensible values for the
> SolrEntityProcessor are (to give me a sensible starting point, to save me
> iterating over loads of them).
>
> This is where I'm at right now.  I know `rows` would increase memory
> pressure but speed up the copy, I can't really find anywhere online where
> people have benchmarked different values for rows and the default (50)
> seems quite low.
>
> 
> 
>  query="*:*"
>  rows="100"
>  fl="*,old_version:_version_"
>  wt="javabin"
>  
url="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%2Fsolr%2Fat-uk&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=e9BfXappFygVqSlweYXJdsxf5TXtlrL%2BwHop7PrOsJQ%3D&reserved=0";>
>
> 
> 
>
> Any suggestions are welcome.
> Thanks
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office:
> 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in 
England
> No. 9439967). This email and any files transmitted with it are 
confidential
> and may be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> email in error please notify the sender. This email message has been swept
> for the presence of computer viruses.
>


--
Sincerely yours
Mikhail Khludnev


This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


Re: DataImportHandler SolrEntityProcessor configuration for local copy

2020-02-06 Thread Karl Stoney
Spoke too soon, looks like it memory leaks.  After about 1.3m the old gc times 
went through the root and solr was almost unresponsive, had to abort.  We're 
going to write our own implementation to copy data from one core to another 
that runs outside of solr.

On 06/02/2020, 09:57, "Karl Stoney"  wrote:

I cannot believe how much of a difference that cursorMark and sort order 
made.
Previously it died about 800k docs, now we're at 1.2m without any slowdown.

Thank you so much

On 06/02/2020, 08:14, "Mikhail Khludnev"  wrote:

Hello, Karl.
Please check these:

https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fpagination-of-results.html%23constraints-when-using-cursors&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=pNw8x6YUBTtXst60oMAe8UqWvUtakYvoJ9%2FKn7R8ETo%3D&reserved=0


https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fuploading-structured-data-store-data-with-the-data-import-handler.html%23solrentityprocessor&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=572w%2Br7QtZ8eHORG5UVrE3yE3SZaUXsuqFpRuwE80sw%3D&reserved=0
 cursorMark="true"
Good luck.


On Wed, Feb 5, 2020 at 10:06 PM Karl Stoney
 wrote:

> Hey All,
> I'm trying to implement a simplistic reindex strategy to copy all of 
the
> data out of one collection, into another, on a single node (no 
distributed
> queries).
>
> It's approx 4 million documents, with an index size of 26gig.  Based 
on
> your experience, I'm wondering what people feel sensible values for 
the
> SolrEntityProcessor are (to give me a sensible starting point, to 
save me
> iterating over loads of them).
>
> This is where I'm at right now.  I know `rows` would increase memory
> pressure but speed up the copy, I can't really find anywhere online 
where
> people have benchmarked different values for rows and the default (50)
> seems quite low.
>
> 
> 
>  query="*:*"
>  rows="100"
>  fl="*,old_version:_version_"
>  wt="javabin"
>  
url="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%2Fsolr%2Fat-uk&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=e9BfXappFygVqSlweYXJdsxf5TXtlrL%2BwHop7PrOsJQ%3D&reserved=0";>
>
> 
> 
>
> Any suggestions are welcome.
> Thanks
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered 
Office:
> 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in 
England
> No. 9439967). This email and any files transmitted with it are 
confidential
> and may be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received 
this
> email in error please notify the sender. This email message has been 
swept
> for the presence of computer viruses.
>


--
Sincerely yours
Mikhail Khludnev




This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


Re: Bug? Documents not visible after sucessful commit - chaos testing

2020-02-06 Thread Michael Frank
Hi Chris,
thank you for your detailed answer!

We are aware that Solr Cloud is eventually consistent and in our
application that's fine in most cases.
However, what is really important for us is that we get a "Read Your
Writes" for a clear point in time - which in our understand should be after
hard commits with waitSearcher=true return sucessfull from all replicas. Is
that correct?
The client that indexes new documents performs a hard commit with
waitSearcher=true and after that was successful, we expect the documents to
be visible on all Replicas.
This seems to work as expected if the cluster is in a healthy state.
If we shut down nodes while updating documents and committing we observe
that commits somehow get lost.
The documents are neither visible on the leader nor on any replica! Even
after all nodes and replicas are up again.
And we don't get any error or exception from the Solrj client.
Is there any way to make sure that a commit is executed sucessfully on
_every_ replica (and fail if the replica is currently down or recovering)?
Or to get notified that the commit could not be executed because the
cluster is in an unhealthy state?
If we can confirm and verify this in our Indexing client, we could detect
failures and recover.

I don't think the /get request handler is not an option for us because it
only accepts document IDs and no search queries, which we rely heavily on.
Is that correct?


: FYI: three is no need to send a softCommit after a hardCommit
Agreed, that was just us experimenting and trying stuff.

: So to be clear: 'rf=2' means a total of 2 replicas confirmed the update
-- that includes the leader replica.  'rf=1' means the leader accepted the
doc, but all other replicas are down.
: if you wnat to me 100% certain that every replica recieved the update,
then you should be confirming rf=3
Agreed, should have been more clear. We have multiple test scenarios. Some
with 2 replicas (1 leader 1 reps) and some with 3 (1 leader, 2 reps). In
the first mail i just picked the simplest test setup that failed,
consisting of one leader and one replica - so technically we could
reproduce the error in a two node cluster.

Cheers,
Michael

Am Do., 6. Feb. 2020 um 01:42 Uhr schrieb Chris Hostetter <
hossman_luc...@fucit.org>:

>
> I may be missunderstanding something in your setup, and/or I may be
> miss-remembering things about Solr, but I think the behavior you are
> seeing is because *search* in solr is "eventually consistent" -- while
> "RTG" (ie: using the /get" handler) is (IIRC) "strongly consistent"
>
> ie: there's a reason it's called "Near Real Time Searching" and "NRT
> Replica" ... not "RT Replica"
>
> When you kill a node hosting a replica, then send an update which a leader
> accepts but can't send to that replica, that replica is now "out of sync"
> and will continue to be out of sync when it comes back online and starts
> responding to search requests as it recovers from the leader/tlog --
> eventually the search will have consistent results across all replicas,
> but during the recovery period this isn't garunteed.
>
> If however you use the /get request handler, then it (again, IIRC)
> consults the tlog for the latest version of the doc even if it's
> mid-recovery and the index itself isn't yet up to date.
>
> So for the purposes of testing solr as a "strongly consistent" document
> store, using /get?id=foo to check the "current" data in the document is
> more appropriate then /select?q=id:foo
>
> Some more info here...
>
> https://lucene.apache.org/solr/guide/8_4/solrcloud-resilience.html
> https://lucene.apache.org/solr/guide/8_4/realtime-get.html
>
>
> A few other things that jumped out at me in your email that seemed weird
> or worthy of comment...
>
> : Accordung to solrs documentation, a commit with openSearcher=true and
> : waitSearcher=true and waitFlush=true only returns once everything is
> : presisted AND the new searcher is visible.
> :
> : To me this sounds like that any subsequent request after a successful
> : commit MUST hit the new searcher and is guaranteed to see the commit
> : changes, regardless of node failures or restarts.
>
> that is true for *single* node solr, or a "heathy" cluster but as i
> mentioned if a node is down when the "commit" happens it won't have the
> document yet -- nor is it alive to process the commit.  the document
> update -- and the commit -- are in the tlog that still needs to replay
> when the replica comes back online
>
> :- A test-collection with 1 Shard and 2 NRT Replicas.
>
> I'm guessing since you said you were using 3 nodes, that what you
> mean here is a single shard with a total of 3 replicas which are all NRT
> -- remember the "leader" is still itself an NRT  replica.
>
> (i know, i know ... i hate the terminology)
>
> This is a really important point to clarify in your testing because of how
> you are using 'rf' ... seeing exactly how you create your collection is
> important to make sure we're talking about the same thing.
>

Re: DataImportHandler SolrEntityProcessor configuration for local copy

2020-02-06 Thread Mikhail Khludnev
Egor, would you mind to share some best practices regarding cursorMark in
SolrEntityProcessor?

On Thu, Feb 6, 2020 at 1:04 PM Karl Stoney
 wrote:

> Spoke too soon, looks like it memory leaks.  After about 1.3m the old gc
> times went through the root and solr was almost unresponsive, had to
> abort.  We're going to write our own implementation to copy data from one
> core to another that runs outside of solr.
>
> On 06/02/2020, 09:57, "Karl Stoney"  wrote:
>
> I cannot believe how much of a difference that cursorMark and sort
> order made.
> Previously it died about 800k docs, now we're at 1.2m without any
> slowdown.
>
> Thank you so much
>
> On 06/02/2020, 08:14, "Mikhail Khludnev"  wrote:
>
> Hello, Karl.
> Please check these:
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fpagination-of-results.html%23constraints-when-using-cursors&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=pNw8x6YUBTtXst60oMAe8UqWvUtakYvoJ9%2FKn7R8ETo%3D&reserved=0
>
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fuploading-structured-data-store-data-with-the-data-import-handler.html%23solrentityprocessor&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=572w%2Br7QtZ8eHORG5UVrE3yE3SZaUXsuqFpRuwE80sw%3D&reserved=0
>  cursorMark="true"
> Good luck.
>
>
> On Wed, Feb 5, 2020 at 10:06 PM Karl Stoney
>  wrote:
>
> > Hey All,
> > I'm trying to implement a simplistic reindex strategy to copy
> all of the
> > data out of one collection, into another, on a single node (no
> distributed
> > queries).
> >
> > It's approx 4 million documents, with an index size of 26gig.
> Based on
> > your experience, I'm wondering what people feel sensible values
> for the
> > SolrEntityProcessor are (to give me a sensible starting point,
> to save me
> > iterating over loads of them).
> >
> > This is where I'm at right now.  I know `rows` would increase
> memory
> > pressure but speed up the copy, I can't really find anywhere
> online where
> > people have benchmarked different values for rows and the
> default (50)
> > seems quite low.
> >
> > 
> > 
> > >  query="*:*"
> >  rows="100"
> >  fl="*,old_version:_version_"
> >  wt="javabin"
> >  url="
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%2Fsolr%2Fat-uk&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=e9BfXappFygVqSlweYXJdsxf5TXtlrL%2BwHop7PrOsJQ%3D&reserved=0
> ">
> >
> > 
> > 
> >
> > Any suggestions are welcome.
> > Thanks
> > This e-mail is sent on behalf of Auto Trader Group Plc,
> Registered Office:
> > 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered
> in England
> > No. 9439967). This email and any files transmitted with it are
> confidential
> > and may be legally privileged, and intended solely for the use
> of the
> > individual or entity to whom they are addressed. If you have
> received this
> > email in error please notify the sender. This email message has
> been swept
> > for the presence of computer viruses.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
>
>
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office:
> 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England
> No. 9439967). This email and any files transmitted with it are confidential
> and may be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> email in error please notify the sender. This email message has been swept
> for the presence of computer viruses.
>


-- 
Sincerely yours
Mikhail Khludnev


migrating my application

2020-02-06 Thread Carmen Márquez Vázquez
Hello, I am migrating my application that uses Solr 4.4.0 to use Solr 8.2.0.
I have the following code that I am unable to migrate.
Can you help me?
new ChainedFilter(filters.toArray(new Filter[filters.size()]), 
ChainedFilter.OR);
Thanks in advance.


Storage/Volume type for Kubernetes Solr POD?

2020-02-06 Thread Susheel Kumar
Hello,

Whats type of storage/volume is recommended to run Solr on Kubernetes POD?
I know in the past Solr has issues with NFS storing its indexes and was not
recommended.

https://kubernetes.io/docs/concepts/storage/volumes/

Thanks,
Susheel


Re: Checking in on Solr Progress

2020-02-06 Thread Erick Erickson
When you say “look”, where are you looking from? Http requests? SolrJ? The 
admin UI?

Zookeeper is always the keeper of the state, so when the replica is “active” 
_AND_
the replica’s node is in the “live_nodes” hive it’s up.

The Collections API CLUSTERSTATUS can help here if you’re not using Solrj.

Best,
Erick

> On Feb 5, 2020, at 5:59 PM, dj-manning  wrote:
> 
> Hi - I'm wondering if you would be able to point me in the right direction -
> I'm looking for the best way to check solr recover progress and status. 
> 
> I've seen a replica fall into recovery and I was wondering where I should
> look to monitor progress.
> 
> Thank you in advance.
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: NoClassDefFoundError - Faceting on 8.2.0

2020-02-06 Thread Erick Erickson
My first guess is that you have multiple or out-of-date jars in your classpath 
on those machines.

Best,
Erick

> On Feb 5, 2020, at 5:53 PM, Joe Obernberger  
> wrote:
> 
> Hi All - getting this error intermittently on a solr cloud cluster.  
> Sometimes the heatmap generation works, sometimes not.  I tracked it down to 
> some of the nodes are reporting this error:
> 
> null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: Could not 
> initialize class org.apache.solr.search.facet.FacetHeatmap$PngHelper
>   at org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:733)
>   at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:591)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
>   at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
>   at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>   at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>   at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>   at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
>   at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>   at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>   at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>   at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>   at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>   at org.eclipse.jetty.server.Server.handle(Server.java:505)
>   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>   at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:311)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
>   at 
> org.eclipse.jetty.http2.HTTP2Connection.produce(HTTP2Connection.java:170)
>   at 
> org.eclipse.jetty.http2.HTTP2Connection.onFillable(HTTP2Connection.java:125)
>   at 
> org.eclipse.jetty.http2.HTTP2Connection$FillableCallback.succeeded(HTTP2Connection.java:348)
>   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>   at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>   at 
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>   at 
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
>   at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.solr.search.facet.FacetHeatmap$PngHelper
>   at 
> org.apache.solr.search.facet.FacetHeatmap.asPngBytes(FacetHeatmap.java:406)
>   at 
> org.apache.solr.search.facet.FacetHeatmap.formatCountsVal(FacetHeatmap.java:295)
>   at 
> org.apache.solr.search.facet.FacetHeatmap.access$500(F

Re: StatelessScriptUpdateProcessorFactory causing OOM errors?

2020-02-06 Thread Erick Erickson
How many fields do you wind up having? It looks on a quick glance like
it depends on the values of fields. While I’ve seen Solr/Lucene handle
indexes with over 1M different fields, it’s unsatisfactory.

What I’m wondering is if you are adding a zillion different fields to your
docs as time passes and eventually the structures that are needed to
maintain your field mappings are blowing up memory.

If that’s that case, you need an alternative design because your
performance will be unacceptable.

May be off base, if so we can dig further.

Best,
Erick

> On Feb 5, 2020, at 3:41 PM, Haschart, Robert J (rh9ec)  
> wrote:
> 
> StatelessScriptUpdateProcessorFactory



JSON from Term Vectors Component

2020-02-06 Thread Doug Turnbull
Hi all,

I was curious if anyone had any tips on parsing the JSON response of the
term vectors component? Or anyway to force it to be more standard JSON? It
appears to be very heavily nested and idiosyncratic JSON, such as below.

Notice the lists, within lists, within lists. Where the keys are adjacent
items in the list. Is there a reason this isn't a JSON dictionary? Instead
you have to build a stateful list parser that just seems prone to errors...

Any thoughts or ideas are very welcome, I probably just need to do
something rather simple here...

"termVectors": [
"D10", [
"uniqueKey", "D10",
"body", [
"1", [
"positions", [
"position", 92,
"position", 113
]
],
"10", [ ...

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: JSON from Term Vectors Component

2020-02-06 Thread Munendra S N
>
> Notice the lists, within lists, within lists. Where the keys are adjacent
> items in the list. Is there a reason this isn't a JSON dictionary?
>
I think this is because of NamedList. Have you tried using json.nl=map as a
query parameter for this case?

Regards,
Munendra S N



On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Hi all,
>
> I was curious if anyone had any tips on parsing the JSON response of the
> term vectors component? Or anyway to force it to be more standard JSON? It
> appears to be very heavily nested and idiosyncratic JSON, such as below.
>
> Notice the lists, within lists, within lists. Where the keys are adjacent
> items in the list. Is there a reason this isn't a JSON dictionary? Instead
> you have to build a stateful list parser that just seems prone to errors...
>
> Any thoughts or ideas are very welcome, I probably just need to do
> something rather simple here...
>
> "termVectors": [
> "D10", [
> "uniqueKey", "D10",
> "body", [
> "1", [
> "positions", [
> "position", 92,
> "position", 113
> ]
> ],
> "10", [ ...
>
> --
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Re: JSON from Term Vectors Component

2020-02-06 Thread Doug Turnbull
Thanks for the tip,

The issue is json.nl produces non-standard json with duplicate keys. Solr
generates the following, which json lint fails given multiple keys

{
"positions": {
"position": 155,
"position": 844,
"position": 1726
}
}

On Thu, Feb 6, 2020 at 11:36 AM Munendra S N 
wrote:

> >
> > Notice the lists, within lists, within lists. Where the keys are adjacent
> > items in the list. Is there a reason this isn't a JSON dictionary?
> >
> I think this is because of NamedList. Have you tried using json.nl=map as
> a
> query parameter for this case?
>
> Regards,
> Munendra S N
>
>
>
> On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
> > Hi all,
> >
> > I was curious if anyone had any tips on parsing the JSON response of the
> > term vectors component? Or anyway to force it to be more standard JSON?
> It
> > appears to be very heavily nested and idiosyncratic JSON, such as below.
> >
> > Notice the lists, within lists, within lists. Where the keys are adjacent
> > items in the list. Is there a reason this isn't a JSON dictionary?
> Instead
> > you have to build a stateful list parser that just seems prone to
> errors...
> >
> > Any thoughts or ideas are very welcome, I probably just need to do
> > something rather simple here...
> >
> > "termVectors": [
> > "D10", [
> > "uniqueKey", "D10",
> > "body", [
> > "1", [
> > "positions", [
> > "position", 92,
> > "position", 113
> > ]
> > ],
> > "10", [ ...
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search 
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
> >
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: JSON from Term Vectors Component

2020-02-06 Thread Walter Underwood
Repeated keys are quite legal in JSON, but many libraries don’t support that.

It does look like that data layout could be redesigned to be more portable.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 6, 2020, at 8:38 AM, Doug Turnbull 
>  wrote:
> 
> Thanks for the tip,
> 
> The issue is json.nl produces non-standard json with duplicate keys. Solr
> generates the following, which json lint fails given multiple keys
> 
> {
> "positions": {
> "position": 155,
> "position": 844,
> "position": 1726
> }
> }
> 
> On Thu, Feb 6, 2020 at 11:36 AM Munendra S N 
> wrote:
> 
>>> 
>>> Notice the lists, within lists, within lists. Where the keys are adjacent
>>> items in the list. Is there a reason this isn't a JSON dictionary?
>>> 
>> I think this is because of NamedList. Have you tried using json.nl=map as
>> a
>> query parameter for this case?
>> 
>> Regards,
>> Munendra S N
>> 
>> 
>> 
>> On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
>> dturnb...@opensourceconnections.com> wrote:
>> 
>>> Hi all,
>>> 
>>> I was curious if anyone had any tips on parsing the JSON response of the
>>> term vectors component? Or anyway to force it to be more standard JSON?
>> It
>>> appears to be very heavily nested and idiosyncratic JSON, such as below.
>>> 
>>> Notice the lists, within lists, within lists. Where the keys are adjacent
>>> items in the list. Is there a reason this isn't a JSON dictionary?
>> Instead
>>> you have to build a stateful list parser that just seems prone to
>> errors...
>>> 
>>> Any thoughts or ideas are very welcome, I probably just need to do
>>> something rather simple here...
>>> 
>>> "termVectors": [
>>> "D10", [
>>> "uniqueKey", "D10",
>>> "body", [
>>> "1", [
>>> "positions", [
>>> "position", 92,
>>> "position", 113
>>> ]
>>> ],
>>> "10", [ ...
>>> 
>>> --
>>> *Doug Turnbull **| CTO* | OpenSource Connections
>>> , LLC | 240.476.9983
>>> Author: Relevant Search 
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless
>>> of whether attachments are marked as such.
>>> 
>> 
> 
> 
> -- 
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.



Re: JSON from Term Vectors Component

2020-02-06 Thread Doug Turnbull
Well that is interesting, I did not know that! Thanks Walter...

https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object

I gave it a go in Python (what I'm using) to see what would happen, indeed
it gives some odd behavior

In [4]: jsonStr = ' {"test": 1, "test": 2} '


In [5]: json.loads(jsonStr)

Out[5]: {'test': 2}

On Thu, Feb 6, 2020 at 11:49 AM Walter Underwood 
wrote:

> Repeated keys are quite legal in JSON, but many libraries don’t support
> that.
>
> It does look like that data layout could be redesigned to be more portable.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 6, 2020, at 8:38 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > Thanks for the tip,
> >
> > The issue is json.nl produces non-standard json with duplicate keys.
> Solr
> > generates the following, which json lint fails given multiple keys
> >
> > {
> > "positions": {
> > "position": 155,
> > "position": 844,
> > "position": 1726
> > }
> > }
> >
> > On Thu, Feb 6, 2020 at 11:36 AM Munendra S N 
> > wrote:
> >
> >>>
> >>> Notice the lists, within lists, within lists. Where the keys are
> adjacent
> >>> items in the list. Is there a reason this isn't a JSON dictionary?
> >>>
> >> I think this is because of NamedList. Have you tried using json.nl=map
> as
> >> a
> >> query parameter for this case?
> >>
> >> Regards,
> >> Munendra S N
> >>
> >>
> >>
> >> On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
> >> dturnb...@opensourceconnections.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I was curious if anyone had any tips on parsing the JSON response of
> the
> >>> term vectors component? Or anyway to force it to be more standard JSON?
> >> It
> >>> appears to be very heavily nested and idiosyncratic JSON, such as
> below.
> >>>
> >>> Notice the lists, within lists, within lists. Where the keys are
> adjacent
> >>> items in the list. Is there a reason this isn't a JSON dictionary?
> >> Instead
> >>> you have to build a stateful list parser that just seems prone to
> >> errors...
> >>>
> >>> Any thoughts or ideas are very welcome, I probably just need to do
> >>> something rather simple here...
> >>>
> >>> "termVectors": [
> >>> "D10", [
> >>> "uniqueKey", "D10",
> >>> "body", [
> >>> "1", [
> >>> "positions", [
> >>> "position", 92,
> >>> "position", 113
> >>> ]
> >>> ],
> >>> "10", [ ...
> >>>
> >>> --
> >>> *Doug Turnbull **| CTO* | OpenSource Connections
> >>> , LLC | 240.476.9983
> >>> Author: Relevant Search 
> >>> This e-mail and all contents, including attachments, is considered to
> be
> >>> Company Confidential unless explicitly stated otherwise, regardless
> >>> of whether attachments are marked as such.
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search 
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: JSON from Term Vectors Component

2020-02-06 Thread Walter Underwood
It is one of those things that happens when you don’t have a working group beat 
on a spec for six months. With an IETF process, I bet JSON would disallow 
duplicate keys and have comments. It might even have a datetime data type or at 
least recommend ISO8601 in a string.

I was on the Atom working group. That is still a solid spec.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 6, 2020, at 8:56 AM, Doug Turnbull 
>  wrote:
> 
> Well that is interesting, I did not know that! Thanks Walter...
> 
> https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object
> 
> I gave it a go in Python (what I'm using) to see what would happen, indeed
> it gives some odd behavior
> 
> In [4]: jsonStr = ' {"test": 1, "test": 2} '
> 
> 
> In [5]: json.loads(jsonStr)
> 
> Out[5]: {'test': 2}
> 
> On Thu, Feb 6, 2020 at 11:49 AM Walter Underwood 
> wrote:
> 
>> Repeated keys are quite legal in JSON, but many libraries don’t support
>> that.
>> 
>> It does look like that data layout could be redesigned to be more portable.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 6, 2020, at 8:38 AM, Doug Turnbull <
>> dturnb...@opensourceconnections.com> wrote:
>>> 
>>> Thanks for the tip,
>>> 
>>> The issue is json.nl produces non-standard json with duplicate keys.
>> Solr
>>> generates the following, which json lint fails given multiple keys
>>> 
>>> {
>>> "positions": {
>>> "position": 155,
>>> "position": 844,
>>> "position": 1726
>>> }
>>> }
>>> 
>>> On Thu, Feb 6, 2020 at 11:36 AM Munendra S N 
>>> wrote:
>>> 
> 
> Notice the lists, within lists, within lists. Where the keys are
>> adjacent
> items in the list. Is there a reason this isn't a JSON dictionary?
> 
 I think this is because of NamedList. Have you tried using json.nl=map
>> as
 a
 query parameter for this case?
 
 Regards,
 Munendra S N
 
 
 
 On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
 dturnb...@opensourceconnections.com> wrote:
 
> Hi all,
> 
> I was curious if anyone had any tips on parsing the JSON response of
>> the
> term vectors component? Or anyway to force it to be more standard JSON?
 It
> appears to be very heavily nested and idiosyncratic JSON, such as
>> below.
> 
> Notice the lists, within lists, within lists. Where the keys are
>> adjacent
> items in the list. Is there a reason this isn't a JSON dictionary?
 Instead
> you have to build a stateful list parser that just seems prone to
 errors...
> 
> Any thoughts or ideas are very welcome, I probably just need to do
> something rather simple here...
> 
> "termVectors": [
> "D10", [
> "uniqueKey", "D10",
> "body", [
> "1", [
> "positions", [
> "position", 92,
> "position", 113
> ]
> ],
> "10", [ ...
> 
> --
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to
>> be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
> 
 
>>> 
>>> 
>>> --
>>> *Doug Turnbull **| CTO* | OpenSource Connections
>>> , LLC | 240.476.9983
>>> Author: Relevant Search 
>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless
>>> of whether attachments are marked as such.
>> 
>> 
> 
> -- 
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.



Re: Checking in on Solr Progress

2020-02-06 Thread dj-manning
Erick Erickson wrote
> When you say “look”, where are you looking from? Http requests? SolrJ? The
> admin UI?

I'm open to looking form anywhere  - http request, or the admin UI, or
following a log if possible. 

My objective for this ask would be to human interactively follow/watch
solr's recovery progress - if that's even possible.

Stretch goal would be to autonomously report on recovery progress.

The question stems from seeing recovery in log or the admin UI, then
wondering what progress is.  

Appreciation.




--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: JSON from Term Vectors Component

2020-02-06 Thread Edward Ribeiro
Python's json lib will convert text as '{"id": 1, "id": 2}' to a dict, that
doesn't allow duplicate keys. The solution in this case is to inject your
own parsing logic as explained here:
https://stackoverflow.com/questions/29321677/python-json-parser-allow-duplicate-keys

One possible solution (below) is to turn the duplicate keys into key-list
pairs

from json import JSONDecoder

jsonStr = '{"positions": {"position": 155,"position": 844,"position":
1726}}'

def dict_treat_duplicates(ordered_pairs):
 d = {}
 for k,v in ordered_pairs:
 if k in d:
# duplicate keys
prev_v = d.get(k)
if isinstance(prev_v, list):
# append to list
prev_v.append(v)
else:
# turn into list
new_v = [prev_v, v]
d[k] = new_v
 else:
d[k] = v
 return d
decoder = JSONDecoder(object_pairs_hook=dict_treat_duplicates)
decoder.decode(jsonStr)

will give you {'positions': {'position': [155, 844, 1726]}}, while

def dict_raise_on_duplicates(ordered_pairs):
  return ordered_pairs

will give you [('positions', [('position', 155), ('position', 844),
('position', 1726)])]

Best,
Edward

On Thu, Feb 6, 2020 at 1:57 PM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:
>
> Well that is interesting, I did not know that! Thanks Walter...
>
>
https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object
>
> I gave it a go in Python (what I'm using) to see what would happen, indeed
> it gives some odd behavior
>
> In [4]: jsonStr = ' {"test": 1, "test": 2} '
>
>
> In [5]: json.loads(jsonStr)
>
> Out[5]: {'test': 2}
>
> On Thu, Feb 6, 2020 at 11:49 AM Walter Underwood 
> wrote:
>
> > Repeated keys are quite legal in JSON, but many libraries don’t support
> > that.
> >
> > It does look like that data layout could be redesigned to be more
portable.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Feb 6, 2020, at 8:38 AM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> > >
> > > Thanks for the tip,
> > >
> > > The issue is json.nl produces non-standard json with duplicate keys.
> > Solr
> > > generates the following, which json lint fails given multiple keys
> > >
> > > {
> > > "positions": {
> > > "position": 155,
> > > "position": 844,
> > > "position": 1726
> > > }
> > > }
> > >
> > > On Thu, Feb 6, 2020 at 11:36 AM Munendra S N 
> > > wrote:
> > >
> > >>>
> > >>> Notice the lists, within lists, within lists. Where the keys are
> > adjacent
> > >>> items in the list. Is there a reason this isn't a JSON dictionary?
> > >>>
> > >> I think this is because of NamedList. Have you tried using json.nl
=map
> > as
> > >> a
> > >> query parameter for this case?
> > >>
> > >> Regards,
> > >> Munendra S N
> > >>
> > >>
> > >>
> > >> On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
> > >> dturnb...@opensourceconnections.com> wrote:
> > >>
> > >>> Hi all,
> > >>>
> > >>> I was curious if anyone had any tips on parsing the JSON response of
> > the
> > >>> term vectors component? Or anyway to force it to be more standard
JSON?
> > >> It
> > >>> appears to be very heavily nested and idiosyncratic JSON, such as
> > below.
> > >>>
> > >>> Notice the lists, within lists, within lists. Where the keys are
> > adjacent
> > >>> items in the list. Is there a reason this isn't a JSON dictionary?
> > >> Instead
> > >>> you have to build a stateful list parser that just seems prone to
> > >> errors...
> > >>>
> > >>> Any thoughts or ideas are very welcome, I probably just need to do
> > >>> something rather simple here...
> > >>>
> > >>> "termVectors": [
> > >>> "D10", [
> > >>> "uniqueKey", "D10",
> > >>> "body", [
> > >>> "1", [
> > >>> "positions", [
> > >>> "position", 92,
> > >>> "position", 113
> > >>> ]
> > >>> ],
> > >>> "10", [ ...
> > >>>
> > >>> --
> > >>> *Doug Turnbull **| CTO* | OpenSource Connections
> > >>> , LLC | 240.476.9983
> > >>> Author: Relevant Search 
> > >>> This e-mail and all contents, including attachments, is considered
to
> > be
> > >>> Company Confidential unless explicitly stated otherwise, regardless
> > >>> of whether attachments are marked as such.
> > >>>
> > >>
> > >
> > >
> > > --
> > > *Doug Turnbull **| CTO* | OpenSource Connections
> > > , LLC | 240.476.9983
> > > Author: Relevant Search 
> > > This e-mail and all contents, including attachments, is considered to
be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> >
> >
>
> --
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all conte

Re: Checking in on Solr Progress

2020-02-06 Thread Erick Erickson
There’s actually a crying need for this, but there’s nothing that’s there yet, 
basically you have to look at the log files and try to figure it out. 

Actually I think this would be a great thing to work on, but it’d be pretty 
much all new. If you’d like, you can create a Solr Improvement Proposal here: 
https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to flesh out what 
this would look like.

A couple of thoughts off the top of my head:

I really think what would be most useful would be a collections API command, 
something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. Currently a 
replica can be stuck in recovery and never get out. There are several scenarios 
that’d have to be considered:

1> normal startup. The replica briefly goes from down->recovering->active which 
should be quite brief. 
1a> Waiting for a leader to be elected before continuing

2> “peer sync” where another replica is replaying documents from the tlog.

3> situations where the replica is replaying documents from its own tlog. This 
can be very, very, very long too.

4> full sync where it’s copying the entire index from a leader.

5> knickers in a knot, it’s given up even trying to recover.

In either case, you’d want to report “all ok” if nothing was in recovery, “just 
the ones having trouble” and “everything because I want to look”.

But like I said, there’s nothing really built into the system to accomplish 
this now that I know of.

Best,
Erick

> On Feb 6, 2020, at 12:15 PM, dj-manning  wrote:
> 
> Erick Erickson wrote
>> When you say “look”, where are you looking from? Http requests? SolrJ? The
>> admin UI?
> 
> I'm open to looking form anywhere  - http request, or the admin UI, or
> following a log if possible. 
> 
> My objective for this ask would be to human interactively follow/watch
> solr's recovery progress - if that's even possible.
> 
> Stretch goal would be to autonomously report on recovery progress.
> 
> The question stems from seeing recovery in log or the admin UI, then
> wondering what progress is.  
> 
> Appreciation.
> 
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: JSON from Term Vectors Component

2020-02-06 Thread Doug Turnbull
FWIW, I ended up writing some code that does a best effort turning the
named list into a dict representation, if it can't, it'll keep it as a
python tuple.

def every_other_zipped(lst):
return zip(lst[0::2],lst[1::2])

def dictify(nl_tups):
""" Return dict if all keys unique, otherwise
dont modify """
as_dict = dict(nl_tups)
if len(as_dict) == len(nl_tups):
return as_dict
return nl_tups

def parse_named_list(lst):
shallow_tups = [tup for tup in every_other_zipped(lst)]

nl_as_tups = []

for tup in shallow_tups:
if isinstance(tup[1], list):
tup = (tup[0], parse_named_list(tup[1]))
nl_as_tups.append(tup)
return dictify(nl_as_tups)



if __name__ == "__main__":
solr_nl =  [
"D10", [
"uniqueKey", "D10",
"body", [
"1", [
"positions", [
"position", 92,
"position", 113
],
"2", [
"positions", [
"position", 22,
"position", 413
]
]
print(repr(parse_named_list(solr_nl)))



Outputs

{
'D10': {
'uniqueKey': 'D10',
'body': {
'1': {
'positions': [('position', 92), ('position', 113)]
},
'2': {
'positions': [('position', 22), ('position', 413)]
}
}
}
}


On Thu, Feb 6, 2020 at 12:59 PM Edward Ribeiro 
wrote:

> Python's json lib will convert text as '{"id": 1, "id": 2}' to a dict, that
> doesn't allow duplicate keys. The solution in this case is to inject your
> own parsing logic as explained here:
>
> https://stackoverflow.com/questions/29321677/python-json-parser-allow-duplicate-keys
>
> One possible solution (below) is to turn the duplicate keys into key-list
> pairs
>
> from json import JSONDecoder
>
> jsonStr = '{"positions": {"position": 155,"position": 844,"position":
> 1726}}'
>
> def dict_treat_duplicates(ordered_pairs):
>  d = {}
>  for k,v in ordered_pairs:
>  if k in d:
> # duplicate keys
> prev_v = d.get(k)
> if isinstance(prev_v, list):
> # append to list
> prev_v.append(v)
> else:
> # turn into list
> new_v = [prev_v, v]
> d[k] = new_v
>  else:
> d[k] = v
>  return d
> decoder = JSONDecoder(object_pairs_hook=dict_treat_duplicates)
> decoder.decode(jsonStr)
>
> will give you {'positions': {'position': [155, 844, 1726]}}, while
>
> def dict_raise_on_duplicates(ordered_pairs):
>   return ordered_pairs
>
> will give you [('positions', [('position', 155), ('position', 844),
> ('position', 1726)])]
>
> Best,
> Edward
>
> On Thu, Feb 6, 2020 at 1:57 PM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > Well that is interesting, I did not know that! Thanks Walter...
> >
> >
>
> https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object
> >
> > I gave it a go in Python (what I'm using) to see what would happen,
> indeed
> > it gives some odd behavior
> >
> > In [4]: jsonStr = ' {"test": 1, "test": 2} '
> >
> >
> > In [5]: json.loads(jsonStr)
> >
> > Out[5]: {'test': 2}
> >
> > On Thu, Feb 6, 2020 at 11:49 AM Walter Underwood 
> > wrote:
> >
> > > Repeated keys are quite legal in JSON, but many libraries don’t support
> > > that.
> > >
> > > It does look like that data layout could be redesigned to be more
> portable.
> > >
> > > wunder
> > > Walter Underwood
> > > wun...@wunderwood.org
> > > http://observer.wunderwood.org/  (my blog)
> > >
> > > > On Feb 6, 2020, at 8:38 AM, Doug Turnbull <
> > > dturnb...@opensourceconnections.com> wrote:
> > > >
> > > > Thanks for the tip,
> > > >
> > > > The issue is json.nl produces non-standard json with duplicate keys.
> > > Solr
> > > > generates the following, which json lint fails given multiple keys
> > > >
> > > > {
> > > > "positions": {
> > > > "position": 155,
> > > > "position": 844,
> > > > "position": 1726
> > > > }
> > > > }
> > > >
> > > > On Thu, Feb 6, 2020 at 11:36 AM Munendra S N <
> sn.munendr...@gmail.com>
> > > > wrote:
> > > >
> > > >>>
> > > >>> Notice the lists, within lists, within lists. Where the keys are
> > > adjacent
> > > >>> items in the list. Is there a reason this isn't a JSON dictionary?
> > > >>>
> > > >> I think this is because of NamedList. Have you tried using json.nl
> =map
> > > as
> > > >> a
> > > >> query parameter for this case?
> > > >>
> > > >> Regards,
> > > >> Munendra S N
> > > >>
> > > >>
> > > >>
> > > >> On Thu, Feb 6, 2020 at 10:01 PM Doug Turnbull <
> > > >> dturnb...@opensourceconnections.com> wrote:
> > > >>
> > > >>> Hi all,
> > > >>>
> > > >>> I was curious if anyone had any tips on parsing the JSON response
> of
> > > the
> > > >>> term vectors component? Or anyway to force it to be more standard
> JSON?
> > > >> It
> > > >>> appears to be very heavily nested and idiosyncratic JSON, such as
> > > below.
> > > >>>
> > > >>> Notice the lists, within lists, within lists. Where the keys are
> > > adjacent
> > > >>> items in the list. Is there a reason this isn't a JSON dicti

Re: DataImportHandler SolrEntityProcessor configuration for local copy

2020-02-06 Thread Mikhail Khludnev
Karl, what would you do if that own implementation stalls in GC, or smashes
Solr over?

On Thu, Feb 6, 2020 at 1:04 PM Karl Stoney
 wrote:

> Spoke too soon, looks like it memory leaks.  After about 1.3m the old gc
> times went through the root and solr was almost unresponsive, had to
> abort.  We're going to write our own implementation to copy data from one
> core to another that runs outside of solr.
>
> On 06/02/2020, 09:57, "Karl Stoney"  wrote:
>
> I cannot believe how much of a difference that cursorMark and sort
> order made.
> Previously it died about 800k docs, now we're at 1.2m without any
> slowdown.
>
> Thank you so much
>
> On 06/02/2020, 08:14, "Mikhail Khludnev"  wrote:
>
> Hello, Karl.
> Please check these:
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fpagination-of-results.html%23constraints-when-using-cursors&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=pNw8x6YUBTtXst60oMAe8UqWvUtakYvoJ9%2FKn7R8ETo%3D&reserved=0
>
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F6_6%2Fuploading-structured-data-store-data-with-the-data-import-handler.html%23solrentityprocessor&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=572w%2Br7QtZ8eHORG5UVrE3yE3SZaUXsuqFpRuwE80sw%3D&reserved=0
>  cursorMark="true"
> Good luck.
>
>
> On Wed, Feb 5, 2020 at 10:06 PM Karl Stoney
>  wrote:
>
> > Hey All,
> > I'm trying to implement a simplistic reindex strategy to copy
> all of the
> > data out of one collection, into another, on a single node (no
> distributed
> > queries).
> >
> > It's approx 4 million documents, with an index size of 26gig.
> Based on
> > your experience, I'm wondering what people feel sensible values
> for the
> > SolrEntityProcessor are (to give me a sensible starting point,
> to save me
> > iterating over loads of them).
> >
> > This is where I'm at right now.  I know `rows` would increase
> memory
> > pressure but speed up the copy, I can't really find anywhere
> online where
> > people have benchmarked different values for rows and the
> default (50)
> > seems quite low.
> >
> > 
> > 
> > >  query="*:*"
> >  rows="100"
> >  fl="*,old_version:_version_"
> >  wt="javabin"
> >  url="
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%2Fsolr%2Fat-uk&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C31a2300d8a0e42a9e28f08d7aadc92c7%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165736641024457&sdata=e9BfXappFygVqSlweYXJdsxf5TXtlrL%2BwHop7PrOsJQ%3D&reserved=0
> ">
> >
> > 
> > 
> >
> > Any suggestions are welcome.
> > Thanks
> > This e-mail is sent on behalf of Auto Trader Group Plc,
> Registered Office:
> > 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered
> in England
> > No. 9439967). This email and any files transmitted with it are
> confidential
> > and may be legally privileged, and intended solely for the use
> of the
> > individual or entity to whom they are addressed. If you have
> received this
> > email in error please notify the sender. This email message has
> been swept
> > for the presence of computer viruses.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>
>
>
>
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office:
> 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England
> No. 9439967). This email and any files transmitted with it are confidential
> and may be legally privileged, and intended solely for the use of the
> individual or entity to whom they are addressed. If you have received this
> email in error please notify the sender. This email message has been swept
> for the presence of computer viruses.
>


-- 
Sincerely yours
Mikhail Khludnev


Re: Solr 7.7 heap space is getting full

2020-02-06 Thread Rajdeep Sahoo
If we reduce the no of threads then is it going to help.
  Is there any other way to debug this.


On Mon, 3 Feb, 2020, 2:52 AM Walter Underwood, 
wrote:

> The only time I’ve ever had an OOM is when Solr gets a huge load
> spike and fires up 2000 threads. Then it runs out of space for stacks.
>
> I’ve never run anything other than an 8GB heap, starting with Solr 1.3
> at Netflix.
>
> Agreed about filter cache, though I’d expect heavy use of that to most
> often be part of a faceted search system.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 2, 2020, at 12:36 PM, Erick Erickson 
> wrote:
> >
> > Mostly I was reacting to the statement that the number
> > of docs increased by over 4x and then there were
> > memory problems.
> >
> > Hmmm, that said, what does “heap space is getting full”
> > mean anyway? If you’re hitting OOMs, that’s one thing. If
> > you’re measuring the amount of heap consumed and
> > noticing that it fills up, that’s totally normal. Java will
> > collect garbage when it needs to. If you attach something
> > like jconsole to Solr you’ll see memory grow and shrink
> > quite regularly. Take a look at your garbage collection logs
> > with something like GCViewer to see how much memory is
> > still required after a GC cycle. If that number is reasonable
> > then there’s no problem.
> >
> > Walter:
> >
> > Well, the expectation that one can keep adding docs without
> > considering heap size is simply naive. The filterCache
> > for instance grows linearly with the number of documents
> > (OK, if it it stores the full bitset). Real Time Get requires
> > on-heap structures to keep track of changed docs between
> > commits. Etc.
> >
> > The OP hasn’t even told us whether docValues are enabled
> > appropriately, which if not set for fields needing it will also
> > grow heap requirements linearly with the number of docs.
> >
> > I’ll totally agree that the relationship between the size of
> > the index on disk and heap is iffy at best. But if more heap is
> > _not_ needed for bigger indexes then we’d never hit OOMs
> > no matter how many docs we put in 4G.
> >
> > Best,
> > Erick
> >
> >
> >
> >> On Feb 2, 2020, at 11:18 AM, Walter Underwood 
> wrote:
> >>
> >> We CANNOT diagnose anything until you tell us the error message!
> >>
> >> Erick, I strongly disagree that more heap is needed for bigger indexes.
> >> Except for faceting, Lucene was designed to stream index data and
> >> work regardless of the size of the index. Indexing is in RAM buffer
> >> sized chunks, so large updates also don’t need extra RAM.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Feb 2, 2020, at 7:52 AM, Rajdeep Sahoo 
> wrote:
> >>>
> >>> We have allocated 16 gb of heap space  out of 24 g.
> >>> There are 3 solr cores here, for one core when the no of documents are
> >>> getting increased i.e. around 4.5 lakhs,then this scenario is
> happening.
> >>>
> >>>
> >>> On Sun, 2 Feb, 2020, 9:02 PM Erick Erickson, 
> >>> wrote:
> >>>
>  Allocate more heap and possibly add more RAM.
> 
>  What are you expectations? You can't continue to
>  add documents to your Solr instance without regard to
>  how much heap you’ve allocated. You’ve put over 4x
>  the number of docs on the node. There’s no magic here.
>  You can’t continue to add docs to a Solr instance without
>  increasing the heap at some point.
> 
>  And as far as I know, you’ve never told us how much heap yo
>  _are_ allocating. The default for Java processes is 512M, which
>  is quite small. so perhaps it’s a simple matter of starting Solr
>  with the -XmX parameter set to something larger.
> 
>  Best,
>  Erick
> 
> > On Feb 2, 2020, at 10:19 AM, Rajdeep Sahoo <
> rajdeepsahoo2...@gmail.com>
>  wrote:
> >
> > What can we do in this scenario as the solr master node is going
> down and
> > the indexing is failing.
> > Please provide some workaround for this issue.
> >
> > On Sat, 1 Feb, 2020, 11:51 PM Walter Underwood, <
> wun...@wunderwood.org>
> > wrote:
> >
> >> What message do you get about the heap space.
> >>
> >> It is completely normal for Java to use all of heap before running a
>  major
> >> GC. That
> >> is how the JVM works.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Feb 1, 2020, at 6:35 AM, Rajdeep Sahoo <
> rajdeepsahoo2...@gmail.com>
> >> wrote:
> >>>
> >>> Please reply anyone
> >>>
> >>> On Fri, 31 Jan, 2020, 11:37 PM Rajdeep Sahoo, <
> >> rajdeepsahoo2...@gmail.com>
> >>> wrote:
> >>>
>  This is happening when the no of indexed document count is
> increasing.
>  With 1 million docs it's working fine but when it's crossing 4.5
> >

Re: How can shards distributed evenly among nodes

2020-02-06 Thread Radar Lei
This is weird, when we creating an index, Solr will make sure shards of an
index be distributed to all the existing nodes evenly. But after you used
'UTILIZENODE' of AutoScale, Solr will try to put all the shards of an index
to one or several nodes. Is this intentional or a bug?

For example, we have a four nodes Solr cluster, and my index 'demo' have 4
shards, Solr assigned one shard on each node evenly by default. But after
we used 'UTILIZENODE' against a new node, all the shards will be put on
Node1. This will make one of the node have heavy workload while other nodes
have no work to do.

So the problem is 'UTILIZENODE' only cares if each node have the same
number of replicas, but it won't try to distribute each index's
replica/shard to as many nodes as possible.
Any thoughts? Thanks.

Regards,
Radar


On Tue, Feb 4, 2020 at 5:20 PM Yuan Zhao  wrote:

> Hi Team,
>
> We are using autoscaling policy, we make use of the utilize node feature to
> move replica to new nod.
> But we found after replica are moved, solr can make sure the repilica
> belongs to a same shard located
> on different nodes,  but it can not make sure shard distributed evenly on
> all the nodes.
> That means a node might contain all the shards of an index.
> And, more remarkable, the shards distributed evenly before utilize node
> command is executed.
>
>index_name  | replica_name | shard_name |  node_name   |
> replica_state
>
> ---+--++--+---
>  test_index.t2 | core_node6   | shard1 | test-server:8983_solr|
> active
>  test_index.t4 | core_node7   | shard2 | test-server:8983_solr|
> active
>  test_index.t4 | core_node5   | shard1 | test-server:8983_solr|
> active
>  test_index.t2 | core_node4   | shard0 | test-server:8983_solr|
> active
>  test_index.t1 | core_node3   | shard1 | test-server:8984_solr|
> active
>  test_index.t4 | core_node8   | shard2 | test-server:8984_solr|
> active
>  test_index.t3 | core_node8   | shard1 | test-server:8984_solr|
> active
>  test_index.t2 | core_node2   | shard0 | test-server:8984_solr|
> active
>  test_index.t2 | core_node10  | shard1 | test-server:8985_solr|
> active
>  test_index.t1 | core_node18  | shard2 | test-server:8985_solr|
> active
>  test_index.t4 | core_node10  | shard1 | test-server:8985_solr|
> active
>  test_index.t3 | core_node10  | shard0 | test-server:8985_solr|
> active
>  test_index.t1 | core_node14  | shard2 | test-server:8987_solr|
> active
>  test_index.t3 | core_node14  | shard0 | test-server:8987_solr|
> active
>  test_index.t3 | core_node12  | shard1 | test-server:8987_solr|
> active
>  test_index.t1 | core_node16  | shard1 | test-server:8987_solr|
> active
>
>  Do you have any good solution to this problem.
>  The solr version we are using is 7.4.
>  The cluster policy like:
>  {
> "set-cluster-policy" : [{
>  "replica" : "<2",
>  "shard" : "#EACH",
>  "node" : "#ANY",
>  "strict" : false
> }]
> }
>
> --
> Thanks & regards,
> Yuan
>