Re: Indexing 700 docs per second

2016-04-19 Thread Tim Robertson
Hi Mark,

We were putting in and updating docs of around 20-25 indexed fields (mainly
INTs, but some Strings and multivalue fields) at >1000/sec on far lesser
hardware and a total of 600 million docs (batch updates of course) while
also serving live queries for a website which had about 30 concurrent users
steady state (not all hitting SOLR though).

It seems realistic with that kind of hardware in my experience, but you
didn't mention what else was going on that might affect it (e.g. reads).

HTH,
Tim


On Tue, Apr 19, 2016 at 7:12 PM, Erick Erickson 
wrote:

> Make very sure you batch updates though.
> Here's a benchmark I ran:
> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>
> NOTE: it's not entirely clear that you want to
> put 122M docs on a single shard. Depending on the queries
> you'll run you may want 2 or more shards, but that depends
> on the query pattern and your SLAs. Here's the long version
> of "you really have to load test this":
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar 
> wrote:
> >  It sounds achievable with your machine configuration and i would suggest
> > to try out atomic update.  Use SolrJ with multi-threaded indexing for
> > higher indexing rate.
> >
> > Thanks,
> > Susheel
> >
> >
> >
> > On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans 
> wrote:
> >
> >> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <
> mark123lea...@gmail.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a requirement to index (mainly updation) 700 docs per second.
> >> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around
> 260
> >> > byes (6 fields out of which only 2 will undergo updation at the above
> >> > rate). This collection has around 122Million docs and that count is
> >> pretty
> >> > much a constant.
> >> >
> >> > 1. Can I manage this updation rate with a non-sharded ie single Solr
> >> > instance set up?
> >> > 2. Also is atomic update or a full update (the whole doc) of the
> changed
> >> > records the better approach in this case.
> >> >
> >> > Could some one please share their views/ experience?
> >>
> >> Try it and see - everyone's data/schemas are different and can affect
> >> indexing speed. It certainly sounds achievable enough - presumably you
> >> can at least produce the documents at that rate?
> >>
> >> Cheers
> >>
> >> Tom
> >>
>


Newbie CVS problem

2008-03-16 Thread tim robertson
Hi All,
I have today installed SOLR and am trying to get CSV files indexed but cant
seem to get any hits.

Using a fresh 1.2 install, I am using the schema shipped and the
books.csvin the example.

It seems to upload ok:

[EMAIL 
PROTECTED]//Users/timrobertson/dev/apache-solr-nightly/example-tim/exampledocs$
curl
http://localhost:8983/solr/update/csv --data-binary @books.csv -H
'Content-type:text/plain; charset=utf-8'


017


But a search for Black returns no results - this is the URL I am using:

http://localhost:8983/solr/select/?q=Black&version=2.2&start=0&rows=10&indent=on


I am complete newbie, but looking at the schema I thought the Name column
would end up indexed.


Could someone please tell me what I am missing?


Many Thanks


Tim


Re: Newbie CVS problem

2008-03-16 Thread tim robertson
Ah - perfect
Thanks!


On Sun, Mar 16, 2008 at 1:26 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> You won't see anything until the results are committed.
> try something like http://localhost:8983/solr/update/csv?commit=true
> to commit after adding all the docs.
>
> post.sh in exampledocs also has an example at the end of how to send a
> commit command separately.
>
> -Yonik
>
> On Sun, Mar 16, 2008 at 6:56 AM, tim robertson
> <[EMAIL PROTECTED]> wrote:
> > Hi All,
> >  I have today installed SOLR and am trying to get CSV files indexed but
> cant
> >  seem to get any hits.
> >
> >  Using a fresh 1.2 install, I am using the schema shipped and the
> >  books.csvin the example.
> >
> >  It seems to upload ok:
> >
> >  [EMAIL PROTECTED]
> //Users/timrobertson/dev/apache-solr-nightly/example-tim/exampledocs$
> >  curl
> >  http://localhost:8983/solr/update/csv --data-binary @books.csv -H
> >  'Content-type:text/plain; charset=utf-8'
> >  
> >  
> >  0 >  name="QTime">17
> >  
> >
> >  But a search for Black returns no results - this is the URL I am using:
> >
> >
> http://localhost:8983/solr/select/?q=Black&version=2.2&start=0&rows=10&indent=on
> >
> >
> >  I am complete newbie, but looking at the schema I thought the Name
> column
> >  would end up indexed.
> >
> >
> >  Could someone please tell me what I am missing?
> >
> >
> >  Many Thanks
> >
> >
> >  Tim
> >
>


missing content stream - simple tab file

2008-03-24 Thread tim robertson
Hi all,
I am a newbie with SOLR, trying to index a very simple tab delimitted file
(using a nightly build from a couple days ago).
Any help would be greatly appreciated!

My test tab file has only 3 lines:

Passer domesticus 1787248
Passer domesticus (Linnaeus, 1758) 694
Passer domesticus (Linnaeus,1758) 8

My schema:


  


 
 
   
   
 
 name
 


And I am uploading using this command:
curl
http://localhost:8983/solr/update/csv?fieldnames=name,count&separator=%09&escape=\&header=false--data-binary
@test -H 'Content-type:text/plain; charset=utf-8'

It gives a missing content stream error with the stack trace at the bottom
of this email.

Any help greatly appreciated!!!

Thanks

Tim


Mar 24, 2008 8:35:56 PM org.apache.solr.core.SolrCore execute
INFO: /update/csv fieldnames=id,name 0 3
Mar 24, 2008 8:36:16 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing content stream
at org.apache.solr.handler.CSVRequestHandler.handleRequestBody(
CSVRequestHandler.java:49)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:118)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest
(RequestHandlers.java:228)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:948)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:326)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:280)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(
ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(
ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(
SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(
SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(
ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java
:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(
ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(
HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(
HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(
HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(
HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(
SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(
BoundedThreadPool.java:442)

Mar 24, 2008 8:36:16 PM org.apache.solr.core.SolrCore execute
INFO: /update/csv fieldnames=name,count 0 2


missing content stream - simple tab file

2008-03-24 Thread tim robertson
Hi all,
I am a newbie with SOLR, trying to index a very simple tab delimitted file
(using a nightly build from a couple days ago).
Any help would be greatly appreciated!

My test tab file has only 3 lines:

Passer domesticus 1787248
Passer domesticus (Linnaeus, 1758) 694
Passer domesticus (Linnaeus,1758) 8

My schema:


  


 
 
   
   
 
 name
 


And I am uploading using this command:
curl
http://localhost:8983/solr/update/csv?fieldnames=name,count&separator=%09&escape=\&header=false
--data-binary
@test -H 'Content-type:text/plain; charset=utf-8'

It gives a missing content stream error with the stack trace at the bottom
of this email.

Any help greatly appreciated!!!

Thanks

Tim


Mar 24, 2008 8:35:56 PM org.apache.solr.core.SolrCore execute
INFO: /update/csv fieldnames=id,name 0 3
Mar 24, 2008 8:36:16 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing content stream
at org.apache.solr.handler.CSVRequestHandler.handleRequestBody(
CSVRequestHandler.java:49)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:118)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest
(RequestHandlers.java:228)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:948)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:326)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:280)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(
ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(
ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(
SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(
SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(
ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java
:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(
ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(
HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(
HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(
HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(
HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(
SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(
BoundedThreadPool.java:442)

Mar 24, 2008 8:36:16 PM org.apache.solr.core.SolrCore execute
INFO: /update/csv fieldnames=name,count 0 2


Fwd: missing content stream - simple tab file

2008-03-24 Thread tim robertson
Ah, for some reason I am not receiving SOLR-user messages even though I am
subscribed.
If anyone has any ideas, can you please copy me in on the reply?

Thanks

-- Forwarded message --
From: tim robertson <[EMAIL PROTECTED]>
Date: Mon, Mar 24, 2008 at 8:53 PM
Subject: missing content stream - simple tab file
To: solr-user@lucene.apache.org


Hi all,
I am a newbie with SOLR, trying to index a very simple tab delimitted file
(using a nightly build from a couple days ago).
Any help would be greatly appreciated!

My test tab file has only 3 lines:

Passer domesticus 1787248
Passer domesticus (Linnaeus, 1758) 694
Passer domesticus (Linnaeus,1758) 8

My schema:


  


 
 
   
   
 
 name
 


And I am uploading using this command:
curl
http://localhost:8983/solr/update/csv?fieldnames=name,count&separator=%09&escape=\&header=false<http://localhost:8983/solr/update/csv?fieldnames=name,count&separator=%09&escape=%5C&header=false>
--data-binary
@test -H 'Content-type:text/plain; charset=utf-8'

It gives a missing content stream error with the stack trace at the bottom
of this email.

Any help greatly appreciated!!!

Thanks

Tim


Mar 24, 2008 8:35:56 PM org.apache.solr.core.SolrCore execute
INFO: /update/csv fieldnames=id,name 0 3
Mar 24, 2008 8:36:16 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: missing content stream
at org.apache.solr.handler.CSVRequestHandler.handleRequestBody(
CSVRequestHandler.java:49)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:118)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest
(RequestHandlers.java:228)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:948)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:326)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:280)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(
ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(
ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(
SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(
SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(
ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java
:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(
ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(
HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(
HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(
HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(
HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.jetty.bio.SocketConnector$Connection.run(
SocketConnector.java:226)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(
BoundedThreadPool.java:442)

Mar 24, 2008 8:36:16 PM org.apache.solr.core.SolrCore execute
INFO: /update/csv fieldnames=name,count 0 2


Re: missing content stream - simple tab file

2008-03-24 Thread tim robertson
Thanks,
You are correct...  ' ' around the URL solved it - schoolboy error

thanks

Tim


On Mon, Mar 24, 2008 at 9:48 PM, Chris Hostetter <[EMAIL PROTECTED]>
wrote:

>
> Tim: double check that solr-user mail isn't showing up in your spam
> folder, you may need to whitelist it since it identifies itself as bulk
> mail.
>
> : And I am uploading using this command:
> : curl
> :
> http://localhost:8983/solr/update/csv?fieldnames=name,count&separator=%09&escape=\&header=false--data-binary
> : @test -H 'Content-type:text/plain; charset=utf-8'
>
> It looks like you aren't Quoting hte URL so that your shell knows it's a
> single string .. the "&" characters are getting treated special .. you can
> tell because the URL with params that Solr says is getting hit ends with
> "...,count" ...
>
> : Mar 24, 2008 8:36:16 PM org.apache.solr.core.SolrCore execute
> : INFO: /update/csv fieldnames=name,count 0 2
>
> everything after that "&" is probably getting interpreted by your shell as
> additional commands (don't you see any errors in your terminal where you
> run this command?)
>
> Also: it looks like you are missing a space between "false' and
> "--data-binary"
>
>
> -Hoss
>
>


What are the limits? Billions of records anyone?

2008-03-24 Thread tim robertson
Hi all,
I have just got a SOLR index working for the first time on a few 100,000
records from a custom database dump, and the results are very impressive,
both in the speed it indexes (even on my macbook) and the response times.

If I want to index "what, where(grid based to 0.1 degree cells), when, who"
type information (lets say a schema of 10 strings, 2 dates, 4 ints) what are
the limitations going to be?

Is there any documentation on whether indexes can be partitioned easily, so
scaling is somewhat linear?

My reasoning to look for this is our current searchable "index" is on a
mysql database with 2 main fact tables of 150,000,000 records and 15,000,000
records which are normally joined for most queries.  We are looking to
increase to 10x that size so I am looking at Billions of records...

How likely will this scale on SOLR?
What's the biggest number of items people have indexed?
How complicated do the queries have to get before things get slow? This is
the kind of thing I am looking for:
(name:"Passer domesticus*" AND cell:[36543 TO 43324] AND mod360Cell[45 TO
65] AND year:[1950 TO *])
- if you care, this is a search for "The bird of type Sparrows in a geo
bounding box and collected/observed after 1950"...

I'm going to be trying anyway, but any pointers appreciated (Hadoop
perhaps?)

Thanks,

Tim
PS - This is an open source open access project to create an index of
biodiversity data (http://data.gbif.org) so your help is going towards a
worthwhile cause!


Re: What are the limits? Billions of records anyone?

2008-03-25 Thread tim robertson
Thanks Yonik,
I will give it a play when I get some time and write back.

Tim


On Tue, Mar 25, 2008 at 1:21 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> On Mon, Mar 24, 2008 at 5:30 PM, tim robertson
> <[EMAIL PROTECTED]> wrote:
> >  Is there any documentation on whether indexes can be partitioned
> easily, so
> >  scaling is somewhat linear?
>
> http://wiki.apache.org/solr/DistributedSearch
>
> It's very new, so you would need a recent nightly build.
> If you try it, let us know how it works (or what issues you run into).
>
> >  (name:"Passer domesticus*" AND cell:[36543 TO 43324] AND mod360Cell[45
> TO
> >  65] AND year:[1950 TO *])
>
> Range queries can be slow if the number of terms in the range is large.
> If the range query is common, it can be pulled out into a separate
> filter query (fq param) and cached separately.  If it's rather unique
> (different endpoint values each time), then there is currently no
> quick fix.  But due to some basic work being done in Lucene, I predict
> some relief not too far in the future.
>
> -Yonik
>


Multiple schemas?

2008-03-27 Thread tim robertson
Hi,
Would I be correct in thinking that for each schema I want, I need a new
SOLR instance running?

Thanks

Tim


Re: size limitation when adding document?

2008-03-27 Thread tim robertson
Today I added a single 9gig tab file into solr, with the resulting index
being 16gig.3 hours to load and is performing mightily fine (jvm -Xmx3G)


On Thu, Mar 27, 2008 at 7:08 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> On Thu, Mar 27, 2008 at 1:21 PM, Andrew Tan <[EMAIL PROTECTED]>
> wrote:
> >  I am new to solr and just try it out. I copied solr.war (from 1.2.0
> >  distribution) into tomcat 5.5.26's webapps directory and started
> tomcat.
> >  Then I use the java SimplePostTool to add documents. when the document
> >  is small, things are fine. However, when I tried to add document
> >  (greater than 8KB) into solr server, I got the following error message
> :
> >
> >  java.io.EOFException: no more data available -
> >  expected end tags  to close start
> >  tag  from line 35 and start tag  from line 3
> and
> >  start tag  from line 2, parser stopped on START_TAG seen
> >  ...uiror Termination Fee if the preceding relates to antitrust or
> com...
> >  @35:859
>
> This looks like the XML may not be valid.
> I just tried a big document (120K) and a big single field (75K).  Both
> worked fine.
> Are you sure the field values have reserved XML chars escaped
> correctly?  Perhaps try opening the file in an XML editor.
>
> -Yonik
>


Re: Multiple schemas?

2008-03-27 Thread tim robertson
Thanks all, for the answers

On Thu, Mar 27, 2008 at 10:04 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> On Thu, Mar 27, 2008 at 4:56 PM, Otis Gospodnetic
> <[EMAIL PROTECTED]> wrote:
> > Or use the JNDI approach that's described on the Wiki.  I've used it
> with Jetty and it works nicely.  Multiple webapp contexts, multiple Solr
> indices, but a single JVM.
>
> With multiple smaller collections, one might want this (or multicore).
> If the collections are big, it's best to use a separate JVM.  Among
> other benefits, GC pause times will be shorter for a smaller heap.
>
> -Yonik
>


Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
Hi all

We load Solr (8.4.1) from Spark and are trying to grow the schema with some
dynamic fields that will result in around 500-600 indexed fields per doc.

Currently, we see ~300 fields/doc work very well into an 8-node Solr
cluster with CPU nicely balanced across a cluster and we saturate our
network.

However, growing to ~500-600 fields we see incoming network traffic drop to
around a quarter and in the Solr cluster we see low CPU on most machines,
but always one machine with high load (it is the Solr process). That
machine will stay high for many minutes, and then another will go high -
see CPU graph [1]. I've played with changing shard counts but beyond 32
didn't see any gains. There is only one replica on each shard, each machine
runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
different set of machines.

Can anyone please throw out ideas of what you would do to tune Solr for
large amounts of dynamic fields?

Does anyone have a guess on what the single high CPU node is doing (some
kind of metrics aggregation maybe?).

Thank you all,
Tim

[1]

[image: image.png]


Re: Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
Thank you Erick

I should have been clearer that this is a bulk load job into a write-only
cluster (until loaded when it becomes read-only) and it is the write
throughput I was chasing.

I made some changes and have managed to get it working more closely to what
I expect.  I'll summarise them here in case anyone stumbles on
this thread but please note this was just the result of a few tuning
experiments and is not definitive:

- Increased shard count, so there were the same number of shards as virtual
CPU cores on each machine
- Set the ramBufferSizeMB to 2048
- Increased the parallelization in the loading job (i.e. ran the job across
more spark cores concurrently)
- Dropped to batches of 500 docs sent instead of 1000


On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson 
wrote:

> The Apache mail server strips attachments pretty aggressively, so I can’t
> see your attachment.
>
> About the only way to diagnose would be to take a thread dump of the
> machine that’s running hot.
>
> There are a couple of places I’d look:
>
> 1> what happens if you don’t return any non-docValue fields? To return
> stored fields, the doc must be fetched and decompressed. That doesn’t fit
> very well with your observation that only one node runs hot, but it’s worth
> checking.
>
> 2> Return one doc-value=true field and search only on a single field (with
> different values of course). Does that follow this pattern? What I’m
> wondering about here is whether the delays are because you’re swapping
> index files in and out of memory. Again, that doesn’t really explain high
> CPU utilization, if that were the case I’d expect you to be I/O bound.
>
> 3> I’ve seen indexes with this many fields perform reasonably well BTW.
>
> How many fields are you returning? One thing that happens is that when a
> query comes in to a node, sub-queries are sent out to one replica of each
> shard, and the results from each shard are sorted by one node and returned
> to the client. Unless you’re returning lots and lots of fields and/or many
> rows, this shouldn’t run “for many minutes”, but it’s something to look for.
>
> When this happens, what is your query response time like? I’m assuming
> it’s very slow.
>
> But these are all shots in the dark, some thread dumps would be where I’d
> start.
>
> Best,
> Erick
>
> > On Mar 18, 2020, at 6:55 AM, Tim Robertson 
> wrote:
> >
> > Hi all
> >
> > We load Solr (8.4.1) from Spark and are trying to grow the schema with
> some dynamic fields that will result in around 500-600 indexed fields per
> doc.
> >
> > Currently, we see ~300 fields/doc work very well into an 8-node Solr
> cluster with CPU nicely balanced across a cluster and we saturate our
> network.
> >
> > However, growing to ~500-600 fields we see incoming network traffic drop
> to around a quarter and in the Solr cluster we see low CPU on most
> machines, but always one machine with high load (it is the Solr process).
> That machine will stay high for many minutes, and then another will go high
> - see CPU graph [1]. I've played with changing shard counts but beyond 32
> didn't see any gains. There is only one replica on each shard, each machine
> runs on AWS with an EFS mounted disk only running Solr 8, ZK is on a
> different set of machines.
> >
> > Can anyone please throw out ideas of what you would do to tune Solr for
> large amounts of dynamic fields?
> >
> > Does anyone have a guess on what the single high CPU node is doing (some
> kind of metrics aggregation maybe?).
> >
> > Thank you all,
> > Tim
> >
> > [1]
> >
> >
> >
>
>


Re: Tuning for 500+ field schemas

2020-03-18 Thread Tim Robertson
Thank you Edward, Erick,

In this environment, hard commits @60s without openSearcher and soft
commits are off.
We have the luxury of building the index, then opening searchers and adding
replicas afterward.

We'll monitor the segment merging and lengthen the commit time as suggested
- thank you!




On Wed, Mar 18, 2020 at 5:45 PM Erick Erickson 
wrote:

> Ak, ok. Then your spikes were probably being caused by segment merging,
> which would account for it being on different machines and running for a
> long time. Segment merging is a very expensive operation...
>
> As Edward mentioned, your commit settings come into play. You could easily
> be creating segments much smaller due to commits, I'd check the indexes to
> see how small the smallest segments are while indexing, you can do that
> through the admin UI. If they're much smaller than your ramBufferSizeMB,
> lengthen the commit interval...
>
> The default merge policy caps segments at 5g btw.
>
> Finally, indexing throughput should scale roughly linearly to the number of
> shards. You should be able to saturate the CPUs with enough client threads.
>
> Best,
> Erick
>
> On Wed, Mar 18, 2020, 12:04 Edward Ribeiro 
> wrote:
>
> > What are your hard and soft commit settings? This can have a large
> > impact on the writing throughput.
> >
> > Best,
> > Edward
> >
> > On Wed, Mar 18, 2020 at 11:43 AM Tim Robertson
> >  wrote:
> > >
> > > Thank you Erick
> > >
> > > I should have been clearer that this is a bulk load job into a
> write-only
> > > cluster (until loaded when it becomes read-only) and it is the write
> > > throughput I was chasing.
> > >
> > > I made some changes and have managed to get it working more closely to
> > what
> > > I expect.  I'll summarise them here in case anyone stumbles on
> > > this thread but please note this was just the result of a few tuning
> > > experiments and is not definitive:
> > >
> > > - Increased shard count, so there were the same number of shards as
> > virtual
> > > CPU cores on each machine
> > > - Set the ramBufferSizeMB to 2048
> > > - Increased the parallelization in the loading job (i.e. ran the job
> > across
> > > more spark cores concurrently)
> > > - Dropped to batches of 500 docs sent instead of 1000
> > >
> > >
> > > On Wed, Mar 18, 2020 at 1:19 PM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > The Apache mail server strips attachments pretty aggressively, so I
> > can’t
> > > > see your attachment.
> > > >
> > > > About the only way to diagnose would be to take a thread dump of the
> > > > machine that’s running hot.
> > > >
> > > > There are a couple of places I’d look:
> > > >
> > > > 1> what happens if you don’t return any non-docValue fields? To
> return
> > > > stored fields, the doc must be fetched and decompressed. That doesn’t
> > fit
> > > > very well with your observation that only one node runs hot, but it’s
> > worth
> > > > checking.
> > > >
> > > > 2> Return one doc-value=true field and search only on a single field
> > (with
> > > > different values of course). Does that follow this pattern? What I’m
> > > > wondering about here is whether the delays are because you’re
> swapping
> > > > index files in and out of memory. Again, that doesn’t really explain
> > high
> > > > CPU utilization, if that were the case I’d expect you to be I/O
> bound.
> > > >
> > > > 3> I’ve seen indexes with this many fields perform reasonably well
> BTW.
> > > >
> > > > How many fields are you returning? One thing that happens is that
> when
> > a
> > > > query comes in to a node, sub-queries are sent out to one replica of
> > each
> > > > shard, and the results from each shard are sorted by one node and
> > returned
> > > > to the client. Unless you’re returning lots and lots of fields and/or
> > many
> > > > rows, this shouldn’t run “for many minutes”, but it’s something to
> > look for.
> > > >
> > > > When this happens, what is your query response time like? I’m
> assuming
> > > > it’s very slow.
> > > >
> > > > But these are all shots in the dark, some thread dumps would be where
> > I’d
> > > > start.
> > > >
> > > > Best,