cValues changed with schema version 1.6
(https://issues.apache.org/jira/browse/SOLR-8220). Have you checked that the
same number of fields are returned for the two setups?
- Toke Eskildsen?
03/docvalues-vs-stored-fields-apache-solr-features-and-performance-smackdown.html
BTW: The documentation should definitely mention that stored preserves
order & duplicates. It is not obvious.
- Toke Eskildsen, Royal Danish Library
to be processed, it indicates that
the cluster is overloaded. Increasing the timeout is just a band-aid.
- Toke Eskildsen, Royal Danish Library
hash:01* OR hash:02* OR hash:03* OR hash:04*
-> Facets for 1950K documents (100M/256 * 5)
Prefix queries might prove to be too expensive, so you could also
create fields with random values from 0-9, 0-99, 0-999 etc. and do
exact match filtering on those to get the number of hits down.
- Toke Eskildsen, Royal Danish Library
eeding-up-core-search/
and there is https://issues.apache.org/jira/browse/LUCENE-8875 which
takes care of the Sentinel thing in solr 8.2.
- Toke Eskildsen, Royal Danish Library
e problem.
- Toke Eskildsen, Royal Danish Library
On Mon, 2019-10-07 at 10:18 -0700, Wei wrote:
> /solr/mycollection/select?stats=true&stats.field=unique_ids&stats.cal
> cdistinct=true
...
> Is there a way to block certain solr queries based on url pattern?
> i.e. ignore the stats.calcdistinct request in this case.
It sounds like it is possible f
e shard? Single shard
indexes maximizes throughput at the possible cost of latency, so that
seems fitting for your requirements.
- Toke Eskildsen, Royal Danish Library
cessorFactory that is mentioned:
http://lucene.apache.org/solr/7_2_1/solr-core/org/apache/solr/schema/DatePointField.html
- Toke Eskildsen
te to read up on that and respond
in that thread, to avoid hi-jacking this one. It probably won't be this
week as Real Work is heating up.
- Toke Eskildsen, Royal Danish Library
last year we experienced similar
> problems.
The iterator-based DocValues implementation in Solr 7 has a performance issue
with large segments, with symptoms akin to SOLR-8096. If you have not already
solved your problems, Solr 8 (with an upgraded index) might help.
- Toke Eskildsen
try disabling grouping fully.
It does not explain the difference between Solr 4 & 8, but I agree with David
that we need to isolate what causes the overall slowdown first, before we can
attempt to fix the Solr 4 vs 8 thing.
- Toke Eskildsen
a very large
result set requires more CPU power to uncompress in Solr 8 (but less
IO))
* Do you have any response related defaults in your solrconfig.xml,
such as faceting or grouping?
(You might be doing heavy aggregation even if you don't explicitly ask
for it)
- Toke Eskildsen, Royal Danish Library
ome special
high-performance setup with a budget for tuning: Matching terms and joining
filters is core Solr (Lucene really) functionality. Plain query &
filter-matching time tend to be dwarfed by aggregations (grouping, faceting,
stats).
- Toke Eskildsen
very large documents? How big is your index in bytes?
- Toke Eskildsen
now and keep
> going back and forth on whether we should preserve accent marks.
Going with what we do, my answer would be: Yes, do preserve and also remove
:-). You could even have 3 or more levels of normalisation, depending on how
much time you have for polishing.
- Toke Eskildsen
getSorted for
each collect call? Could you share your code somewhere?
- Toke Eskildsen
lues in Solr, so
the safe (best performance) solution would be to implement something
like the pseudo code I wrote earlier.
- Toke Eskildsen, Royal Danish Library
&&
isValid(dv.binaryValue().utf8ToString())
in your collect method.
https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/index/DocValues.html#getSorted-org.apache.lucene.index.LeafReader-java.lang.String
-
If you want to speed it up further, you can use BytesRefs as keys in
your c
ts knows which cluster to use? Can it be divided
further?
- Toke Eskildsen
obs. Scaling this specialized setup to your corpus size would require
about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided among 4 shards. You are
likely to need quite a lot more than that, so this is just to say that at this
scale the use of the index matters _a lot_.
- Toke Eskildsen
as worst case
for storage usage during optimize is a total of 3*index size.
- Toke Eskildsen, Royal Danish Library
fference between
evaluating a graph query (any query really) and asking for 1M results
to be returned. With that in mind, what do you set rows to?
- Toke Eskildsen, Royal Danish Library
ld-value-faceting-parameters
- Toke Eskildsen, Royal Danish Library
pache.org/jira/browse/SOLR-13013
If it is easy for you to test, you could try Solr 8 as that should work
better for random access of DocValues.
- Toke Eskildsen, Royal Danish Library
e indexes and/or setups where performance
is very important.
- Toke Eskildsen, Royal Danish library
regression for
DocValues that is very visible when using export. See
https://issues.apache.org/jira/browse/SOLR-13013), so I would expect it to be
slower than Solr 5. You could try with Solr 8 where this regression should be
mitigated somewhat.
- Toke Eskildsen
t well with
that. Instead you can look at Common Grams, where your high-frequency
words gets concatenated with surrounding words. This only works with
phrases though. There's a nice article at
https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
- Toke Eskildsen, Royal Danish Library
aring me the work.
- Toke Eskildsen, Royal Danish Library
f possible) will be for a later Solr version. Currently it is
not possible to tweak the docValues indexing parameters outside of code
changes.
Do note that we're still operating on guesses here. The cause for your
regression might easily be elsewhere.
- Toke Eskildsen, Royal Danish Library
e tiny, but mistakes happen. With that in mind, do you have
DocValues enabled for a lot of your fields?
Performance issues like this one are notoriously hard to debug remote.
Is it possible for you to share your setup and your test data?
- Toke Eskildsen, Royal Danish Library
, the query
> doesn't fetch results.
You need to tell Solr which fields it should search: df=cfield
https://lucene.apache.org/solr/guide/7_7/the-standard-query-parser.html#standard-query-parser-parameters
- Toke Eskildsen, Royal Danish Library
due to stop-the-world garbage collections.
Try dialing Xmx _way_ down: If your batches are only 5MB each, try
Xmx=20g or less. I know that the stats above says that Solr uses 111GB,
but the JVM has a tendency to expand the heap quite a lot when it is
getting hammered. If you want to check beforehand, you can see how much
memeory is freed from full GCs in the GC-log.
- Toke Eskildsen, Royal Danish Library
On Thu, 2019-03-14 at 13:16 +0100, jim ferenczi wrote:
> http://lucene.apache.org/solr/8_0_0/changes/Changes.html
Thank you for the hard work of rolling the release!
Looking forward to upgrading.
- Toke Eskildsen, Royal Danish Library
ent retrieval) doc values performance for indexes with many
documents.
- Toke Eskildsen, royal Danish Library
hat you are unsure
of.
- Toke Eskildsen
hey do add up.
For most practical purposes (URL-lookup & grouping, following links between
archived pages, resolving embedded resources from pages) we use the heavily
normalised URL.
- Toke Eskildsen
Arunan Sugunakumar wrote:
> https://lucene.apache.org/solr/guide/6_6/making-and-restoring-backups.html
We (also?) prefer to keep our stage/build setup separate from production.
Backup + restore works well for us. It is very fast, as it is basically just
copying the segment files.
- T
dea: Issue a query with debug=sanity and
get a report from checks on both the underlying index and the issued
query for indicators of problems:
https://github.com/tokee/lucene-solr/issues/54
- Toke Eskildsen, Royal Danish Library
I'll just note that faceting on a
DocValues=true indexed=false field on a multi-shard index also has a
performance penalty as the field will be slow-searched (using the
DocValues) in the secondary fine-counting phase.
- Toke Eskildsen, Royal Danish Library
On Wed, 2018-11-14 at 17:53 +0530, Anil wrote:
> I don;t see median aggregation in JSON facet api documentation.
It's the 50 percentile:
https://lucene.apache.org/solr/guide/7_5/json-facet-api.html#metrics-example
- Toke Eskildsen, Royal Danish Library
-8374
* Experiment with different amounts of concurrent requests to see what
gives the optimum throughput. This also tells you how much extra
hardware you need, if you decide you need to expand..
- Toke Eskildsen, Royal Danish Library
ully that would unearth very
few problematic parts, such as regexp, function or prefix-wildcard
queries. There might be ways to replace or tune those.
- Toke Eskildsen, Royal Danish Library
k check if it is the resolving of
specific field values that is the problem. If using fl=id speeds up
substantially, the next step would be to add fields gradually until
(hopefully) there is a sharp performance decrease.
- Toke Eskildsen, Royal Danish Library
e bottleneck.
Are you looking at overall CPU usage or single-core? When we run force
merge, we have a single core at 100% while the rest are idle.
NB: There is currently a thread "Static index, fastest way to do
forceMerge" in the Lucene users mailinglist, which seem to be quite
parallel t
t runs,
where you change different components and tell us roughly how that
affects performance?
1) Only request simple sorting by score
2) Reduce rows to 0
3) Increase rows to 100
4) Set fl=id only
- Toke Eskildsen, Royal Danish Library
measuring (which of course also takes resources, this
time in the form of work hours). My rough suggestion of a factor 10 for
your system is guesswork erring on the side of a high number.
- Toke Eskildsen, Royal Danish Library
with a max amount of concurrent connections
and a sensible queue. Preferably after a bit of testing to locale where the
highest throughput is. It won't make you hit your overall goal, but it can move
you closer to it.
- Toke Eskildsen
lues=true, Solr
treats all existing documents as having docValues enabled for that field. As
there is no docValue content, DocValues-aware functionality such as sorting and
faceting will not work for that field, until the documents has been re-indexed.
- Toke Eskildsen
d help with the patch.
- Toke Eskildsen
-8374).
With that in mind, could you tell me
* How many documents you have in your index?
* Whether you use stored or docValues for the fields that you retrieve
as part of the search result?
* If you perform heavy faceting, grouping or stats?
Maybe provide a sample query, if you are able?
Than
simple answer there. If you have an index that you update very rarely, it
can save memory and processing power. If you have a live index where you add
and delete documents, it will probably be a bad idea. One strategy used with
time series data is to have old and immutable data in dedicated collections,
which can then be optimized.
- Toke Eskildsen
up, so I would
expect streaming to do the same. I would not expect a 30% increase to
cause something serious on that account though. How many documents in
your index?
- Toke Eskildsen, Royal Danish Library
for excessive traffic in short bursts, not
for a sustained high traffic level.
This advice is independent of Shawn's BTW. You could increase your
server capabiblities 10-fold and it would still apply.
- Toke Eskildsen, Royal Danish Library
ind, have you
considered posting a write-up of your hard work somewhere? It seems a
shame only to have is as an input on this mailing list.
- Toke Eskildsen, Royal Danish Library
fields is an outlier
in Solr Land and as such warrants caution and consideration.
- Toke Eskildsen, Royal Danish Library
the result set.
I would argue your OOM with small result sets and huge rows is a good
thing: You encounter the problem immediately, instead of hitting it at
some random time when a match-a-lot query is issued by a user.
- Toke Eskildsen, Royal Danish Library
The relevant JIRA seems to be https://issues.apache.org/jira/
browse/SOLR-8988
Try setting facet.distrib.mco=true
- Toke Eskildsen, Royal Danish Library
rCache or
faceting on a high-cardinality field.
If the query above is representative of your general queries, I'll guess it's
the many docs + large filterCache one. It's fairly easy to check:
* What is your Xmx?
* How many documents in your index?
* What is your filterCache size?
- Toke Eskildsen
ring queries from the same
user and then blacklisting the user? But what if the query is a link
shared on a forum? And so forth.
Hardening by blacklisting is a game that is hard to win. So to
paraphrase Shawn: Make sure your users cannot issue OOMing queries.
- Toke Eskildsen, Royal Danish Library - Aarhus
Dominique Bejean wrote:
> Hi, Thank you for the explanations about faceting. I was thinking the hit
> count had a biggest impact on facet memory lifecycle.
Only if you have a very high facet.limit. Could you provide us with a typical
query, including all the parameters?
- Toke Eskildsen
0-12:00.
If you cannot share, please check if you have excessive traffic around that
time or if there is a lot of UnInverting going on (triggered by faceting on
non.DocValues String fields). I know your post implies that you have already
done so, so this is more of a sanity check.
- Toke Eskildsen
ll use Threads (wrapped as Futures) as they are easy to work with. Getting
into thousands of connections in Solr seems like a danger sigh to me, whether
they are done async or not.
- Toke Eskildsen
ve ~20 shards in your cloud?
The issue of the default 10K limit is an old one:
https://issues.apache.org/jira/browse/SOLR-7344
I suggest you put a proxy in from of your Solr-cloud to handle queueing of
incoming requests.
- Toke Eskildsen
lter-queries for all the different groups so that the
users does not pay the first-call penalty. This requires your filter-
cache to be large enough to hold all the author lists.
- Toke Eskildsen, Royal Danish Library
check if you have any "Overlapping onDeckSearchers" in your solr.log?
- Toke Eskildsen
ueries has
finished or not? It it is the latter, one explanation could be that your Solr 7
setup is simply slower on average to respond than your Solr 4 setup, to the
point where it cannot keep up with the influx of queries.
- Toke Eskildsen
ncurrent requests and a queue to hold
the rest? Even with an overprovisioning of 4 requests/CPU-core to get them
running close to 100% we're talking 1000 CPU-cores in your system.
- Toke Eskildsen
ent search criteria:
Do they all take ~1 minute or just the first?
- Toke Eskildsen, Royal Danish Library
idea at
https://sbdevel.wordpress.com/2014/03/17/fast-faceting-with-high-cardinality-and-small-result-set/
- Toke Eskildsen
r, it is a design decision. In order to provide
pagination without recomputing the result set, you would need a
guaranteed server-side state. Solr does not implement that pattern and
thanks for that.
- Toke Eskildsen, Royal Danish Library
false since they are multi-valued). Debug info below.
docValues works fine with multi-values (at least for Strings).
- Toke Eskildsen
27;t find it very usable for
observing and tweaking heap size. The GC-log is better.
- Toke Eskildsen, Royal Danish Library
complicated syntax Solr
> uses. I think V2 APIs are coming to address this, but they did come a
> bit late in the game.
I guess you mean JSON APIs? Anyway, I fully agree that the old Solr
syntax is extremely clunky as soon as we move beyond the simple "just
supply a few search terms&
h-dates.html#Workin
gwithDates-DateMath
Your query would be something like
mydate:[* TO NOW/DAY] AND mydate:[NOW+1DAY/DAY TO *]
- Toke Eskildsen, Royal Danish Library
g much further ahead, the whole caching system would benefit from
having constraints that encompasses all the shards & collections served
in the same Solr. Unfortunately it is a daunting task just to figure
out the overall principles in this.
- Toke Eskildsen, Royal Danish Library
ed to 32.
Best solution: Use maxSizeMB (if it works)
Second best solution: Reduce to 32 or less
Third best, but often used, solution: Hope that most of the entries are
sparse and will remain so
- Toke Eskildsen, Royal Danish Library
Are you indexing while you search? If so, you need to set auto-warm or
state a few explicit warmup-queries. If not, your measuring will not be
representative as it will be on first-searches, which are always slower
than warmed-searches.
- Toke Eskildsen, Royal Danish Library
ly ask for the number you need.
Same goes for rows BTW.
- Toke Eskildsen
> I hope the heap size will continue to sustain for the index size.
You can check the memory usage in the admin GUI.
- Toke Eskildsen, Royal Danish Library
ith a 2GB JVM or something like that.
One of the symptoms for having too large a memory allocation for the
JVM are occasional long pauses due to garbage collection. However, you
should not lose anything - it is just a pause. Can you describe in more
detail what you mean by freeze and losing data
aking a shot at that. A fairly easy optimization would
be to replace the BytesRef[] indexedTermsArray with a BytesRefArray.
- Toke Eskildsen, Royal Danish Library
memory? What I am aiming at is if this is primarily a "many
relatively slow random access"-thing or more due to the way DocValues
are represented in the segments (the codec).
- Toke Eskildsen, Royal Danish Library
n-trivial overhead going from 1 to more than 1 shard. If
your collections are not too large, chances are that you will lower
your hardware requirements (and/or improve response times) by using
only 1 shard/collection.
- Toke Eskildsen, Royal Danish Library
d for Solr, but even then
you might want to have a hard limit, just to avoid the occasional "cat
steps on F5 and the browser issues a gazillion requests"-scenario.
--
Toke Eskildsen, Royal Danish Library
garbage
collections can take a long time.
We have a setup with 25 nodes per physical server, each with 8GB of heap.
Running that as a single node per physical machine would mean ~200GB heap. I am
sure it is possible to wrangle such a beast, but I'd rather spend my energy on
Solr instead.
- Toke Eskildsen
could
say.
Out-of-the-box Solr is pure relevance ranked. By the definition in the
Wikipedia-article, it is already Organic Search. I think you need to go
back to your client and ask what the client thinks "Organic Search" is.
--
Toke Eskildsen, Royal Danish Library
single physical machine that could be an explanation.
What is your hardware-setup?
--
Toke Eskildsen, Royal Danish Library
an expert in
segment merge mechanics).
We're also using a 1 Solr/shard setup, but with SolrCloud. Our initial
rationale for 1 Solr/shard was to avoid long GC-pauses due to large
heaps, but that does not seem to be a problem here. Now we stick to it
as it works fine and makes for simple lo
Nawab Zada Asad Iqbal wrote:
> @Toke, I stumbled upon your page last week but it seems that your huge
> index doesn't receive a lot of query traffic.
It switches between two kinds of usage:
Everyday use is very low traffic by researchers using it interactively: 1-2
simultaneous queries, with fa
Shawn Heisey wrote:
> On 5/24/2017 3:44 AM, Toke Eskildsen wrote:
>> It is relatively easy to downgrade to an earlier release within the
>> same major version. We have not switched to 6.5.1 simply because we
>> have no pressing need for it - Solr 6.3 works well for us.
&
works well for us.
I guess it depends quite a bit on your need for stability. We are a
library and uptime is only "best effort".
--
Toke Eskildsen, Royal Danish Library
he filter-cache (secondarily the other caches, but the filter-cache tends to
be the large one). A heap of 10GB might very well be fine for handling your
whole 50GB index. If that is on a 64GB machine, the remaining 54GB of RAM
(minus the other stuff that is running) ought to ensure a fully cached
Shawn Heisey wrote:
> Adding more shards as Toke suggested *might* help,[...]
I seem to have phrased my suggestion poorly. What I meant to suggest was a
switch to a single shard (with 4 replicas) setup, instead of the current 2
shards (with 2 replicas).
- Toke
Why don't you use q instead of fq for the
part of your request that changes?
--
Toke Eskildsen, Royal Danish Library
Chetas Joshi wrote:
> Thanks for the insights into the memory requirements. Looks like cursor
> approach is going to require a lot of memory for millions of documents.
Sorry, that is a premature conclusion from your observations.
> If I run a query that returns only 500K documents still keeping
? Does it mean solr will serve stale data( i.e.
> send stale data to the slaves) ignoring the changes from the second
> commit? [...]
Sorry, I am not that familiar with the details of master-slave-setups.
--
Toke Eskildsen, Royal Danish Library
wo problems may be linked.
Quick sanity check: Look for "Overlapping onDeckSearchers" in your
solr.log to see if your memory problems are caused by multiple open
searchers:
https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarm
ingSearchers.3DX.22_mean.3F
--
Toke Eskildsen, Royal Danish Library
tand the expected gain of adding replicas, if the data are
remote. Why can't the replica Solrs run on the nodes with the data? Do you have
very CPU-intensive search?
- Toke Eskildsen
e.
You can get a detailed breakdown by doing VisualVM profiling and doing
a snapshot instead of sampling, but be prepared to restart your Solr
afterwards as that is quite intrusive.
Another (and simpler) option would be to check how much IO-wait there
is with 'top' from a shell.
- Tok
1 - 100 of 594 matches
Mail list logo