What I'd say is that there are *substantial* optimisations done already
when indexing terms, especially numerical ones, e.g. looking for common
divisors. Look out for a talk by Adrien Grand at Berlin Buzzwords
earlier this year for a taste of it.
I don't know how much of this kind of optimisation
Thanks everyone, for your answers. I will probably make a simple parametric
test pumping a solr index full of those integers with very limited range
and then sorting by vector distances to see how the performance
characteristics are.
On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev <
mkhlud...@gr
Robert,
>From what I know as inverted index as docvalues compress content much, even
stored fields compressed too. So, I think you have much chance to
experiment successfully. You might need tweak schema disabling storing
unnecessary info in the index.
On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüge
I'd still like to see a very clear statement of how data is stored in
Lucene. For example, is there any increase in index size if you placed your
32-bit integers in a long field? Could somebody make a clear statement
about what the index packing/compression would actually do - not the actual
algori
On the surface this seems like something of a distraction.
10M docs x 100 values/docs = 1B integers. Assuming all
need to be held in memory at once. My straw-man proposal:
it would be much cheaper to just provision each JVM
with an additional couple of gig memory and forget about it.
Feel free to
Thanks for the feedback.
What I am trying to do is to "abuse" integers to store 8bit (or even lower)
values of metrics I use for content-based image/video search (such as
statistical values regarding color distribution) and then implement
similarity calculations based on formulas using vector dist
Under the covers, Lucene stores ints in a packed format, so I'd just count
on that for a first pass.
What is "a lot of integer values"? Hundreds of millions? Billions? Trillions?
Unless you give us some indication of scale, it's hard to say anything
helpful. But unless you have some evidence that
Hi Robert,
current Solr compression will work really well , both for Stored and
DocValues contents.
Related the index term dictionaries, I ask for some help to other experts
as I never checked how the actual compression works in there, but I assume
it is quite efficient.
Usually the field type aff
Hi,
I have a data model where I would store and index a lot of integer values
with a very restricted range (e.g. 0-255), so theoretically the 32 bits of
Solr's integer fields are complete overkill. I want to be able to to things
like vector distance calculations on those fields. Should I worry abo