Re: Efficiency of integer storage/use

Jack Krupansky Sun, 18 Oct 2015 07:42:00 -0700

I'd still like to see a very clear statement of how data is stored in
Lucene. For example, is there any increase in index size if you placed your
32-bit integers in a long field? Could somebody make a clear statement
about what the index packing/compression would actually do - not the actual
algorithm, but the effect for a bunch of the common use cases.


-- Jack Krupansky

On Sun, Oct 18, 2015 at 10:18 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> On the surface this seems like something of a distraction.
>
> 10M docs x 100 values/docs = 1B integers. Assuming all
> need to be held in memory at once. My straw-man proposal:
> it would be much cheaper to just provision each JVM
> with an additional couple of gig memory and forget about it.
> Feel free to disagree of course, I'm really asking whether
> the engineering effort/debugging/whatever is worth it, effort
> that could be put towards adding some killer feature....
>
> Assuming the answer is that it _is_ worth the effort, I'd
> think about a custom ValueSource or FieldType
> that just packed standard int (or long) values with bytes and
> then just a multiValued int (maybe long) field in the schema.
> Then you'd have to do some bit twiddling to manipulate individual
> values. Mind you I'm waiving my hands here a _lot_..
>
> Best,
> Erick
>
> On Sat, Oct 17, 2015 at 3:15 AM, Robert Krüger <krue...@lesspain.de>
> wrote:
> > Thanks for the feedback.
> >
> > What I am trying to do is to "abuse" integers to store 8bit (or even
> lower)
> > values of metrics I use for content-based image/video search (such as
> > statistical values regarding color distribution) and then implement
> > similarity calculations based on formulas using vector distances. The
> Index
> > can become large (tens of millions of documents each with say 50-100
> > integers  describing the image metrics). I am looking at using a part of
> > those metrics for selecting a subset of images using range queries and
> then
> > more for sorting the result set by relevance.
> >
> > I was first looking at implementing those metrics as binary fields (see
> > other posting) and then use a custom function for the distance
> calculation
> > but so far I got the impression that way is not supported really well by
> > Solr. Base64-En/Decoding would kill performance and implementing a custom
> > field type with all that is probably required for that to work properly
> is
> > currently beyond my Solr knowledge. Besides, using built-in Solr features
> > makes it easier to finetune/experiment with different approaches,
> because I
> > can just play around with different queries and see what works best,
> > without each time adjusting a custom function.
> >
> > I hope that provides a better picture of what I am trying to achieve.
> >
> > Best,
> >
> > Robert
> >
> > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Under the covers, Lucene stores ints in a packed format, so I'd just
> count
> >> on that for a first pass.
> >>
> >> What is "a lot of integer values"? Hundreds of millions? Billions?
> >> Trillions?
> >>
> >> Unless you give us some indication of scale, it's hard to say anything
> >> helpful. But unless you have some evidence that your going to blow out
> >> memory I'd just ignore the "wasted" bits. Especially if you can use
> >> docValues,
> >> that option holds much of the underlying data in MMapDirectory
> >> that uses swappable OS memory....
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
> >> wrote:
> >> > Hi,
> >> >
> >> > I have a data model where I would store and index a lot of integer
> values
> >> > with a very restricted range (e.g. 0-255), so theoretically the 32
> bits
> >> of
> >> > Solr's integer fields are complete overkill. I want to be able to to
> >> things
> >> > like vector distance calculations on those fields. Should I worry
> about
> >> the
> >> > "wasted" bits or will Solr compress/organize the index in a way that
> >> > compensates for this if there are only 256 (or even fewer) distinct
> >> values?
> >> >
> >> > Any recommendations on how my fields should be defined to make things
> >> like
> >> > numeric functions work as fast as technically possible?
> >> >
> >> > Thanks in advance,
> >> >
> >> > Robert
> >>
> >
> >
> >
> > --
> > Robert Krüger
> > Managing Partner
> > Lesspain GmbH & Co. KG
> >
> > www.lesspain-software.com
>

Re: Efficiency of integer storage/use

Reply via email to