Thanks for the feedback.

What I am trying to do is to "abuse" integers to store 8bit (or even lower)
values of metrics I use for content-based image/video search (such as
statistical values regarding color distribution) and then implement
similarity calculations based on formulas using vector distances. The Index
can become large (tens of millions of documents each with say 50-100
integers  describing the image metrics). I am looking at using a part of
those metrics for selecting a subset of images using range queries and then
more for sorting the result set by relevance.

I was first looking at implementing those metrics as binary fields (see
other posting) and then use a custom function for the distance calculation
but so far I got the impression that way is not supported really well by
Solr. Base64-En/Decoding would kill performance and implementing a custom
field type with all that is probably required for that to work properly is
currently beyond my Solr knowledge. Besides, using built-in Solr features
makes it easier to finetune/experiment with different approaches, because I
can just play around with different queries and see what works best,
without each time adjusting a custom function.

I hope that provides a better picture of what I am trying to achieve.

Best,

Robert

On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Under the covers, Lucene stores ints in a packed format, so I'd just count
> on that for a first pass.
>
> What is "a lot of integer values"? Hundreds of millions? Billions?
> Trillions?
>
> Unless you give us some indication of scale, it's hard to say anything
> helpful. But unless you have some evidence that your going to blow out
> memory I'd just ignore the "wasted" bits. Especially if you can use
> docValues,
> that option holds much of the underlying data in MMapDirectory
> that uses swappable OS memory....
>
> Best,
> Erick
>
> On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
> wrote:
> > Hi,
> >
> > I have a data model where I would store and index a lot of integer values
> > with a very restricted range (e.g. 0-255), so theoretically the 32 bits
> of
> > Solr's integer fields are complete overkill. I want to be able to to
> things
> > like vector distance calculations on those fields. Should I worry about
> the
> > "wasted" bits or will Solr compress/organize the index in a way that
> > compensates for this if there are only 256 (or even fewer) distinct
> values?
> >
> > Any recommendations on how my fields should be defined to make things
> like
> > numeric functions work as fast as technically possible?
> >
> > Thanks in advance,
> >
> > Robert
>



-- 
Robert Krüger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com

Reply via email to