Robert, >From what I know as inverted index as docvalues compress content much, even stored fields compressed too. So, I think you have much chance to experiment successfully. You might need tweak schema disabling storing unnecessary info in the index.
On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüger <krue...@lesspain.de> wrote: > Thanks for the feedback. > > What I am trying to do is to "abuse" integers to store 8bit (or even lower) > values of metrics I use for content-based image/video search (such as > statistical values regarding color distribution) and then implement > similarity calculations based on formulas using vector distances. The Index > can become large (tens of millions of documents each with say 50-100 > integers describing the image metrics). I am looking at using a part of > those metrics for selecting a subset of images using range queries and then > more for sorting the result set by relevance. > > I was first looking at implementing those metrics as binary fields (see > other posting) and then use a custom function for the distance calculation > but so far I got the impression that way is not supported really well by > Solr. Base64-En/Decoding would kill performance and implementing a custom > field type with all that is probably required for that to work properly is > currently beyond my Solr knowledge. Besides, using built-in Solr features > makes it easier to finetune/experiment with different approaches, because I > can just play around with different queries and see what works best, > without each time adjusting a custom function. > > I hope that provides a better picture of what I am trying to achieve. > > Best, > > Robert > > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > Under the covers, Lucene stores ints in a packed format, so I'd just > count > > on that for a first pass. > > > > What is "a lot of integer values"? Hundreds of millions? Billions? > > Trillions? > > > > Unless you give us some indication of scale, it's hard to say anything > > helpful. But unless you have some evidence that your going to blow out > > memory I'd just ignore the "wasted" bits. Especially if you can use > > docValues, > > that option holds much of the underlying data in MMapDirectory > > that uses swappable OS memory.... > > > > Best, > > Erick > > > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de> > > wrote: > > > Hi, > > > > > > I have a data model where I would store and index a lot of integer > values > > > with a very restricted range (e.g. 0-255), so theoretically the 32 bits > > of > > > Solr's integer fields are complete overkill. I want to be able to to > > things > > > like vector distance calculations on those fields. Should I worry about > > the > > > "wasted" bits or will Solr compress/organize the index in a way that > > > compensates for this if there are only 256 (or even fewer) distinct > > values? > > > > > > Any recommendations on how my fields should be defined to make things > > like > > > numeric functions work as fast as technically possible? > > > > > > Thanks in advance, > > > > > > Robert > > > > > > -- > Robert Krüger > Managing Partner > Lesspain GmbH & Co. KG > > www.lesspain-software.com > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>