I'd still like to see a very clear statement of how data is stored in Lucene. For example, is there any increase in index size if you placed your 32-bit integers in a long field? Could somebody make a clear statement about what the index packing/compression would actually do - not the actual algorithm, but the effect for a bunch of the common use cases.
-- Jack Krupansky On Sun, Oct 18, 2015 at 10:18 AM, Erick Erickson <erickerick...@gmail.com> wrote: > On the surface this seems like something of a distraction. > > 10M docs x 100 values/docs = 1B integers. Assuming all > need to be held in memory at once. My straw-man proposal: > it would be much cheaper to just provision each JVM > with an additional couple of gig memory and forget about it. > Feel free to disagree of course, I'm really asking whether > the engineering effort/debugging/whatever is worth it, effort > that could be put towards adding some killer feature.... > > Assuming the answer is that it _is_ worth the effort, I'd > think about a custom ValueSource or FieldType > that just packed standard int (or long) values with bytes and > then just a multiValued int (maybe long) field in the schema. > Then you'd have to do some bit twiddling to manipulate individual > values. Mind you I'm waiving my hands here a _lot_.. > > Best, > Erick > > On Sat, Oct 17, 2015 at 3:15 AM, Robert Krüger <krue...@lesspain.de> > wrote: > > Thanks for the feedback. > > > > What I am trying to do is to "abuse" integers to store 8bit (or even > lower) > > values of metrics I use for content-based image/video search (such as > > statistical values regarding color distribution) and then implement > > similarity calculations based on formulas using vector distances. The > Index > > can become large (tens of millions of documents each with say 50-100 > > integers describing the image metrics). I am looking at using a part of > > those metrics for selecting a subset of images using range queries and > then > > more for sorting the result set by relevance. > > > > I was first looking at implementing those metrics as binary fields (see > > other posting) and then use a custom function for the distance > calculation > > but so far I got the impression that way is not supported really well by > > Solr. Base64-En/Decoding would kill performance and implementing a custom > > field type with all that is probably required for that to work properly > is > > currently beyond my Solr knowledge. Besides, using built-in Solr features > > makes it easier to finetune/experiment with different approaches, > because I > > can just play around with different queries and see what works best, > > without each time adjusting a custom function. > > > > I hope that provides a better picture of what I am trying to achieve. > > > > Best, > > > > Robert > > > > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> Under the covers, Lucene stores ints in a packed format, so I'd just > count > >> on that for a first pass. > >> > >> What is "a lot of integer values"? Hundreds of millions? Billions? > >> Trillions? > >> > >> Unless you give us some indication of scale, it's hard to say anything > >> helpful. But unless you have some evidence that your going to blow out > >> memory I'd just ignore the "wasted" bits. Especially if you can use > >> docValues, > >> that option holds much of the underlying data in MMapDirectory > >> that uses swappable OS memory.... > >> > >> Best, > >> Erick > >> > >> On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de> > >> wrote: > >> > Hi, > >> > > >> > I have a data model where I would store and index a lot of integer > values > >> > with a very restricted range (e.g. 0-255), so theoretically the 32 > bits > >> of > >> > Solr's integer fields are complete overkill. I want to be able to to > >> things > >> > like vector distance calculations on those fields. Should I worry > about > >> the > >> > "wasted" bits or will Solr compress/organize the index in a way that > >> > compensates for this if there are only 256 (or even fewer) distinct > >> values? > >> > > >> > Any recommendations on how my fields should be defined to make things > >> like > >> > numeric functions work as fast as technically possible? > >> > > >> > Thanks in advance, > >> > > >> > Robert > >> > > > > > > > > -- > > Robert Krüger > > Managing Partner > > Lesspain GmbH & Co. KG > > > > www.lesspain-software.com >