On the surface this seems like something of a distraction.

10M docs x 100 values/docs = 1B integers. Assuming all
need to be held in memory at once. My straw-man proposal:
it would be much cheaper to just provision each JVM
with an additional couple of gig memory and forget about it.
Feel free to disagree of course, I'm really asking whether
the engineering effort/debugging/whatever is worth it, effort
that could be put towards adding some killer feature....

Assuming the answer is that it _is_ worth the effort, I'd
think about a custom ValueSource or FieldType
that just packed standard int (or long) values with bytes and
then just a multiValued int (maybe long) field in the schema.
Then you'd have to do some bit twiddling to manipulate individual
values. Mind you I'm waiving my hands here a _lot_..

Best,
Erick

On Sat, Oct 17, 2015 at 3:15 AM, Robert Krüger <krue...@lesspain.de> wrote:
> Thanks for the feedback.
>
> What I am trying to do is to "abuse" integers to store 8bit (or even lower)
> values of metrics I use for content-based image/video search (such as
> statistical values regarding color distribution) and then implement
> similarity calculations based on formulas using vector distances. The Index
> can become large (tens of millions of documents each with say 50-100
> integers  describing the image metrics). I am looking at using a part of
> those metrics for selecting a subset of images using range queries and then
> more for sorting the result set by relevance.
>
> I was first looking at implementing those metrics as binary fields (see
> other posting) and then use a custom function for the distance calculation
> but so far I got the impression that way is not supported really well by
> Solr. Base64-En/Decoding would kill performance and implementing a custom
> field type with all that is probably required for that to work properly is
> currently beyond my Solr knowledge. Besides, using built-in Solr features
> makes it easier to finetune/experiment with different approaches, because I
> can just play around with different queries and see what works best,
> without each time adjusting a custom function.
>
> I hope that provides a better picture of what I am trying to achieve.
>
> Best,
>
> Robert
>
> On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Under the covers, Lucene stores ints in a packed format, so I'd just count
>> on that for a first pass.
>>
>> What is "a lot of integer values"? Hundreds of millions? Billions?
>> Trillions?
>>
>> Unless you give us some indication of scale, it's hard to say anything
>> helpful. But unless you have some evidence that your going to blow out
>> memory I'd just ignore the "wasted" bits. Especially if you can use
>> docValues,
>> that option holds much of the underlying data in MMapDirectory
>> that uses swappable OS memory....
>>
>> Best,
>> Erick
>>
>> On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de>
>> wrote:
>> > Hi,
>> >
>> > I have a data model where I would store and index a lot of integer values
>> > with a very restricted range (e.g. 0-255), so theoretically the 32 bits
>> of
>> > Solr's integer fields are complete overkill. I want to be able to to
>> things
>> > like vector distance calculations on those fields. Should I worry about
>> the
>> > "wasted" bits or will Solr compress/organize the index in a way that
>> > compensates for this if there are only 256 (or even fewer) distinct
>> values?
>> >
>> > Any recommendations on how my fields should be defined to make things
>> like
>> > numeric functions work as fast as technically possible?
>> >
>> > Thanks in advance,
>> >
>> > Robert
>>
>
>
>
> --
> Robert Krüger
> Managing Partner
> Lesspain GmbH & Co. KG
>
> www.lesspain-software.com

Reply via email to