Re: Performance/scaling with custom function queries

Robert Krüger Wed, 11 Jun 2014 04:47:23 -0700

Would Solr use multithreading to process the records of a function
query as described above? In my scenario concurrent searches are not
the issue, rather the speed of one query will be the optimization
target. Or will I have to set up distributed search to achieve that?


Thanks,

Robert

On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger <krue...@lesspain.de> wrote:
> Great, I was hoping for that. In my case I will have to deal with the
> worst case scenario, i.e. all documents matching the query, because
> the only criterion is the fingerprint and the result of the
> distance/similarity function which will have to be executed for every
> document. However, I am dealing with a scenario where there will not
> be many concurrent users.
>
> Thank you.
>
> On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein <joels...@gmail.com> wrote:
>> You only need to have fast access to the fingerprint field so only that
>> field needs to be in memory. You'll want to review how Lucene DocValues and
>> FieldCache work. Sorting is done with a PriorityQueue so only the top N
>> docs are kept in memory.
>>
>> You'll only need to access the fingerprint field values for documents that
>> match the query, so it won't be a full table scan unless all the docs match
>> the query.
>>
>> Sounds like an interesting project. Please keep us posted.
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>>
>> On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger <krue...@lesspain.de> wrote:
>>
>>> Hi,
>>>
>>> let's say I have an index that contains a field of type BinaryField
>>> called "fingerprint" that stores a few (let's say 100) bytes that are
>>> some kind of digital fingerprint-like thing.
>>>
>>> Let's say I want to perform queries on that field to achieve sorting
>>> or filtering based on a kind of custom distance function
>>> "customDistance", i.e. I input a reference "fingerprint" and Solr
>>> returns either all documents sorted by
>>> customDistance(<referenceFingerprint>,<documentFingerprint>) or use
>>> that in an frange expression for filtering.
>>>
>>> I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
>>> do understand that using function queries with a custom function is
>>> definitely an expensive thing as it will result in what is called a
>>> "full table scan" in the sql world, i.e. data from all documents needs
>>> to be touched to select the correct documents or sort by a function's
>>> result.
>>>
>>> Given all that and provided, I have to use a custom function for my
>>> needs, I would like to know a few more details about solr architecture
>>> to understand what I have to look out for.
>>>
>>> I will have potentially millions of records. Does the data contained
>>> in other index fields play a role when I only use the "fingerprint"
>>> field for sorting and searching when it comes to RAM usage? I am
>>> hoping to calculate that my RAM should be able to accommodate the
>>> fingerprint data of all available documents for the queries to be fast
>>> but not fingerprint data and all other indexed or stored data.
>>>
>>> Example: My fingerprint data needs 100bytes per document, my other
>>> indexed field data needs 900 bytes per document. Will I need 100MB or
>>> 1GB to fit all data that is needed to process one query in memory?
>>>
>>> Are there other things to be aware of?
>>>
>>> Thanks,
>>>
>>> Robert
>>>
>
>
>
> --
> Robert Krüger
> Managing Partner
> Lesspain GmbH & Co. KG
>
> www.lesspain-software.com



-- 
Robert Krüger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com

Re: Performance/scaling with custom function queries

Reply via email to