Would Solr use multithreading to process the records of a function query as described above? In my scenario concurrent searches are not the issue, rather the speed of one query will be the optimization target. Or will I have to set up distributed search to achieve that?
Thanks, Robert On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger <krue...@lesspain.de> wrote: > Great, I was hoping for that. In my case I will have to deal with the > worst case scenario, i.e. all documents matching the query, because > the only criterion is the fingerprint and the result of the > distance/similarity function which will have to be executed for every > document. However, I am dealing with a scenario where there will not > be many concurrent users. > > Thank you. > > On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein <joels...@gmail.com> wrote: >> You only need to have fast access to the fingerprint field so only that >> field needs to be in memory. You'll want to review how Lucene DocValues and >> FieldCache work. Sorting is done with a PriorityQueue so only the top N >> docs are kept in memory. >> >> You'll only need to access the fingerprint field values for documents that >> match the query, so it won't be a full table scan unless all the docs match >> the query. >> >> Sounds like an interesting project. Please keep us posted. >> >> Joel Bernstein >> Search Engineer at Heliosearch >> >> >> On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger <krue...@lesspain.de> wrote: >> >>> Hi, >>> >>> let's say I have an index that contains a field of type BinaryField >>> called "fingerprint" that stores a few (let's say 100) bytes that are >>> some kind of digital fingerprint-like thing. >>> >>> Let's say I want to perform queries on that field to achieve sorting >>> or filtering based on a kind of custom distance function >>> "customDistance", i.e. I input a reference "fingerprint" and Solr >>> returns either all documents sorted by >>> customDistance(<referenceFingerprint>,<documentFingerprint>) or use >>> that in an frange expression for filtering. >>> >>> I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I >>> do understand that using function queries with a custom function is >>> definitely an expensive thing as it will result in what is called a >>> "full table scan" in the sql world, i.e. data from all documents needs >>> to be touched to select the correct documents or sort by a function's >>> result. >>> >>> Given all that and provided, I have to use a custom function for my >>> needs, I would like to know a few more details about solr architecture >>> to understand what I have to look out for. >>> >>> I will have potentially millions of records. Does the data contained >>> in other index fields play a role when I only use the "fingerprint" >>> field for sorting and searching when it comes to RAM usage? I am >>> hoping to calculate that my RAM should be able to accommodate the >>> fingerprint data of all available documents for the queries to be fast >>> but not fingerprint data and all other indexed or stored data. >>> >>> Example: My fingerprint data needs 100bytes per document, my other >>> indexed field data needs 900 bytes per document. Will I need 100MB or >>> 1GB to fit all data that is needed to process one query in memory? >>> >>> Are there other things to be aware of? >>> >>> Thanks, >>> >>> Robert >>> > > > > -- > Robert Krüger > Managing Partner > Lesspain GmbH & Co. KG > > www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH & Co. KG www.lesspain-software.com