Re: Percolate feature?

Roman Chyla Fri, 09 Aug 2013 13:51:26 -0700

On Fri, Aug 9, 2013 at 2:56 PM, Chris Hostetter <hossman_luc...@fucit.org>wrote:


>
> : I'll look into this. Thanks for the concrete example as I don't even
> : know which classes to start to look at to implement such a feature.
>
> Either roman isn't understanding what you are aksing for, or i'm not --
> but i don't think what roman described will work for you...
>
> : > so if your query contains no duplicates and all terms must match, you
> can
> : > be sure that you are collecting docs only when the number of terms
> matches
> : > number of clauses in the query
>
> several of the examples you gave did not match what Roman is describing,
> as i understand it.  Most people on this thread seem to be getting
> confused by having their perceptions "flipped" about what your "data known
> in advance is" vs the "data you get at request time".
>
> You described this...
>
> : >>>>> Product keyword:  "Sony"
> : >>>>> Product keyword:  "Samsung Galaxy"
> : >>>>>
> : >>>>> We would like to be able to detect given a product title whether or
> : >> not it
> : >>>>> matches any known keywords. For a keyword to be matched all of it's
> : >> terms
> : >>>>> must be present in the product title given.
> : >>>>>
> : >>>>> Product Title: "Sony Experia"
> : >>>>> Matches and returns a highlight: "<em>Sony</em> Experia"
>
> ...suggesting that what you call "product keywords" are the "data you know
> about in advance" and "product titles" are the data you get at request
> time.
>
> So your example of the "request time" input (ie: query) "Sony Experia"
> matching "data known in advance (ie: indexed document) "Sony" would not
> work with Roman's example.
>
> To rephrase (what i think i understand is) your goal...
>
>  * you have many (10*3+) documents known in advance
>  * any document D contain a set of words W(D) of varing sizes
>  * any requests Q contains a set of words W(Q) of varing izes
>  * you want a given request R to match a document D if and only if:
>    - W(D) is a subset of W(Q)
>

aha! this was not what i was understanding! i was assuming W(Q) is a subset
of W(D) - or rather, W(Q) === W(D)

so now i finally see the reasoning behind it and the use case, which is a
VERY interesting one.

roman



>    - ie: no iten exists in W(D) that does not exist in W(Q)
>    - ie: any number of items may exist in W(Q) that are not in W(D)
>



>
> So to reiteratve your examples from before, but change the "labels" a
> bit and add some more converse examples (and ignore the "highlighting"
> aspect for a moment...
>
> doc1 = "Sony"
> doc2 = "Samsung Galaxy"
> doc3 = "Sony Playstation"
>
> queryA = "Sony Experia"       ... matches only doc1
> queryB = "Sony Playstation 3" ... matches doc3 and doc1
> queryC = "Samsung 52inch LC"  ... doesn't match anything
> queryD = "Samsung Galaxy S4"  ... matches doc2
> queryE = "Galaxy Samsung S4"  ... matches doc2
>
>
> ...do i still have that correct?
>
>
> A similar question came up in the past, but i can't find my response now
> so i'll try to recreate it ...
>
>
> 1) if you don't care about using non-trivial analysis (ie: you don't need
> stemming, or synonyms, etc..), you can do this with some
> really simple function queries -- asusming you index a field containing
> hte number of "words" in each document, in addition to the words
> themselves.  Assuming your words are in a field named "words" and the
> number of words is in a field named "words_count" a request for something
> like "Galaxy Samsung S4" can be represented as...
>
>   q={!frange l=0 u=0}sub(words_count,
>                          sum(termfreq('words','Galaxy'),
>                              termfreq('words','Samsung'),
>                              termfreq('words','S4'))
>
> ...ie: you want to compute the sub of the term frequencies for each of
> hte words requested, and then you want ot subtract that sum from the
> number of terms in the documengt -- and then you only want ot match
> documents where the result of that subtraction is 0.
>
> one complexity that comes up, is that you haven't specified:
>
>   * can the list of words in your documents contain duplicates?
>   * can the list of words in your query contain duplicates?
>   * should a document with duplicatewords match only if the query also
> contains the same word duplicated?
>
> ...the answers to those questions make hte math more complicated (and are
> left as an excersize for the reader)
>
>
> 2) if you *do* care about using non-trivial analysis, then you can't use
> the simple "termfreq()" function, which deals with raw terms -- in stead
> you have to use the "query()" function to ensure that the input is parsed
> appropriately -- but then you have to wrap that function in something that
> will normalize the scores - so in place of termfreq('words','Galaxy')
> you'd want something like...
>
>             if(query({!field f=words v='Galaxy'}),1,0)
>
> ...but again the math gets much harder if you make things more complex
> with duplicate words i nthe document or duplicate words in the query --
> you'd
> probably have to use a custom similarity to get the scores returned by the
> query() function to be usable as is in the match equation (and drop the
> "if()" function)
>
>
> As for the highlighting part of hte problme -- that becomes much easier --
> independent of the queries you use to *match* the documents, you can then
> specify a "hl.q" param to specify a much simpler query just containing the
> basic lst of words (as a simple boolean query, all clouses optional) and
> let it highlight them in your list of words.
>
>
>
>
>
>
>
> -Hoss
>

Re: Percolate feature?

Reply via email to