On Fri, Aug 9, 2013 at 2:56 PM, Chris Hostetter <hossman_luc...@fucit.org>wrote:
> > : I'll look into this. Thanks for the concrete example as I don't even > : know which classes to start to look at to implement such a feature. > > Either roman isn't understanding what you are aksing for, or i'm not -- > but i don't think what roman described will work for you... > > : > so if your query contains no duplicates and all terms must match, you > can > : > be sure that you are collecting docs only when the number of terms > matches > : > number of clauses in the query > > several of the examples you gave did not match what Roman is describing, > as i understand it. Most people on this thread seem to be getting > confused by having their perceptions "flipped" about what your "data known > in advance is" vs the "data you get at request time". > > You described this... > > : >>>>> Product keyword: "Sony" > : >>>>> Product keyword: "Samsung Galaxy" > : >>>>> > : >>>>> We would like to be able to detect given a product title whether or > : >> not it > : >>>>> matches any known keywords. For a keyword to be matched all of it's > : >> terms > : >>>>> must be present in the product title given. > : >>>>> > : >>>>> Product Title: "Sony Experia" > : >>>>> Matches and returns a highlight: "<em>Sony</em> Experia" > > ...suggesting that what you call "product keywords" are the "data you know > about in advance" and "product titles" are the data you get at request > time. > > So your example of the "request time" input (ie: query) "Sony Experia" > matching "data known in advance (ie: indexed document) "Sony" would not > work with Roman's example. > > To rephrase (what i think i understand is) your goal... > > * you have many (10*3+) documents known in advance > * any document D contain a set of words W(D) of varing sizes > * any requests Q contains a set of words W(Q) of varing izes > * you want a given request R to match a document D if and only if: > - W(D) is a subset of W(Q) > aha! this was not what i was understanding! i was assuming W(Q) is a subset of W(D) - or rather, W(Q) === W(D) so now i finally see the reasoning behind it and the use case, which is a VERY interesting one. roman > - ie: no iten exists in W(D) that does not exist in W(Q) > - ie: any number of items may exist in W(Q) that are not in W(D) > > > So to reiteratve your examples from before, but change the "labels" a > bit and add some more converse examples (and ignore the "highlighting" > aspect for a moment... > > doc1 = "Sony" > doc2 = "Samsung Galaxy" > doc3 = "Sony Playstation" > > queryA = "Sony Experia" ... matches only doc1 > queryB = "Sony Playstation 3" ... matches doc3 and doc1 > queryC = "Samsung 52inch LC" ... doesn't match anything > queryD = "Samsung Galaxy S4" ... matches doc2 > queryE = "Galaxy Samsung S4" ... matches doc2 > > > ...do i still have that correct? > > > A similar question came up in the past, but i can't find my response now > so i'll try to recreate it ... > > > 1) if you don't care about using non-trivial analysis (ie: you don't need > stemming, or synonyms, etc..), you can do this with some > really simple function queries -- asusming you index a field containing > hte number of "words" in each document, in addition to the words > themselves. Assuming your words are in a field named "words" and the > number of words is in a field named "words_count" a request for something > like "Galaxy Samsung S4" can be represented as... > > q={!frange l=0 u=0}sub(words_count, > sum(termfreq('words','Galaxy'), > termfreq('words','Samsung'), > termfreq('words','S4')) > > ...ie: you want to compute the sub of the term frequencies for each of > hte words requested, and then you want ot subtract that sum from the > number of terms in the documengt -- and then you only want ot match > documents where the result of that subtraction is 0. > > one complexity that comes up, is that you haven't specified: > > * can the list of words in your documents contain duplicates? > * can the list of words in your query contain duplicates? > * should a document with duplicatewords match only if the query also > contains the same word duplicated? > > ...the answers to those questions make hte math more complicated (and are > left as an excersize for the reader) > > > 2) if you *do* care about using non-trivial analysis, then you can't use > the simple "termfreq()" function, which deals with raw terms -- in stead > you have to use the "query()" function to ensure that the input is parsed > appropriately -- but then you have to wrap that function in something that > will normalize the scores - so in place of termfreq('words','Galaxy') > you'd want something like... > > if(query({!field f=words v='Galaxy'}),1,0) > > ...but again the math gets much harder if you make things more complex > with duplicate words i nthe document or duplicate words in the query -- > you'd > probably have to use a custom similarity to get the scores returned by the > query() function to be usable as is in the match equation (and drop the > "if()" function) > > > As for the highlighting part of hte problme -- that becomes much easier -- > independent of the queries you use to *match* the documents, you can then > specify a "hl.q" param to specify a much simpler query just containing the > basic lst of words (as a simple boolean query, all clouses optional) and > let it highlight them in your list of words. > > > > > > > > -Hoss >