Re: Percolate feature?

Mark Tue, 13 Aug 2013 13:47:56 -0700

Any ideas?

On Aug 10, 2013, at 6:28 PM, Mark <static.void....@gmail.com> wrote:


> Our schema is pretty basic.. nothing fancy going on here
> 
> <fieldType name="text" class="solr.TextField" omitNorms="false">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protected.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> preserveOriginal="1"/>
>        <filter class="solr.KStemFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>       <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
>        <filter class="solr.KeywordMarkerFilterFactory" 
> protected="protected.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" 
> preserveOriginal="1"/>
>        <filter class="solr.KStemFilterFactory"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> On Aug 10, 2013, at 3:40 PM, "Jack Krupansky" <j...@basetechnology.com> wrote:
> 
>> Now we're getting somewhere!
>> 
>> To (over-simplify), you simply want to know if a given "listing" would match 
>> a high-value pattern, either in a "clean" manner (obvious keywords) or in an 
>> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)
>> 
>> To a large this also depends on how rich and powerful your end-user query 
>> support is. So, if the user searches for "sony", "samsung", or "apple", will 
>> it match some oddball listing that fuzzily matches those terms.
>> 
>> So... tell us, how rich your query interface is. I mean, do you support 
>> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", 
>> or... will "sony" match "sonblah-blah")?
>> 
>> Reverse-search may in fact be what you need in this case since you literally 
>> do mean "if I index this document, will it match any of these queries" (but 
>> doesn't score a hit on your direct check for whether it is a clean keyword 
>> match.)
>> 
>> In your previous examples you only gave clean product titles, not examples 
>> of circumventions of simple keyword matches.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Mark
>> Sent: Saturday, August 10, 2013 6:24 PM
>> To: solr-user@lucene.apache.org
>> Cc: Chris Hostetter
>> Subject: Re: Percolate feature?
>> 
>>> So to reiteratve your examples from before, but change the "labels" a
>>> bit and add some more converse examples (and ignore the "highlighting"
>>> aspect for a moment...
>>> 
>>> doc1 = "Sony"
>>> doc2 = "Samsung Galaxy"
>>> doc3 = "Sony Playstation"
>>> 
>>> queryA = "Sony Experia"       ... matches only doc1
>>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>>> queryD = "Samsung Galaxy S4"  ... matches doc2
>>> queryE = "Galaxy Samsung S4"  ... matches doc2
>>> 
>>> 
>>> ...do i still have that correct?
>> 
>> Yes
>> 
>>> 2) if you *do* care about using non-trivial analysis, then you can't use
>>> the simple "termfreq()" function, which deals with raw terms -- in stead
>>> you have to use the "query()" function to ensure that the input is parsed
>>> appropriately -- but then you have to wrap that function in something that
>>> will normalize the scores - so in place of termfreq('words','Galaxy')
>>> you'd want something like...
>> 
>> 
>> Yes we will be using non-trivial analysis. Now heres another twist… what if 
>> we don't care about scoring?
>> 
>> 
>> Let's talk about the real use case. We are marketplace that sells products 
>> that users have listed. For certain popular, high risk or restricted 
>> keywords we charge the seller an extra fee/ban the listing. We now have 
>> sellers purposely misspelling their listings to circumvent this fee. They 
>> will start adding suffixes to their product listings such as "Sonies" 
>> knowing that it gets indexed down to "Sony" and thus matching a users query 
>> for Sony. Or they will munge together numbers and products… "2013Sony". Same 
>> thing goes for adding crazy non-ascii characters to the front of the keyword 
>> "Î’Sony". This is obviously a problem because we aren't charging for these 
>> keywords and more importantly it makes our search results look like shit.
>> 
>> We would like to:
>> 
>> 1) Detect when a certain keyword is in a product title at listing time so we 
>> may charge the seller. This was my idea of a "reverse search" although 
>> sounds like I may have caused to much confusion with that term.
>> 2) Attempt to autocorrect these titles hence the need for highlighting so we 
>> can try and replace the terms… this of course done outside of Solr via an 
>> external service.
>> 
>> Since we do some stemming (KStemmer) and filtering 
>> (WordDelimiterFilterFactory) this makes conventional approaches such as 
>> regex quite troublesome. Regex is also quite slow and scales horribly and 
>> always needs to be in lockstep with schema changes.
>> 
>> Now knowing this, is there a good way to approach this?
>> 
>> Thanks
>> 
>> 
>> On Aug 9, 2013, at 11:56 AM, Chris Hostetter <hossman_luc...@fucit.org> 
>> wrote:
>> 
>>> 
>>> : I'll look into this. Thanks for the concrete example as I don't even
>>> : know which classes to start to look at to implement such a feature.
>>> 
>>> Either roman isn't understanding what you are aksing for, or i'm not -- but 
>>> i don't think what roman described will work for you...
>>> 
>>> : > so if your query contains no duplicates and all terms must match, you 
>>> can
>>> : > be sure that you are collecting docs only when the number of terms 
>>> matches
>>> : > number of clauses in the query
>>> 
>>> several of the examples you gave did not match what Roman is describing,
>>> as i understand it.  Most people on this thread seem to be getting
>>> confused by having their perceptions "flipped" about what your "data known
>>> in advance is" vs the "data you get at request time".
>>> 
>>> You described this...
>>> 
>>> : >>>>> Product keyword:  "Sony"
>>> : >>>>> Product keyword:  "Samsung Galaxy"
>>> : >>>>>
>>> : >>>>> We would like to be able to detect given a product title whether or
>>> : >> not it
>>> : >>>>> matches any known keywords. For a keyword to be matched all of it's
>>> : >> terms
>>> : >>>>> must be present in the product title given.
>>> : >>>>>
>>> : >>>>> Product Title: "Sony Experia"
>>> : >>>>> Matches and returns a highlight: "<em>Sony</em> Experia"
>>> 
>>> ...suggesting that what you call "product keywords" are the "data you know
>>> about in advance" and "product titles" are the data you get at request
>>> time.
>>> 
>>> So your example of the "request time" input (ie: query) "Sony Experia"
>>> matching "data known in advance (ie: indexed document) "Sony" would not
>>> work with Roman's example.
>>> 
>>> To rephrase (what i think i understand is) your goal...
>>> 
>>> * you have many (10*3+) documents known in advance
>>> * any document D contain a set of words W(D) of varing sizes
>>> * any requests Q contains a set of words W(Q) of varing izes
>>> * you want a given request R to match a document D if and only if:
>>> - W(D) is a subset of W(Q)
>>> - ie: no iten exists in W(D) that does not exist in W(Q)
>>> - ie: any number of items may exist in W(Q) that are not in W(D)
>>> 
>>> So to reiteratve your examples from before, but change the "labels" a
>>> bit and add some more converse examples (and ignore the "highlighting"
>>> aspect for a moment...
>>> 
>>> doc1 = "Sony"
>>> doc2 = "Samsung Galaxy"
>>> doc3 = "Sony Playstation"
>>> 
>>> queryA = "Sony Experia"       ... matches only doc1
>>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>>> queryD = "Samsung Galaxy S4"  ... matches doc2
>>> queryE = "Galaxy Samsung S4"  ... matches doc2
>>> 
>>> 
>>> ...do i still have that correct?
>>> 
>>> 
>>> A similar question came up in the past, but i can't find my response now
>>> so i'll try to recreate it ...
>>> 
>>> 
>>> 1) if you don't care about using non-trivial analysis (ie: you don't need
>>> stemming, or synonyms, etc..), you can do this with some
>>> really simple function queries -- asusming you index a field containing
>>> hte number of "words" in each document, in addition to the words
>>> themselves.  Assuming your words are in a field named "words" and the
>>> number of words is in a field named "words_count" a request for something
>>> like "Galaxy Samsung S4" can be represented as...
>>> 
>>> q={!frange l=0 u=0}sub(words_count,
>>>                       sum(termfreq('words','Galaxy'),
>>>                           termfreq('words','Samsung'),
>>>                           termfreq('words','S4'))
>>> 
>>> ...ie: you want to compute the sub of the term frequencies for each of
>>> hte words requested, and then you want ot subtract that sum from the
>>> number of terms in the documengt -- and then you only want ot match
>>> documents where the result of that subtraction is 0.
>>> 
>>> one complexity that comes up, is that you haven't specified:
>>> 
>>> * can the list of words in your documents contain duplicates?
>>> * can the list of words in your query contain duplicates?
>>> * should a document with duplicatewords match only if the query also
>>> contains the same word duplicated?
>>> 
>>> ...the answers to those questions make hte math more complicated (and are
>>> left as an excersize for the reader)
>>> 
>>> 
>>> 2) if you *do* care about using non-trivial analysis, then you can't use
>>> the simple "termfreq()" function, which deals with raw terms -- in stead
>>> you have to use the "query()" function to ensure that the input is parsed
>>> appropriately -- but then you have to wrap that function in something that
>>> will normalize the scores - so in place of termfreq('words','Galaxy')
>>> you'd want something like...
>>> 
>>>          if(query({!field f=words v='Galaxy'}),1,0)
>>> 
>>> ...but again the math gets much harder if you make things more complex
>>> with duplicate words i nthe document or duplicate words in the query --  
>>> you'd
>>> probably have to use a custom similarity to get the scores returned by the
>>> query() function to be usable as is in the match equation (and drop the
>>> "if()" function)
>>> 
>>> 
>>> As for the highlighting part of hte problme -- that becomes much easier -- 
>>> independent of the queries you use to *match* the documents, you can then
>>> specify a "hl.q" param to specify a much simpler query just containing the
>>> basic lst of words (as a simple boolean query, all clouses optional) and
>>> let it highlight them in your list of words.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -Hoss 
>> 
>

Re: Percolate feature?

Reply via email to