Any ideas? On Aug 10, 2013, at 6:28 PM, Mark <static.void....@gmail.com> wrote:
> Our schema is pretty basic.. nothing fancy going on here > > <fieldType name="text" class="solr.TextField" omitNorms="false"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.KeywordMarkerFilterFactory" > protected="protected.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" > preserveOriginal="1"/> > <filter class="solr.KStemFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.KeywordMarkerFilterFactory" > protected="protected.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" > preserveOriginal="1"/> > <filter class="solr.KStemFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > > On Aug 10, 2013, at 3:40 PM, "Jack Krupansky" <j...@basetechnology.com> wrote: > >> Now we're getting somewhere! >> >> To (over-simplify), you simply want to know if a given "listing" would match >> a high-value pattern, either in a "clean" manner (obvious keywords) or in an >> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.) >> >> To a large this also depends on how rich and powerful your end-user query >> support is. So, if the user searches for "sony", "samsung", or "apple", will >> it match some oddball listing that fuzzily matches those terms. >> >> So... tell us, how rich your query interface is. I mean, do you support >> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", >> or... will "sony" match "sonblah-blah")? >> >> Reverse-search may in fact be what you need in this case since you literally >> do mean "if I index this document, will it match any of these queries" (but >> doesn't score a hit on your direct check for whether it is a clean keyword >> match.) >> >> In your previous examples you only gave clean product titles, not examples >> of circumventions of simple keyword matches. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Mark >> Sent: Saturday, August 10, 2013 6:24 PM >> To: solr-user@lucene.apache.org >> Cc: Chris Hostetter >> Subject: Re: Percolate feature? >> >>> So to reiteratve your examples from before, but change the "labels" a >>> bit and add some more converse examples (and ignore the "highlighting" >>> aspect for a moment... >>> >>> doc1 = "Sony" >>> doc2 = "Samsung Galaxy" >>> doc3 = "Sony Playstation" >>> >>> queryA = "Sony Experia" ... matches only doc1 >>> queryB = "Sony Playstation 3" ... matches doc3 and doc1 >>> queryC = "Samsung 52inch LC" ... doesn't match anything >>> queryD = "Samsung Galaxy S4" ... matches doc2 >>> queryE = "Galaxy Samsung S4" ... matches doc2 >>> >>> >>> ...do i still have that correct? >> >> Yes >> >>> 2) if you *do* care about using non-trivial analysis, then you can't use >>> the simple "termfreq()" function, which deals with raw terms -- in stead >>> you have to use the "query()" function to ensure that the input is parsed >>> appropriately -- but then you have to wrap that function in something that >>> will normalize the scores - so in place of termfreq('words','Galaxy') >>> you'd want something like... >> >> >> Yes we will be using non-trivial analysis. Now heres another twist… what if >> we don't care about scoring? >> >> >> Let's talk about the real use case. We are marketplace that sells products >> that users have listed. For certain popular, high risk or restricted >> keywords we charge the seller an extra fee/ban the listing. We now have >> sellers purposely misspelling their listings to circumvent this fee. They >> will start adding suffixes to their product listings such as "Sonies" >> knowing that it gets indexed down to "Sony" and thus matching a users query >> for Sony. Or they will munge together numbers and products… "2013Sony". Same >> thing goes for adding crazy non-ascii characters to the front of the keyword >> "Î’Sony". This is obviously a problem because we aren't charging for these >> keywords and more importantly it makes our search results look like shit. >> >> We would like to: >> >> 1) Detect when a certain keyword is in a product title at listing time so we >> may charge the seller. This was my idea of a "reverse search" although >> sounds like I may have caused to much confusion with that term. >> 2) Attempt to autocorrect these titles hence the need for highlighting so we >> can try and replace the terms… this of course done outside of Solr via an >> external service. >> >> Since we do some stemming (KStemmer) and filtering >> (WordDelimiterFilterFactory) this makes conventional approaches such as >> regex quite troublesome. Regex is also quite slow and scales horribly and >> always needs to be in lockstep with schema changes. >> >> Now knowing this, is there a good way to approach this? >> >> Thanks >> >> >> On Aug 9, 2013, at 11:56 AM, Chris Hostetter <hossman_luc...@fucit.org> >> wrote: >> >>> >>> : I'll look into this. Thanks for the concrete example as I don't even >>> : know which classes to start to look at to implement such a feature. >>> >>> Either roman isn't understanding what you are aksing for, or i'm not -- but >>> i don't think what roman described will work for you... >>> >>> : > so if your query contains no duplicates and all terms must match, you >>> can >>> : > be sure that you are collecting docs only when the number of terms >>> matches >>> : > number of clauses in the query >>> >>> several of the examples you gave did not match what Roman is describing, >>> as i understand it. Most people on this thread seem to be getting >>> confused by having their perceptions "flipped" about what your "data known >>> in advance is" vs the "data you get at request time". >>> >>> You described this... >>> >>> : >>>>> Product keyword: "Sony" >>> : >>>>> Product keyword: "Samsung Galaxy" >>> : >>>>> >>> : >>>>> We would like to be able to detect given a product title whether or >>> : >> not it >>> : >>>>> matches any known keywords. For a keyword to be matched all of it's >>> : >> terms >>> : >>>>> must be present in the product title given. >>> : >>>>> >>> : >>>>> Product Title: "Sony Experia" >>> : >>>>> Matches and returns a highlight: "<em>Sony</em> Experia" >>> >>> ...suggesting that what you call "product keywords" are the "data you know >>> about in advance" and "product titles" are the data you get at request >>> time. >>> >>> So your example of the "request time" input (ie: query) "Sony Experia" >>> matching "data known in advance (ie: indexed document) "Sony" would not >>> work with Roman's example. >>> >>> To rephrase (what i think i understand is) your goal... >>> >>> * you have many (10*3+) documents known in advance >>> * any document D contain a set of words W(D) of varing sizes >>> * any requests Q contains a set of words W(Q) of varing izes >>> * you want a given request R to match a document D if and only if: >>> - W(D) is a subset of W(Q) >>> - ie: no iten exists in W(D) that does not exist in W(Q) >>> - ie: any number of items may exist in W(Q) that are not in W(D) >>> >>> So to reiteratve your examples from before, but change the "labels" a >>> bit and add some more converse examples (and ignore the "highlighting" >>> aspect for a moment... >>> >>> doc1 = "Sony" >>> doc2 = "Samsung Galaxy" >>> doc3 = "Sony Playstation" >>> >>> queryA = "Sony Experia" ... matches only doc1 >>> queryB = "Sony Playstation 3" ... matches doc3 and doc1 >>> queryC = "Samsung 52inch LC" ... doesn't match anything >>> queryD = "Samsung Galaxy S4" ... matches doc2 >>> queryE = "Galaxy Samsung S4" ... matches doc2 >>> >>> >>> ...do i still have that correct? >>> >>> >>> A similar question came up in the past, but i can't find my response now >>> so i'll try to recreate it ... >>> >>> >>> 1) if you don't care about using non-trivial analysis (ie: you don't need >>> stemming, or synonyms, etc..), you can do this with some >>> really simple function queries -- asusming you index a field containing >>> hte number of "words" in each document, in addition to the words >>> themselves. Assuming your words are in a field named "words" and the >>> number of words is in a field named "words_count" a request for something >>> like "Galaxy Samsung S4" can be represented as... >>> >>> q={!frange l=0 u=0}sub(words_count, >>> sum(termfreq('words','Galaxy'), >>> termfreq('words','Samsung'), >>> termfreq('words','S4')) >>> >>> ...ie: you want to compute the sub of the term frequencies for each of >>> hte words requested, and then you want ot subtract that sum from the >>> number of terms in the documengt -- and then you only want ot match >>> documents where the result of that subtraction is 0. >>> >>> one complexity that comes up, is that you haven't specified: >>> >>> * can the list of words in your documents contain duplicates? >>> * can the list of words in your query contain duplicates? >>> * should a document with duplicatewords match only if the query also >>> contains the same word duplicated? >>> >>> ...the answers to those questions make hte math more complicated (and are >>> left as an excersize for the reader) >>> >>> >>> 2) if you *do* care about using non-trivial analysis, then you can't use >>> the simple "termfreq()" function, which deals with raw terms -- in stead >>> you have to use the "query()" function to ensure that the input is parsed >>> appropriately -- but then you have to wrap that function in something that >>> will normalize the scores - so in place of termfreq('words','Galaxy') >>> you'd want something like... >>> >>> if(query({!field f=words v='Galaxy'}),1,0) >>> >>> ...but again the math gets much harder if you make things more complex >>> with duplicate words i nthe document or duplicate words in the query -- >>> you'd >>> probably have to use a custom similarity to get the scores returned by the >>> query() function to be usable as is in the match equation (and drop the >>> "if()" function) >>> >>> >>> As for the highlighting part of hte problme -- that becomes much easier -- >>> independent of the queries you use to *match* the documents, you can then >>> specify a "hl.q" param to specify a much simpler query just containing the >>> basic lst of words (as a simple boolean query, all clouses optional) and >>> let it highlight them in your list of words. >>> >>> >>> >>> >>> >>> >>> >>> -Hoss >> >