: Let's talk about the real use case. We are marketplace that sells
: products that users have listed. For certain popular, high risk or
: restricted keywords we charge the seller an extra fee/ban the listing.
: We now have sellers purposely misspelling their listings to circumvent
: this fee. They will start adding suffixes to their product listings such
: as "Sonies" knowing that it gets indexed down to "Sony" and thus
: matching a users query for Sony. Or they will munge together numbers and
: products… "2013Sony". Same thing goes for adding crazy non-ascii
: characters to the front of the keyword "Î’Sony". This is obviously a
: problem because we aren't charging for these keywords and more
: importantly it makes our search results look like shit.
: 1) Detect when a certain keyword is in a product title at listing time
: so we may charge the seller. This was my idea of a "reverse search"
: although sounds like I may have caused to much confusion with that term.
Ok ... with the concrete specifics of your situation in mind, i can think
of 2 completley differnet approaches -- depending on how precise you need
to be about your definition of a "match" and how you want to deal with
ongoing maintence as your system evolves...
## Approach #1 - NRT index & searching w/custom plugin
Even if you have 1000-5000 of these special queries you need to check,
a custom comonent to execute those 1000-5000 queries should be very fast
against a small index where most of the queries won't match anything --
especially if you write a custom component that pre-parses them into Query
obejcts and hangs onto them in memory.
(As a sample data point: With the 32 sample docs from Solr 4.x, I
configured a request handler with 5000 unique facet.query defaults using
hte {!field} qparser. most of these facet queries didn't match anything
but a handfull of which matched on of the same documents. With completely
cold caches, these 5000 facet queries had a QTime of 502ms on my laptop --
and that includes parsing all 5000 query strings)
So imagine if you wrote a custom SearchComonent that could read your X
special queries from some remote database on init (and re-load them on
command) and parse them into Queries which it then holds on to in kind of
datastructure that also tracked why you cared about them (ie: charge 10%
more, banned, etc...). At query time, your custom component would filter
the main result set of docs against these queries to look for matches that
should be reported (along with the metdata about hte queries that match)
and could also inspect the results of any query that matches, and generate
highlighting each query+doc that matches. You would then register this
custom search component in a special "validation" solr core that is
otherwise confiure exactly the same as your regular production index.
When a client says "here's my Y products i want to add" you would...
1) index those Y products into your validation solr core using
softCommit=true&openSearcher=true
2) execute a query using your special search component filtered to just
the list of Y unique ids of hte products the client just gave you (that
way you can handle concurrent requests from different clients w/o false
positives)
3) use the results of that query to tell your client things like "product
#123 matches 'Sony' so we are charging you more; and product #456 matches
'Porn' so we are rejecting it"
4) only when done, would you re-index those products into your "real"
index.
5) help keep your "validating" index small by also doing a deleteById on
all of that batch of Y docs when you are done validating.
The upside of this approach is that it helps you ensure the validation
logic you apply to products when you get them from clients *exactly*
matches your real queries, even if your schema & analysis evolve over
time. the downside is it's a decent mount of custom plugin code you need
to write upfront, and it will get slower if/when the number of special
validation queries increases.
## Approach #1 - Approximate things with a reverse search
Build a small index where each document contains the text of one
of your special queries copied into multiple fields with a variety of
analysis options configured (in particular: i suspect using shingles would
be fruitful here). setup a query structure that uses functions to combine
together the scores of many queries against each of those fields -- this
might be simple addition, or you might want it to be considtional, ie:
maybe you multiple the sum of the scores of some queries against simple
fields with teh score of a query against a really simple field to
eliminate false positives.
Experiment a bit to see what kinds of inputs get you what kinds of scores,
and maybe associate a "threshold" with each document which you index as a
numeric field on those docs and then fold that threshold value into your
calvulation using the {!frame} parser to make sure you only count matches
that score above the "threshold" of the document they match against.
then when your client gives you Y documents, send them each one at a time
against your little "reverse" index and see if they match anything, if so
report back to the client what they matched
The upside of this approach is that it doesn't require implementing
any custom java plugins, and scales better if you expect the number of
special "queries" to scale w/o bound. The downside is that it's really
just giving you a hueristic to indicate that something *might* be a match,
but you will have to constantly tune & adjust if/when the analysis or
query structure of your production search system evolves.
-Hoss