It doesn't sound like a very good match with Solr - or any other search
engine or any relational database or data store for that matter. Sure,
maybe you can get something to work with extraordinary effort, but it is
unlikely that you will ever be happy with the results. You should probably
just bite the bullet and develop a full-custom in-memory data store that is
wired for the kinds of matching you are trying to accomplish. Sure, you can
probably scavenge some code/logic from Lucene, but that won't help with the
kind of patterns you are trying to match. Or... if you're willing to put
enough effort into it you might be able to develop a custom Lucene Query
class that did in fact align with your pattern matching requirements. But
that's not an out of the box feature at this stage.

It's not that this type of sentence matching is so unusual or hasn't come
up before (once a year or so?), but it just doesn't have any natural fit in
Lucene as it was originally conceptualized. Solr and Lucene are focused on
query of documents or matching a query against a document, not matching one
set of documents against another set of documents. IOW, Solr/Lucene is a
Query/Search engine, not a document/sentence set matching system.

A sentence matcher would be a great new feature for Lucene/Solr, but it's
not there today.

You can also take a look at Elasticsearch Percolator for another example of
matching incoing documents against stored queries:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html


-- Jack Krupansky

On Tue, Jan 5, 2016 at 11:05 AM, Allison, Timothy B. <talli...@mitre.org>
wrote:

> Might want to look into:
>
> https://github.com/flaxsearch/luwak
>
> or
>  https://github.com/OpenSextant/SolrTextTagger
>
> -----Original Message-----
> From: Will Moy [mailto:w...@fullfact.org]
> Sent: Tuesday, January 05, 2016 11:02 AM
> To: solr-user@lucene.apache.org
> Subject: Many patterns against many sentences, storing all results
>
> Hello
>
> Please may I have your advice as to whether Solr is a good tool for this
> job?
>
> We have (per year) –
> Up to 50,000,000 sentences
> And about 5,000 search patterns (i.e. queries)
>
> Our task is to identify all matches between any sentence and any search
> pattern.
>
> That list of detections must be kept up to date as patterns are added or
> updated (a handful an hour), and as new sentences are added.
>
> Some of the sentences will be added in real time, at probably max 100 /
> second and usually much less. The detections on these should be provided
> within 3 seconds.
>
> It's an unusual application in that we want all results in an external DB,
> and also in that every sentence is either a hit or not. we don't care about
> scoring results, only about matches for the exact search pattern entered.
>
> The application is automatically detecting instances of factchecked
> statements.
>
> The smaller-scale prototype was done with postgres full text searching,
> but that can't do exact phrase matching or other more sophisticated
> searches, so it's out.
>
> Thanks very much
>
> Will
>

Reply via email to