Thank you both, that's really helpful. Luwak and Percolator look like good
places to dig deeper.

Best wishes

Will


*Will Moy*
Director
020 3397 5140

*Full Fact*
fullfact.org
Twitter <https://twitter.com/FullFact> • Facebook
<https://www.facebook.com/FullFact.org> • LinkedIn
<http://www.linkedin.com/company/fullfact>

A registered charity (no. 1158683) and a non-profit company (no. 6975984)
limited by guarantee and registered in England and Wales. 9 Warwick Court,
London WC1R 5DJ.

On 5 January 2016 at 17:17, Jack Krupansky <jack.krupan...@gmail.com> wrote:

> It doesn't sound like a very good match with Solr - or any other search
> engine or any relational database or data store for that matter. Sure,
> maybe you can get something to work with extraordinary effort, but it is
> unlikely that you will ever be happy with the results. You should probably
> just bite the bullet and develop a full-custom in-memory data store that is
> wired for the kinds of matching you are trying to accomplish. Sure, you can
> probably scavenge some code/logic from Lucene, but that won't help with the
> kind of patterns you are trying to match. Or... if you're willing to put
> enough effort into it you might be able to develop a custom Lucene Query
> class that did in fact align with your pattern matching requirements. But
> that's not an out of the box feature at this stage.
>
> It's not that this type of sentence matching is so unusual or hasn't come
> up before (once a year or so?), but it just doesn't have any natural fit in
> Lucene as it was originally conceptualized. Solr and Lucene are focused on
> query of documents or matching a query against a document, not matching one
> set of documents against another set of documents. IOW, Solr/Lucene is a
> Query/Search engine, not a document/sentence set matching system.
>
> A sentence matcher would be a great new feature for Lucene/Solr, but it's
> not there today.
>
> You can also take a look at Elasticsearch Percolator for another example of
> matching incoing documents against stored queries:
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html
>
>
> -- Jack Krupansky
>
> On Tue, Jan 5, 2016 at 11:05 AM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > Might want to look into:
> >
> > https://github.com/flaxsearch/luwak
> >
> > or
> >  https://github.com/OpenSextant/SolrTextTagger
> >
> > -----Original Message-----
> > From: Will Moy [mailto:w...@fullfact.org]
> > Sent: Tuesday, January 05, 2016 11:02 AM
> > To: solr-user@lucene.apache.org
> > Subject: Many patterns against many sentences, storing all results
> >
> > Hello
> >
> > Please may I have your advice as to whether Solr is a good tool for this
> > job?
> >
> > We have (per year) –
> > Up to 50,000,000 sentences
> > And about 5,000 search patterns (i.e. queries)
> >
> > Our task is to identify all matches between any sentence and any search
> > pattern.
> >
> > That list of detections must be kept up to date as patterns are added or
> > updated (a handful an hour), and as new sentences are added.
> >
> > Some of the sentences will be added in real time, at probably max 100 /
> > second and usually much less. The detections on these should be provided
> > within 3 seconds.
> >
> > It's an unusual application in that we want all results in an external
> DB,
> > and also in that every sentence is either a hit or not. we don't care
> about
> > scoring results, only about matches for the exact search pattern entered.
> >
> > The application is automatically detecting instances of factchecked
> > statements.
> >
> > The smaller-scale prototype was done with postgres full text searching,
> > but that can't do exact phrase matching or other more sophisticated
> > searches, so it's out.
> >
> > Thanks very much
> >
> > Will
> >
>

Reply via email to