Thank you both, that's really helpful. Luwak and Percolator look like good places to dig deeper.
Best wishes Will *Will Moy* Director 020 3397 5140 *Full Fact* fullfact.org Twitter <https://twitter.com/FullFact> • Facebook <https://www.facebook.com/FullFact.org> • LinkedIn <http://www.linkedin.com/company/fullfact> A registered charity (no. 1158683) and a non-profit company (no. 6975984) limited by guarantee and registered in England and Wales. 9 Warwick Court, London WC1R 5DJ. On 5 January 2016 at 17:17, Jack Krupansky <jack.krupan...@gmail.com> wrote: > It doesn't sound like a very good match with Solr - or any other search > engine or any relational database or data store for that matter. Sure, > maybe you can get something to work with extraordinary effort, but it is > unlikely that you will ever be happy with the results. You should probably > just bite the bullet and develop a full-custom in-memory data store that is > wired for the kinds of matching you are trying to accomplish. Sure, you can > probably scavenge some code/logic from Lucene, but that won't help with the > kind of patterns you are trying to match. Or... if you're willing to put > enough effort into it you might be able to develop a custom Lucene Query > class that did in fact align with your pattern matching requirements. But > that's not an out of the box feature at this stage. > > It's not that this type of sentence matching is so unusual or hasn't come > up before (once a year or so?), but it just doesn't have any natural fit in > Lucene as it was originally conceptualized. Solr and Lucene are focused on > query of documents or matching a query against a document, not matching one > set of documents against another set of documents. IOW, Solr/Lucene is a > Query/Search engine, not a document/sentence set matching system. > > A sentence matcher would be a great new feature for Lucene/Solr, but it's > not there today. > > You can also take a look at Elasticsearch Percolator for another example of > matching incoing documents against stored queries: > > https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html > > > -- Jack Krupansky > > On Tue, Jan 5, 2016 at 11:05 AM, Allison, Timothy B. <talli...@mitre.org> > wrote: > > > Might want to look into: > > > > https://github.com/flaxsearch/luwak > > > > or > > https://github.com/OpenSextant/SolrTextTagger > > > > -----Original Message----- > > From: Will Moy [mailto:w...@fullfact.org] > > Sent: Tuesday, January 05, 2016 11:02 AM > > To: solr-user@lucene.apache.org > > Subject: Many patterns against many sentences, storing all results > > > > Hello > > > > Please may I have your advice as to whether Solr is a good tool for this > > job? > > > > We have (per year) – > > Up to 50,000,000 sentences > > And about 5,000 search patterns (i.e. queries) > > > > Our task is to identify all matches between any sentence and any search > > pattern. > > > > That list of detections must be kept up to date as patterns are added or > > updated (a handful an hour), and as new sentences are added. > > > > Some of the sentences will be added in real time, at probably max 100 / > > second and usually much less. The detections on these should be provided > > within 3 seconds. > > > > It's an unusual application in that we want all results in an external > DB, > > and also in that every sentence is either a hit or not. we don't care > about > > scoring results, only about matches for the exact search pattern entered. > > > > The application is automatically detecting instances of factchecked > > statements. > > > > The smaller-scale prototype was done with postgres full text searching, > > but that can't do exact phrase matching or other more sophisticated > > searches, so it's out. > > > > Thanks very much > > > > Will > > >