Hi - trying to extract entities for facets or whatever using IDF is bad at best. MLT works well because of scoring, not for entity extraction, because it doesnt extract entities. The OpenNLP Lucene filters do what you need, but it depends on the model you built. The freely available maxent models are crude, so you do need to build models yourself. Tedious, but more fruitful than any shortcut available to my knowledge.
M. -----Original message----- > From:Erick Erickson <erickerick...@gmail.com> > Sent: Tuesday 24th March 2015 0:50 > To: solr-user@lucene.apache.org > Subject: Re: Creating facets based on the content field > > I wasn't talking about using NLP at query time. I was trying to convey > that perhaps NLP processing on documents at _index_ time could reduce > the number of distinct tokens you then facet over at query time. > > The basic caution still remains, faceting on high-cardinality fields > is expensive, it's just a caution about trying your queries on a > corpus that's representative of your final corpus in terms of size > before deciding whether it's fast enough and will work on your > hardware. > > Best, > Erick > > On Mon, Mar 23, 2015 at 2:20 PM, Philippe de Rochambeau <phi...@free.fr> > wrote: > > Hi Erick, > > can you use NLP for query-time facetting? How? > > Moreover, can you use it to find keyword patterns? > > Cheers, > > Philippe > > > > > >> Le 23 mars 2015 à 18:44, Erick Erickson <erickerick...@gmail.com> a écrit : > >> > >> Be a little careful here about memory. Faceting on high-cardinality > >> fields is a very good way to encounter OOM and/or performance > >> problems. > >> > >> But you're right, in Solr faceting is a query-time construct, it needs > >> nothing at index time. The NLP stuff can help narrow down the number > >> of unique values in the field you're faceting on. > >> > >> Best, > >> Erick > >> > >>> On Mon, Mar 23, 2015 at 9:41 AM, <phi...@free.fr> wrote: > >>> I just want a list of recurring words (for now.) > >>> > >>> I removed the manually-created facets from solrconfig.xml and SOLR > >>> "automagically" created a facet list for me. > >>> > >>> But thanks for your suggestions. > >>> > >>> > >>> > >>> ----- Mail original ----- > >>> De: "Charlie Hull" <char...@flax.co.uk> > >>> À: solr-user@lucene.apache.org > >>> Envoyé: Lundi 23 Mars 2015 17:26:18 > >>> Objet: Re: Creating facets based on the content field > >>> > >>>> On 23/03/2015 16:08, phi...@free.fr wrote: > >>>> Let's say that one pdf has the following contents: > >>> > >>> Aren't you thinking of Named Entity Recognition? We've used Stanford NLP > >>> for this in the past and it's quite good at People, Places and > >>> Organisations out of the box (needs tuning for other classes of > >>> entities). You can then add these entities as metadata to your document > >>> objects and index them so you can facet on them appropriately. > >>> > >>> Cheers > >>> > >>> Charlie > >>>> > >>>> "[thousands of characters] blablabla Churchill blablabla [thousands of > >>>> text characters]" > >>>> > >>>> ... and another PDF contains: > >>>> > >>>> "[thousands of characters] blablabla Gandhi [thousands of characters] > >>>> Churchill blablabla [thousands of text characters]" > >>>> > >>>> As you can see, there two PDFs contain keywords that are potential > >>>> candidates for facets (e.g. Churchill, Gandhi, ...), but I have no > >>>> way of knowing that when adding facets to the solrconfig.xml file, > >>>> unless I read all the PDFs (which will take me years) and compile a list > >>>> of often-occurring words and names. > >>>> > >>>> The fallback solution is therefore to guess the keywords, which are > >>>> likely to appear in the PDFs; e.g.: > >>>> > >>>> <str name="facet.query">Aircraft</str> > >>>> <str name="facet.query">Armistice</str> > >>>> <str name="facet.query">Austria</str> > >>>> <str name="facet.query">Bolshevik</str> > >>>> <str name="facet.query">Britain</str> > >>>> <str name="facet.query">British</str> > >>>> <str name="facet.query">Charlie > >>>> Chaplin</str> > >>>> <str name="facet.query">Clemenceau</str> > >>>> <str name="facet.query">Einstein</str> > >>>> ... > >>>> > >>>> > >>>> However, how can I be sure that these facets will be useful to the other > >>>> 'core' users? For instance, let's say that one > >>>> user is more interested in Gandhi that Einstein: the "Einstein" facet is > >>>> therefore useless to him and a "Gandhi" facet is missing from > >>>> sorlconfig.xml. > >>>> > >>>> Is there a way to dynamically generate a list of facets based on words > >>>> contained in the content field? > >>>> > >>>> Cheers, > >>>> > >>>> Philippe > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> ----- Mail original ----- > >>>> De: "Erik Hatcher" <erik.hatc...@gmail.com> > >>>> À: solr-user@lucene.apache.org > >>>> Envoyé: Lundi 23 Mars 2015 16:30:49 > >>>> Objet: Re: Creating facets based on the content field > >>>> > >>>> Philippe - can you provide a concrete example of what you mean by > >>>> creating facets on field’s content? Or maybe rather, what’s missing > >>>> from doing &facet.field=content currently? > >>>> > >>>> Erik > >>>> > >>>> > >>>> > >>>> > >>>>> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote: > >>>>> > >>>>> Hello, > >>>>> > >>>>> let's say that you haved indexed hundreds of PDFs using the following > >>>>> curl command: > >>>>> > >>>>> curl -Ss -X POST > >>>>> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf" > >>>>> > >>>>> The PDF's contents are now stored in core0's "content" field. > >>>>> > >>>>> I wonder how you create facets based on the field's contents, if you > >>>>> don't know in advance what it contains (unless you have compiled a list > >>>>> of frequently-occurring words in the PDFs, after reading them.) > >>>>> > >>>>> Many thanks. > >>>>> > >>>>> Philippe > >>> > >>> > >>> -- > >>> Charlie Hull > >>> Flax - Open Source Enterprise Search > >>> > >>> tel/fax: +44 (0)8700 118334 > >>> mobile: +44 (0)7767 825828 > >>> web: www.flax.co.uk >