I just want a list of recurring words (for now.) I removed the manually-created facets from solrconfig.xml and SOLR "automagically" created a facet list for me.
But thanks for your suggestions. ----- Mail original ----- De: "Charlie Hull" <char...@flax.co.uk> À: solr-user@lucene.apache.org Envoyé: Lundi 23 Mars 2015 17:26:18 Objet: Re: Creating facets based on the content field On 23/03/2015 16:08, phi...@free.fr wrote: > Let's say that one pdf has the following contents: Aren't you thinking of Named Entity Recognition? We've used Stanford NLP for this in the past and it's quite good at People, Places and Organisations out of the box (needs tuning for other classes of entities). You can then add these entities as metadata to your document objects and index them so you can facet on them appropriately. Cheers Charlie > > "[thousands of characters] blablabla Churchill blablabla [thousands of text > characters]" > > ... and another PDF contains: > > "[thousands of characters] blablabla Gandhi [thousands of characters] > Churchill blablabla [thousands of text characters]" > > As you can see, there two PDFs contain keywords that are potential candidates > for facets (e.g. Churchill, Gandhi, ...), but I have no > way of knowing that when adding facets to the solrconfig.xml file, unless I > read all the PDFs (which will take me years) and compile a list of > often-occurring words and names. > > The fallback solution is therefore to guess the keywords, which are likely to > appear in the PDFs; e.g.: > > <str name="facet.query">Aircraft</str> > <str name="facet.query">Armistice</str> > <str name="facet.query">Austria</str> > <str name="facet.query">Bolshevik</str> > <str name="facet.query">Britain</str> > <str name="facet.query">British</str> > <str name="facet.query">Charlie Chaplin</str> > <str name="facet.query">Clemenceau</str> > <str name="facet.query">Einstein</str> > ... > > > However, how can I be sure that these facets will be useful to the other > 'core' users? For instance, let's say that one > user is more interested in Gandhi that Einstein: the "Einstein" facet is > therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml. > > Is there a way to dynamically generate a list of facets based on words > contained in the content field? > > Cheers, > > Philippe > > > > > > ----- Mail original ----- > De: "Erik Hatcher" <erik.hatc...@gmail.com> > À: solr-user@lucene.apache.org > Envoyé: Lundi 23 Mars 2015 16:30:49 > Objet: Re: Creating facets based on the content field > > Philippe - can you provide a concrete example of what you mean by creating > facets on field’s content? Or maybe rather, what’s missing from doing > &facet.field=content currently? > > Erik > > > > >> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote: >> >> Hello, >> >> let's say that you haved indexed hundreds of PDFs using the following curl >> command: >> >> curl -Ss -X POST >> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf" >> >> The PDF's contents are now stored in core0's "content" field. >> >> I wonder how you create facets based on the field's contents, if you don't >> know in advance what it contains (unless you have compiled a list of >> frequently-occurring words in the PDFs, after reading them.) >> >> Many thanks. >> >> Philippe >> >> > -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk