Let's say that one pdf has the following contents: "[thousands of characters] blablabla Churchill blablabla [thousands of text characters]"
... and another PDF contains: "[thousands of characters] blablabla Gandhi [thousands of characters] Churchill blablabla [thousands of text characters]" As you can see, there two PDFs contain keywords that are potential candidates for facets (e.g. Churchill, Gandhi, ...), but I have no way of knowing that when adding facets to the solrconfig.xml file, unless I read all the PDFs (which will take me years) and compile a list of often-occurring words and names. The fallback solution is therefore to guess the keywords, which are likely to appear in the PDFs; e.g.: <str name="facet.query">Aircraft</str> <str name="facet.query">Armistice</str> <str name="facet.query">Austria</str> <str name="facet.query">Bolshevik</str> <str name="facet.query">Britain</str> <str name="facet.query">British</str> <str name="facet.query">Charlie Chaplin</str> <str name="facet.query">Clemenceau</str> <str name="facet.query">Einstein</str> ... However, how can I be sure that these facets will be useful to the other 'core' users? For instance, let's say that one user is more interested in Gandhi that Einstein: the "Einstein" facet is therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml. Is there a way to dynamically generate a list of facets based on words contained in the content field? Cheers, Philippe ----- Mail original ----- De: "Erik Hatcher" <erik.hatc...@gmail.com> À: solr-user@lucene.apache.org Envoyé: Lundi 23 Mars 2015 16:30:49 Objet: Re: Creating facets based on the content field Philippe - can you provide a concrete example of what you mean by creating facets on field’s content? Or maybe rather, what’s missing from doing &facet.field=content currently? Erik > On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote: > > Hello, > > let's say that you haved indexed hundreds of PDFs using the following curl > command: > > curl -Ss -X POST > 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf" > > The PDF's contents are now stored in core0's "content" field. > > I wonder how you create facets based on the field's contents, if you don't > know in advance what it contains (unless you have compiled a list of > frequently-occurring words in the PDFs, after reading them.) > > Many thanks. > > Philippe > >