I think you are over-complicated this before actually trying it. If you index your texts and tokenize them to have individual words then "facet.field=content" will actually give you the list of words sorted by their occurrence count. That's what facet will do.
A bigger problem is - from your example - that I still don't see how exactly that will be good for your users. But perhaps seeing the actual results will help with that too. Regards, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 23 March 2015 at 12:08, <phi...@free.fr> wrote: > Let's say that one pdf has the following contents: > > "[thousands of characters] blablabla Churchill blablabla [thousands of text > characters]" > > ... and another PDF contains: > > "[thousands of characters] blablabla Gandhi [thousands of characters] > Churchill blablabla [thousands of text characters]" > > As you can see, there two PDFs contain keywords that are potential candidates > for facets (e.g. Churchill, Gandhi, ...), but I have no > way of knowing that when adding facets to the solrconfig.xml file, unless I > read all the PDFs (which will take me years) and compile a list of > often-occurring words and names. > > The fallback solution is therefore to guess the keywords, which are likely to > appear in the PDFs; e.g.: > > <str name="facet.query">Aircraft</str> > <str name="facet.query">Armistice</str> > <str name="facet.query">Austria</str> > <str name="facet.query">Bolshevik</str> > <str name="facet.query">Britain</str> > <str name="facet.query">British</str> > <str name="facet.query">Charlie Chaplin</str> > <str name="facet.query">Clemenceau</str> > <str name="facet.query">Einstein</str> > ... > > > However, how can I be sure that these facets will be useful to the other > 'core' users? For instance, let's say that one > user is more interested in Gandhi that Einstein: the "Einstein" facet is > therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml. > > Is there a way to dynamically generate a list of facets based on words > contained in the content field? > > Cheers, > > Philippe > > > > > > ----- Mail original ----- > De: "Erik Hatcher" <erik.hatc...@gmail.com> > À: solr-user@lucene.apache.org > Envoyé: Lundi 23 Mars 2015 16:30:49 > Objet: Re: Creating facets based on the content field > > Philippe - can you provide a concrete example of what you mean by creating > facets on field’s content? Or maybe rather, what’s missing from doing > &facet.field=content currently? > > Erik > > > > >> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote: >> >> Hello, >> >> let's say that you haved indexed hundreds of PDFs using the following curl >> command: >> >> curl -Ss -X POST >> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf" >> >> The PDF's contents are now stored in core0's "content" field. >> >> I wonder how you create facets based on the field's contents, if you don't >> know in advance what it contains (unless you have compiled a list of >> frequently-occurring words in the PDFs, after reading them.) >> >> Many thanks. >> >> Philippe >> >> >