I think you are over-complicated this before actually trying it. If
you index your texts and tokenize them to have individual words then
"facet.field=content" will actually give you the list of words sorted
by their occurrence count. That's what facet will do.

A bigger problem is - from your example - that I still don't see how
exactly that will be good for your users. But perhaps seeing the
actual results will help with that too.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 March 2015 at 12:08,  <phi...@free.fr> wrote:
> Let's say that one pdf has the following contents:
>
> "[thousands of characters] blablabla Churchill blablabla [thousands of text 
> characters]"
>
> ... and another PDF contains:
>
> "[thousands of characters] blablabla Gandhi [thousands of characters] 
> Churchill blablabla [thousands of text characters]"
>
> As you can see, there two PDFs contain keywords that are potential candidates 
> for facets (e.g. Churchill, Gandhi, ...), but I have no
> way of knowing that when adding facets to the solrconfig.xml file, unless I 
> read all the PDFs (which will take me years) and compile a list of 
> often-occurring words and names.
>
> The fallback solution is therefore to guess the keywords, which are likely to 
> appear in the PDFs; e.g.:
>
>                                 <str name="facet.query">Aircraft</str>
>                                 <str name="facet.query">Armistice</str>
>                                 <str name="facet.query">Austria</str>
>                                 <str name="facet.query">Bolshevik</str>
>                                 <str name="facet.query">Britain</str>
>                                 <str name="facet.query">British</str>
>                                 <str name="facet.query">Charlie Chaplin</str>
>                                 <str name="facet.query">Clemenceau</str>
>                                 <str name="facet.query">Einstein</str>
> ...
>
>
> However, how can I be sure that these facets will be useful to the other 
> 'core' users? For instance, let's say that one
> user is more interested in Gandhi that Einstein: the "Einstein" facet is 
> therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml.
>
> Is there a way to dynamically generate a list of facets based on words 
> contained in the content field?
>
> Cheers,
>
> Philippe
>
>
>
>
>
> ----- Mail original -----
> De: "Erik Hatcher" <erik.hatc...@gmail.com>
> À: solr-user@lucene.apache.org
> Envoyé: Lundi 23 Mars 2015 16:30:49
> Objet: Re: Creating facets based on the content field
>
> Philippe - can you provide a concrete example of what you mean by creating 
> facets on field’s content?   Or maybe rather, what’s missing from doing 
> &facet.field=content currently?
>
>     Erik
>
>
>
>
>> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
>>
>> Hello,
>>
>> let's say that you haved indexed hundreds of PDFs using the following curl 
>> command:
>>
>> curl -Ss -X POST 
>> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";
>>
>> The PDF's contents are now stored in core0's "content" field.
>>
>> I wonder how you create facets based on the field's contents, if you don't 
>> know in advance what it contains (unless you have compiled a list of 
>> frequently-occurring words in the PDFs, after reading them.)
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>

Reply via email to