Let's say that one pdf has the following contents:

"[thousands of characters] blablabla Churchill blablabla [thousands of text 
characters]"

... and another PDF contains:

"[thousands of characters] blablabla Gandhi [thousands of characters] Churchill 
blablabla [thousands of text characters]"

As you can see, there two PDFs contain keywords that are potential candidates 
for facets (e.g. Churchill, Gandhi, ...), but I have no
way of knowing that when adding facets to the solrconfig.xml file, unless I 
read all the PDFs (which will take me years) and compile a list of 
often-occurring words and names.

The fallback solution is therefore to guess the keywords, which are likely to 
appear in the PDFs; e.g.:

                                <str name="facet.query">Aircraft</str>
                                <str name="facet.query">Armistice</str>
                                <str name="facet.query">Austria</str>
                                <str name="facet.query">Bolshevik</str>
                                <str name="facet.query">Britain</str>
                                <str name="facet.query">British</str>
                                <str name="facet.query">Charlie Chaplin</str>
                                <str name="facet.query">Clemenceau</str>
                                <str name="facet.query">Einstein</str>
...


However, how can I be sure that these facets will be useful to the other 'core' 
users? For instance, let's say that one
user is more interested in Gandhi that Einstein: the "Einstein" facet is 
therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml.

Is there a way to dynamically generate a list of facets based on words 
contained in the content field?

Cheers,

Philippe





----- Mail original -----
De: "Erik Hatcher" <erik.hatc...@gmail.com>
À: solr-user@lucene.apache.org
Envoyé: Lundi 23 Mars 2015 16:30:49
Objet: Re: Creating facets based on the content field

Philippe - can you provide a concrete example of what you mean by creating 
facets on field’s content?   Or maybe rather, what’s missing from doing 
&facet.field=content currently?

    Erik




> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
> 
> Hello,
> 
> let's say that you haved indexed hundreds of PDFs using the following curl 
> command:
> 
> curl -Ss -X POST 
> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";
> 
> The PDF's contents are now stored in core0's "content" field.
> 
> I wonder how you create facets based on the field's contents, if you don't 
> know in advance what it contains (unless you have compiled a list of 
> frequently-occurring words in the PDFs, after reading them.)
> 
> Many thanks.
> 
> Philippe
> 
> 

Reply via email to