On 23/03/2015 16:08, phi...@free.fr wrote:
Let's say that one pdf has the following contents:

Aren't you thinking of Named Entity Recognition? We've used Stanford NLP for this in the past and it's quite good at People, Places and Organisations out of the box (needs tuning for other classes of entities). You can then add these entities as metadata to your document objects and index them so you can facet on them appropriately.

Cheers

Charlie

"[thousands of characters] blablabla Churchill blablabla [thousands of text 
characters]"

... and another PDF contains:

"[thousands of characters] blablabla Gandhi [thousands of characters] Churchill 
blablabla [thousands of text characters]"

As you can see, there two PDFs contain keywords that are potential candidates 
for facets (e.g. Churchill, Gandhi, ...), but I have no
way of knowing that when adding facets to the solrconfig.xml file, unless I 
read all the PDFs (which will take me years) and compile a list of 
often-occurring words and names.

The fallback solution is therefore to guess the keywords, which are likely to 
appear in the PDFs; e.g.:

                                 <str name="facet.query">Aircraft</str>
                                 <str name="facet.query">Armistice</str>
                                 <str name="facet.query">Austria</str>
                                 <str name="facet.query">Bolshevik</str>
                                 <str name="facet.query">Britain</str>
                                 <str name="facet.query">British</str>
                                 <str name="facet.query">Charlie Chaplin</str>
                                 <str name="facet.query">Clemenceau</str>
                                 <str name="facet.query">Einstein</str>
...


However, how can I be sure that these facets will be useful to the other 'core' 
users? For instance, let's say that one
user is more interested in Gandhi that Einstein: the "Einstein" facet is therefore 
useless to him and a "Gandhi" facet is missing from sorlconfig.xml.

Is there a way to dynamically generate a list of facets based on words 
contained in the content field?

Cheers,

Philippe





----- Mail original -----
De: "Erik Hatcher" <erik.hatc...@gmail.com>
À: solr-user@lucene.apache.org
Envoyé: Lundi 23 Mars 2015 16:30:49
Objet: Re: Creating facets based on the content field

Philippe - can you provide a concrete example of what you mean by creating facets 
on field’s content?   Or maybe rather, what’s missing from doing 
&facet.field=content currently?

     Erik




On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:

Hello,

let's say that you haved indexed hundreds of PDFs using the following curl 
command:

curl -Ss -X POST 
'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";

The PDF's contents are now stored in core0's "content" field.

I wonder how you create facets based on the field's contents, if you don't know 
in advance what it contains (unless you have compiled a list of 
frequently-occurring words in the PDFs, after reading them.)

Many thanks.

Philippe





--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to