On 23/03/2015 16:08, phi...@free.fr wrote:
Let's say that one pdf has the following contents:
Aren't you thinking of Named Entity Recognition? We've used Stanford NLP
for this in the past and it's quite good at People, Places and
Organisations out of the box (needs tuning for other classes of
entities). You can then add these entities as metadata to your document
objects and index them so you can facet on them appropriately.
Cheers
Charlie
"[thousands of characters] blablabla Churchill blablabla [thousands of text
characters]"
... and another PDF contains:
"[thousands of characters] blablabla Gandhi [thousands of characters] Churchill
blablabla [thousands of text characters]"
As you can see, there two PDFs contain keywords that are potential candidates
for facets (e.g. Churchill, Gandhi, ...), but I have no
way of knowing that when adding facets to the solrconfig.xml file, unless I
read all the PDFs (which will take me years) and compile a list of
often-occurring words and names.
The fallback solution is therefore to guess the keywords, which are likely to
appear in the PDFs; e.g.:
<str name="facet.query">Aircraft</str>
<str name="facet.query">Armistice</str>
<str name="facet.query">Austria</str>
<str name="facet.query">Bolshevik</str>
<str name="facet.query">Britain</str>
<str name="facet.query">British</str>
<str name="facet.query">Charlie Chaplin</str>
<str name="facet.query">Clemenceau</str>
<str name="facet.query">Einstein</str>
...
However, how can I be sure that these facets will be useful to the other 'core'
users? For instance, let's say that one
user is more interested in Gandhi that Einstein: the "Einstein" facet is therefore
useless to him and a "Gandhi" facet is missing from sorlconfig.xml.
Is there a way to dynamically generate a list of facets based on words
contained in the content field?
Cheers,
Philippe
----- Mail original -----
De: "Erik Hatcher" <erik.hatc...@gmail.com>
À: solr-user@lucene.apache.org
Envoyé: Lundi 23 Mars 2015 16:30:49
Objet: Re: Creating facets based on the content field
Philippe - can you provide a concrete example of what you mean by creating facets
on field’s content? Or maybe rather, what’s missing from doing
&facet.field=content currently?
Erik
On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
Hello,
let's say that you haved indexed hundreds of PDFs using the following curl
command:
curl -Ss -X POST
'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf"
The PDF's contents are now stored in core0's "content" field.
I wonder how you create facets based on the field's contents, if you don't know
in advance what it contains (unless you have compiled a list of
frequently-occurring words in the PDFs, after reading them.)
Many thanks.
Philippe
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk