2011/8/29 Rode González <r...@libnova.es>:
> Hi Gora.
>
> The phrases are separated by dots or commas (I think it's the easiest way to 
> do this).

In that case, you should be able to use a tokeniser to split
the input into phrases, though you will probably need to write
a custom tokeniser, depending on what characters you want to
break phrases at. Please see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

It is also entirely possible to index the full text, and just do a
phrase search later. This is probably the easiest option, unless
you have a huge volume of text, and the volume of phrases to
be indexed can be significantly lower.

> The documents to index come from pdf (books scanned, other pdf docs) or other 
> binary docs that the /update/extract handler can manipulate.

Text PDFs should be fine if you are using Tika with Solr.
However, the PDF of scanned books will typically be a
set of images, and you would need to pre-process these
with some kind of OCR.

Regards,
Gora

Reply via email to