2011/8/29 Rode González <r...@libnova.es>: > Hi Gora. > > The phrases are separated by dots or commas (I think it's the easiest way to > do this).
In that case, you should be able to use a tokeniser to split the input into phrases, though you will probably need to write a custom tokeniser, depending on what characters you want to break phrases at. Please see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters It is also entirely possible to index the full text, and just do a phrase search later. This is probably the easiest option, unless you have a huge volume of text, and the volume of phrases to be indexed can be significantly lower. > The documents to index come from pdf (books scanned, other pdf docs) or other > binary docs that the /update/extract handler can manipulate. Text PDFs should be fine if you are using Tika with Solr. However, the PDF of scanned books will typically be a set of images, and you would need to pre-process these with some kind of OCR. Regards, Gora