See the Solr Cell contrib: http://wiki.apache.org/solr/ExtractingRequestHandler
. Note, it's 1.4-dev only. If you want it for 1.3, you'll have to
use Tika on the client side.
Solr does support Unicode indexing.
On Apr 24, 2009, at 2:22 AM, ahmed baseet wrote:
Hi All,
I'm trying to index some regional/non-eng html pages with Solr. I
thought of
indexing the corresponding unicode text for that page as Solr supports
Unicode indexing, right?
But I'm not able to extract Xml from the html page, because for
posting to
Solr we require Xml. Can anyone tell me any good method of
extracting Xml
from html or just let me know how to index non-english html pages
with Solr
that will enable me searching with unicode queries (for corresponding
regional query). Thanks in advance.
--Ahmed.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search