Grant, thanks for your quick response. In the mean time I did a bit googling and found that there are java swing html parsers that can extract the plain text from the html page. I tried running sample examples with non-english pages and found that its working fine. Then I thought of putting this whole extracted text(the unicode text obviously) under one field say "PageContent" and add the basic xml tags like <add> , <doc> etc that will form my Xml and then push that off to Solr for indexing. Now since its just a single page I don't know if the size will be supported by Solr, because the page sizes can be quite large sometimes. What is the maximum field length supported by Solr [Is it 10000 by default? I think so]. Will this make sense during searching?
Requesting all Solr users to give me their valuable advice. Thanks. --Ahmed. On Fri, Apr 24, 2009 at 4:32 PM, Grant Ingersoll <gsing...@apache.org>wrote: > See the Solr Cell contrib: > http://wiki.apache.org/solr/ExtractingRequestHandler. Note, it's 1.4-dev > only. If you want it for 1.3, you'll have to use Tika on the client side. > > Solr does support Unicode indexing. > > > On Apr 24, 2009, at 2:22 AM, ahmed baseet wrote: > > Hi All, >> I'm trying to index some regional/non-eng html pages with Solr. I thought >> of >> indexing the corresponding unicode text for that page as Solr supports >> Unicode indexing, right? >> But I'm not able to extract Xml from the html page, because for posting to >> Solr we require Xml. Can anyone tell me any good method of extracting Xml >> from html or just let me know how to index non-english html pages with >> Solr >> that will enable me searching with unicode queries (for corresponding >> regional query). Thanks in advance. >> >> --Ahmed. >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >