Grant, thanks for your quick response.

In the mean time I did a bit googling and found that there are java swing
html parsers that can extract the plain text from the html page. I tried
running sample examples with non-english pages and found that its working
fine. Then I thought of putting this whole extracted text(the unicode text
obviously)  under one field say "PageContent" and add the basic xml tags
like <add> , <doc> etc that will form my Xml and then push that off to Solr
for indexing. Now since its just a single page I don't know if the size will
be supported by Solr, because the page sizes can be quite large sometimes.
What is the maximum field length supported by Solr [Is it 10000 by default?
I think so]. Will this make sense during searching?

Requesting all Solr users to give me their valuable advice. Thanks.

--Ahmed.

On Fri, Apr 24, 2009 at 4:32 PM, Grant Ingersoll <gsing...@apache.org>wrote:

> See the Solr Cell contrib:
> http://wiki.apache.org/solr/ExtractingRequestHandler.  Note, it's 1.4-dev
> only.  If you want it for 1.3, you'll have to use Tika on the client side.
>
> Solr does support Unicode indexing.
>
>
> On Apr 24, 2009, at 2:22 AM, ahmed baseet wrote:
>
>  Hi All,
>> I'm trying to index some regional/non-eng html pages with Solr. I thought
>> of
>> indexing the corresponding unicode text for that page as Solr supports
>> Unicode indexing, right?
>> But I'm not able to extract Xml from the html page, because for posting to
>> Solr we require Xml. Can anyone tell me any good method of extracting Xml
>> from html or just let me know how to index non-english html pages with
>> Solr
>> that will enable me searching with unicode queries (for corresponding
>> regional query). Thanks in advance.
>>
>> --Ahmed.
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to