Re: WELCOME to solr-user@lucene.apache.org

khalid y Sat, 05 Dec 2009 14:44:43 -0800

Thanks a lot for you response !!

For the first solution :


I need to index all the content of my websites and I want just tika ignore
<meta name="id"> because I have already an id
I'll try monday and tell you if it works

The second solution :
Are your sure Tika use the HTML Tokenizer ? I'll check

2009/12/5 Raghuveer Kancherla <raghuveer.kanche...@aplopio.com>

> 2 ways I can think of ...
>
>   - ExtractingRequestHandler (this is what I am guessing you are using now)
>
> Set extractOnly=true while making a request to the extractingRequestHandler
> and get the parsed content back. Now make a post request on update request
> handler with what ever fields and field values you want.
>



>
>   - Use HTMLStripWhiteSpaceTokenizer factory. This article may be helpful
>   to explain what I mean.
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripWhitespaceTokenizerFactory
> .
>
>
>
> - Raghu
>
>
>
> On Sat, Dec 5, 2009 at 3:44 AM, khalid y <kern...@gmail.com> wrote:
>
> > Hi,
> >
> > I have a problem with solr. I'm indexing some html content and solr crash
> > because my id field is multivalued.
> > I found that Tika read the html and extract metadata like <meta name="id"
> > content="12"> from my htmls but my documents has an already an id setted
> by
> > literal.id=10.
> >
> > I tried to map the id from Tika by fmap.id=ignored_ but it ignore also
> my
> > literal.id
> >
> > I'm using solr 1.4 and tika 0.5
> >
> > Someone can explain to me how I can ignore this the Tika id metadata ??
> >
> > Thanks
> >
>

Re: WELCOME to solr-user@lucene.apache.org

Reply via email to