Re: How to Index Pure Text into Seperate Fields?

Erick Erickson Wed, 29 Sep 2010 14:59:56 -0700

Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index "just the text" of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett <
savannah_becket...@yahoo.com> wrote:

> Hi,
>   I am using xpath to index different parts of the html pages into
> different
> fields.  Now, I have some pure text documents that has no html.  So I can't
> use
> xpath.  How do I index these pure text into different fields of the index?
> How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>

Re: How to Index Pure Text into Seperate Fields?

Reply via email to