Re: Parsing and indexing parts of the input file paths

Andrew Musselman Wed, 22 Jul 2015 10:01:26 -0700

Thanks; I don't know how the file path is getting into the id field.  Must
be some Tika default?


On Wed, Jul 22, 2015 at 9:52 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> the id field is absolutely NOT the thing you need to try to parse.
> Assuming you're stuffing the file path into that field, use a
> copyField to copy the filepath into another text (not string)
> field and do your work there.
>
> As far as whether the filepath is in some other field, well, you have
> to put it there, either through Tika configurations or explicitly through
> your crawler.
>
> Best,
> Erick
>
> On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
> <andrew.mussel...@gmail.com> wrote:
> > Trying to figure out how to parse the file path, which when I run the
> > "cloud" instance becomes the "id" for each PDF document.
> >
> > Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> > the config?  If not, is there a "file-path" field I can parse?
> >
> > On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Don't understand your question. If you're talking two different
> >> fields, use copyField.
> >>
> >> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
> >> <andrew.mussel...@gmail.com> wrote:
> >> > Fwding to user..
> >> >
> >> > ---------- Forwarded message ----------
> >> > From: Andrew Musselman <andrew.mussel...@gmail.com>
> >> > Date: Wed, Jul 22, 2015 at 8:54 AM
> >> > Subject: Re: Parsing and indexing parts of the input file paths
> >> > To: d...@lucene.apache.org
> >> >
> >> >
> >> > Thanks, and tell it to index the "id" field, which eventually contains
> >> the
> >> > file path?
> >> >
> >> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> PatternReplacecFilterFactory would be just a configuration solution,
> >> >> construct a fieldType in schema.xml and you're done. It would require
> >> >> re-indexing of course.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
> >> >> <andrew.mussel...@gmail.com> wrote:
> >> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be
> known,
> >> and
> >> >> > can be put into config, let's assume.  Would this be config-only or
> >> >> would it
> >> >> > require some code, and could you point to some classes I can start
> >> with
> >> >> if I
> >> >> > need to write code, and some up-to-date docs?
> >> >> >
> >> >> > Same for the update processor, is there an example I could read?
> >> >> >
> >> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
> >> erik.hatc...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> If this is only for search, then an analysis chain could be
> crafted,
> >> >> >> likely with the pattern regex filter in the mix, to pull out
> pieces
> >> of
> >> >> the
> >> >> >> path.  How will you know the prefix of the file though?
> >> >> >>
> >> >> >> There’s also the ability to do this sort of thing in an update
> >> >> processor,
> >> >> >> most easily using the script update processor, using a bit of
> >> >> JavaScript to
> >> >> >> pull out the piece(s) you want to index (and even store at this
> >> point).
> >> >> >>
> >> >> >> —
> >> >> >> Erik Hatcher, Senior Solutions Architect
> >> >> >> http://www.lucidworks.com
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
> >> >> andrew.mussel...@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >> Dear user and dev lists,
> >> >> >>
> >> >> >> We are loading files from a directory and would like to index a
> >> portion
> >> >> of
> >> >> >> each file path as a field as well as the text inside the file.
> >> >> >>
> >> >> >> E.g., on HDFS we have this file path:
> >> >> >>
> >> >> >> /user/andrew/1234/1234/file.pdf
> >> >> >>
> >> >> >> And we would like the "1234" token parsed from the file path and
> >> indexed
> >> >> >> as an additional field that can be searched on.
> >> >> >>
> >> >> >> From my initial searches I can't see how to do this easily, so
> would
> >> I
> >> >> >> need to write some custom code, or a plugin?
> >> >> >>
> >> >> >> Thanks!
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >> >>
> >> >>
> >>
>

Re: Parsing and indexing parts of the input file paths

Reply via email to