the id field is absolutely NOT the thing you need to try to parse.
Assuming you're stuffing the file path into that field, use a
copyField to copy the filepath into another text (not string)
field and do your work there.

As far as whether the filepath is in some other field, well, you have
to put it there, either through Tika configurations or explicitly through
your crawler.

Best,
Erick

On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
<andrew.mussel...@gmail.com> wrote:
> Trying to figure out how to parse the file path, which when I run the
> "cloud" instance becomes the "id" for each PDF document.
>
> Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> the config?  If not, is there a "file-path" field I can parse?
>
> On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Don't understand your question. If you're talking two different
>> fields, use copyField.
>>
>> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
>> <andrew.mussel...@gmail.com> wrote:
>> > Fwding to user..
>> >
>> > ---------- Forwarded message ----------
>> > From: Andrew Musselman <andrew.mussel...@gmail.com>
>> > Date: Wed, Jul 22, 2015 at 8:54 AM
>> > Subject: Re: Parsing and indexing parts of the input file paths
>> > To: d...@lucene.apache.org
>> >
>> >
>> > Thanks, and tell it to index the "id" field, which eventually contains
>> the
>> > file path?
>> >
>> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <erickerick...@gmail.com
>> >
>> > wrote:
>> >
>> >> PatternReplacecFilterFactory would be just a configuration solution,
>> >> construct a fieldType in schema.xml and you're done. It would require
>> >> re-indexing of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
>> >> <andrew.mussel...@gmail.com> wrote:
>> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be known,
>> and
>> >> > can be put into config, let's assume.  Would this be config-only or
>> >> would it
>> >> > require some code, and could you point to some classes I can start
>> with
>> >> if I
>> >> > need to write code, and some up-to-date docs?
>> >> >
>> >> > Same for the update processor, is there an example I could read?
>> >> >
>> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
>> erik.hatc...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> If this is only for search, then an analysis chain could be crafted,
>> >> >> likely with the pattern regex filter in the mix, to pull out pieces
>> of
>> >> the
>> >> >> path.  How will you know the prefix of the file though?
>> >> >>
>> >> >> There’s also the ability to do this sort of thing in an update
>> >> processor,
>> >> >> most easily using the script update processor, using a bit of
>> >> JavaScript to
>> >> >> pull out the piece(s) you want to index (and even store at this
>> point).
>> >> >>
>> >> >> —
>> >> >> Erik Hatcher, Senior Solutions Architect
>> >> >> http://www.lucidworks.com
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
>> >> andrew.mussel...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> Dear user and dev lists,
>> >> >>
>> >> >> We are loading files from a directory and would like to index a
>> portion
>> >> of
>> >> >> each file path as a field as well as the text inside the file.
>> >> >>
>> >> >> E.g., on HDFS we have this file path:
>> >> >>
>> >> >> /user/andrew/1234/1234/file.pdf
>> >> >>
>> >> >> And we would like the "1234" token parsed from the file path and
>> indexed
>> >> >> as an additional field that can be searched on.
>> >> >>
>> >> >> From my initial searches I can't see how to do this easily, so would
>> I
>> >> >> need to write some custom code, or a plugin?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >>
>>

Reply via email to