Re: Parsing and indexing parts of the input file paths

Erick Erickson Wed, 22 Jul 2015 09:53:14 -0700

the id field is absolutely NOT the thing you need to try to parse.
Assuming you're stuffing the file path into that field, use a
copyField to copy the filepath into another text (not string)
field and do your work there.


As far as whether the filepath is in some other field, well, you have
to put it there, either through Tika configurations or explicitly through
your crawler.

Best,
Erick

On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman
<[email protected]> wrote:
> Trying to figure out how to parse the file path, which when I run the
> "cloud" instance becomes the "id" for each PDF document.
>
> Is that "id" field the thing to parse with PatternReplaceFilterFactory in
> the config?  If not, is there a "file-path" field I can parse?
>
> On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <[email protected]>
> wrote:
>
>> Don't understand your question. If you're talking two different
>> fields, use copyField.
>>
>> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman
>> <[email protected]> wrote:
>> > Fwding to user..
>> >
>> > ---------- Forwarded message ----------
>> > From: Andrew Musselman <[email protected]>
>> > Date: Wed, Jul 22, 2015 at 8:54 AM
>> > Subject: Re: Parsing and indexing parts of the input file paths
>> > To: [email protected]
>> >
>> >
>> > Thanks, and tell it to index the "id" field, which eventually contains
>> the
>> > file path?
>> >
>> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson <[email protected]
>> >
>> > wrote:
>> >
>> >> PatternReplacecFilterFactory would be just a configuration solution,
>> >> construct a fieldType in schema.xml and you're done. It would require
>> >> re-indexing of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman
>> >> <[email protected]> wrote:
>> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be known,
>> and
>> >> > can be put into config, let's assume.  Would this be config-only or
>> >> would it
>> >> > require some code, and could you point to some classes I can start
>> with
>> >> if I
>> >> > need to write code, and some up-to-date docs?
>> >> >
>> >> > Same for the update processor, is there an example I could read?
>> >> >
>> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher <
>> [email protected]>
>> >> > wrote:
>> >> >>
>> >> >> If this is only for search, then an analysis chain could be crafted,
>> >> >> likely with the pattern regex filter in the mix, to pull out pieces
>> of
>> >> the
>> >> >> path.  How will you know the prefix of the file though?
>> >> >>
>> >> >> There’s also the ability to do this sort of thing in an update
>> >> processor,
>> >> >> most easily using the script update processor, using a bit of
>> >> JavaScript to
>> >> >> pull out the piece(s) you want to index (and even store at this
>> point).
>> >> >>
>> >> >> —
>> >> >> Erik Hatcher, Senior Solutions Architect
>> >> >> http://www.lucidworks.com
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman <
>> >> [email protected]>
>> >> >> wrote:
>> >> >>
>> >> >> Dear user and dev lists,
>> >> >>
>> >> >> We are loading files from a directory and would like to index a
>> portion
>> >> of
>> >> >> each file path as a field as well as the text inside the file.
>> >> >>
>> >> >> E.g., on HDFS we have this file path:
>> >> >>
>> >> >> /user/andrew/1234/1234/file.pdf
>> >> >>
>> >> >> And we would like the "1234" token parsed from the file path and
>> indexed
>> >> >> as an additional field that can be searched on.
>> >> >>
>> >> >> From my initial searches I can't see how to do this easily, so would
>> I
>> >> >> need to write some custom code, or a plugin?
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>> >>
>>

Re: Parsing and indexing parts of the input file paths

Reply via email to