Thanks; I don't know how the file path is getting into the id field. Must be some Tika default?
On Wed, Jul 22, 2015 at 9:52 AM, Erick Erickson <erickerick...@gmail.com> wrote: > the id field is absolutely NOT the thing you need to try to parse. > Assuming you're stuffing the file path into that field, use a > copyField to copy the filepath into another text (not string) > field and do your work there. > > As far as whether the filepath is in some other field, well, you have > to put it there, either through Tika configurations or explicitly through > your crawler. > > Best, > Erick > > On Wed, Jul 22, 2015 at 9:47 AM, Andrew Musselman > <andrew.mussel...@gmail.com> wrote: > > Trying to figure out how to parse the file path, which when I run the > > "cloud" instance becomes the "id" for each PDF document. > > > > Is that "id" field the thing to parse with PatternReplaceFilterFactory in > > the config? If not, is there a "file-path" field I can parse? > > > > On Wed, Jul 22, 2015 at 9:42 AM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> Don't understand your question. If you're talking two different > >> fields, use copyField. > >> > >> On Wed, Jul 22, 2015 at 8:55 AM, Andrew Musselman > >> <andrew.mussel...@gmail.com> wrote: > >> > Fwding to user.. > >> > > >> > ---------- Forwarded message ---------- > >> > From: Andrew Musselman <andrew.mussel...@gmail.com> > >> > Date: Wed, Jul 22, 2015 at 8:54 AM > >> > Subject: Re: Parsing and indexing parts of the input file paths > >> > To: d...@lucene.apache.org > >> > > >> > > >> > Thanks, and tell it to index the "id" field, which eventually contains > >> the > >> > file path? > >> > > >> > On Wed, Jul 22, 2015 at 8:48 AM, Erick Erickson < > erickerick...@gmail.com > >> > > >> > wrote: > >> > > >> >> PatternReplacecFilterFactory would be just a configuration solution, > >> >> construct a fieldType in schema.xml and you're done. It would require > >> >> re-indexing of course. > >> >> > >> >> Best, > >> >> Erick > >> >> > >> >> On Tue, Jul 21, 2015 at 5:59 PM, Andrew Musselman > >> >> <andrew.mussel...@gmail.com> wrote: > >> >> > Erik, thanks; the prefix starting with "/user/andrew/" will be > known, > >> and > >> >> > can be put into config, let's assume. Would this be config-only or > >> >> would it > >> >> > require some code, and could you point to some classes I can start > >> with > >> >> if I > >> >> > need to write code, and some up-to-date docs? > >> >> > > >> >> > Same for the update processor, is there an example I could read? > >> >> > > >> >> > On Tue, Jul 21, 2015 at 11:19 AM, Erik Hatcher < > >> erik.hatc...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> If this is only for search, then an analysis chain could be > crafted, > >> >> >> likely with the pattern regex filter in the mix, to pull out > pieces > >> of > >> >> the > >> >> >> path. How will you know the prefix of the file though? > >> >> >> > >> >> >> There’s also the ability to do this sort of thing in an update > >> >> processor, > >> >> >> most easily using the script update processor, using a bit of > >> >> JavaScript to > >> >> >> pull out the piece(s) you want to index (and even store at this > >> point). > >> >> >> > >> >> >> — > >> >> >> Erik Hatcher, Senior Solutions Architect > >> >> >> http://www.lucidworks.com > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Jul 21, 2015, at 1:31 PM, Andrew Musselman < > >> >> andrew.mussel...@gmail.com> > >> >> >> wrote: > >> >> >> > >> >> >> Dear user and dev lists, > >> >> >> > >> >> >> We are loading files from a directory and would like to index a > >> portion > >> >> of > >> >> >> each file path as a field as well as the text inside the file. > >> >> >> > >> >> >> E.g., on HDFS we have this file path: > >> >> >> > >> >> >> /user/andrew/1234/1234/file.pdf > >> >> >> > >> >> >> And we would like the "1234" token parsed from the file path and > >> indexed > >> >> >> as an additional field that can be searched on. > >> >> >> > >> >> >> From my initial searches I can't see how to do this easily, so > would > >> I > >> >> >> need to write some custom code, or a plugin? > >> >> >> > >> >> >> Thanks! > >> >> >> > >> >> >> > >> >> > > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > >> >> > >> >