Re: PathHierarchyTokenizerFactory single level match

lstusr 5u93n4 Fri, 23 Nov 2018 11:45:01 -0800

Lots of discussion about XY problems on this list lately..... Maybe I'm a
bit guilty. :D


I used the example from the docs to be clear, but our real use case is
indexing file metadata on a large filesystem. With a few fields like owner,
group, mode, lastmodified, filesize, type, and path, the path field is the
only non-numeric, non-date field that can exceed a couple of characters. So
we want to be able to say: give me all of the directories in a particular
parent, and get the answer without the children.

Using the level_count is a great idea. I think this is the way we'll go
here.

Thanks for your help!

Kyle

On Fri, 23 Nov 2018 at 14:18, Erick Erickson <erickerick...@gmail.com>
wrote:

> A couple of things.
>
> bq. the field is by far the largest contributor to the index size already,
>
> That's a rather odd statement. It implies that there's very little
> else in your documents. If you have any descriptions etc. I'd think
> that the category info wouldn't be all that huge in comparison. How
> are you measuring?
>
> One alternative would be to index an extra field with just the
> _number_ of levels, so Books/NonFic/Science would have a second field
> "level_count" set to 3. Now your secondary search becomes
> "q=whatever&fq=category:Books/NonFic&fq=level_count:2".
>
> Best,
> Erick
> On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote:
> >
> > Hi,
> >
> > I have a schema that has a descendent_path field as configured in the
> > PathTokenizerHierarchyFactory docs:
> >
> >  <fieldType name="descendent_path" class="solr.TextField">
> >    <analyzer type="index">
> >      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"
> />
> >    </analyzer>
> >    <analyzer type="query">
> >      <tokenizer class="solr.KeywordTokenizerFactory" />
> >    </analyzer>
> >  </fieldType>
> >
> >
> > Using the example in the docs:  *For example, in the configuration below
> a
> > query for Books/NonFic will match documents indexed with values like
> > Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
> > will not match documents indexed with values like Books, or Books/Fic.*
> This
> > works great and solves a primary use case.
> >
> > However, we have a secondary use case where we need to get all documents
> > that match a single level. For example, let's say I wanted all of the
> > categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
> > Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
> > all children records too. One solution is to query for:
> >
> > category:Books/NonFic/* -category:Books/NonFic/*/*
> >
> > which seems like it works, but feels a little clunky.
> >
> > The other solution I can think of is to put a separate, non-tokenized
> field
> > into the document at index time for each record, something like
> > parentCategory, which would be non-tokenized and indexed (not stored)
> like
> > Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
> > However, with this solution I'm duplicating the information and
> increasing
> > my index size. This is not the worst thing, I know, but the field is by
> far
> > the largest contributor to the index size already, and doubling the
> > information there will have a noticeable impact on the disk footprint.
> >
> > So my question: with a projected index size in the billions of documents,
> > would you take either one of those two approaches? Or a third that I
> > haven't thought of?
> >
> > Thanks,
> >
> > Kyle
>

Re: PathHierarchyTokenizerFactory single level match

Reply via email to