Re: PathHierarchyTokenizerFactory single level match

Erick Erickson Sat, 24 Nov 2018 09:06:02 -0800

Ah, I see. And I doubt there's any position information or vector
information for that field so it's probably as small as it could be
anyway.


One note about stored data, assuming you've set stored="true". It's
all kept in the "fdt" and "fdx" segment files and doesn't have much
effect on the memory requirements for searching. It's only accessed to
return the top N documents, so while the search may look at a zillion
docs, the stored data will only be accessed for, say, the 10 documents
returned. True it occupies a lot of disk space....

Good luck!
On Fri, Nov 23, 2018 at 11:44 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote:
>
> Lots of discussion about XY problems on this list lately..... Maybe I'm a
> bit guilty. :D
>
> I used the example from the docs to be clear, but our real use case is
> indexing file metadata on a large filesystem. With a few fields like owner,
> group, mode, lastmodified, filesize, type, and path, the path field is the
> only non-numeric, non-date field that can exceed a couple of characters. So
> we want to be able to say: give me all of the directories in a particular
> parent, and get the answer without the children.
>
> Using the level_count is a great idea. I think this is the way we'll go
> here.
>
> Thanks for your help!
>
> Kyle
>
> On Fri, 23 Nov 2018 at 14:18, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > A couple of things.
> >
> > bq. the field is by far the largest contributor to the index size already,
> >
> > That's a rather odd statement. It implies that there's very little
> > else in your documents. If you have any descriptions etc. I'd think
> > that the category info wouldn't be all that huge in comparison. How
> > are you measuring?
> >
> > One alternative would be to index an extra field with just the
> > _number_ of levels, so Books/NonFic/Science would have a second field
> > "level_count" set to 3. Now your secondary search becomes
> > "q=whatever&fq=category:Books/NonFic&fq=level_count:2".
> >
> > Best,
> > Erick
> > On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a schema that has a descendent_path field as configured in the
> > > PathTokenizerHierarchyFactory docs:
> > >
> > >  <fieldType name="descendent_path" class="solr.TextField">
> > >    <analyzer type="index">
> > >      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"
> > />
> > >    </analyzer>
> > >    <analyzer type="query">
> > >      <tokenizer class="solr.KeywordTokenizerFactory" />
> > >    </analyzer>
> > >  </fieldType>
> > >
> > >
> > > Using the example in the docs:  *For example, in the configuration below
> > a
> > > query for Books/NonFic will match documents indexed with values like
> > > Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
> > > will not match documents indexed with values like Books, or Books/Fic.*
> > This
> > > works great and solves a primary use case.
> > >
> > > However, we have a secondary use case where we need to get all documents
> > > that match a single level. For example, let's say I wanted all of the
> > > categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
> > > Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
> > > all children records too. One solution is to query for:
> > >
> > > category:Books/NonFic/* -category:Books/NonFic/*/*
> > >
> > > which seems like it works, but feels a little clunky.
> > >
> > > The other solution I can think of is to put a separate, non-tokenized
> > field
> > > into the document at index time for each record, something like
> > > parentCategory, which would be non-tokenized and indexed (not stored)
> > like
> > > Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
> > > However, with this solution I'm duplicating the information and
> > increasing
> > > my index size. This is not the worst thing, I know, but the field is by
> > far
> > > the largest contributor to the index size already, and doubling the
> > > information there will have a noticeable impact on the disk footprint.
> > >
> > > So my question: with a projected index size in the billions of documents,
> > > would you take either one of those two approaches? Or a third that I
> > > haven't thought of?
> > >
> > > Thanks,
> > >
> > > Kyle
> >

Re: PathHierarchyTokenizerFactory single level match

Reply via email to