Re: PathHierarchyTokenizerFactory single level match

Erick Erickson Fri, 23 Nov 2018 11:18:58 -0800

A couple of things.

bq. the field is by far the largest contributor to the index size already,


That's a rather odd statement. It implies that there's very little
else in your documents. If you have any descriptions etc. I'd think
that the category info wouldn't be all that huge in comparison. How
are you measuring?

One alternative would be to index an extra field with just the
_number_ of levels, so Books/NonFic/Science would have a second field
"level_count" set to 3. Now your secondary search becomes
"q=whatever&fq=category:Books/NonFic&fq=level_count:2".

Best,
Erick
On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <[email protected]> wrote:
>
> Hi,
>
> I have a schema that has a descendent_path field as configured in the
> PathTokenizerHierarchyFactory docs:
>
>  <fieldType name="descendent_path" class="solr.TextField">
>    <analyzer type="index">
>      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" />
>    </analyzer>
>    <analyzer type="query">
>      <tokenizer class="solr.KeywordTokenizerFactory" />
>    </analyzer>
>  </fieldType>
>
>
> Using the example in the docs:  *For example, in the configuration below a
> query for Books/NonFic will match documents indexed with values like
> Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it
> will not match documents indexed with values like Books, or Books/Fic.* This
> works great and solves a primary use case.
>
> However, we have a secondary use case where we need to get all documents
> that match a single level. For example, let's say I wanted all of the
> categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art,
> Books/NonFic/Math, etc..  I can query for Books/NonFic, but this gives me
> all children records too. One solution is to query for:
>
> category:Books/NonFic/* -category:Books/NonFic/*/*
>
> which seems like it works, but feels a little clunky.
>
> The other solution I can think of is to put a separate, non-tokenized field
> into the document at index time for each record, something like
> parentCategory, which would be non-tokenized and indexed (not stored) like
> Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents.
> However, with this solution I'm duplicating the information and increasing
> my index size. This is not the worst thing, I know, but the field is by far
> the largest contributor to the index size already, and doubling the
> information there will have a noticeable impact on the disk footprint.
>
> So my question: with a projected index size in the billions of documents,
> would you take either one of those two approaches? Or a third that I
> haven't thought of?
>
> Thanks,
>
> Kyle

Re: PathHierarchyTokenizerFactory single level match

Reply via email to