Ah, I see. And I doubt there's any position information or vector information for that field so it's probably as small as it could be anyway.
One note about stored data, assuming you've set stored="true". It's all kept in the "fdt" and "fdx" segment files and doesn't have much effect on the memory requirements for searching. It's only accessed to return the top N documents, so while the search may look at a zillion docs, the stored data will only be accessed for, say, the 10 documents returned. True it occupies a lot of disk space.... Good luck! On Fri, Nov 23, 2018 at 11:44 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote: > > Lots of discussion about XY problems on this list lately..... Maybe I'm a > bit guilty. :D > > I used the example from the docs to be clear, but our real use case is > indexing file metadata on a large filesystem. With a few fields like owner, > group, mode, lastmodified, filesize, type, and path, the path field is the > only non-numeric, non-date field that can exceed a couple of characters. So > we want to be able to say: give me all of the directories in a particular > parent, and get the answer without the children. > > Using the level_count is a great idea. I think this is the way we'll go > here. > > Thanks for your help! > > Kyle > > On Fri, 23 Nov 2018 at 14:18, Erick Erickson <erickerick...@gmail.com> > wrote: > > > A couple of things. > > > > bq. the field is by far the largest contributor to the index size already, > > > > That's a rather odd statement. It implies that there's very little > > else in your documents. If you have any descriptions etc. I'd think > > that the category info wouldn't be all that huge in comparison. How > > are you measuring? > > > > One alternative would be to index an extra field with just the > > _number_ of levels, so Books/NonFic/Science would have a second field > > "level_count" set to 3. Now your secondary search becomes > > "q=whatever&fq=category:Books/NonFic&fq=level_count:2". > > > > Best, > > Erick > > On Fri, Nov 23, 2018 at 6:24 AM lstusr 5u93n4 <lstusr...@gmail.com> wrote: > > > > > > Hi, > > > > > > I have a schema that has a descendent_path field as configured in the > > > PathTokenizerHierarchyFactory docs: > > > > > > <fieldType name="descendent_path" class="solr.TextField"> > > > <analyzer type="index"> > > > <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/" > > /> > > > </analyzer> > > > <analyzer type="query"> > > > <tokenizer class="solr.KeywordTokenizerFactory" /> > > > </analyzer> > > > </fieldType> > > > > > > > > > Using the example in the docs: *For example, in the configuration below > > a > > > query for Books/NonFic will match documents indexed with values like > > > Books/NonFic, Books/NonFic/Law, Books/NonFic/Science/Physics, etc. But it > > > will not match documents indexed with values like Books, or Books/Fic.* > > This > > > works great and solves a primary use case. > > > > > > However, we have a secondary use case where we need to get all documents > > > that match a single level. For example, let's say I wanted all of the > > > categories in Books/NonFic/, like Books/NonFic/Science, Books/NonFic/Art, > > > Books/NonFic/Math, etc.. I can query for Books/NonFic, but this gives me > > > all children records too. One solution is to query for: > > > > > > category:Books/NonFic/* -category:Books/NonFic/*/* > > > > > > which seems like it works, but feels a little clunky. > > > > > > The other solution I can think of is to put a separate, non-tokenized > > field > > > into the document at index time for each record, something like > > > parentCategory, which would be non-tokenized and indexed (not stored) > > like > > > Books/NonFic for each of the Books/NonFic/[Science, Art, Math] documents. > > > However, with this solution I'm duplicating the information and > > increasing > > > my index size. This is not the worst thing, I know, but the field is by > > far > > > the largest contributor to the index size already, and doubling the > > > information there will have a noticeable impact on the disk footprint. > > > > > > So my question: with a projected index size in the billions of documents, > > > would you take either one of those two approaches? Or a third that I > > > haven't thought of? > > > > > > Thanks, > > > > > > Kyle > >