[ https://issues.apache.org/jira/browse/SOLR-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164611#comment-17164611 ]
Michael Gibney commented on SOLR-8362: -------------------------------------- I opened SOLR-14454 with a narrow scope in mind: to enable absent export functionality while keeping the patch footprint as small as possible. In response to [~dsmiley]'s feedback, I experimented with a second approach that didn't shy away from making more general changes (shifting the priority from "small footprint" to "generality and consistency"). As a consequence of the more general approach, I arrived at a patch ([PR #1691|https://github.com/apache/lucene-solr/pull/1691]) with a larger footprint – general enough that I think it's more appropriately associated with SOLR-8362. The main motivation for the new patch is to enable functionality that is not currently achievable even with creative index configuration; namely: arbitrary-size "value-access"-use-case docValues (including useDocValuesAsStored, export/functions/streaming expressions, atomic updates), and docValues for analyzed fields. The most general stumbling block encountered was (as expected) the mutually exclusive use cases served by docValues over text content. There are three distinct categories of purpose that might be served by text docValues: # faceting # sorting (including grouping) # value access (useDocValuesAsStored, export/functions/streaming expressions, atomic updates) Of these, the first (faceting) requires strict correspondence between docValues and indexed terms (otherwise faceting is trappy on refinement – see SOLR-13056); for the latter two purposes (sorting and value access), there is no functional benefit to setting indexed==true. A strict, minimalist approach would dictate that _all_ analyzed fields must enforce a direct correspondence between docValues and indexed values (by generating docValues based on post-analysis terms), and that users who desire non-faceting purposes to be served by docValues must (sacrificing convenience and user-facing simplicity) declare different fields for these different purposes. A similarly strict, but more user-friendly approach would be to continue to support the semantics (at the Solr level) of "docValues==true" enabling facet, sort, group, value access, functions, etc. I think such an approach necessitates decoupling the Solr "field" concept from the concept of a "field" in the underlying Lucene index. Without such decoupling, I think the only other options are various combinations of the "strict, minimalist approach" mentioned above, and/or zero-sum tradeoffs made by (or on behalf of) the user. PR #1691 pursues this "decoupling" approach via the "polyField" concept in Solr. The polyField {{TextField}} approach ends up (I think?) being quite user-friendly, and I think addressing many of the concerns that are documented on this issue and on SOLR-11917. The vast majority of users, with all the above-envisioned use cases, should be able to set {{docValues=true}} and have things "just work". But the approach also provides considerable flexibility for expert use cases (e.g., inline sort-value analyzers/normalization), without introducing too many extra fieldType configuration parameters. With these changes, {{TextField}} ends up being a full and more general replacement for {{SortableTextField}}. To illustrate this, I left all tests for {{SortableTextField}} unchanged (though I added some tests for related {{TextField}} behavior in {{TestSortableTextField}}), made {{SortableTextField}} syntactic sugar for {{TextField}} with a handful of extra options and restrictions, and removed {{SortableTextField}} from the class ancestry of {{NestPathField}} (with the latter now directly extending {{TextField}}). There are two main API additions: # introduces {{org.apache.solr.schema.DocValuesRefIterator}}, which abstracts DocValues access for cases that only need to iterate over documents and BytesRef values, but don't care about term ords (as one would for sort or facet use cases). Where {{[solr.]FieldType.isUtf8Field()==true}}, {{[solr.]FieldType.getDocValuesRefIterator(LeafReader, SchemaField)}} allows the utf8 {{FieldType}} to mediate "value-access"-type cases – a prerequisite for the flexibility to select different DocValues representations for "stored"-type values. # {{[lucene.]IndexableFieldType.tokenDocValuesType()}} and {{[solr.]FieldType.getAnalyzedDocValuesType()}} to indicate whether docValues should be generated from post-analysis terms. Regarding the (minimal) Lucene changes: in order to support docValues on post-analysis terms, I know it would be possible to do post-analysis docValues entirely in Solr, by "pre-analyzing" in {{createFields(...)}}, collecting tokens and buffering each into a separate {{*DocValuesField}} instance. But it seemed so straightforward (and general-purpose useful?) to do this in Lucene that I went that route initially. If there's interest in pursuing that approach, it could spin off into a separate Lucene issue. There are some nocommits to call attention to certain aspects of the code, and not all of the _new_ functionality has tests as robust as I'd like, but the tests that are there (including the new ones such as they are) seem pretty well-behaved. Notably, I haven't looked at all at actually enabling atomic updates for {{TextField}}, but IIUC it should in principle be possible (now that {{docValues="true"}} and {{useDocValuesAsStored="true"}} are supported). > Add docValues support for TextField > ----------------------------------- > > Key: SOLR-8362 > URL: https://issues.apache.org/jira/browse/SOLR-8362 > Project: Solr > Issue Type: Improvement > Reporter: Chris M. Hostetter > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > At the last lucene/solr revolution, Toke asked a question about why TextField > doesn't support docValues. The short answer is because no one ever added it, > but the longer answer was because we would have to think through carefully > the _intent_ of supporting docValues for a "tokenized" field like TextField, > and how to support various conflicting usecases where they could be handy. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org