[jira] [Commented] (SOLR-8362) Add docValues support for TextField

Michael Gibney (Jira) Fri, 24 Jul 2020 12:58:32 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-8362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164611#comment-17164611
 ]


Michael Gibney commented on SOLR-8362:
--------------------------------------

I opened SOLR-14454 with a narrow scope in mind: to enable absent export 
functionality while keeping the patch footprint as small as possible. In 
response to [~dsmiley]'s feedback, I experimented with a second approach that 
didn't shy away from making more general changes (shifting the priority from 
"small footprint" to "generality and consistency").

As a consequence of the more general approach, I arrived at a patch ([PR 
#1691|https://github.com/apache/lucene-solr/pull/1691]) with a larger footprint 
– general enough that I think it's more appropriately associated with SOLR-8362.

The main motivation for the new patch is to enable functionality that is not 
currently achievable even with creative index configuration; namely: 
arbitrary-size "value-access"-use-case docValues (including 
useDocValuesAsStored, export/functions/streaming expressions, atomic updates), 
and docValues for analyzed fields.

The most general stumbling block encountered was (as expected) the mutually 
exclusive use cases served by docValues over text content. There are three 
distinct categories of purpose that might be served by text docValues:
 # faceting
 # sorting (including grouping)
 # value access (useDocValuesAsStored, export/functions/streaming expressions, 
atomic updates)

Of these, the first (faceting) requires strict correspondence between docValues 
and indexed terms (otherwise faceting is trappy on refinement – see 
SOLR-13056); for the latter two purposes (sorting and value access), there is 
no functional benefit to setting indexed==true.

A strict, minimalist approach would dictate that _all_ analyzed fields must 
enforce a direct correspondence between docValues and indexed values (by 
generating docValues based on post-analysis terms), and that users who desire 
non-faceting purposes to be served by docValues must (sacrificing convenience 
and user-facing simplicity) declare different fields for these different 
purposes.

A similarly strict, but more user-friendly approach would be to continue to 
support the semantics (at the Solr level) of "docValues==true" enabling facet, 
sort, group, value access, functions, etc. I think such an approach 
necessitates decoupling the Solr "field" concept from the concept of a "field" 
in the underlying Lucene index. Without such decoupling, I think the only other 
options are various combinations of the "strict, minimalist approach" mentioned 
above, and/or zero-sum tradeoffs made by (or on behalf of) the user. PR #1691 
pursues this "decoupling" approach via the "polyField" concept in Solr.

The polyField {{TextField}} approach ends up (I think?) being quite 
user-friendly, and I think addressing many of the concerns that are documented 
on this issue and on SOLR-11917. The vast majority of users, with all the 
above-envisioned use cases, should be able to set {{docValues=true}} and have 
things "just work". But the approach also provides considerable flexibility for 
expert use cases (e.g., inline sort-value analyzers/normalization), without 
introducing too many extra fieldType configuration parameters. With these 
changes, {{TextField}} ends up being a full and more general replacement for 
{{SortableTextField}}. To illustrate this, I left all tests for 
{{SortableTextField}} unchanged (though I added some tests for related 
{{TextField}} behavior in {{TestSortableTextField}}), made 
{{SortableTextField}} syntactic sugar for {{TextField}} with a handful of extra 
options and restrictions, and removed {{SortableTextField}} from the class 
ancestry of {{NestPathField}} (with the latter now directly extending 
{{TextField}}).

There are two main API additions:
 # introduces {{org.apache.solr.schema.DocValuesRefIterator}}, which abstracts 
DocValues access for cases that only need to iterate over documents and 
BytesRef values, but don't care about term ords (as one would for sort or facet 
use cases). Where {{[solr.]FieldType.isUtf8Field()==true}}, 
{{[solr.]FieldType.getDocValuesRefIterator(LeafReader, SchemaField)}} allows 
the utf8 {{FieldType}} to mediate "value-access"-type cases – a prerequisite 
for the flexibility to select different DocValues representations for 
"stored"-type values.
 # {{[lucene.]IndexableFieldType.tokenDocValuesType()}} and 
{{[solr.]FieldType.getAnalyzedDocValuesType()}} to indicate whether docValues 
should be generated from post-analysis terms.

Regarding the (minimal) Lucene changes: in order to support docValues on 
post-analysis terms, I know it would be possible to do post-analysis docValues 
entirely in Solr, by "pre-analyzing" in {{createFields(...)}}, collecting 
tokens and buffering each into a separate {{*DocValuesField}} instance. But it 
seemed so straightforward (and general-purpose useful?) to do this in Lucene 
that I went that route initially. If there's interest in pursuing that 
approach, it could spin off into a separate Lucene issue.

There are some nocommits to call attention to certain aspects of the code, and 
not all of the _new_ functionality has tests as robust as I'd like, but the 
tests that are there (including the new ones such as they are) seem pretty 
well-behaved. Notably, I haven't looked at all at actually enabling atomic 
updates for {{TextField}}, but IIUC it should in principle be possible (now 
that {{docValues="true"}} and {{useDocValuesAsStored="true"}} are supported).

> Add docValues support for TextField
> -----------------------------------
>
>                 Key: SOLR-8362
>                 URL: https://issues.apache.org/jira/browse/SOLR-8362
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Chris M. Hostetter
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> At the last lucene/solr revolution, Toke asked a question about why TextField 
> doesn't support docValues.  The short answer is because no one ever added it, 
> but the longer answer was because we would have to think through carefully 
> the _intent_ of supporting docValues for  a "tokenized" field like TextField, 
> and how to support various conflicting usecases where they could be handy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-8362) Add docValues support for TextField

Reply via email to