[ 
https://issues.apache.org/jira/browse/SOLR-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202344#comment-17202344
 ] 

Alexandre Rafalovitch commented on SOLR-14701:
----------------------------------------------

Ok, let me try to walk through this and see what I am missing in code 
knowledge. Because, while I hear both of you, I am having issues reconciling it 
with the code we currently have.

Let's say we have

{code:xml}
 <requestHandler name="/update/guess-schema" class="solr.GuessSchemaHandler" >
{code}

We can no longer send the output of PDF parsing to it. Because that has to go 
to */update/extract*. 
Also, for custom JSON, if we want to benefit from those default parameters, 
unless we register */update/guess-schema/json/docs* as a duplicate. 
Or are these use-cases we don't care about? With URP approach, they are still 
supported.

Ok, next issue. Currently we have:

{code:xml}
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema" 
default="${update.autoCreateFields:true}"
           
processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
{code}

So, we are talking about taking that last-element in the URPChain (currently 
add-schema-fields) and uplifting it to the full Update Handler. What happens to 
the rest of the chain? We keep it as default for all Update handlers? Make it 
implicit? Remember that currently the 'default' flag treatment solution breaks 
things when it is actually turned off, because it disables all the parsing 
logic (dates, uuid, field renaming). In my proposal, you put that chain on 
whatever handler you want and turn-on/off the final component only. 
Or do we explicitly define that chain on every update request processor? Except 
they are all implicit now, so yes, again, have to make that implicit as well.

If we are keeping the URPChain, we will need in this new UpdateHandler reach 
into the chain and remove the URPs that we just inserted to do any sort of 
document processing logic. That's inside ContentStreamHandlerBase, two levels 
above new implementation. We cannot just wrap getLoader() because not all the 
logic goes through the loader (commit path does not). I mean, maybe we can, but 
that starts to feel quite brittle if we remove some URPs on addDocument, but 
then run them all on commit(). They may have all or none expectations. 

Finally, at the highest conceptual level, I thought we kind of agreed that this 
guessing, with whichever implementation, is not something we want in 
production. Certainly, not in SolrCloud production. That includes my URP 
approach, I am pretty sure it will not work in SolrCloud. So, whatever 
implementation I write for this, I believe it will go into a learning schema, a 
separate example schema that is not in the production. Which I agree seems like 
a point for generating JSON instead of updating schema directly, but it then 
(even with JSON) opens the can of discrepancies between schema designs. But it 
makes, at the same time, a point against any sort of implicit setup. A catch-22 
in a way.  

With my approach, at least, it can all be explicit in a separate tidy minimal 
schema (with explicit commits only, etc) for users to experiment with and then 
apply the learning to their true production schema that will also have auto 
commits and caches and additional types, etc.

(As an aside, I think generating 'Schema API' instructions from a live schema 
should be a separate JIRA with a full implementation. Possibly as an 
enhancement of Schema API call to allow other use cases such as 'copy 
definition from this schema to that')

Wrapping up,
I can see how to make my approach work and be better than what we have 
currently, apart from semi-ugly flag and index/commit/index again cycle. 
I cannot see how to make your suggestion work however much I stare at it and 
our code. It does not mean it is not doable. I just don't know how to take it 
from your suggestion to the implementation.
If somebody else can, I will be very happy for them to do so and to learn from 
them for the next time.
But I am -1 on shipping 9 with the current broken implementation that is 
misleading the user with the very first advice upon executing *bin/solr create 
-c corename.*

> Deprecate Schemaless Mode (Discussion)
> --------------------------------------
>
>                 Key: SOLR-14701
>                 URL: https://issues.apache.org/jira/browse/SOLR-14701
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Marcus Eagan
>            Assignee: Alexandre Rafalovitch
>            Priority: Major
>         Attachments: image-2020-08-04-01-35-03-075.png
>
>          Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> I know this won't be the most popular ticket out there, but I am growing more 
> and more sympathetic to the idea that we should rip many of the freedoms out 
> that cause users more harm than not. One of the freedoms I saw time and time 
> again to cause issues was schemaless mode. It doesn't work as named or 
> documented, so I think it should be deprecated. 
> If you use it in production reliably and in a way that cannot be accomplished 
> another way, I am happy to hear from more knowledgeable folks as to why 
> deprecation is a bad idea. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to