[ https://issues.apache.org/jira/browse/SOLR-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202344#comment-17202344 ]
Alexandre Rafalovitch commented on SOLR-14701: ---------------------------------------------- Ok, let me try to walk through this and see what I am missing in code knowledge. Because, while I hear both of you, I am having issues reconciling it with the code we currently have. Let's say we have {code:xml} <requestHandler name="/update/guess-schema" class="solr.GuessSchemaHandler" > {code} We can no longer send the output of PDF parsing to it. Because that has to go to */update/extract*. Also, for custom JSON, if we want to benefit from those default parameters, unless we register */update/guess-schema/json/docs* as a duplicate. Or are these use-cases we don't care about? With URP approach, they are still supported. Ok, next issue. Currently we have: {code:xml} <updateRequestProcessorChain name="add-unknown-fields-to-the-schema" default="${update.autoCreateFields:true}" processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields"> {code} So, we are talking about taking that last-element in the URPChain (currently add-schema-fields) and uplifting it to the full Update Handler. What happens to the rest of the chain? We keep it as default for all Update handlers? Make it implicit? Remember that currently the 'default' flag treatment solution breaks things when it is actually turned off, because it disables all the parsing logic (dates, uuid, field renaming). In my proposal, you put that chain on whatever handler you want and turn-on/off the final component only. Or do we explicitly define that chain on every update request processor? Except they are all implicit now, so yes, again, have to make that implicit as well. If we are keeping the URPChain, we will need in this new UpdateHandler reach into the chain and remove the URPs that we just inserted to do any sort of document processing logic. That's inside ContentStreamHandlerBase, two levels above new implementation. We cannot just wrap getLoader() because not all the logic goes through the loader (commit path does not). I mean, maybe we can, but that starts to feel quite brittle if we remove some URPs on addDocument, but then run them all on commit(). They may have all or none expectations. Finally, at the highest conceptual level, I thought we kind of agreed that this guessing, with whichever implementation, is not something we want in production. Certainly, not in SolrCloud production. That includes my URP approach, I am pretty sure it will not work in SolrCloud. So, whatever implementation I write for this, I believe it will go into a learning schema, a separate example schema that is not in the production. Which I agree seems like a point for generating JSON instead of updating schema directly, but it then (even with JSON) opens the can of discrepancies between schema designs. But it makes, at the same time, a point against any sort of implicit setup. A catch-22 in a way. With my approach, at least, it can all be explicit in a separate tidy minimal schema (with explicit commits only, etc) for users to experiment with and then apply the learning to their true production schema that will also have auto commits and caches and additional types, etc. (As an aside, I think generating 'Schema API' instructions from a live schema should be a separate JIRA with a full implementation. Possibly as an enhancement of Schema API call to allow other use cases such as 'copy definition from this schema to that') Wrapping up, I can see how to make my approach work and be better than what we have currently, apart from semi-ugly flag and index/commit/index again cycle. I cannot see how to make your suggestion work however much I stare at it and our code. It does not mean it is not doable. I just don't know how to take it from your suggestion to the implementation. If somebody else can, I will be very happy for them to do so and to learn from them for the next time. But I am -1 on shipping 9 with the current broken implementation that is misleading the user with the very first advice upon executing *bin/solr create -c corename.* > Deprecate Schemaless Mode (Discussion) > -------------------------------------- > > Key: SOLR-14701 > URL: https://issues.apache.org/jira/browse/SOLR-14701 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Reporter: Marcus Eagan > Assignee: Alexandre Rafalovitch > Priority: Major > Attachments: image-2020-08-04-01-35-03-075.png > > Time Spent: 4h 10m > Remaining Estimate: 0h > > I know this won't be the most popular ticket out there, but I am growing more > and more sympathetic to the idea that we should rip many of the freedoms out > that cause users more harm than not. One of the freedoms I saw time and time > again to cause issues was schemaless mode. It doesn't work as named or > documented, so I think it should be deprecated. > If you use it in production reliably and in a way that cannot be accomplished > another way, I am happy to hear from more knowledgeable folks as to why > deprecation is a bad idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org