[ https://issues.apache.org/jira/browse/SOLR-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194982#comment-17194982 ]
David Eric Pugh commented on SOLR-14701: ---------------------------------------- I'm trying to sort out in my own head this whole idea of a "learning schema", and I think that's a red herring, ie not the crux of the issue. I'm trying to understand when I think about schema and data types first, before indexing data, and when I think about it "after" I've updated my data. First is definitly when I know what I'm building: "add new field quantity as an integer to support inventory numbers". In that use case, I think we should be encouraging folks to use the schema API to add your specific "quantity" field, or, as a fall back, you can use dynamic fields to add "quantity_i". The second example, and this touches on what you are doing, is when I am indexing data from a source where I actually don't know much about the data that I'm indexing. For example, I've got a [streaming expression|https://github.com/epugh/playing-with-solr-streaming-expressions/blob/interact_with_tika_server/streaming_expressions/src/main/java/com/o19s/solr/streaming/SpaCyStream.java] doing NER w/ spaCy. The results might be a single value, it might be an array, it might be numbers, it might turn into multiple fields. And what I find is that invariably, when I think it's a number or a date, along comes something that isn't and blows up my process (the same pressure behind your work to widen types). What I'm not seeing in the GuessSchemaFieldsUpdateProcessorFactory is the ability to manage the widening of types over multiple commits as new data is added. So maybe what I'm really trying to say is that maybe we need better tooling around unknown fields being indexed and stored in Solr that we feel would be robust enough to be used in everywhere for certain use cases (and drop the "learning" idea)? Maybe what we need is simpler AddSchemaFieldsUpdateProcessorFactory where everything you don't recognize is a multivalued string? It's now in Solr, you can then add some copyfields etc. Let's double down on making this work smoothly instead of trying to do indexing in some sort of perfect way straight from a messy source: commit(films, update(films, select( search(films, q="*:*" ), initial_release_date as initial_release_date_dt, ) ) ) We've long been hobbled by the big limitation that we have do not have the update multiple documents all at once semantics that a SQL database does that lets it evolve easily over time. I like that this issue is exposing it, and I'd like to see it solved for production not just as a odd "learning" type thing. Especially cause everything we do in a "learning" example goes straight to production anyway! > Deprecate Schemaless Mode (Discussion) > -------------------------------------- > > Key: SOLR-14701 > URL: https://issues.apache.org/jira/browse/SOLR-14701 > Project: Solr > Issue Type: Improvement > Components: Schema and Analysis > Reporter: Marcus Eagan > Assignee: Alexandre Rafalovitch > Priority: Major > Attachments: image-2020-08-04-01-35-03-075.png > > Time Spent: 10m > Remaining Estimate: 0h > > I know this won't be the most popular ticket out there, but I am growing more > and more sympathetic to the idea that we should rip many of the freedoms out > that cause users more harm than not. One of the freedoms I saw time and time > again to cause issues was schemaless mode. It doesn't work as named or > documented, so I think it should be deprecated. > If you use it in production reliably and in a way that cannot be accomplished > another way, I am happy to hear from more knowledgeable folks as to why > deprecation is a bad idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org