[jira] [Commented] (SOLR-14701) Deprecate Schemaless Mode (Discussion)

David Eric Pugh (Jira) Sun, 13 Sep 2020 05:19:18 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17194982#comment-17194982
 ]


David Eric Pugh commented on SOLR-14701:
----------------------------------------

I'm trying to sort out in my own head this whole idea of a "learning schema", 
and I think that's a red herring, ie not the crux of the issue.   

I'm trying to understand when I think about schema and data types first, before 
indexing data, and when I think about it "after" I've updated my data.   First 
is definitly when I know what I'm building: "add new field quantity as an 
integer to support inventory numbers".   In that use case, I think we should be 
encouraging folks to use the schema API to add your specific "quantity" field, 
or, as a fall back, you can use dynamic fields to add "quantity_i".   

The second example, and this touches on what you are doing, is when I am 
indexing data from a source where I actually don't know much about the data 
that I'm indexing.  For example, I've got a [streaming 
expression|https://github.com/epugh/playing-with-solr-streaming-expressions/blob/interact_with_tika_server/streaming_expressions/src/main/java/com/o19s/solr/streaming/SpaCyStream.java]
 doing NER w/ spaCy.  The results might be a single value, it might be an 
array, it might be numbers, it might turn into multiple fields.   And what I 
find is that invariably, when I think it's a number or a date, along comes 
something that isn't and blows up my process (the same pressure behind your 
work to widen types).   What I'm not seeing in the 
GuessSchemaFieldsUpdateProcessorFactory is the ability to manage the widening 
of types over multiple commits as new data is added.  

So maybe what I'm really trying to say is that maybe we need better tooling 
around unknown fields being indexed and stored in Solr that we feel would be 
robust enough to be used in everywhere for certain use cases (and drop the 
"learning" idea)?   Maybe what we need is simpler 
AddSchemaFieldsUpdateProcessorFactory where everything you don't recognize is a 
multivalued string?  It's now in Solr, you can then add some copyfields etc.  
Let's double down on making this work smoothly instead of trying to do indexing 
in some sort of perfect way straight from a messy source:

commit(films,
  update(films,
   select(
      search(films,
           q="*:*"
      ),
      initial_release_date as initial_release_date_dt,
    )
  )
)

We've long been hobbled by the big limitation that we have do not have the 
update multiple documents all at once semantics that a SQL database does that 
lets it evolve easily over time.  I like that this issue is exposing it, and 
I'd like to see it solved for production not just as a odd "learning" type 
thing.   Especially cause everything we do in a "learning" example goes 
straight to production anyway!


> Deprecate Schemaless Mode (Discussion)
> --------------------------------------
>
>                 Key: SOLR-14701
>                 URL: https://issues.apache.org/jira/browse/SOLR-14701
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>            Reporter: Marcus Eagan
>            Assignee: Alexandre Rafalovitch
>            Priority: Major
>         Attachments: image-2020-08-04-01-35-03-075.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I know this won't be the most popular ticket out there, but I am growing more 
> and more sympathetic to the idea that we should rip many of the freedoms out 
> that cause users more harm than not. One of the freedoms I saw time and time 
> again to cause issues was schemaless mode. It doesn't work as named or 
> documented, so I think it should be deprecated. 
> If you use it in production reliably and in a way that cannot be accomplished 
> another way, I am happy to hear from more knowledgeable folks as to why 
> deprecation is a bad idea. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14701) Deprecate Schemaless Mode (Discussion)

Reply via email to