ErickErickson commented on pull request #1863:
URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698933577


   I’ve always thought “guess" was mostly a reflection of how inadequate us 
devs thought the whole process was. I agree its not very confidence-inspiring…
   
   Off the top of my head, the idea of a new handler has a lot of appeal. It’s 
easy to get tell a user “index as many training documents as you want to this 
handler to create your schema. NO DOCUMENTS WILL BE INDEXED during this stage, 
this is for refining your schema. When you believe you’ve send enough training 
documents for Solr to have a reasonable chance of defining the correct schema, 
you must send the documents again, this time to the update handler after 
reloading the collection to be able to search them”. Maybe “schema-trainer”?
   
   It seems to me that there could be a some efficiencies here.
   
   1> Could we bypass reloading the collection each time? We’d have to read the 
schema directly and modify it. Or at least only reload it at the end of each 
batch if there were changes.
   
   2> The response could indicate whether there were any changes to the schema. 
I can imagine a process whereby I send 10 docs, see there were changes. Then 
send 10 more and see there were changes. Rinse/repeat until I’d been able to 
send N batches without any changes.
   
   3> This still isn’t going to bullet-proof semi-structured docs. Or any other 
really, but it’ll at least make things far more robust. What happens if I index 
1,000 Word docs, call it good, then index PDFs. Or PNGs or…. anyway, I try to 
index some new type of document. Or do we change the update handler to put all 
unrecognized fields into a text field? Probably should log warnings that this 
is the case.
   
   4> We’d have to throw an error if the training handler was used after there 
were documents in the index _and_ any existing field needed to be modified. If 
we did that users could try to send docs through the training handler at any 
time, which would partially handle <3>.
   
   5> One tricky bit would be how to train on a bunch of documents _after_ some 
docs were already indexed. Prior to indexing any docs, we can freely modify 
existing fields in the schema. Back to <3>. I’m indexing all the Word docs just 
fine. Now I want to index PDFs and they have new fields, so I want to throw a 
bunch of them at the training handler. If I send them one-by-one, how to 
distinguish between a field that had been defined that no currently-indexed 
documents use and can be changed .vs. one that has documents indexed against 
it? This’ll bear some thought because the replica processing the training could 
theoretically examine the local index and see no docs using that field but some 
replica of some other shard _does_ have one or more docs using the field. And 
in the case where implicit routing is used or even composite keys this gets 
worse. This either gets really complicated or we make a rule like “if you use 
the training handler after documents are indexed, you get on
 e chance to send a batch of documents. Training will fail if any existing 
field needs to be modified”.
   
   6> We’d be able to confine field guessing exclusively to the training 
handler.
   
   Erick
   
   
   
   > On Sep 25, 2020, at 7:38 AM, Alexandre Rafalovitch 
<[email protected]> wrote:
   > 
   > 
   > @noblepaul I don't think your proposal is fully thought out.
   > 
   >    • This seems not orthogonal to being able to send CSV/JSON/XML/DIH to 
it, if you are proposing for it to be another one of pathVsLoaders
   >    • The whole 'I give you schema' proposal ignores the fact that 
DateParser or BoolParser URP present in guess process also needs to be present 
in whatever schema you send it to. That 'curl' command is hiding the user from 
the actual issues the guessing is supposed to help with.
   >    • Even your types "string vs text" is not something that can currently 
can guess.
   > Can you do a counter-proposal in code that skips all the guessing and just 
shows this send/return path? But is still in the execution path to take the 
concerns above into the account. I could not find such a place.
   > 
   > @epugh Predict works for me. Explore feels a bit too general and confusing 
(more interactive/UI feel). But, in general, I am not stuck on Guess at all.
   > 
   > —
   > You are receiving this because you are subscribed to this thread.
   > Reply to this email directly, view it on GitHub, or unsubscribe.
   > 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to