ErickErickson commented on pull request #1863: URL: https://github.com/apache/lucene-solr/pull/1863#issuecomment-698933577
I’ve always thought “guess" was mostly a reflection of how inadequate us devs thought the whole process was. I agree its not very confidence-inspiring… Off the top of my head, the idea of a new handler has a lot of appeal. It’s easy to get tell a user “index as many training documents as you want to this handler to create your schema. NO DOCUMENTS WILL BE INDEXED during this stage, this is for refining your schema. When you believe you’ve send enough training documents for Solr to have a reasonable chance of defining the correct schema, you must send the documents again, this time to the update handler after reloading the collection to be able to search them”. Maybe “schema-trainer”? It seems to me that there could be a some efficiencies here. 1> Could we bypass reloading the collection each time? We’d have to read the schema directly and modify it. Or at least only reload it at the end of each batch if there were changes. 2> The response could indicate whether there were any changes to the schema. I can imagine a process whereby I send 10 docs, see there were changes. Then send 10 more and see there were changes. Rinse/repeat until I’d been able to send N batches without any changes. 3> This still isn’t going to bullet-proof semi-structured docs. Or any other really, but it’ll at least make things far more robust. What happens if I index 1,000 Word docs, call it good, then index PDFs. Or PNGs or…. anyway, I try to index some new type of document. Or do we change the update handler to put all unrecognized fields into a text field? Probably should log warnings that this is the case. 4> We’d have to throw an error if the training handler was used after there were documents in the index _and_ any existing field needed to be modified. If we did that users could try to send docs through the training handler at any time, which would partially handle <3>. 5> One tricky bit would be how to train on a bunch of documents _after_ some docs were already indexed. Prior to indexing any docs, we can freely modify existing fields in the schema. Back to <3>. I’m indexing all the Word docs just fine. Now I want to index PDFs and they have new fields, so I want to throw a bunch of them at the training handler. If I send them one-by-one, how to distinguish between a field that had been defined that no currently-indexed documents use and can be changed .vs. one that has documents indexed against it? This’ll bear some thought because the replica processing the training could theoretically examine the local index and see no docs using that field but some replica of some other shard _does_ have one or more docs using the field. And in the case where implicit routing is used or even composite keys this gets worse. This either gets really complicated or we make a rule like “if you use the training handler after documents are indexed, you get on e chance to send a batch of documents. Training will fail if any existing field needs to be modified”. 6> We’d be able to confine field guessing exclusively to the training handler. Erick > On Sep 25, 2020, at 7:38 AM, Alexandre Rafalovitch <[email protected]> wrote: > > > @noblepaul I don't think your proposal is fully thought out. > > • This seems not orthogonal to being able to send CSV/JSON/XML/DIH to it, if you are proposing for it to be another one of pathVsLoaders > • The whole 'I give you schema' proposal ignores the fact that DateParser or BoolParser URP present in guess process also needs to be present in whatever schema you send it to. That 'curl' command is hiding the user from the actual issues the guessing is supposed to help with. > • Even your types "string vs text" is not something that can currently can guess. > Can you do a counter-proposal in code that skips all the guessing and just shows this send/return path? But is still in the execution path to take the concerns above into the account. I could not find such a place. > > @epugh Predict works for me. Explore feels a bit too general and confusing (more interactive/UI feel). But, in general, I am not stuck on Guess at all. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub, or unsubscribe. > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
