mikemccand commented on pull request #157: URL: https://github.com/apache/lucene/pull/157#issuecomment-854760044
Maybe another way to improve the checking for correctness in the randomized test (or maybe in a new randomized test) would be to randomly generate a set of strings from a limited alphabet, create the minimal automaton matching only those strings (we have a nice API to do that, efficiently, already), call flatten, and the confirm that the resulting output graph still accepts all the original strings? I.e., flatten should only ever "generalize" -- accepting strings that the original machine did not -- and never "remove" previously accepted strings? But I think one missing part for such a test would be an "Automaton to TokenStream" converter, i.e. a "serializer" from (acyclic) Automaton to TokenStream. I think such a thing would not be too difficult to build, basically just topo sort the input graph (and throw exception if it has cycles), then emit the transitions as tokens. The `posInc` attribute is guaranteed to never go negative because of the topo sort. This would (separately) be a nice utility API to convert between these two things that are really nearly the same ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org