glawson0 commented on pull request #157: URL: https://github.com/apache/lucene/pull/157#issuecomment-861386826
>Maybe another way to improve the checking for correctness in the randomized test (or maybe in a new randomized test) would be to randomly generate a set of strings from a limited alphabet, create the minimal automaton matching only those strings (we have a nice API to do that, efficiently, already), call flatten, and the confirm that the resulting output graph still accepts all the original strings? Are you referring to the Automoton.Builder API here? > But I think one missing part for such a test would be an "Automaton to TokenStream" converter, i.e. a "serializer" from (acyclic) Automaton to TokenStream. I think such a thing would not be too difficult to build, basically just topo sort the input graph (and throw exception if it has cycles), then emit the transitions as tokens. The `posInc` attribute is guaranteed to never go negative because of the topo sort. This would (separately) be a nice utility API to convert between these two things that are really nearly the same ;) I'm having trouble with the transition from Automoton back to TokenStream. `posInc` is easy to calculate because after a toposort it's either 1 or 1+ number of holes since last non-hole. `poslen` is giving me some trouble because it needs to know how many positions were in alternate paths. I should be able to look at each path from a node that has multiple edges, follow those paths until they converge counting token separators along the way, then take the max of those as the length to be covered. However I'm not sure how to distribute that length across paths. For example ``` --->O--->O--->O--- | | | V (Start) (End) | ^ ----->O------>O----- ``` There are 3 positions on the longest path so the total positions to be covered is 3, but on the short 2 node path I don't think it's possible to tell what should be `poslen`=1 and which should be `poslen`=2. For creating a valid token stream that can match the same as the automaton can accept it might not matter. However to reverse `tokenstreamToAutomaton` we would need to get those `poslen`s right. We would also need to set Offsets. Offsets are completely lost in the conversion from `TokenStream` to `Automaton`. We should be able to generate offsets that are valid for a `TokenStream` but may not reflect what the text originally looked like. For creating a general `AutomotonToTokenStream` I think this would be a problem, as `tokenstreamToAutomaton` -> `AutomotonToTokenStream` would end in a different `TokenStream` then one started with. For the generalize test I might have to create an `Automaton` from a `TokenStream` with and one without `FlattenGraphFilter`, then create the acceptable paths from the one without `FGF` and check if the `Automaton` with `FGF` can accept all of them. It's less elegant but would still confirm that the flattened graph was a `generalization`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org