glawson0 commented on pull request #157:
URL: https://github.com/apache/lucene/pull/157#issuecomment-861386826


   >Maybe another way to improve the checking for correctness in the randomized 
test (or maybe in a new randomized test) would be to randomly generate a set of 
strings from a limited alphabet, create the minimal automaton matching only 
those strings (we have a nice API to do that, efficiently, already), call 
flatten, and the confirm that the resulting output graph still accepts all the 
original strings?
   
   Are you referring to the Automoton.Builder API here?
   
   > But I think one missing part for such a test would be an "Automaton to 
TokenStream" converter, i.e. a "serializer" from (acyclic) Automaton to 
TokenStream. I think such a thing would not be too difficult to build, 
basically just topo sort the input graph (and throw exception if it has 
cycles), then emit the transitions as tokens. The `posInc` attribute is 
guaranteed to never go negative because of the topo sort. This would 
(separately) be a nice utility API to convert between these two things that are 
really nearly the same ;)
   
   I'm having trouble with the transition from Automoton back to TokenStream. 
`posInc` is easy to calculate because after a toposort it's either 1 or 1+ 
number of holes since last non-hole. `poslen` is giving me some trouble because 
it needs to know how many positions were in alternate paths. I should be able 
to look at each path from a node that has multiple edges, follow those paths 
until they converge counting token separators along the way, then take the max 
of those as the length to be covered. However I'm not sure how to distribute 
that length across paths.
   For example
   ```
    --->O--->O--->O---
   |                  |
   |                  V
   (Start)           (End)
   |                  ^
   ----->O------>O-----
   ```
   There are 3 positions on the longest path so the total positions to be 
covered is 3, but on the short 2 node path I don't think it's possible to tell 
what should be `poslen`=1 and which should be `poslen`=2. For creating a valid 
token stream that can match the same as the automaton can accept it might not 
matter. However to reverse `tokenstreamToAutomaton` we would need to get those 
`poslen`s right.
   We would also need to set Offsets. Offsets are completely lost in the 
conversion from `TokenStream` to `Automaton`.  We should be able to generate 
offsets that are valid for a `TokenStream` but may not reflect what the text 
originally looked like. For creating a general `AutomotonToTokenStream` I think 
this would be a problem, as `tokenstreamToAutomaton` -> 
`AutomotonToTokenStream` would end in a different `TokenStream` then one 
started with.
   
   For the generalize test I might have to create an `Automaton` from a 
`TokenStream` with and one without `FlattenGraphFilter`, then create the 
acceptable paths from the one without `FGF` and check if the `Automaton` with 
`FGF` can accept all of them. It's less elegant but would still confirm that 
the flattened graph was a `generalization`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to