ptmcg opened a new issue, #2860:
URL: https://github.com/apache/iceberg-python/issues/2860

   ### Feature Request / Improvement
   
   I'm happy to see that pyparsing is part of the pyiceberg project. I've 
attached two files that are railroad diagrams for this parser:
   - pyiceberg_diag.html - browseable railroad diagram (SVG), non-terrminal 
elements are links to subdiagrams
   - pyiceberg_parser.png - static PNG version of the same diagram
   
   Overall I think the parser is well-structured, and makes good use of the 
`infix_notation`, `one_of`, and `DelimitedList` helpers.
   
   My suggestions are largely cosmetic, but may add some parse-time performance 
improvements.
   
   Recommendations to use `set_name` are either for better diagramming, better 
exception messages, or both.
   
   ## `set_results_name()` vs. `set_name()`
   
   I'd like to point out the distinction between `set_name` and 
`set_results_name`. `set_name` is most useful to describe the expression 
itself, in abstract. `set_results_name` should be used to mark expressions with 
some contextual name, so that their parsed value can be retrieved by that name. 
 I usually think of `set_name` describes what the expression _is_, and 
`set_results_name` is how the expression _is used_.
   
   For example, these expressions should probably use `set_name`, not 
`set_results_name`:
   
   ```python
   boolean = one_of(["true", "false"], 
caseless=True).set_results_name("boolean")
   string = sgl_quoted_string.set_results_name("raw_quoted_string")
   decimal = common.real().set_results_name("decimal")
   integer = common.signed_integer().set_results_name("integer")
   literal = Group(string | decimal | integer | 
boolean).set_results_name("literal")
   literal_set = Group(
       DelimitedList(string) | DelimitedList(decimal) | DelimitedList(integer) 
| DelimitedList(boolean)
   ).set_results_name("literal_set")
   ```
   
   You'll also find that exception messages are less cryptic with more use of 
`set_name`. 
   
   ```python
   Word(nums).parse_string("NAN")
   # raises ParseException: Expected W:(0-9), found 'NAN'
   
   Word(nums).set_name("integer").parse_string("NAN")
   # raises ParseException: Expected integer, found 'NAN'
   ```
   
   There are 11 calls to set_results_name. 
   
   Many of the `set_results_name` calls are probably better as `set_name` 
calls. I try to reserve `set_results_name` for expressions that need to be 
referenced in a parse action, or other post-parsing code. "op" is a good 
example of using `set_results_name`. 
   
   "column" is an excellent example of being an exception to this guideline. It 
is used in many places, but always as the subject of some larger expression. 
"column" should be defined using both `set_results_name` and `set_name`.
   
   Lastly, you can keep your set_name and set_results_name calls to a minimum:
   - insert a call to pyparsing.autoname_elements(), which will call set_name 
on all locally defined ParserElements that have not already been given a name 
(making this call before calling the `infix_notation` function will use the 
names as part of that functions internal expression building)
   - use the `expr("results_name")` short-cut for 
`expr.set_results_name("results_name")`
   
   
   ## Collapse IS/IS NOT calls
   
   You can reduce the number of terms in your parser by merging IS and IS NOT 
expressions (fewer terms can translate into faster parsing).  Here you define 
separate `is_null` and `not_null` expressions in building null_check:
   
   ```python
   is_null = column + IS + NULL
   not_null = column + IS + NOT + NULL
   null_check = is_null | not_null
   ```
   
   I propose defining an IS_NOT expression, so that null_check can simplify 
down (and also reduce 2 parse action functions down to one):
   
   ```python
   IS_NOT = (IS + NOT).add_parse_action(lambda: "IS_NOT")
   null_check = column + (IS_NOT | IS)("op") + NULL
   
   @null_check.set_parse_action
   def _(result: ParseResults) -> BooleanExpression:
       expr_class = IsNull if result.op == "IS" else NotNull
       return expr_class(result.column)
   ```
   
   Similar treatment can be given to IN/NOT IN and LIKE/NOT LIKE.
   
   
   ## Creating the diagram
   
   I wrote this little script to build the attached diagrams:
   
   ```python
   import pyiceberg.expressions.parser as ibp
   
   ibp.boolean_expression.create_diagram("pyiceberg_parser_diag.html")
   ```
   
   Run this before making any changes, and you'll get a diagram that is 
difficult to navigate, probably even difficult to view! Make a few `set_name` 
calls (like adding `set_name("column")` to `column`) and regenerate the diagram 
and see how the structure becomes clearer. The non-terminals in the diagram are 
actually links to their corresponding sub-diagrams, so navigation in a large 
diagram is much easier. I've gone back and refactored a number of the pyparsing 
examples, after generating their diagrams.
   
   ## Other notes
   
   Not really a pyparsing note, but just offering this alternative style for 
your if-elif-else chains:
   
   ```python
   @left_ref.set_parse_action
   def _(result: ParseResults) -> BooleanExpression:
       op_classes = {
           ">": GreaterThan,
           ">=": GreaterThanOrEqual,
           "<": LessThan,
           "<=": LessThanOrEqual,
           "=": EqualTo,
           "==": EqualTo,
           "!=": NotEqualTo,
           "<>": NotEqualTo,
       }
       op_class = op_classes.get(result.op)
       if op_class is not None:
           return op_class(result.column, result.literal)
       raise ValueError(f"Unsupported operation type: {result.op!r}")
   ```
   
   I encourage people at work, when echoing erroneous values in an exception, 
to add the "!r" format marker. This encloses the value in single quotes _and_ 
expands any non-printing characters to hex notation. A big help for when the 
parser accidentally includes trailing space in the operator string, and so it 
doesn't match any of the expected patterns.
   
   ## Finally
   
   Again, your parser is fine as-is, these suggestions just may make it a bit 
easier to maintain, and even a bit faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to