rdblue commented on code in PR #15561:
URL: https://github.com/apache/iceberg/pull/15561#discussion_r2921467152


##########
core/src/main/java/org/apache/iceberg/rest/RESTTableScan.java:
##########
@@ -85,7 +85,7 @@ class RESTTableScan extends DataTableScan {
   private final Object hadoopConf;
   private final FileIO tableIO;
   private String planId = null;
-  private FileIO fileIOForPlanId = null;
+  private FileIO scanFileIO = null;

Review Comment:
   I'm looking at the information passed around in this class more based on 
having a couple of unnecessary fields and I have a few more questions:
   
   Why does this use `Map<String, String> headers` rather than 
`Supplier<Map<String, String>> headers` that the table has? Doesn't this mean 
that the headers from the auth session are static? How does credential refresh 
work if the authentication header is stale?
   
   Was the `TableOperations` field originally used? It isn't used now, other 
than to pass it to refined scans. I think it should be removed from both the 
constructor and the fields.
   
   This passes `ResourcePaths` and `TableIdentifier` to build 2 paths that 
could be passed in a 1 path using the plan ID. I think this should be 
reconsidered. The plan and fetch endpoints can be passed in as simple strings, 
and it would simplify this to use a `Function<String, String>` to construct the 
plan path with ID. Another option is to create an object similar to 
`ResourcePaths` that is specific to a table (hides `TableIdentifier`) and use 
that. Either one would be cleaner.
   
   This also passes `supportedEndpoints`, which spreads out the logic for how 
to handle endpoints that aren't supported. These endpoints are also checked 
when needed, so this code will create a plan request and could then fail to 
fetch tasks if the fetch endpoint isn't supported. It also throws error 
messages that are not helpful, like "Server does not support endpoint: GET 
/v1/{prefix}/namespaces/{namespace}/tables/{table}/plan/{plan-id}" instead of 
"Invalid status: submitted (service does not support async planning)" that 
would be more helpful.
   
   A cleaner way to handle endpoints is to verify required endpoints before 
creating this scan. Both the plan and fetch endpoints should be required. Then 
the optional endpoint should be booleans, like `supportsAsync` and 
`supportsCancel`.
   
   Last, this leaks catalog properties and the Hadoop conf so that 
`CatalogUtil.loadFileIO` can be called with `List<Credential>`. Why not pass a 
function to create the `FileIO` so that the properties and conf are contained 
in the REST table operations?
   
   I think this works right now and these aren't blockers (other than the 
potential auth session issue), but I would really like to see this class 
simplified by reducing the number of things that have to be passed to it and 
remove some of the things that are done here, like handling endpoint checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to