aokolnychyi commented on code in PR #6655: URL: https://github.com/apache/iceberg/pull/6655#discussion_r1095169075
########## core/src/main/java/org/apache/iceberg/hadoop/Util.java: ########## @@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, ScanTaskGroup<?> taskGroup) { return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE); } + public static boolean usesHadoopFileIO(FileIO io, String location) { + if (io instanceof HadoopFileIO) { + return true; + + } else if (io instanceof ResolvingFileIO) { + ResolvingFileIO resolvingFileIO = (ResolvingFileIO) io; + return resolvingFileIO.ioClass(location).isAssignableFrom(HadoopFileIO.class); + + } else { + return false; + } + } + + public static boolean mayHaveBlockLocations(FileIO io, String location) { Review Comment: I'd add the new methods after the `blockLocations` method below. The new code split `blockLocations` methods, which are related. It would be easier to have those next to each other. Otherwise, one would need to jump through the file. ########## core/src/main/java/org/apache/iceberg/hadoop/Util.java: ########## @@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, ScanTaskGroup<?> taskGroup) { return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE); } + public static boolean usesHadoopFileIO(FileIO io, String location) { + if (io instanceof HadoopFileIO) { + return true; + + } else if (io instanceof ResolvingFileIO) { + ResolvingFileIO resolvingFileIO = (ResolvingFileIO) io; + return resolvingFileIO.ioClass(location).isAssignableFrom(HadoopFileIO.class); + + } else { + return false; + } + } + + public static boolean mayHaveBlockLocations(FileIO io, String location) { + if (usesHadoopFileIO(io, location)) { + InputFile inputFile = io.newInputFile(location); + if (inputFile instanceof HadoopInputFile) { + String scheme = ((HadoopInputFile) inputFile).getFileSystem().getScheme(); + return LOCALITY_WHITELIST_FS.contains(scheme); + + } else { + return false; + } + } + + return false; + } + private static String[] blockLocations(FileIO io, ContentScanTask<?> task) { - InputFile inputFile = io.newInputFile(task.file().path().toString()); - if (inputFile instanceof HadoopInputFile) { - HadoopInputFile hadoopInputFile = (HadoopInputFile) inputFile; + String location = task.file().path().toString(); + if (usesHadoopFileIO(io, location)) { + HadoopInputFile hadoopInputFile = (HadoopInputFile) io.newInputFile(location); Review Comment: Should we keep `instanceof` check to be consistent with `mayHaveBlockLocations` and old code? ########## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java: ########## @@ -67,14 +62,13 @@ public boolean caseSensitive() { } public boolean localityEnabled() { - if (table.io() instanceof HadoopFileIO) { - HadoopInputFile file = (HadoopInputFile) table.io().newInputFile(table.location()); - String scheme = file.getFileSystem().getScheme(); - boolean defaultValue = LOCALITY_WHITELIST_FS.contains(scheme); + if (Util.usesHadoopFileIO(table.io(), table.location())) { Review Comment: Can this be simplified like this? ``` boolean defaultValue = Util.mayHaveBlockLocations(table.io(), table.location()); return PropertyUtil.propertyAsBoolean(readOptions, SparkReadOptions.LOCALITY, defaultValue); ``` ########## core/src/main/java/org/apache/iceberg/hadoop/Util.java: ########## @@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, ScanTaskGroup<?> taskGroup) { return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE); } + public static boolean usesHadoopFileIO(FileIO io, String location) { + if (io instanceof HadoopFileIO) { + return true; + + } else if (io instanceof ResolvingFileIO) { + ResolvingFileIO resolvingFileIO = (ResolvingFileIO) io; + return resolvingFileIO.ioClass(location).isAssignableFrom(HadoopFileIO.class); Review Comment: Shouldn't this be the other way around? ``` HadoopFileIO.class.isAssignableFrom(resolvingFileIO.ioClass(location)); ``` https://stackoverflow.com/a/6983596/4108401 ########## core/src/main/java/org/apache/iceberg/hadoop/Util.java: ########## @@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, ScanTaskGroup<?> taskGroup) { return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE); } + public static boolean usesHadoopFileIO(FileIO io, String location) { Review Comment: I think this can be private and we can simply use `mayHaveBlockLocations` in Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org