[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6655: Spark: Handle ResolvingFileIO while determining LocalityPreference

via GitHub Thu, 02 Feb 2023 14:47:07 -0800


aokolnychyi commented on code in PR #6655:
URL: https://github.com/apache/iceberg/pull/6655#discussion_r1095169075



##########
core/src/main/java/org/apache/iceberg/hadoop/Util.java:
##########
@@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, 
ScanTaskGroup<?> taskGroup) {
     return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE);
   }
 
+  public static boolean usesHadoopFileIO(FileIO io, String location) {
+    if (io instanceof HadoopFileIO) {
+      return true;
+
+    } else if (io instanceof ResolvingFileIO) {
+      ResolvingFileIO resolvingFileIO = (ResolvingFileIO) io;
+      return 
resolvingFileIO.ioClass(location).isAssignableFrom(HadoopFileIO.class);
+
+    } else {
+      return false;
+    }
+  }
+
+  public static boolean mayHaveBlockLocations(FileIO io, String location) {

Review Comment:
   I'd add the new methods after the `blockLocations` method below. The new 
code split `blockLocations` methods, which are related. It would be easier to 
have those next to each other. Otherwise, one would need to jump through the 
file.



##########
core/src/main/java/org/apache/iceberg/hadoop/Util.java:
##########
@@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, 
ScanTaskGroup<?> taskGroup) {
     return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE);
   }
 
+  public static boolean usesHadoopFileIO(FileIO io, String location) {
+    if (io instanceof HadoopFileIO) {
+      return true;
+
+    } else if (io instanceof ResolvingFileIO) {
+      ResolvingFileIO resolvingFileIO = (ResolvingFileIO) io;
+      return 
resolvingFileIO.ioClass(location).isAssignableFrom(HadoopFileIO.class);
+
+    } else {
+      return false;
+    }
+  }
+
+  public static boolean mayHaveBlockLocations(FileIO io, String location) {
+    if (usesHadoopFileIO(io, location)) {
+      InputFile inputFile = io.newInputFile(location);
+      if (inputFile instanceof HadoopInputFile) {
+        String scheme = ((HadoopInputFile) 
inputFile).getFileSystem().getScheme();
+        return LOCALITY_WHITELIST_FS.contains(scheme);
+
+      } else {
+        return false;
+      }
+    }
+
+    return false;
+  }
+
   private static String[] blockLocations(FileIO io, ContentScanTask<?> task) {
-    InputFile inputFile = io.newInputFile(task.file().path().toString());
-    if (inputFile instanceof HadoopInputFile) {
-      HadoopInputFile hadoopInputFile = (HadoopInputFile) inputFile;
+    String location = task.file().path().toString();
+    if (usesHadoopFileIO(io, location)) {
+      HadoopInputFile hadoopInputFile = (HadoopInputFile) 
io.newInputFile(location);

Review Comment:
   Should we keep `instanceof` check to be consistent with 
`mayHaveBlockLocations` and old code?



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -67,14 +62,13 @@ public boolean caseSensitive() {
   }
 
   public boolean localityEnabled() {
-    if (table.io() instanceof HadoopFileIO) {
-      HadoopInputFile file = (HadoopInputFile) 
table.io().newInputFile(table.location());
-      String scheme = file.getFileSystem().getScheme();
-      boolean defaultValue = LOCALITY_WHITELIST_FS.contains(scheme);
+    if (Util.usesHadoopFileIO(table.io(), table.location())) {

Review Comment:
   Can this be simplified like this?
   
   ```
   boolean defaultValue = Util.mayHaveBlockLocations(table.io(), 
table.location());
   return PropertyUtil.propertyAsBoolean(readOptions, 
SparkReadOptions.LOCALITY, defaultValue);
   ```



##########
core/src/main/java/org/apache/iceberg/hadoop/Util.java:
##########
@@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, 
ScanTaskGroup<?> taskGroup) {
     return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE);
   }
 
+  public static boolean usesHadoopFileIO(FileIO io, String location) {
+    if (io instanceof HadoopFileIO) {
+      return true;
+
+    } else if (io instanceof ResolvingFileIO) {
+      ResolvingFileIO resolvingFileIO = (ResolvingFileIO) io;
+      return 
resolvingFileIO.ioClass(location).isAssignableFrom(HadoopFileIO.class);

Review Comment:
   Shouldn't this be the other way around?
   
   ```
   HadoopFileIO.class.isAssignableFrom(resolvingFileIO.ioClass(location));
   ```
   https://stackoverflow.com/a/6983596/4108401



##########
core/src/main/java/org/apache/iceberg/hadoop/Util.java:
##########
@@ -84,10 +88,38 @@ public static String[] blockLocations(FileIO io, 
ScanTaskGroup<?> taskGroup) {
     return locations.toArray(HadoopInputFile.NO_LOCATION_PREFERENCE);
   }
 
+  public static boolean usesHadoopFileIO(FileIO io, String location) {

Review Comment:
   I think this can be private and we can simply use `mayHaveBlockLocations` in 
Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #6655: Spark: Handle ResolvingFileIO while determining LocalityPreference

Reply via email to