Re: [PR] [Improvement](Iceberg) Optimize LocationPath.of performance for Iceberg table queries [doris]

via GitHub Mon, 22 Dec 2025 22:46:35 -0800


Copilot commented on code in PR #59217:
URL: https://github.com/apache/doris/pull/59217#discussion_r2642166920



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -363,9 +375,83 @@ private CloseableIterable<FileScanTask> 
planFileScanTask(TableScan scan) {
         return TableScanUtil.splitFiles(scan.planFiles(), targetSplitSize);
     }
 
+    /**
+     * Initialize cached values for LocationPath creation on first use.
+     * This avoids repeated StorageProperties lookup, scheme parsing, and 
S3URI regex parsing for each file.
+     */
+    private void initLocationPathCache(String samplePath) {
+        if (locationPathCacheInitialized) {
+            return;
+        }
+        synchronized (this) {
+            if (locationPathCacheInitialized) {
+                return;
+            }
+            try {
+                // Create a LocationPath using the full method to get all 
cached values
+                LocationPath sampleLocationPath = LocationPath.of(samplePath, 
storagePropertiesMap);
+                cachedStorageProperties = 
sampleLocationPath.getStorageProperties();
+                cachedSchema = sampleLocationPath.getSchema();
+                cachedFsIdentifier = sampleLocationPath.getFsIdentifier();
+
+                // Extract fsIdPrefix like "s3://" from fsIdentifier like 
"s3://bucket"
+                int schemeEnd = cachedFsIdentifier.indexOf("://");
+                if (schemeEnd > 0) {
+                    cachedFsIdPrefix = cachedFsIdentifier.substring(0, 
schemeEnd + 3);
+                }
+
+                // Cache path prefix mapping for fast transformation
+                // This allows subsequent files to skip S3URI regex parsing 
entirely
+                String normalizedPath = 
sampleLocationPath.getNormalizedLocation();
+
+                // Find the common prefix by looking for the last '/' before 
the filename
+                int lastSlashInOriginal = samplePath.lastIndexOf('/');
+                int lastSlashInNormalized = normalizedPath.lastIndexOf('/');
+
+                if (lastSlashInOriginal > 0 && lastSlashInNormalized > 0) {
+                    cachedOriginalPathPrefix = samplePath.substring(0, 
lastSlashInOriginal + 1);
+                    cachedNormalizedPathPrefix = normalizedPath.substring(0, 
lastSlashInNormalized + 1);
+                }
+
+                locationPathCacheInitialized = true;
+            } catch (Exception e) {
+                // If caching fails, we'll fall back to the full method each 
time
+                LOG.warn("Failed to initialize LocationPath cache, will use 
full parsing", e);
+                locationPathCacheInitialized = true;
+            }
+        }
+    }
+
+    /**
+     * Create a LocationPath with cached values for better performance.
+     * Uses cached path prefix mapping to completely bypass S3URI regex 
parsing for most files.
+     * Falls back to full parsing if cache is not available or path doesn't 
match cached prefix.
+     */
+    private LocationPath createLocationPathWithCache(String path) {
+        // Initialize cache on first call
+        if (!locationPathCacheInitialized) {
+            initLocationPathCache(path);
+        }
+
+        // Fast path: if path starts with cached original prefix, directly 
transform without any parsing
+        if (cachedOriginalPathPrefix != null && 
path.startsWith(cachedOriginalPathPrefix)) {
+            // Transform: replace original prefix with normalized prefix
+            String normalizedPath = cachedNormalizedPathPrefix + 
path.substring(cachedOriginalPathPrefix.length());
+            return LocationPath.ofDirect(normalizedPath, cachedSchema, 
cachedFsIdentifier, cachedStorageProperties);
+        }

Review Comment:
   This method assumes that files with the same prefix will always have the 
same normalized path prefix. However, this may not be correct in all cases, 
particularly when files in different subdirectories could have different 
storage properties or when path normalization depends on the full path rather 
than just the prefix.
   
   For example, if different files require different normalization rules based 
on their full paths (not just the prefix), using the cached prefix mapping 
could produce incorrect results. Consider validating that the assumption holds 
for all files in the table, or adding a comment documenting this assumption.



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -116,6 +116,18 @@ public class IcebergScanNode extends FileQueryScanNode {
     private Map<StorageProperties.Type, StorageProperties> 
storagePropertiesMap;
     private Map<String, String> backendStorageProperties;
 
+    // Cached values for LocationPath creation optimization
+    // These are lazily initialized on first use to avoid parsing overhead for 
each file
+    private volatile StorageProperties cachedStorageProperties;
+    private volatile String cachedSchema;
+    private volatile String cachedFsIdPrefix;
+    private volatile boolean locationPathCacheInitialized = false;
+    // Cache for path prefix transformation to avoid repeated S3URI parsing
+    // Maps original path prefix (e.g., "https://bucket.s3.amazonaws.com/";) to 
normalized prefix (e.g., "s3://bucket/")
+    private volatile String cachedOriginalPathPrefix;
+    private volatile String cachedNormalizedPathPrefix;
+    private volatile String cachedFsIdentifier;

Review Comment:
   The volatile fields in this caching implementation may have visibility 
issues in multi-threaded contexts. While `locationPathCacheInitialized` is 
checked with double-checked locking, the other cached fields 
(cachedOriginalPathPrefix, cachedNormalizedPathPrefix, etc.) may not be 
properly visible to other threads even after `locationPathCacheInitialized` is 
set to true.
   
   The issue is that while the synchronized block ensures that the 
initialization completes atomically, the writes to the non-volatile cached 
fields inside the synchronized block may not be visible to other threads 
reading them outside the synchronized block in `createLocationPathWithCache`.
   
   Consider either:
   1. Making all cached fields (cachedOriginalPathPrefix, 
cachedNormalizedPathPrefix, cachedFsIdentifier) volatile as well
   2. Or using a single atomic reference containing all cached values as an 
immutable object



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/property/storage/S3PropertyUtils.java:
##########
@@ -132,16 +142,50 @@ public static String validateAndNormalizeUri(String path,
         if (StringUtils.isBlank(path)) {
             throw new StoragePropertiesException("path is null");
         }
-        if (path.startsWith("s3://")) {
+
+        // Fast path 1: s3:// paths are already in the normalized format 
expected by BE
+        if (path.startsWith(S3_SCHEME_PREFIX)) {
             return path;
         }
 
+        // Fast path 2: simple S3-compatible schemes (oss://, cos://, s3a://, 
etc.)
+        // can be converted with simple string replacement: 
scheme://bucket/key -> s3://bucket/key
+        String normalized = trySimpleSchemeConversion(path);
+        if (normalized != null) {
+            return normalized;
+        }
+
+        // Full parsing path: for HTTP URLs and other complex formats
         boolean usePathStyle = Boolean.parseBoolean(stringUsePathStyle);
         boolean forceParsingByStandardUri = 
Boolean.parseBoolean(stringForceParsingByStandardUri);
         S3URI s3uri = S3URI.create(path, usePathStyle, 
forceParsingByStandardUri);
         return "s3" + S3URI.SCHEME_DELIM + s3uri.getBucket() + 
S3URI.PATH_DELIM + s3uri.getKey();
     }
 
+    /**
+     * Try to convert simple S3-compatible scheme URIs to s3:// format using 
string replacement.
+     * This avoids expensive regex parsing for common cases like 
oss://bucket/key, s3a://bucket/key, etc.
+     *
+     * @param path the input path
+     * @return converted s3:// path if successful, null if the path doesn't 
match simple pattern
+     */
+    private static String trySimpleSchemeConversion(String path) {
+        int delimIndex = path.indexOf(SCHEME_DELIM);
+        if (delimIndex <= 0) {
+            return null;
+        }
+
+        String scheme = path.substring(0, delimIndex).toLowerCase();
+        for (String compatibleScheme : SIMPLE_S3_COMPATIBLE_SCHEMES) {
+            if (compatibleScheme.equals(scheme)) {
+                // Simple conversion: replace scheme with "s3"
+                // e.g., "oss://bucket/key" -> "s3://bucket/key"
+                return S3_SCHEME_PREFIX + path.substring(delimIndex + 
SCHEME_DELIM.length());
+            }
+        }
+        return null;
+    }

Review Comment:
   The new `trySimpleSchemeConversion` method and the fast path optimization 
for S3-compatible schemes (oss://, cos://, s3a://, etc.) lack test coverage. 
While existing tests cover the basic validateAndNormalizeUri functionality, 
they don't specifically test the new optimization path.
   
   Consider adding test cases that verify:
   1. Conversion of each S3-compatible scheme listed in 
SIMPLE_S3_COMPATIBLE_SCHEMES (e.g., "oss://bucket/key" -> "s3://bucket/key")
   2. Case sensitivity handling (e.g., "OSS://bucket/key" vs "oss://bucket/key")
   3. Edge cases like missing bucket/key after the scheme
   4. Paths that should fall back to full S3URI parsing



##########
fe/fe-core/src/main/java/org/apache/doris/common/util/LocationPath.java:
##########
@@ -201,6 +199,65 @@ public static LocationPath of(String location,
         }
     }
 
+    /**
+     * Ultra-fast factory method that directly constructs LocationPath without 
any parsing.
+     * This is used when the normalized location is already known (e.g., from 
prefix transformation).
+     *
+     * @param normalizedLocation the already-normalized location string
+     * @param schema             pre-computed schema
+     * @param fsIdentifier       pre-computed filesystem identifier
+     * @param storageProperties  the storage properties (can be null)
+     * @return a new LocationPath instance
+     */
+    public static LocationPath ofDirect(String normalizedLocation,
+                                        String schema,
+                                        String fsIdentifier,
+                                        StorageProperties storageProperties) {
+        return new LocationPath(schema, normalizedLocation, fsIdentifier, 
storageProperties);
+    }
+
+    /**
+     * Fast factory method that reuses pre-computed schema and fsIdentifier.
+     * This is optimized for batch processing where many files share the same 
bucket/prefix.
+     *
+     * @param location           the input URI location string
+     * @param storageProperties  pre-computed storage properties for 
normalization
+     * @param cachedSchema       pre-computed schema (can be null to compute)
+     * @param cachedFsIdPrefix   pre-computed fsIdentifier prefix like "s3://" 
(can be null to compute)
+     * @return a new LocationPath instance
+     */
+    public static LocationPath ofWithCache(String location,
+                                           StorageProperties storageProperties,
+                                           String cachedSchema,
+                                           String cachedFsIdPrefix) {
+        try {
+            String normalizedLocation = 
storageProperties.validateAndNormalizeUri(location);
+
+            String fsIdentifier;
+            if (cachedFsIdPrefix != null && 
normalizedLocation.startsWith(cachedFsIdPrefix)) {
+                // Fast path: extract authority from normalized location 
without full URI parsing
+                int authorityStart = cachedFsIdPrefix.length();
+                int authorityEnd = normalizedLocation.indexOf('/', 
authorityStart);
+                if (authorityEnd == -1) {
+                    authorityEnd = normalizedLocation.length();
+                }
+                String authority = 
normalizedLocation.substring(authorityStart, authorityEnd);
+                fsIdentifier = cachedFsIdPrefix + authority;
+            } else {
+                // Fallback to full URI parsing
+                String encodedLocation = encodedLocation(normalizedLocation);
+                URI uri = URI.create(encodedLocation);
+                fsIdentifier = Strings.nullToEmpty(uri.getScheme()) + "://"
+                        + Strings.nullToEmpty(uri.getAuthority());
+            }
+
+            String schema = cachedSchema != null ? cachedSchema : 
extractScheme(location);
+            return new LocationPath(schema, normalizedLocation, fsIdentifier, 
storageProperties);
+        } catch (UserException e) {
+            throw new StoragePropertiesException("Failed to create 
LocationPath for location: " + location, e);
+        }
+    }

Review Comment:
   The new methods `LocationPath.ofDirect` and `LocationPath.ofWithCache` lack 
test coverage. While the existing LocationPathTest has comprehensive coverage 
for the original `LocationPath.of` method, the new optimization methods 
introduced in this PR are not tested.
   
   Consider adding test cases that:
   1. Verify `ofDirect` correctly constructs LocationPath with pre-computed 
values
   2. Test `ofWithCache` with various scenarios (cached fsIdPrefix present vs 
absent)
   3. Validate that the fast path produces the same results as the full parsing 
path
   4. Test edge cases like paths without authority or malformed normalized 
locations



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/property/storage/S3PropertyUtils.java:
##########
@@ -132,16 +142,50 @@ public static String validateAndNormalizeUri(String path,
         if (StringUtils.isBlank(path)) {
             throw new StoragePropertiesException("path is null");
         }
-        if (path.startsWith("s3://")) {
+
+        // Fast path 1: s3:// paths are already in the normalized format 
expected by BE
+        if (path.startsWith(S3_SCHEME_PREFIX)) {
             return path;
         }
 
+        // Fast path 2: simple S3-compatible schemes (oss://, cos://, s3a://, 
etc.)
+        // can be converted with simple string replacement: 
scheme://bucket/key -> s3://bucket/key
+        String normalized = trySimpleSchemeConversion(path);
+        if (normalized != null) {
+            return normalized;
+        }
+
+        // Full parsing path: for HTTP URLs and other complex formats
         boolean usePathStyle = Boolean.parseBoolean(stringUsePathStyle);
         boolean forceParsingByStandardUri = 
Boolean.parseBoolean(stringForceParsingByStandardUri);
         S3URI s3uri = S3URI.create(path, usePathStyle, 
forceParsingByStandardUri);
         return "s3" + S3URI.SCHEME_DELIM + s3uri.getBucket() + 
S3URI.PATH_DELIM + s3uri.getKey();
     }
 
+    /**
+     * Try to convert simple S3-compatible scheme URIs to s3:// format using 
string replacement.
+     * This avoids expensive regex parsing for common cases like 
oss://bucket/key, s3a://bucket/key, etc.
+     *
+     * @param path the input path
+     * @return converted s3:// path if successful, null if the path doesn't 
match simple pattern
+     */
+    private static String trySimpleSchemeConversion(String path) {
+        int delimIndex = path.indexOf(SCHEME_DELIM);
+        if (delimIndex <= 0) {
+            return null;
+        }
+
+        String scheme = path.substring(0, delimIndex).toLowerCase();
+        for (String compatibleScheme : SIMPLE_S3_COMPATIBLE_SCHEMES) {
+            if (compatibleScheme.equals(scheme)) {
+                // Simple conversion: replace scheme with "s3"
+                // e.g., "oss://bucket/key" -> "s3://bucket/key"
+                return S3_SCHEME_PREFIX + path.substring(delimIndex + 
SCHEME_DELIM.length());
+            }
+        }
+        return null;
+    }

Review Comment:
   The simple scheme conversion in `trySimpleSchemeConversion` doesn't validate 
that the path after the scheme delimiter is actually a valid S3-style path 
(bucket/key format). It performs a blind string replacement that could produce 
invalid results for malformed URIs.
   
   For example, paths like "oss://http://example.com/path"; or "s3a://" (missing 
bucket/key) would be converted to "s3://http://example.com/path"; or "s3://" 
without validation. While these edge cases are unlikely in production, consider 
adding basic validation (e.g., checking that there's content after "://") to 
ensure the converted path is well-formed.



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -363,9 +375,83 @@ private CloseableIterable<FileScanTask> 
planFileScanTask(TableScan scan) {
         return TableScanUtil.splitFiles(scan.planFiles(), targetSplitSize);
     }
 
+    /**
+     * Initialize cached values for LocationPath creation on first use.
+     * This avoids repeated StorageProperties lookup, scheme parsing, and 
S3URI regex parsing for each file.
+     */
+    private void initLocationPathCache(String samplePath) {
+        if (locationPathCacheInitialized) {
+            return;
+        }
+        synchronized (this) {
+            if (locationPathCacheInitialized) {
+                return;
+            }
+            try {
+                // Create a LocationPath using the full method to get all 
cached values
+                LocationPath sampleLocationPath = LocationPath.of(samplePath, 
storagePropertiesMap);
+                cachedStorageProperties = 
sampleLocationPath.getStorageProperties();
+                cachedSchema = sampleLocationPath.getSchema();
+                cachedFsIdentifier = sampleLocationPath.getFsIdentifier();
+
+                // Extract fsIdPrefix like "s3://" from fsIdentifier like 
"s3://bucket"
+                int schemeEnd = cachedFsIdentifier.indexOf("://");
+                if (schemeEnd > 0) {
+                    cachedFsIdPrefix = cachedFsIdentifier.substring(0, 
schemeEnd + 3);
+                }
+
+                // Cache path prefix mapping for fast transformation
+                // This allows subsequent files to skip S3URI regex parsing 
entirely
+                String normalizedPath = 
sampleLocationPath.getNormalizedLocation();
+
+                // Find the common prefix by looking for the last '/' before 
the filename
+                int lastSlashInOriginal = samplePath.lastIndexOf('/');
+                int lastSlashInNormalized = normalizedPath.lastIndexOf('/');
+
+                if (lastSlashInOriginal > 0 && lastSlashInNormalized > 0) {
+                    cachedOriginalPathPrefix = samplePath.substring(0, 
lastSlashInOriginal + 1);
+                    cachedNormalizedPathPrefix = normalizedPath.substring(0, 
lastSlashInNormalized + 1);
+                }
+
+                locationPathCacheInitialized = true;
+            } catch (Exception e) {
+                // If caching fails, we'll fall back to the full method each 
time
+                LOG.warn("Failed to initialize LocationPath cache, will use 
full parsing", e);
+                locationPathCacheInitialized = true;

Review Comment:
   Setting `locationPathCacheInitialized` to true even when an exception occurs 
prevents the cache from being initialized on subsequent attempts. This means 
that if initialization fails once, all subsequent file processing will skip the 
cache optimization entirely and fall back to the full parsing path indefinitely.
   
   Consider either:
   1. Not setting `locationPathCacheInitialized` to true in the catch block, 
allowing retry on the next file
   2. Or using a separate flag to track initialization failure and avoid 
repeated initialization attempts while still allowing fallback logic
   ```suggestion
                   // Do not mark the cache as initialized on failure; allow 
future retries.
   ```



##########
fe/fe-core/src/main/java/org/apache/doris/common/util/LocationPath.java:
##########
@@ -201,6 +199,65 @@ public static LocationPath of(String location,
         }
     }
 
+    /**
+     * Ultra-fast factory method that directly constructs LocationPath without 
any parsing.
+     * This is used when the normalized location is already known (e.g., from 
prefix transformation).
+     *
+     * @param normalizedLocation the already-normalized location string
+     * @param schema             pre-computed schema
+     * @param fsIdentifier       pre-computed filesystem identifier
+     * @param storageProperties  the storage properties (can be null)
+     * @return a new LocationPath instance
+     */
+    public static LocationPath ofDirect(String normalizedLocation,
+                                        String schema,
+                                        String fsIdentifier,
+                                        StorageProperties storageProperties) {
+        return new LocationPath(schema, normalizedLocation, fsIdentifier, 
storageProperties);
+    }
+
+    /**
+     * Fast factory method that reuses pre-computed schema and fsIdentifier.
+     * This is optimized for batch processing where many files share the same 
bucket/prefix.
+     *
+     * @param location           the input URI location string
+     * @param storageProperties  pre-computed storage properties for 
normalization
+     * @param cachedSchema       pre-computed schema (can be null to compute)
+     * @param cachedFsIdPrefix   pre-computed fsIdentifier prefix like "s3://" 
(can be null to compute)
+     * @return a new LocationPath instance
+     */
+    public static LocationPath ofWithCache(String location,
+                                           StorageProperties storageProperties,
+                                           String cachedSchema,
+                                           String cachedFsIdPrefix) {
+        try {
+            String normalizedLocation = 
storageProperties.validateAndNormalizeUri(location);
+
+            String fsIdentifier;
+            if (cachedFsIdPrefix != null && 
normalizedLocation.startsWith(cachedFsIdPrefix)) {
+                // Fast path: extract authority from normalized location 
without full URI parsing
+                int authorityStart = cachedFsIdPrefix.length();
+                int authorityEnd = normalizedLocation.indexOf('/', 
authorityStart);
+                if (authorityEnd == -1) {
+                    authorityEnd = normalizedLocation.length();
+                }
+                String authority = 
normalizedLocation.substring(authorityStart, authorityEnd);

Review Comment:
   In the fast path for extracting the authority from the normalized location, 
if the normalized location ends with a '/' character immediately after the 
fsIdPrefix (e.g., "s3:///path"), the authority extraction logic would produce 
an empty string for the authority. This would result in an fsIdentifier like 
"s3://" (missing the bucket), which differs from the behavior of the full URI 
parsing fallback.
   
   While this edge case is unlikely in practice (as it would represent an 
invalid S3 path), the inconsistent behavior between fast path and fallback 
could cause issues. Consider adding validation to ensure that the authority is 
not empty.
   ```suggestion
                   String authority = 
normalizedLocation.substring(authorityStart, authorityEnd);
                   if (authority.isEmpty()) {
                       throw new StoragePropertiesException(
                               "Invalid location: missing authority in 
normalized URI: " + normalizedLocation);
                   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [Improvement](Iceberg) Optimize LocationPath.of performance for Iceberg table queries [doris]

Reply via email to