Copilot commented on code in PR #59217:
URL: https://github.com/apache/doris/pull/59217#discussion_r2642166920
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -363,9 +375,83 @@ private CloseableIterable<FileScanTask>
planFileScanTask(TableScan scan) {
return TableScanUtil.splitFiles(scan.planFiles(), targetSplitSize);
}
+ /**
+ * Initialize cached values for LocationPath creation on first use.
+ * This avoids repeated StorageProperties lookup, scheme parsing, and
S3URI regex parsing for each file.
+ */
+ private void initLocationPathCache(String samplePath) {
+ if (locationPathCacheInitialized) {
+ return;
+ }
+ synchronized (this) {
+ if (locationPathCacheInitialized) {
+ return;
+ }
+ try {
+ // Create a LocationPath using the full method to get all
cached values
+ LocationPath sampleLocationPath = LocationPath.of(samplePath,
storagePropertiesMap);
+ cachedStorageProperties =
sampleLocationPath.getStorageProperties();
+ cachedSchema = sampleLocationPath.getSchema();
+ cachedFsIdentifier = sampleLocationPath.getFsIdentifier();
+
+ // Extract fsIdPrefix like "s3://" from fsIdentifier like
"s3://bucket"
+ int schemeEnd = cachedFsIdentifier.indexOf("://");
+ if (schemeEnd > 0) {
+ cachedFsIdPrefix = cachedFsIdentifier.substring(0,
schemeEnd + 3);
+ }
+
+ // Cache path prefix mapping for fast transformation
+ // This allows subsequent files to skip S3URI regex parsing
entirely
+ String normalizedPath =
sampleLocationPath.getNormalizedLocation();
+
+ // Find the common prefix by looking for the last '/' before
the filename
+ int lastSlashInOriginal = samplePath.lastIndexOf('/');
+ int lastSlashInNormalized = normalizedPath.lastIndexOf('/');
+
+ if (lastSlashInOriginal > 0 && lastSlashInNormalized > 0) {
+ cachedOriginalPathPrefix = samplePath.substring(0,
lastSlashInOriginal + 1);
+ cachedNormalizedPathPrefix = normalizedPath.substring(0,
lastSlashInNormalized + 1);
+ }
+
+ locationPathCacheInitialized = true;
+ } catch (Exception e) {
+ // If caching fails, we'll fall back to the full method each
time
+ LOG.warn("Failed to initialize LocationPath cache, will use
full parsing", e);
+ locationPathCacheInitialized = true;
+ }
+ }
+ }
+
+ /**
+ * Create a LocationPath with cached values for better performance.
+ * Uses cached path prefix mapping to completely bypass S3URI regex
parsing for most files.
+ * Falls back to full parsing if cache is not available or path doesn't
match cached prefix.
+ */
+ private LocationPath createLocationPathWithCache(String path) {
+ // Initialize cache on first call
+ if (!locationPathCacheInitialized) {
+ initLocationPathCache(path);
+ }
+
+ // Fast path: if path starts with cached original prefix, directly
transform without any parsing
+ if (cachedOriginalPathPrefix != null &&
path.startsWith(cachedOriginalPathPrefix)) {
+ // Transform: replace original prefix with normalized prefix
+ String normalizedPath = cachedNormalizedPathPrefix +
path.substring(cachedOriginalPathPrefix.length());
+ return LocationPath.ofDirect(normalizedPath, cachedSchema,
cachedFsIdentifier, cachedStorageProperties);
+ }
Review Comment:
This method assumes that files with the same prefix will always have the
same normalized path prefix. However, this may not be correct in all cases,
particularly when files in different subdirectories could have different
storage properties or when path normalization depends on the full path rather
than just the prefix.
For example, if different files require different normalization rules based
on their full paths (not just the prefix), using the cached prefix mapping
could produce incorrect results. Consider validating that the assumption holds
for all files in the table, or adding a comment documenting this assumption.
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -116,6 +116,18 @@ public class IcebergScanNode extends FileQueryScanNode {
private Map<StorageProperties.Type, StorageProperties>
storagePropertiesMap;
private Map<String, String> backendStorageProperties;
+ // Cached values for LocationPath creation optimization
+ // These are lazily initialized on first use to avoid parsing overhead for
each file
+ private volatile StorageProperties cachedStorageProperties;
+ private volatile String cachedSchema;
+ private volatile String cachedFsIdPrefix;
+ private volatile boolean locationPathCacheInitialized = false;
+ // Cache for path prefix transformation to avoid repeated S3URI parsing
+ // Maps original path prefix (e.g., "https://bucket.s3.amazonaws.com/") to
normalized prefix (e.g., "s3://bucket/")
+ private volatile String cachedOriginalPathPrefix;
+ private volatile String cachedNormalizedPathPrefix;
+ private volatile String cachedFsIdentifier;
Review Comment:
The volatile fields in this caching implementation may have visibility
issues in multi-threaded contexts. While `locationPathCacheInitialized` is
checked with double-checked locking, the other cached fields
(cachedOriginalPathPrefix, cachedNormalizedPathPrefix, etc.) may not be
properly visible to other threads even after `locationPathCacheInitialized` is
set to true.
The issue is that while the synchronized block ensures that the
initialization completes atomically, the writes to the non-volatile cached
fields inside the synchronized block may not be visible to other threads
reading them outside the synchronized block in `createLocationPathWithCache`.
Consider either:
1. Making all cached fields (cachedOriginalPathPrefix,
cachedNormalizedPathPrefix, cachedFsIdentifier) volatile as well
2. Or using a single atomic reference containing all cached values as an
immutable object
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/property/storage/S3PropertyUtils.java:
##########
@@ -132,16 +142,50 @@ public static String validateAndNormalizeUri(String path,
if (StringUtils.isBlank(path)) {
throw new StoragePropertiesException("path is null");
}
- if (path.startsWith("s3://")) {
+
+ // Fast path 1: s3:// paths are already in the normalized format
expected by BE
+ if (path.startsWith(S3_SCHEME_PREFIX)) {
return path;
}
+ // Fast path 2: simple S3-compatible schemes (oss://, cos://, s3a://,
etc.)
+ // can be converted with simple string replacement:
scheme://bucket/key -> s3://bucket/key
+ String normalized = trySimpleSchemeConversion(path);
+ if (normalized != null) {
+ return normalized;
+ }
+
+ // Full parsing path: for HTTP URLs and other complex formats
boolean usePathStyle = Boolean.parseBoolean(stringUsePathStyle);
boolean forceParsingByStandardUri =
Boolean.parseBoolean(stringForceParsingByStandardUri);
S3URI s3uri = S3URI.create(path, usePathStyle,
forceParsingByStandardUri);
return "s3" + S3URI.SCHEME_DELIM + s3uri.getBucket() +
S3URI.PATH_DELIM + s3uri.getKey();
}
+ /**
+ * Try to convert simple S3-compatible scheme URIs to s3:// format using
string replacement.
+ * This avoids expensive regex parsing for common cases like
oss://bucket/key, s3a://bucket/key, etc.
+ *
+ * @param path the input path
+ * @return converted s3:// path if successful, null if the path doesn't
match simple pattern
+ */
+ private static String trySimpleSchemeConversion(String path) {
+ int delimIndex = path.indexOf(SCHEME_DELIM);
+ if (delimIndex <= 0) {
+ return null;
+ }
+
+ String scheme = path.substring(0, delimIndex).toLowerCase();
+ for (String compatibleScheme : SIMPLE_S3_COMPATIBLE_SCHEMES) {
+ if (compatibleScheme.equals(scheme)) {
+ // Simple conversion: replace scheme with "s3"
+ // e.g., "oss://bucket/key" -> "s3://bucket/key"
+ return S3_SCHEME_PREFIX + path.substring(delimIndex +
SCHEME_DELIM.length());
+ }
+ }
+ return null;
+ }
Review Comment:
The new `trySimpleSchemeConversion` method and the fast path optimization
for S3-compatible schemes (oss://, cos://, s3a://, etc.) lack test coverage.
While existing tests cover the basic validateAndNormalizeUri functionality,
they don't specifically test the new optimization path.
Consider adding test cases that verify:
1. Conversion of each S3-compatible scheme listed in
SIMPLE_S3_COMPATIBLE_SCHEMES (e.g., "oss://bucket/key" -> "s3://bucket/key")
2. Case sensitivity handling (e.g., "OSS://bucket/key" vs "oss://bucket/key")
3. Edge cases like missing bucket/key after the scheme
4. Paths that should fall back to full S3URI parsing
##########
fe/fe-core/src/main/java/org/apache/doris/common/util/LocationPath.java:
##########
@@ -201,6 +199,65 @@ public static LocationPath of(String location,
}
}
+ /**
+ * Ultra-fast factory method that directly constructs LocationPath without
any parsing.
+ * This is used when the normalized location is already known (e.g., from
prefix transformation).
+ *
+ * @param normalizedLocation the already-normalized location string
+ * @param schema pre-computed schema
+ * @param fsIdentifier pre-computed filesystem identifier
+ * @param storageProperties the storage properties (can be null)
+ * @return a new LocationPath instance
+ */
+ public static LocationPath ofDirect(String normalizedLocation,
+ String schema,
+ String fsIdentifier,
+ StorageProperties storageProperties) {
+ return new LocationPath(schema, normalizedLocation, fsIdentifier,
storageProperties);
+ }
+
+ /**
+ * Fast factory method that reuses pre-computed schema and fsIdentifier.
+ * This is optimized for batch processing where many files share the same
bucket/prefix.
+ *
+ * @param location the input URI location string
+ * @param storageProperties pre-computed storage properties for
normalization
+ * @param cachedSchema pre-computed schema (can be null to compute)
+ * @param cachedFsIdPrefix pre-computed fsIdentifier prefix like "s3://"
(can be null to compute)
+ * @return a new LocationPath instance
+ */
+ public static LocationPath ofWithCache(String location,
+ StorageProperties storageProperties,
+ String cachedSchema,
+ String cachedFsIdPrefix) {
+ try {
+ String normalizedLocation =
storageProperties.validateAndNormalizeUri(location);
+
+ String fsIdentifier;
+ if (cachedFsIdPrefix != null &&
normalizedLocation.startsWith(cachedFsIdPrefix)) {
+ // Fast path: extract authority from normalized location
without full URI parsing
+ int authorityStart = cachedFsIdPrefix.length();
+ int authorityEnd = normalizedLocation.indexOf('/',
authorityStart);
+ if (authorityEnd == -1) {
+ authorityEnd = normalizedLocation.length();
+ }
+ String authority =
normalizedLocation.substring(authorityStart, authorityEnd);
+ fsIdentifier = cachedFsIdPrefix + authority;
+ } else {
+ // Fallback to full URI parsing
+ String encodedLocation = encodedLocation(normalizedLocation);
+ URI uri = URI.create(encodedLocation);
+ fsIdentifier = Strings.nullToEmpty(uri.getScheme()) + "://"
+ + Strings.nullToEmpty(uri.getAuthority());
+ }
+
+ String schema = cachedSchema != null ? cachedSchema :
extractScheme(location);
+ return new LocationPath(schema, normalizedLocation, fsIdentifier,
storageProperties);
+ } catch (UserException e) {
+ throw new StoragePropertiesException("Failed to create
LocationPath for location: " + location, e);
+ }
+ }
Review Comment:
The new methods `LocationPath.ofDirect` and `LocationPath.ofWithCache` lack
test coverage. While the existing LocationPathTest has comprehensive coverage
for the original `LocationPath.of` method, the new optimization methods
introduced in this PR are not tested.
Consider adding test cases that:
1. Verify `ofDirect` correctly constructs LocationPath with pre-computed
values
2. Test `ofWithCache` with various scenarios (cached fsIdPrefix present vs
absent)
3. Validate that the fast path produces the same results as the full parsing
path
4. Test edge cases like paths without authority or malformed normalized
locations
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/property/storage/S3PropertyUtils.java:
##########
@@ -132,16 +142,50 @@ public static String validateAndNormalizeUri(String path,
if (StringUtils.isBlank(path)) {
throw new StoragePropertiesException("path is null");
}
- if (path.startsWith("s3://")) {
+
+ // Fast path 1: s3:// paths are already in the normalized format
expected by BE
+ if (path.startsWith(S3_SCHEME_PREFIX)) {
return path;
}
+ // Fast path 2: simple S3-compatible schemes (oss://, cos://, s3a://,
etc.)
+ // can be converted with simple string replacement:
scheme://bucket/key -> s3://bucket/key
+ String normalized = trySimpleSchemeConversion(path);
+ if (normalized != null) {
+ return normalized;
+ }
+
+ // Full parsing path: for HTTP URLs and other complex formats
boolean usePathStyle = Boolean.parseBoolean(stringUsePathStyle);
boolean forceParsingByStandardUri =
Boolean.parseBoolean(stringForceParsingByStandardUri);
S3URI s3uri = S3URI.create(path, usePathStyle,
forceParsingByStandardUri);
return "s3" + S3URI.SCHEME_DELIM + s3uri.getBucket() +
S3URI.PATH_DELIM + s3uri.getKey();
}
+ /**
+ * Try to convert simple S3-compatible scheme URIs to s3:// format using
string replacement.
+ * This avoids expensive regex parsing for common cases like
oss://bucket/key, s3a://bucket/key, etc.
+ *
+ * @param path the input path
+ * @return converted s3:// path if successful, null if the path doesn't
match simple pattern
+ */
+ private static String trySimpleSchemeConversion(String path) {
+ int delimIndex = path.indexOf(SCHEME_DELIM);
+ if (delimIndex <= 0) {
+ return null;
+ }
+
+ String scheme = path.substring(0, delimIndex).toLowerCase();
+ for (String compatibleScheme : SIMPLE_S3_COMPATIBLE_SCHEMES) {
+ if (compatibleScheme.equals(scheme)) {
+ // Simple conversion: replace scheme with "s3"
+ // e.g., "oss://bucket/key" -> "s3://bucket/key"
+ return S3_SCHEME_PREFIX + path.substring(delimIndex +
SCHEME_DELIM.length());
+ }
+ }
+ return null;
+ }
Review Comment:
The simple scheme conversion in `trySimpleSchemeConversion` doesn't validate
that the path after the scheme delimiter is actually a valid S3-style path
(bucket/key format). It performs a blind string replacement that could produce
invalid results for malformed URIs.
For example, paths like "oss://http://example.com/path" or "s3a://" (missing
bucket/key) would be converted to "s3://http://example.com/path" or "s3://"
without validation. While these edge cases are unlikely in production, consider
adding basic validation (e.g., checking that there's content after "://") to
ensure the converted path is well-formed.
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/source/IcebergScanNode.java:
##########
@@ -363,9 +375,83 @@ private CloseableIterable<FileScanTask>
planFileScanTask(TableScan scan) {
return TableScanUtil.splitFiles(scan.planFiles(), targetSplitSize);
}
+ /**
+ * Initialize cached values for LocationPath creation on first use.
+ * This avoids repeated StorageProperties lookup, scheme parsing, and
S3URI regex parsing for each file.
+ */
+ private void initLocationPathCache(String samplePath) {
+ if (locationPathCacheInitialized) {
+ return;
+ }
+ synchronized (this) {
+ if (locationPathCacheInitialized) {
+ return;
+ }
+ try {
+ // Create a LocationPath using the full method to get all
cached values
+ LocationPath sampleLocationPath = LocationPath.of(samplePath,
storagePropertiesMap);
+ cachedStorageProperties =
sampleLocationPath.getStorageProperties();
+ cachedSchema = sampleLocationPath.getSchema();
+ cachedFsIdentifier = sampleLocationPath.getFsIdentifier();
+
+ // Extract fsIdPrefix like "s3://" from fsIdentifier like
"s3://bucket"
+ int schemeEnd = cachedFsIdentifier.indexOf("://");
+ if (schemeEnd > 0) {
+ cachedFsIdPrefix = cachedFsIdentifier.substring(0,
schemeEnd + 3);
+ }
+
+ // Cache path prefix mapping for fast transformation
+ // This allows subsequent files to skip S3URI regex parsing
entirely
+ String normalizedPath =
sampleLocationPath.getNormalizedLocation();
+
+ // Find the common prefix by looking for the last '/' before
the filename
+ int lastSlashInOriginal = samplePath.lastIndexOf('/');
+ int lastSlashInNormalized = normalizedPath.lastIndexOf('/');
+
+ if (lastSlashInOriginal > 0 && lastSlashInNormalized > 0) {
+ cachedOriginalPathPrefix = samplePath.substring(0,
lastSlashInOriginal + 1);
+ cachedNormalizedPathPrefix = normalizedPath.substring(0,
lastSlashInNormalized + 1);
+ }
+
+ locationPathCacheInitialized = true;
+ } catch (Exception e) {
+ // If caching fails, we'll fall back to the full method each
time
+ LOG.warn("Failed to initialize LocationPath cache, will use
full parsing", e);
+ locationPathCacheInitialized = true;
Review Comment:
Setting `locationPathCacheInitialized` to true even when an exception occurs
prevents the cache from being initialized on subsequent attempts. This means
that if initialization fails once, all subsequent file processing will skip the
cache optimization entirely and fall back to the full parsing path indefinitely.
Consider either:
1. Not setting `locationPathCacheInitialized` to true in the catch block,
allowing retry on the next file
2. Or using a separate flag to track initialization failure and avoid
repeated initialization attempts while still allowing fallback logic
```suggestion
// Do not mark the cache as initialized on failure; allow
future retries.
```
##########
fe/fe-core/src/main/java/org/apache/doris/common/util/LocationPath.java:
##########
@@ -201,6 +199,65 @@ public static LocationPath of(String location,
}
}
+ /**
+ * Ultra-fast factory method that directly constructs LocationPath without
any parsing.
+ * This is used when the normalized location is already known (e.g., from
prefix transformation).
+ *
+ * @param normalizedLocation the already-normalized location string
+ * @param schema pre-computed schema
+ * @param fsIdentifier pre-computed filesystem identifier
+ * @param storageProperties the storage properties (can be null)
+ * @return a new LocationPath instance
+ */
+ public static LocationPath ofDirect(String normalizedLocation,
+ String schema,
+ String fsIdentifier,
+ StorageProperties storageProperties) {
+ return new LocationPath(schema, normalizedLocation, fsIdentifier,
storageProperties);
+ }
+
+ /**
+ * Fast factory method that reuses pre-computed schema and fsIdentifier.
+ * This is optimized for batch processing where many files share the same
bucket/prefix.
+ *
+ * @param location the input URI location string
+ * @param storageProperties pre-computed storage properties for
normalization
+ * @param cachedSchema pre-computed schema (can be null to compute)
+ * @param cachedFsIdPrefix pre-computed fsIdentifier prefix like "s3://"
(can be null to compute)
+ * @return a new LocationPath instance
+ */
+ public static LocationPath ofWithCache(String location,
+ StorageProperties storageProperties,
+ String cachedSchema,
+ String cachedFsIdPrefix) {
+ try {
+ String normalizedLocation =
storageProperties.validateAndNormalizeUri(location);
+
+ String fsIdentifier;
+ if (cachedFsIdPrefix != null &&
normalizedLocation.startsWith(cachedFsIdPrefix)) {
+ // Fast path: extract authority from normalized location
without full URI parsing
+ int authorityStart = cachedFsIdPrefix.length();
+ int authorityEnd = normalizedLocation.indexOf('/',
authorityStart);
+ if (authorityEnd == -1) {
+ authorityEnd = normalizedLocation.length();
+ }
+ String authority =
normalizedLocation.substring(authorityStart, authorityEnd);
Review Comment:
In the fast path for extracting the authority from the normalized location,
if the normalized location ends with a '/' character immediately after the
fsIdPrefix (e.g., "s3:///path"), the authority extraction logic would produce
an empty string for the authority. This would result in an fsIdentifier like
"s3://" (missing the bucket), which differs from the behavior of the full URI
parsing fallback.
While this edge case is unlikely in practice (as it would represent an
invalid S3 path), the inconsistent behavior between fast path and fallback
could cause issues. Consider adding validation to ensure that the authority is
not empty.
```suggestion
String authority =
normalizedLocation.substring(authorityStart, authorityEnd);
if (authority.isEmpty()) {
throw new StoragePropertiesException(
"Invalid location: missing authority in
normalized URI: " + normalizedLocation);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]