Copilot commented on code in PR #2776:
URL: https://github.com/apache/sedona/pull/2776#discussion_r2970141827
##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java:
##########
@@ -63,7 +82,39 @@ private OsmNode parse(int idx, Osmformat.StringTable
stringTable) {
HashMap<String, String> tags = parseTags(stringTable);
- return new OsmNode(id, lat, lon, tags);
+ OsmNode node = new OsmNode(id, lat, lon, tags);
+
+ if (hasDenseInfo) {
+ Osmformat.DenseInfo denseInfo = nodes.getDenseinfo();
+
+ // version is NOT delta-encoded
+ node.setVersion(denseInfo.getVersion(idx));
+
+ // timestamp, changeset, uid, user_sid are delta-encoded
+ long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
+ long changeset = denseInfo.getChangeset(idx) + firstChangeset;
+ int uid = denseInfo.getUid(idx) + firstUid;
+ int userSid = denseInfo.getUserSid(idx) + firstUserSid;
+
+ firstTimestamp = timestamp;
+ firstChangeset = changeset;
+ firstUid = uid;
+ firstUserSid = userSid;
+
+ node.setTimestamp(timestamp * dateGranularity);
+ node.setChangeset(changeset);
+ node.setUid(uid);
+ if (userSid > 0) {
+ node.setUser(stringTable.getS(userSid).toStringUtf8());
+ }
+
+ // visible is NOT delta-encoded, and may not be present
+ if (denseInfo.getVisibleCount() > idx) {
+ node.setVisible(denseInfo.getVisible(idx));
Review Comment:
For DenseNodes, `visible` is only set when `visibleCount > idx`, otherwise
it remains null. The OSM PBF spec notes that in history files
(required_features `HistoricalInformation`) missing `visible` values must be
treated as `true`, so returning null can be incorrect. Consider defaulting to
true based on header required_features (or leaving null but
documenting/encoding that behavior explicitly).
```suggestion
node.setVisible(denseInfo.getVisible(idx));
} else {
// Per OSM PBF spec, missing 'visible' must be treated as true
(especially in history files).
node.setVisible(true);
```
##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sedona.sql.datasources.osmpbf.features;
+
+import org.apache.sedona.sql.datasources.osmpbf.build.Osmformat;
+import org.apache.sedona.sql.datasources.osmpbf.model.OSMEntity;
+
+public class InfoResolver {
+
+ public static void populateInfo(
+ OSMEntity entity,
+ Osmformat.Info info,
+ Osmformat.StringTable stringTable,
+ int dateGranularity) {
+ if (info == null) {
+ return;
+ }
+ entity.setVersion(info.getVersion());
+ entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
+ entity.setChangeset(info.getChangeset());
+ entity.setUid(info.getUid());
+ if (info.getUserSid() > 0) {
+ entity.setUser(stringTable.getS(info.getUserSid()).toStringUtf8());
+ }
+ if (info.hasVisible()) {
+ entity.setVisible(info.getVisible());
Review Comment:
`visible` handling is incomplete per the PBF spec: when the file has
required_features `HistoricalInformation`, objects that *lack* the `visible`
flag must be assumed `true` (see generated Osmformat docs). Currently, if
`info.hasVisible()` is false the column stays null, which is incorrect for
history files. Consider parsing the HeaderBlock required_features and
defaulting `visible` to true when appropriate (or explicitly documenting the
limitation).
```suggestion
entity.setVisible(info.getVisible());
} else {
// Per OSM PBF spec, when HistoricalInformation is a required_feature,
// missing "visible" must be treated as true. Default to true when
absent.
entity.setVisible(true);
```
##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sedona.sql.datasources.osmpbf.features;
+
+import org.apache.sedona.sql.datasources.osmpbf.build.Osmformat;
+import org.apache.sedona.sql.datasources.osmpbf.model.OSMEntity;
+
+public class InfoResolver {
+
+ public static void populateInfo(
+ OSMEntity entity,
+ Osmformat.Info info,
+ Osmformat.StringTable stringTable,
+ int dateGranularity) {
+ if (info == null) {
+ return;
+ }
+ entity.setVersion(info.getVersion());
+ entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
+ entity.setChangeset(info.getChangeset());
+ entity.setUid(info.getUid());
Review Comment:
`populateInfo` writes default protobuf values into the entity even when the
corresponding fields are not present (e.g., `version` defaults to -1,
`timestamp/changeset/uid` default to 0). This makes missing metadata look like
real values and can produce incorrect timestamps (epoch) or version -1. Use
`info.hasVersion/hasTimestamp/hasChangeset/hasUid` (and optionally treat
version==-1 as unset) to only set these fields when present; otherwise leave
them null.
```suggestion
if (info.hasVersion()) {
int version = info.getVersion();
if (version != -1) {
entity.setVersion(version);
}
}
if (info.hasTimestamp()) {
entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
}
if (info.hasChangeset()) {
entity.setChangeset(info.getChangeset());
}
if (info.hasUid()) {
entity.setUid(info.getUid());
}
```
##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java:
##########
@@ -26,23 +26,42 @@ public class DenseNodeExtractor implements Extractor {
long latOffset;
long lonOffset;
long granularity;
+ int dateGranularity;
long firstId;
long firstLat;
long firstLon;
Integer keyIndex;
+ // DenseInfo delta accumulators
+ boolean hasDenseInfo;
+ long firstTimestamp;
+ long firstChangeset;
+ int firstUid;
+ int firstUserSid;
+
Osmformat.DenseNodes nodes;
public DenseNodeExtractor(
- Osmformat.DenseNodes nodes, long latOffset, long lonOffset, long
granularity) {
+ Osmformat.DenseNodes nodes,
+ long latOffset,
+ long lonOffset,
+ long granularity,
+ int dateGranularity) {
this.firstId = 0;
this.firstLat = 0;
this.firstLon = 0;
this.latOffset = latOffset;
this.lonOffset = lonOffset;
this.granularity = granularity;
+ this.dateGranularity = dateGranularity;
this.nodes = nodes;
this.keyIndex = 0;
+
+ this.hasDenseInfo = nodes.hasDenseinfo() &&
nodes.getDenseinfo().getVersionCount() > 0;
Review Comment:
`hasDenseInfo` is derived only from `getVersionCount() > 0`, but the code
later assumes `timestamp/changeset/uid/user_sid` lists are also present for
every node. If a PBF includes DenseInfo with only versions populated (other
lists empty), `getTimestamp(idx)` etc will throw. Consider computing presence
per-field (using `get*Count() > idx`) instead of a single boolean gate.
##########
spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala:
##########
@@ -232,6 +232,90 @@ class OsmReaderTest extends TestBaseScala with Matchers {
relationsList should contain theSameElementsAs expectedRelationsList
}
+ it("should parse metadata fields (changeset, timestamp, uid, user,
version)") {
+ val osmData = sparkSession.read
+ .format("osmpbf")
+ .load(monacoPath)
+
+ // All entities should have version, timestamp, changeset populated
+ val totalCount = osmData.count()
+ val withMetadata = osmData
+ .filter("version is not null and timestamp is not null and changeset
is not null")
+ .count()
+
+ withMetadata shouldEqual totalCount
+
+ // Verify timestamp values are reasonable (after year 2000, before year
2100)
+ val timestamps = osmData
+ .selectExpr("min(timestamp)", "max(timestamp)")
+ .collect()
+ .head
+
+ val minTimestamp = timestamps.getTimestamp(0)
+ val maxTimestamp = timestamps.getTimestamp(1)
+
+ minTimestamp.after(java.sql.Timestamp.valueOf("2000-01-01 00:00:00"))
shouldBe true
+ maxTimestamp.before(java.sql.Timestamp.valueOf("2100-01-01 00:00:00"))
shouldBe true
+
+ // Verify version is positive
+ val minVersion = osmData
+ .selectExpr("min(version)")
+ .collect()
+ .head
+ .getInt(0)
+
+ minVersion should be >= 1
+
+ // Verify changeset is non-negative
+ val minChangeset = osmData
+ .selectExpr("min(changeset)")
+ .collect()
+ .head
+ .getLong(0)
+
+ minChangeset should be >= 0L
+
+ // Verify metadata works for each entity kind
+ for (kind <- Seq("node", "way", "relation")) {
+ val kindData = osmData.filter(s"kind == '$kind'")
+ val kindWithMeta = kindData
+ .filter("version is not null and timestamp is not null")
+ .count()
+
+ kindWithMeta shouldEqual kindData.count()
+ }
+ }
+
+ it("should include metadata fields in schema for dense nodes") {
+ val denseData = sparkSession.read
+ .format("osmpbf")
+ .load(densePath)
+
+ // Verify schema includes the new fields
+ val fieldNames = denseData.schema.fieldNames
+ fieldNames should contain("changeset")
+ fieldNames should contain("timestamp")
+ fieldNames should contain("uid")
+ fieldNames should contain("user")
+ fieldNames should contain("version")
+ fieldNames should contain("visible")
Review Comment:
The new dense-node test only asserts that the schema contains the metadata
columns; it doesn’t validate that `DenseNodeExtractor` actually
populates/decodes them (including the delta-decoding logic for
timestamp/changeset/uid/user_sid). Adding at least one assertion on
non-null/range-correct values for a dense PBF fixture would prevent regressions
where the columns exist but are always null/incorrect.
```suggestion
fieldNames should contain("visible")
// Verify that at least one dense node has populated metadata values
val nodeWithMetadata = denseData
.where(col("kind") === "node")
.select("changeset", "timestamp", "uid", "user", "version",
"visible")
.head()
val changeset = nodeWithMetadata.getAs[Long]("changeset")
val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
val uid = nodeWithMetadata.getAs[Long]("uid")
val user = nodeWithMetadata.getAs[String]("user")
val version = nodeWithMetadata.getAs[Long]("version")
val visible = nodeWithMetadata.getAs[Boolean]("visible")
// Basic range/non-null checks to ensure delta-decoded metadata is
populated
changeset should be > 0L
timestamp should be > 0L
uid should be >= 0L
version should be > 0L
user should not be null
user.trim.isEmpty shouldBe false
// Accessing 'visible' as Boolean already guarantees it is non-null
(visible == true || visible == false) shouldBe true
```
##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java:
##########
@@ -63,7 +82,39 @@ private OsmNode parse(int idx, Osmformat.StringTable
stringTable) {
HashMap<String, String> tags = parseTags(stringTable);
- return new OsmNode(id, lat, lon, tags);
+ OsmNode node = new OsmNode(id, lat, lon, tags);
+
+ if (hasDenseInfo) {
+ Osmformat.DenseInfo denseInfo = nodes.getDenseinfo();
+
+ // version is NOT delta-encoded
+ node.setVersion(denseInfo.getVersion(idx));
+
+ // timestamp, changeset, uid, user_sid are delta-encoded
+ long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
+ long changeset = denseInfo.getChangeset(idx) + firstChangeset;
+ int uid = denseInfo.getUid(idx) + firstUid;
+ int userSid = denseInfo.getUserSid(idx) + firstUserSid;
+
+ firstTimestamp = timestamp;
+ firstChangeset = changeset;
+ firstUid = uid;
+ firstUserSid = userSid;
+
+ node.setTimestamp(timestamp * dateGranularity);
+ node.setChangeset(changeset);
+ node.setUid(uid);
+ if (userSid > 0) {
+ node.setUser(stringTable.getS(userSid).toStringUtf8());
+ }
+
Review Comment:
Inside the `hasDenseInfo` block, `denseInfo.getTimestamp/ getChangeset/
getUid/ getUserSid` are indexed unconditionally. These are repeated fields and
may be absent or shorter than `idCount` in some files, causing
`IndexOutOfBoundsException` during parsing. Guard each access with the
corresponding `get*Count() > idx` (and only advance the delta accumulator for
fields that are actually present).
```suggestion
if (denseInfo.getVersionCount() > idx) {
node.setVersion(denseInfo.getVersion(idx));
}
// timestamp, changeset, uid, user_sid are delta-encoded
if (denseInfo.getTimestampCount() > idx) {
long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
firstTimestamp = timestamp;
node.setTimestamp(timestamp * dateGranularity);
}
if (denseInfo.getChangesetCount() > idx) {
long changeset = denseInfo.getChangeset(idx) + firstChangeset;
firstChangeset = changeset;
node.setChangeset(changeset);
}
if (denseInfo.getUidCount() > idx) {
int uid = denseInfo.getUid(idx) + firstUid;
firstUid = uid;
node.setUid(uid);
}
if (denseInfo.getUserSidCount() > idx) {
int userSid = denseInfo.getUserSid(idx) + firstUserSid;
firstUserSid = userSid;
if (userSid > 0) {
node.setUser(stringTable.getS(userSid).toStringUtf8());
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]