Re: [PR] [GH-2760] Extend OSM PBF reader to support additional metadata fields [sedona]

via GitHub Sat, 21 Mar 2026 13:33:07 -0700


Copilot commented on code in PR #2776:
URL: https://github.com/apache/sedona/pull/2776#discussion_r2970141827



##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java:
##########
@@ -63,7 +82,39 @@ private OsmNode parse(int idx, Osmformat.StringTable 
stringTable) {
 
     HashMap<String, String> tags = parseTags(stringTable);
 
-    return new OsmNode(id, lat, lon, tags);
+    OsmNode node = new OsmNode(id, lat, lon, tags);
+
+    if (hasDenseInfo) {
+      Osmformat.DenseInfo denseInfo = nodes.getDenseinfo();
+
+      // version is NOT delta-encoded
+      node.setVersion(denseInfo.getVersion(idx));
+
+      // timestamp, changeset, uid, user_sid are delta-encoded
+      long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
+      long changeset = denseInfo.getChangeset(idx) + firstChangeset;
+      int uid = denseInfo.getUid(idx) + firstUid;
+      int userSid = denseInfo.getUserSid(idx) + firstUserSid;
+
+      firstTimestamp = timestamp;
+      firstChangeset = changeset;
+      firstUid = uid;
+      firstUserSid = userSid;
+
+      node.setTimestamp(timestamp * dateGranularity);
+      node.setChangeset(changeset);
+      node.setUid(uid);
+      if (userSid > 0) {
+        node.setUser(stringTable.getS(userSid).toStringUtf8());
+      }
+
+      // visible is NOT delta-encoded, and may not be present
+      if (denseInfo.getVisibleCount() > idx) {
+        node.setVisible(denseInfo.getVisible(idx));

Review Comment:
   For DenseNodes, `visible` is only set when `visibleCount > idx`, otherwise 
it remains null. The OSM PBF spec notes that in history files 
(required_features `HistoricalInformation`) missing `visible` values must be 
treated as `true`, so returning null can be incorrect. Consider defaulting to 
true based on header required_features (or leaving null but 
documenting/encoding that behavior explicitly).
   ```suggestion
           node.setVisible(denseInfo.getVisible(idx));
         } else {
           // Per OSM PBF spec, missing 'visible' must be treated as true 
(especially in history files).
           node.setVisible(true);
   ```



##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sedona.sql.datasources.osmpbf.features;
+
+import org.apache.sedona.sql.datasources.osmpbf.build.Osmformat;
+import org.apache.sedona.sql.datasources.osmpbf.model.OSMEntity;
+
+public class InfoResolver {
+
+  public static void populateInfo(
+      OSMEntity entity,
+      Osmformat.Info info,
+      Osmformat.StringTable stringTable,
+      int dateGranularity) {
+    if (info == null) {
+      return;
+    }
+    entity.setVersion(info.getVersion());
+    entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
+    entity.setChangeset(info.getChangeset());
+    entity.setUid(info.getUid());
+    if (info.getUserSid() > 0) {
+      entity.setUser(stringTable.getS(info.getUserSid()).toStringUtf8());
+    }
+    if (info.hasVisible()) {
+      entity.setVisible(info.getVisible());

Review Comment:
   `visible` handling is incomplete per the PBF spec: when the file has 
required_features `HistoricalInformation`, objects that *lack* the `visible` 
flag must be assumed `true` (see generated Osmformat docs). Currently, if 
`info.hasVisible()` is false the column stays null, which is incorrect for 
history files. Consider parsing the HeaderBlock required_features and 
defaulting `visible` to true when appropriate (or explicitly documenting the 
limitation).
   ```suggestion
         entity.setVisible(info.getVisible());
       } else {
         // Per OSM PBF spec, when HistoricalInformation is a required_feature,
         // missing "visible" must be treated as true. Default to true when 
absent.
         entity.setVisible(true);
   ```



##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/features/InfoResolver.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.sedona.sql.datasources.osmpbf.features;
+
+import org.apache.sedona.sql.datasources.osmpbf.build.Osmformat;
+import org.apache.sedona.sql.datasources.osmpbf.model.OSMEntity;
+
+public class InfoResolver {
+
+  public static void populateInfo(
+      OSMEntity entity,
+      Osmformat.Info info,
+      Osmformat.StringTable stringTable,
+      int dateGranularity) {
+    if (info == null) {
+      return;
+    }
+    entity.setVersion(info.getVersion());
+    entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
+    entity.setChangeset(info.getChangeset());
+    entity.setUid(info.getUid());

Review Comment:
   `populateInfo` writes default protobuf values into the entity even when the 
corresponding fields are not present (e.g., `version` defaults to -1, 
`timestamp/changeset/uid` default to 0). This makes missing metadata look like 
real values and can produce incorrect timestamps (epoch) or version -1. Use 
`info.hasVersion/hasTimestamp/hasChangeset/hasUid` (and optionally treat 
version==-1 as unset) to only set these fields when present; otherwise leave 
them null.
   ```suggestion
       if (info.hasVersion()) {
         int version = info.getVersion();
         if (version != -1) {
           entity.setVersion(version);
         }
       }
       if (info.hasTimestamp()) {
         entity.setTimestamp((long) info.getTimestamp() * dateGranularity);
       }
       if (info.hasChangeset()) {
         entity.setChangeset(info.getChangeset());
       }
       if (info.hasUid()) {
         entity.setUid(info.getUid());
       }
   ```



##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java:
##########
@@ -26,23 +26,42 @@ public class DenseNodeExtractor implements Extractor {
   long latOffset;
   long lonOffset;
   long granularity;
+  int dateGranularity;
   long firstId;
   long firstLat;
   long firstLon;
   Integer keyIndex;
 
+  // DenseInfo delta accumulators
+  boolean hasDenseInfo;
+  long firstTimestamp;
+  long firstChangeset;
+  int firstUid;
+  int firstUserSid;
+
   Osmformat.DenseNodes nodes;
 
   public DenseNodeExtractor(
-      Osmformat.DenseNodes nodes, long latOffset, long lonOffset, long 
granularity) {
+      Osmformat.DenseNodes nodes,
+      long latOffset,
+      long lonOffset,
+      long granularity,
+      int dateGranularity) {
     this.firstId = 0;
     this.firstLat = 0;
     this.firstLon = 0;
     this.latOffset = latOffset;
     this.lonOffset = lonOffset;
     this.granularity = granularity;
+    this.dateGranularity = dateGranularity;
     this.nodes = nodes;
     this.keyIndex = 0;
+
+    this.hasDenseInfo = nodes.hasDenseinfo() && 
nodes.getDenseinfo().getVersionCount() > 0;

Review Comment:
   `hasDenseInfo` is derived only from `getVersionCount() > 0`, but the code 
later assumes `timestamp/changeset/uid/user_sid` lists are also present for 
every node. If a PBF includes DenseInfo with only versions populated (other 
lists empty), `getTimestamp(idx)` etc will throw. Consider computing presence 
per-field (using `get*Count() > idx`) instead of a single boolean gate.



##########
spark/common/src/test/scala/org/apache/sedona/sql/OsmReaderTest.scala:
##########
@@ -232,6 +232,90 @@ class OsmReaderTest extends TestBaseScala with Matchers {
       relationsList should contain theSameElementsAs expectedRelationsList
     }
 
+    it("should parse metadata fields (changeset, timestamp, uid, user, 
version)") {
+      val osmData = sparkSession.read
+        .format("osmpbf")
+        .load(monacoPath)
+
+      // All entities should have version, timestamp, changeset populated
+      val totalCount = osmData.count()
+      val withMetadata = osmData
+        .filter("version is not null and timestamp is not null and changeset 
is not null")
+        .count()
+
+      withMetadata shouldEqual totalCount
+
+      // Verify timestamp values are reasonable (after year 2000, before year 
2100)
+      val timestamps = osmData
+        .selectExpr("min(timestamp)", "max(timestamp)")
+        .collect()
+        .head
+
+      val minTimestamp = timestamps.getTimestamp(0)
+      val maxTimestamp = timestamps.getTimestamp(1)
+
+      minTimestamp.after(java.sql.Timestamp.valueOf("2000-01-01 00:00:00")) 
shouldBe true
+      maxTimestamp.before(java.sql.Timestamp.valueOf("2100-01-01 00:00:00")) 
shouldBe true
+
+      // Verify version is positive
+      val minVersion = osmData
+        .selectExpr("min(version)")
+        .collect()
+        .head
+        .getInt(0)
+
+      minVersion should be >= 1
+
+      // Verify changeset is non-negative
+      val minChangeset = osmData
+        .selectExpr("min(changeset)")
+        .collect()
+        .head
+        .getLong(0)
+
+      minChangeset should be >= 0L
+
+      // Verify metadata works for each entity kind
+      for (kind <- Seq("node", "way", "relation")) {
+        val kindData = osmData.filter(s"kind == '$kind'")
+        val kindWithMeta = kindData
+          .filter("version is not null and timestamp is not null")
+          .count()
+
+        kindWithMeta shouldEqual kindData.count()
+      }
+    }
+
+    it("should include metadata fields in schema for dense nodes") {
+      val denseData = sparkSession.read
+        .format("osmpbf")
+        .load(densePath)
+
+      // Verify schema includes the new fields
+      val fieldNames = denseData.schema.fieldNames
+      fieldNames should contain("changeset")
+      fieldNames should contain("timestamp")
+      fieldNames should contain("uid")
+      fieldNames should contain("user")
+      fieldNames should contain("version")
+      fieldNames should contain("visible")

Review Comment:
   The new dense-node test only asserts that the schema contains the metadata 
columns; it doesn’t validate that `DenseNodeExtractor` actually 
populates/decodes them (including the delta-decoding logic for 
timestamp/changeset/uid/user_sid). Adding at least one assertion on 
non-null/range-correct values for a dense PBF fixture would prevent regressions 
where the columns exist but are always null/incorrect.
   ```suggestion
         fieldNames should contain("visible")
   
         // Verify that at least one dense node has populated metadata values
         val nodeWithMetadata = denseData
           .where(col("kind") === "node")
           .select("changeset", "timestamp", "uid", "user", "version", 
"visible")
           .head()
   
         val changeset = nodeWithMetadata.getAs[Long]("changeset")
         val timestamp = nodeWithMetadata.getAs[Long]("timestamp")
         val uid = nodeWithMetadata.getAs[Long]("uid")
         val user = nodeWithMetadata.getAs[String]("user")
         val version = nodeWithMetadata.getAs[Long]("version")
         val visible = nodeWithMetadata.getAs[Boolean]("visible")
   
         // Basic range/non-null checks to ensure delta-decoded metadata is 
populated
         changeset should be > 0L
         timestamp should be > 0L
         uid should be >= 0L
         version should be > 0L
         user should not be null
         user.trim.isEmpty shouldBe false
         // Accessing 'visible' as Boolean already guarantees it is non-null
         (visible == true || visible == false) shouldBe true
   ```



##########
spark/common/src/main/java/org/apache/sedona/sql/datasources/osmpbf/extractors/DenseNodeExtractor.java:
##########
@@ -63,7 +82,39 @@ private OsmNode parse(int idx, Osmformat.StringTable 
stringTable) {
 
     HashMap<String, String> tags = parseTags(stringTable);
 
-    return new OsmNode(id, lat, lon, tags);
+    OsmNode node = new OsmNode(id, lat, lon, tags);
+
+    if (hasDenseInfo) {
+      Osmformat.DenseInfo denseInfo = nodes.getDenseinfo();
+
+      // version is NOT delta-encoded
+      node.setVersion(denseInfo.getVersion(idx));
+
+      // timestamp, changeset, uid, user_sid are delta-encoded
+      long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
+      long changeset = denseInfo.getChangeset(idx) + firstChangeset;
+      int uid = denseInfo.getUid(idx) + firstUid;
+      int userSid = denseInfo.getUserSid(idx) + firstUserSid;
+
+      firstTimestamp = timestamp;
+      firstChangeset = changeset;
+      firstUid = uid;
+      firstUserSid = userSid;
+
+      node.setTimestamp(timestamp * dateGranularity);
+      node.setChangeset(changeset);
+      node.setUid(uid);
+      if (userSid > 0) {
+        node.setUser(stringTable.getS(userSid).toStringUtf8());
+      }
+

Review Comment:
   Inside the `hasDenseInfo` block, `denseInfo.getTimestamp/ getChangeset/ 
getUid/ getUserSid` are indexed unconditionally. These are repeated fields and 
may be absent or shorter than `idCount` in some files, causing 
`IndexOutOfBoundsException` during parsing. Guard each access with the 
corresponding `get*Count() > idx` (and only advance the delta accumulator for 
fields that are actually present).
   ```suggestion
         if (denseInfo.getVersionCount() > idx) {
           node.setVersion(denseInfo.getVersion(idx));
         }
   
         // timestamp, changeset, uid, user_sid are delta-encoded
         if (denseInfo.getTimestampCount() > idx) {
           long timestamp = denseInfo.getTimestamp(idx) + firstTimestamp;
           firstTimestamp = timestamp;
           node.setTimestamp(timestamp * dateGranularity);
         }
   
         if (denseInfo.getChangesetCount() > idx) {
           long changeset = denseInfo.getChangeset(idx) + firstChangeset;
           firstChangeset = changeset;
           node.setChangeset(changeset);
         }
   
         if (denseInfo.getUidCount() > idx) {
           int uid = denseInfo.getUid(idx) + firstUid;
           firstUid = uid;
           node.setUid(uid);
         }
   
         if (denseInfo.getUserSidCount() > idx) {
           int userSid = denseInfo.getUserSid(idx) + firstUserSid;
           firstUserSid = userSid;
           if (userSid > 0) {
             node.setUser(stringTable.getS(userSid).toStringUtf8());
           }
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2760] Extend OSM PBF reader to support additional metadata fields [sedona]

Reply via email to