jiayuasu opened a new pull request, #2776:
URL: https://github.com/apache/sedona/pull/2776

   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Developer Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2760
   
   ## What changes were proposed in this PR?
   
   Extend the OSM PBF reader to extract the following metadata fields from 
`Info` (for Node/Way/Relation) and `DenseInfo` (for DenseNodes) protobuf 
messages:
   
   - **`changeset`** (BIGINT) — the changeset ID the entity belongs to
   - **`timestamp`** (TIMESTAMP) — when the entity was last modified
   - **`uid`** (INT) — the user ID of the last editor
   - **`user`** (STRING) — the username of the last editor (resolved from the 
string table via `user_sid`)
   - **`version`** (INT) — the entity version number
   - **`visible`** (BOOLEAN) — whether the entity is visible (relevant for 
history files)
   
   These fields are part of the standard OSM PBF format specification but were 
previously ignored by the reader.
   
   ### Key implementation details
   
   - Added `InfoResolver` utility class to extract Info fields from 
`Osmformat.Info` for nodes, ways, and relations
   - Extended `DenseNodeExtractor` to decode DenseInfo fields (delta-encoded 
for timestamp, changeset, uid, user_sid)
   - Added metadata fields with setters to `OSMEntity` to avoid constructor 
bloat
   - Updated `SchemaProvider` to include the 6 new columns in the output schema
   - Updated `OsmPartitionReader` to map the new fields into Spark `InternalRow`
   - Passed `PrimitiveBlock` (instead of just `StringTable`) to `WayIterator` 
and `RelationIterator` so they can access both the string table and 
`date_granularity` for timestamp conversion
   
   ## How was this patch tested?
   
   Added 3 new tests to `OsmReaderTest`:
   1. **Metadata fields populated for all entities** — verifies `version`, 
`timestamp`, `changeset` are non-null for all nodes/ways/relations in the 
Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, 
changeset >= 0
   2. **Schema includes metadata for dense nodes** — verifies the 6 new fields 
appear in the schema when reading dense node PBF files
   3. **Schema includes metadata for normal nodes** — verifies the 6 new fields 
appear in the schema when reading normal node PBF files
   
   All 10 existing tests continue to pass.
   
   ## Did this PR include necessary documentation updates?
   
   - No, this PR does not affect any public API so no need to change the 
documentation. The new fields are automatically available in the output schema 
when reading OSM PBF files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to