jiayuasu opened a new pull request, #2776: URL: https://github.com/apache/sedona/pull/2776
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Developer Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2760 ## What changes were proposed in this PR? Extend the OSM PBF reader to extract the following metadata fields from `Info` (for Node/Way/Relation) and `DenseInfo` (for DenseNodes) protobuf messages: - **`changeset`** (BIGINT) — the changeset ID the entity belongs to - **`timestamp`** (TIMESTAMP) — when the entity was last modified - **`uid`** (INT) — the user ID of the last editor - **`user`** (STRING) — the username of the last editor (resolved from the string table via `user_sid`) - **`version`** (INT) — the entity version number - **`visible`** (BOOLEAN) — whether the entity is visible (relevant for history files) These fields are part of the standard OSM PBF format specification but were previously ignored by the reader. ### Key implementation details - Added `InfoResolver` utility class to extract Info fields from `Osmformat.Info` for nodes, ways, and relations - Extended `DenseNodeExtractor` to decode DenseInfo fields (delta-encoded for timestamp, changeset, uid, user_sid) - Added metadata fields with setters to `OSMEntity` to avoid constructor bloat - Updated `SchemaProvider` to include the 6 new columns in the output schema - Updated `OsmPartitionReader` to map the new fields into Spark `InternalRow` - Passed `PrimitiveBlock` (instead of just `StringTable`) to `WayIterator` and `RelationIterator` so they can access both the string table and `date_granularity` for timestamp conversion ## How was this patch tested? Added 3 new tests to `OsmReaderTest`: 1. **Metadata fields populated for all entities** — verifies `version`, `timestamp`, `changeset` are non-null for all nodes/ways/relations in the Monaco PBF dataset, timestamps are in a reasonable range, version >= 1, changeset >= 0 2. **Schema includes metadata for dense nodes** — verifies the 6 new fields appear in the schema when reading dense node PBF files 3. **Schema includes metadata for normal nodes** — verifies the 6 new fields appear in the schema when reading normal node PBF files All 10 existing tests continue to pass. ## Did this PR include necessary documentation updates? - No, this PR does not affect any public API so no need to change the documentation. The new fields are automatically available in the output schema when reading OSM PBF files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
