[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7873: Python: Avro write

via GitHub Mon, 03 Jul 2023 16:19:58 -0700


JonasJ-ap commented on code in PR #7873:
URL: https://github.com/apache/iceberg/pull/7873#discussion_r1251334009



##########
python/pyiceberg/avro/file.py:
##########
@@ -204,3 +214,58 @@ def __next__(self) -> D:
 
     def _read_header(self) -> AvroFileHeader:
         return construct_reader(META_SCHEMA, {-1: 
AvroFileHeader}).read(self.decoder)
+
+
+class AvroOutputFile(Generic[D]):
+    output_file: OutputFile
+    output_stream: OutputStream
+    schema: Schema
+    schema_name: str
+    encoder: BinaryEncoder
+    sync_bytes: bytes
+    writer: Writer
+
+    def __init__(self, output_file: OutputFile, schema: Schema, schema_name: 
str) -> None:
+        self.output_file = output_file
+        self.schema = schema
+        self.schema_name = schema_name
+        self.sync_bytes = os.urandom(SYNC_SIZE)
+        self.writer = construct_writer(self.schema)
+
+    def __enter__(self) -> AvroOutputFile[D]:
+        """
+        Opens the file and writes the header.
+
+        Returns:
+            The file object to write records to
+        """
+        self.output_stream = self.output_file.create(overwrite=True)
+        self.encoder = BinaryEncoder(self.output_stream)
+
+        self._write_header()
+        self.writer = construct_writer(self.schema)
+
+        return self
+
+    def __exit__(
+        self, exctype: Optional[Type[BaseException]], excinst: 
Optional[BaseException], exctb: Optional[TracebackType]
+    ) -> None:
+        """Performs cleanup when exiting the scope of a 'with' statement."""
+        self.output_stream.close()
+
+    def _write_header(self) -> None:
+        json_schema = 
json.dumps(AvroSchemaConversion().iceberg_to_avro(self.schema, 
schema_name=self.schema_name))
+        header = AvroFileHeader(magic=MAGIC, meta={_SCHEMA_KEY: json_schema, 
_CODEC_KEY: "null"}, sync=self.sync_bytes)

Review Comment:
   I am trying to implement a manifest writer and manifest list writer based on 
this PR, and while reviewing I noticed that Iceberg requires manifest files to 
store partition spec and other metadata ("schema", "format-version") in Avro's 
key metadata file. Similarly, the manifest list file stores "snapshot-id", 
"parent-snapshot-id", etc. (see https://iceberg.apache.org/spec/#manifests).
   
   Considering this, I think we may want to allow extra metada to be written to 
the avro output
   ```python
   def __init__(self, output_file: OutputFile, schema: Schema, schema_name: 
str, metadata: Dict[str, str] = EMPTY_DICT) -> None:
     ...
     self.metadata = metadata
   
   def _write_header(self) -> None:
     ...
     meta = {**{_SCHEMA_KEY: json_schema, _CODEC_KEY: "null"}, **self.metadata}
     header = AvroFileHeader(magic=MAGIC, meta=meta, sync=self.sync_bytes)
   ```
   Would you mind sharing your thoughts on this? If there is anything I may 
have misunderstood, please let me know.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] JonasJ-ap commented on a diff in pull request #7873: Python: Avro write

Reply via email to