haydenflinner commented on issue #2456:
URL: https://github.com/apache/iceberg/issues/2456#issuecomment-1416277410

   This appears to be the final barrier to me actually inserting data to 
Iceberg. Since i can't find any way to write to Iceberg except by involving 
Spark, some PySpark is below. Here is my current attempt, after turning off the 
spark checks, to rely only on Iceberg checks. The goal is to somehow convince 
the dataframe that the column is not nullable, without hand-writing the whole 
schema or incurring ridiculous speed penalty. This dataframe is only a few kb, 
somehow going near createDataFrame or rdd's causes massive slowdown.
   
   ```python
       # strftime seems easier than using SQL to cast datetime to DATE logical 
type
       if 'my_date' in df.columns:
           df.my_date = df.my_date.dt.strftime('%Y-%m-%d')
       # Write a parquet file for loading to spark, because 
spark.createDataFrame(pandas_df) is astoundingly slow
       path = tempfile.mktemp(dir="/tmp/hflinner/parquet", suffix=".parquet")
       df.to_parquet(path, index=False, allow_truncated_timestamps=True, 
coerce_timestamps='us')
   
       spark = _get_spark()
       sdf = spark.read.parquet(path)
   
       # Undo strftime hack, also try to fix nullability of date and path 
column.
       from pyspark.sql.functions import to_date
       if 'my_date' in df.columns:
           #sdf = sdf.select('*', (to_date(sdf.my_date)).alias('my_date'))
           sdf = sdf.withColumn('my_date', to_date(sdf.my_date, '%Y-%m-%d'))
           sdf = 
sdf.filter(sdf.my_date.isNotNull()).filter(sdf.backed_up_path.isNotNull())
           sdf.schema['my_date'].nullable = False
           sdf.schema['backed_up_path'].nullable = False
       sdf.writeTo(f"dev_catalog.{tablename}").append()
   
   
   -->
   IllegalArgumentException: Cannot write incompatible dataset to table with 
schema:
   table {
     1: server_name: optional string
     2: my_date: required date
     3: backed_up_path: required string
     4: backed_up_filesize: optional long
     5: num_lines: optional long
   }
   Provided schema:
   table {
     1: server_name: optional string
     2: my_date: optional date
     3: backed_up_path: optional string
     4: backed_up_filesize: optional long
   }
   Problems:
   * my_date should be required, but is optional
   * backed_up_path should be required, but is optional
   ```
   
   I don't have a schema object (the schema is written as SQL for Iceberg) and 
I don't want to create an rdd.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to