[I] Parquet bloom filter doesn't work with nested fields [iceberg]

via GitHub Fri, 08 Mar 2024 07:06:25 -0800


hussein-awala opened a new issue, #9898:
URL: https://github.com/apache/iceberg/issues/9898


   ### Apache Iceberg version
   
   1.4.3 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have an Iceberg table, and I want to create two bloom filters on a root 
string column and nested string column in a struct, I've set the properties 
`write.parquet.bloom-filter-enabled.column.a` and 
`write.parquet.bloom-filter-enabled.column.b.c` to `true`, and I checked with 
`parquet-cli`:
   ```bash
   $ parquet bloom-filter /path/to/file.parquet -c a -v <not existing value>
   
   Row group 0:
   
--------------------------------------------------------------------------------
   value <not existing value> NOT exists.
   
   $ parquet bloom-filter /path/to/file.parquet -c a -v <existing value>
   
   Row group 0:
   
--------------------------------------------------------------------------------
   value <existing value> maybe exists.
   
   $ parquet bloom-filter /path/to/file.parquet -c b.c -v <some value>
   
   Row group 0:
   
--------------------------------------------------------------------------------
   column b.c has no bloom filter
   
   # check if it's an issue with column name parsing:
   $ parquet bloom-filter /path/to/file.parquet -c b.d -v <some value>
   Argument error: Schema doesn't have column: b.d
   ```
   
   However, I tried with Spark and parquet, and it worker without any issue:
   ```scala
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.Row
   import spark.implicits._
   
   val schema = StructType(Array(
       StructField("a", StringType, true),
       StructField("b", StringType, true),
       StructField("nested", StructType(Array(
         StructField("c", StringType, true),
         StructField("d", StringType, true)
       )), true)
   ))
   
   val data = Seq(
       Row("1", "25", Row("100", "a")),
       Row("2", "30", Row("200", "b")),
       Row("3", "35", Row("300", "c")),
       Row("4", "40", Row("400", "d")),
       Row("5", "45", Row("500", "e"))
   )
   
   val df = spark.createDataFrame(
       spark.sparkContext.parallelize(data),
       schema
   )
   
   df.write.format("parquet")
       .option("parquet.bloom.filter.enabled#a", "true")
       .option("parquet.bloom.filter.enabled#nested.c", "true")
       .save("bloom_parquet")
   ```
   Check with `parquet-cli`
   ```bash
   $  github parquet bloom-filter 
bloom_parquet/part-00002-9fac4c38-7113-45df-8db9-d96c3f6b6a8e-c000.snappy.parquet
 -c a -v "1"
   
   Row group 0:
   
--------------------------------------------------------------------------------
   value 1 maybe exists.
   
   $  github parquet bloom-filter 
bloom_parquet/part-00002-9fac4c38-7113-45df-8db9-d96c3f6b6a8e-c000.snappy.parquet
 -c nested.c -v "1"
   
   Row group 0:
   
--------------------------------------------------------------------------------
   value 1 NOT exists.
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Parquet bloom filter doesn't work with nested fields [iceberg]

Reply via email to