[ 
https://issues.apache.org/jira/browse/HADOOP-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080717#comment-18080717
 ] 

ASF GitHub Bot commented on HADOOP-19863:
-----------------------------------------

steveloughran opened a new pull request, #8495:
URL: https://github.com/apache/hadoop/pull/8495

   
   
   Contributed by Steve Loughran.
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   ### AI Tooling
   
   If an AI tool was used:
   
   - [ ] The PR includes the phrase "Contains content generated by <tool>"
         where <tool> is the name of the AI tool used.
   - [ ] My use of AI contributions follows the ASF legal policy
         https://www.apache.org/legal/generative-tooling.html




> Incorrect Vectored IO metrics from Local Filesystem
> ---------------------------------------------------
>
>                 Key: HADOOP-19863
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19863
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: fs
>    Affects Versions: 3.5.0
>            Reporter: Peter Toth
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.1
>
>         Attachments: Screenshot 2026-04-16 at 19.02.30.png, Screenshot 
> 2026-04-16 at 19.03.51.png
>
>
> As discussed in 
> [https://github.com/apache/parquet-java/issues/2703#issuecomment-4260121705] 
> we noticed that when vectoried IO is enabled the {{BytesRead}} metrics of 
> Spark tasks are not correct.
> Spark fetches that metric via {{FileSystem.getAllStatistics}} see
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L98-L109]
>  and
>  - 
> [https://github.com/apache/spark/blob/5d491f62748b4b9c34bc3b5bd7390f7b5ca75053/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L164-L170]
> Repro with latest Spark 4.2.0-SNAPSHOT using Hadoop 3.5.0:
> Vectored IO is enabled by default:
> {code:java}
> ➜  bin/spark-shell
> scala> spark.createDataFrame((0 until 5000).map(i => (i, 
> s"left_$i"))).repartition(1).write.parquet("/tmp/t2")
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.02.30.png|width=85%!
> Vectored IO is disabled explicitely:
> {code:java}
> ➜  bin/spark-shell --conf 
> spark.hadoop.parquet.hadoop.vectored.io.enabled=false
> scala> spark.read.parquet("/tmp/t2").createOrReplaceTempView("t2")
> scala> sql("SELECT * FROM t2").collect()
> {code}
> !Screenshot 2026-04-16 at 19.03.51.png|width=85%!
> In my case the generated test file size was ~45KB:
> {code:java}
> ➜  ls -ll /tmp/t2
> total 88
> -rw-r--r--@ 1 ptoth  wheel      0 Apr 16 18:57 _SUCCESS
> -rw-r--r--@ 1 ptoth  wheel  44944 Apr 16 18:57 
> part-00000-cf825cf6-2fa5-46a2-b897-dbb9dc9828a7-c000.snappy.parquet{code}
> I believe reading the parquet footers don't go through vectored IO so the 
> decreased 1680B probably belongs to that.
> There is no data pruning in the query so the metric value should be around 
> the file size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to