fpacanowski opened a new issue, #45117: URL: https://github.com/apache/arrow/issues/45117
### Describe the usage question you have. Please include as many useful details as possible. I'm experiencing performance issues when reading Parquet files in Ruby. I've created a very simple synthetic benchmark: a table with a single `float` column and 1,000,000 rows. I want to read this data into an array of hashes. Here's my code: ```ruby require 'arrow' require 'parquet' require 'benchmark' def read_parquet table = Arrow::TableLoader.load('data.parquet', { format: :parquet }) table.each_record(reuse_record: true).map(&:to_h) end Benchmark.bmbm do |x| x.report("read_parquet") { read_parquet } end ``` And the report: ``` $ bundle exec ruby --yjit read.rb Rehearsal ------------------------------------------------ read_parquet 20.894371 0.129733 21.024104 ( 21.066472) -------------------------------------- total: 21.024104sec user system total real read_parquet 21.872816 0.079736 21.952552 ( 21.991875) ``` I also tested equivalent code in Python: ```python import pyarrow.parquet as pq import timeit def read_parquet(): table = pq.read_table('data.parquet') return table.to_pylist() time_taken = timeit.timeit(read_parquet, number=10) # Run 10 times print(f"Average time: {time_taken / 10:.6f} seconds") ``` which yields: ``` $ poetry run python read.py Average time: 0.610864 seconds ``` This means that Ruby version is **30-40x** slower than pyarrow. Is there anything I can do to improve the performance here? For completeness, here's a script that generates the test data (`data.parquet` file): ```ruby require 'arrow' require 'parquet' require 'benchmark' schema = Arrow::Schema.new([Arrow::Field.new("foo", :float)]) data = 1_000_000.times.map { {foo: rand} } table = Arrow::RecordBatchBuilder.build(schema, data).to_table table.save('data.parquet', format: :parquet, compression: :uncompressed) ``` ### Component(s) Ruby -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org