KristofferC opened a new issue, #528:
URL: https://github.com/apache/arrow-julia/issues/528

   Running the following code which generates some data and then reads it via 
`Arrow.Table` shows a very bad slow down when using threads:
   
   ```julia
   using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings
   
   function generate_data(f)
       number_of_companies = 10000
       dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31))
       companyid = sample(100000:1000000, number_of_companies, replace = false)
   
       number_of_items = length(companyid)*length(dates)
   
       df = DataFrame(
               dates = repeat(dates, outer = number_of_companies),
               companyid = repeat(companyid, inner = length(dates)),
               item1 = rand(number_of_items),
               item2 = randn(number_of_items),
               item3 = rand(1:1000,number_of_items),
               item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ 
in 1:number_of_companies],length(dates))
           )
   
       @info "Saving to $f"
       open(f, "w") do f
           Arrow.write(f, Tables.partitioner(groupby(df,:dates)))
       end
   end
   
   f = "mytestdata.arrow"
   
   if !isfile(f)
       generate_data(f)
   end
   
   Arrow.Table(f)
   @time Arrow.Table(f)
   ```
   
   Results:
   
   ```
   ❯julia arrowthreads.jl
     0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32% 
compilation time)
   
   ❯ julia  --project --threads=3 arrowthreads.jl
     6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time)
   ```
   
   We can see that `Arrow.Table` spawns a task here 
https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528
 and from profiling we are spending almost all time waiting on the lock in 
https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to