KristofferC opened a new issue, #528: URL: https://github.com/apache/arrow-julia/issues/528
Running the following code which generates some data and then reads it via `Arrow.Table` shows a very bad slow down when using threads: ```julia using DataFrames, Dates, Arrow, StatsBase, Random, InlineStrings function generate_data(f) number_of_companies = 10000 dates = collect(Date(2001,1,1):Day(1):Date(2020,12,31)) companyid = sample(100000:1000000, number_of_companies, replace = false) number_of_items = length(companyid)*length(dates) df = DataFrame( dates = repeat(dates, outer = number_of_companies), companyid = repeat(companyid, inner = length(dates)), item1 = rand(number_of_items), item2 = randn(number_of_items), item3 = rand(1:1000,number_of_items), item4 = repeat([String7(randstring(['a':'z' 'A':'Z'],5)) for _ in 1:number_of_companies],length(dates)) ) @info "Saving to $f" open(f, "w") do f Arrow.write(f, Tables.partitioner(groupby(df,:dates))) end end f = "mytestdata.arrow" if !isfile(f) generate_data(f) end Arrow.Table(f) @time Arrow.Table(f) ``` Results: ``` ❯julia arrowthreads.jl 0.203852 seconds (2.38 M allocations: 126.388 MiB, 34.93% gc time, 1.32% compilation time) ❯ julia --project --threads=3 arrowthreads.jl 6.603782 seconds (2.39 M allocations: 126.349 MiB, 0.46% gc time) ``` We can see that `Arrow.Table` spawns a task here https://github.com/apache/arrow-julia/blob/2696105d01cfda7c55d1902951a20908a3c205e5/src/table.jl#L525C18-L528 and from profiling we are spending almost all time waiting on the lock in https://github.com/JuliaServices/ConcurrentUtilities.jl/blob/5fced8291da84bd081cb2e27d2e16f5bc8081f38/src/synchronizer.jl#L108. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org