erikhansenwong opened a new issue, #46224:
URL: https://github.com/apache/arrow/issues/46224

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   On some calls to `Table.join_asof` my python process becomes unresponsive 
and is using zero cpu.  It appears to be a thread deadlock or something 
similar.  I have created an example that causes the deadlock with high 
probability on my laptop.
   
   Here are the details of my setup:
   - Python 3.12.7
   - pyarrow==19.0.1
   - numpy==2.2.4
   - pandas==2.2.3
   - Ubuntu 22.04.5
   - CPU: 13th Gen Intel(R) Core(TM) i9-13980HX
   
   I was also able to produce the deadlock on a colleague's Mac laptop with 
Apple silicon using this example, so I assume it won't make a big difference 
what hardware it runs on.
   
   On my laptop this always gets deadlocked before the 300th iteration
   
   ```python
   import numpy as np
   import pandas as pd
   import pyarrow as pa
   
   n_left = 100
   n_right = 200_000
   left_start = pd.Timestamp("2025-04-07T07:45:55", tz="UTC")
   right_start = pd.Timestamp("2025-04-07T00:00:00", tz="UTC")
   time_end = pd.Timestamp("2025-04-07T12:05:59", tz="UTC")
   
   tolerance_nanos = 60 * 1_000_000_000
   np.random.seed(0)
   
   
   def get_timestamps(start, end, n):
       seconds = (end - start).total_seconds()
       td = np.random.uniform(0, 1, n)
       td *= np.random.choice([0, 1], n)
       td *= seconds / td.sum()
       td = td.cumsum()
       return start + pd.to_timedelta(td, "seconds")
   
   
   left_schema = pa.schema([pa.field("timestamp", pa.timestamp("ns", "UTC"))])
   right_schema = pa.schema(
       [
           pa.field("timestamp", pa.timestamp("ns", "UTC")),
           pa.field("value", pa.float64()),
       ]
   )
   
   left = pa.table(
       {"timestamp": get_timestamps(left_start, time_end, n_left)},
       schema=left_schema,
   )
   right = pa.table(
       {
           "timestamp": get_timestamps(right_start, time_end, n_right),
           "value": np.random.normal(100, 5, n_right),
       },
       schema=right_schema,
   )
   
   for i in range(1000):
       print(f"{i:>5} | {pd.Timestamp.now()}")
       left.join_asof(
           right,
           on="timestamp",
           by=[],
           tolerance=tolerance_nanos,
       )
   
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to