georgikoyrushki95 opened a new issue, #45078:
URL: https://github.com/apache/arrow/issues/45078

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hello, I have a use case in Python involving arrow flight that is 
exemplified by the below snippet:
   
   ```Python
   import pyarrow as pa
   import pyarrow.flight as flight
   
   def do_some_work(…):
        # … set-up
        client = flight.FlightClient(xxx)
        ticket = xxx
       
        reader = client.do_get(ticket)
        # Assume the table is quite large - 500 MB
        arrow_table: pa.Table = reader.read_all()
        
        # Assume res is a very small object, compared to
        # the size of the table.
        res = do_something_quick_with(arrow_table)
       
        # (A) From this point onwards arrow_table is no longer needed…
   
        # … rest of the pipeline that uses res and does a lot of other things …
   ```
   
   The above snippet is a slight simplification. The real-world scenario is a 
little more complex because the table is obtained in a library I don’t 
necessarily have easy control over and is passed to user-level code.
   
   At point `(A)` above benchmarking in high volume scenarios has shown it 
would be really good to free up the memory of the `arrow_table`. The table 
itself does not have an explicit `.close()` method or anything indicating we’re 
able to free the memory associated with it. A few things I have tried are:
   
   * Obtaining the actual RecordBatchReader and calling close on it:
   ```Python
   
   reader: RecordBatchReader = client.do_get(ticket).to_reader()
   # … use the reader to obtain the arrow table …
   
   # close the reader
   reader.close()
   ```
   * Deleting the reference to the arrow table via `del` and hoping at some 
point GC would kick in. 
   * Deleting the reference via `del` and explicitly calling the GC (just for 
testing, I am aware this is not a recommended practice).
   
   In the last 2 cases above, just as a debugging exercise, I ended up printing 
the number of references to the arrow_table object before calling `del`. 
Expectation was it’d be 1, but it was more than that, so my assumption is 
something gets held internally within the flight framework.
   
   The above said, my question is - **is there a deterministic way that always 
work to release the memory of a `pyarrow.Table`**. I can imagine why in most of 
the cases doing this would be quite cumbersome and it’d be best to rely on the 
reference counting mechanism + the GC naturally kicking in, but in this 
particular case it would be quite useful.
   
   I would also be grateful, if I can get some pointers to the lifetime 
implications of these objects in Python. It is not very clear from the 
documentation, for example, if the `arrow_table`s lifetime from above is tied 
to the lifetime of the `reader` and vice versa. Again, I appreciate in 99% of 
the cases we shouldn’t need to care about it, but there’s still this 1% that 
having this explained a little more in depth would be of great use!
   
   P.S. There is a near-identical example I had to do within Java and the 
`VectorSchemaRoot`’s API conveniently exposes a `.close()` method, which works 
quite nicely in my use case.
   
   
   
   ### Component(s)
   
   Documentation, FlightRPC, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to