georgikoyrushki95 opened a new issue, #45078:
URL: https://github.com/apache/arrow/issues/45078
### Describe the usage question you have. Please include as many useful
details as possible.
Hello, I have a use case in Python involving arrow flight that is
exemplified by the below snippet:
```Python
import pyarrow as pa
import pyarrow.flight as flight
def do_some_work(…):
# … set-up
client = flight.FlightClient(xxx)
ticket = xxx
reader = client.do_get(ticket)
# Assume the table is quite large - 500 MB
arrow_table: pa.Table = reader.read_all()
# Assume res is a very small object, compared to
# the size of the table.
res = do_something_quick_with(arrow_table)
# (A) From this point onwards arrow_table is no longer needed…
# … rest of the pipeline that uses res and does a lot of other things …
```
The above snippet is a slight simplification. The real-world scenario is a
little more complex because the table is obtained in a library I don’t
necessarily have easy control over and is passed to user-level code.
At point `(A)` above benchmarking in high volume scenarios has shown it
would be really good to free up the memory of the `arrow_table`. The table
itself does not have an explicit `.close()` method or anything indicating we’re
able to free the memory associated with it. A few things I have tried are:
* Obtaining the actual RecordBatchReader and calling close on it:
```Python
reader: RecordBatchReader = client.do_get(ticket).to_reader()
# … use the reader to obtain the arrow table …
# close the reader
reader.close()
```
* Deleting the reference to the arrow table via `del` and hoping at some
point GC would kick in.
* Deleting the reference via `del` and explicitly calling the GC (just for
testing, I am aware this is not a recommended practice).
In the last 2 cases above, just as a debugging exercise, I ended up printing
the number of references to the arrow_table object before calling `del`.
Expectation was it’d be 1, but it was more than that, so my assumption is
something gets held internally within the flight framework.
The above said, my question is - **is there a deterministic way that always
work to release the memory of a `pyarrow.Table`**. I can imagine why in most of
the cases doing this would be quite cumbersome and it’d be best to rely on the
reference counting mechanism + the GC naturally kicking in, but in this
particular case it would be quite useful.
I would also be grateful, if I can get some pointers to the lifetime
implications of these objects in Python. It is not very clear from the
documentation, for example, if the `arrow_table`s lifetime from above is tied
to the lifetime of the `reader` and vice versa. Again, I appreciate in 99% of
the cases we shouldn’t need to care about it, but there’s still this 1% that
having this explained a little more in depth would be of great use!
P.S. There is a near-identical example I had to do within Java and the
`VectorSchemaRoot`’s API conveniently exposes a `.close()` method, which works
quite nicely in my use case.
### Component(s)
Documentation, FlightRPC, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]