CurtHagenlocher opened a new issue, #179:
URL: https://github.com/apache/arrow-dotnet/issues/179

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Moved from 
[https://github.com/apache/arrow/issues/37069](https://github.com/apache/arrow/issues/37069).
 Original text follows:
   
   I have some C# code using the Apache.Arrow 12.0.1 nuget to write .feather 
files with `ArrowFileWriter.WriteRecordBatch()` and its companions. These files 
are then read in R 4.2 using `read_feather()` from the arrow 12.0.0 package 
(Windows 10 22H2, RStudio 2023.06.1). This process works fine for files with a 
single record batch up to at least 2,129,214,698 bytes (1.98 GB). Much above 
that—the next file size up the code I'm running produces happens to be 
2,176,530,466 bytes, 2.02 GB—and `read_feather()` fails with
   ```R
   Error: IOError: Invalid IPC message: negative bodyLength
   ```
   This appears to coming from 
[CheckMetadataAndGetBodyLength()](https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/message.cc#L188),
 which takes `bodyLength` as an `int64_t`. So presumably what's happening here 
is something upstream of `CheckMetadataAndGetBodyLength()` is handling 
`bodyLength` as 32 bit signed and inducing integer rollover.
   
   The error message is a bit cryptic and, from a brief look at the code it's 
unclear how a block's `bodyLength` relates to file and record batch size. Since 
forcing the total (uncompressed) size of a record batch below 2 GB avoids the 
negative bodyLength error it appears `CheckMetadataAndGetBodyLength()` is 
picking up something controlled by record batch size rather than file size. At 
the moment I've tested up to 3.6 GB multibatch files and `read_feather()` 
handles them fine.
   
   The [Arrow spec](https://arrow.apache.org/docs/format/Columnar.html) seems 
to suggest compliant Arrow implementations support array lengths to 2³¹ - 1 
elements. The record batch format here is 11 four byte columns, plus a one byte 
and a couple two bytes, so one reading of the spec is files (or maybe record 
batches) up to 98 GB should be compatible with the recommended limits on 
multi-language use.
   
   I'm including the C++ tag on this as I'm not seeing where C# or R would be 
inducing rollover, suggesting it may be a lower level issue or perhaps at the 
managed-unmanaged boundary. 
[`ArrowBuffer`](https://github.com/apache/arrow/blob/main/csharp/src/Apache.Arrow/ArrowBuffer.cs#L51)
 is built on 
[`ReadOnlyMemory<byte>`](https://learn.microsoft.com/en-us/dotnet/api/system.readonlymemory-1?view=net-7.0)
 and therefore constrained by its int32 `Length` property. Not sure if that's 
considered to meet spec but it doesn't appear it should be causing rollover 
here—the 2.02 GB file is only 44M records, so not more than 170 MB per column 
even when everything's in one `RecordBatch`—and the unchecked int32 conversion 
in 
[`ArrowMemoryReaderImplementation.ReadNextRecordBatch()`](https://github.com/apache/arrow/blob/main/csharp/src/Apache.Arrow/Ipc/ArrowMemoryReaderImplementation.cs#L83)
 probably isn't involved either. Since I'm not getting an exception at write 
the checked C# conversions from
  `long` can be excluded.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to