CurtHagenlocher opened a new issue, #179: URL: https://github.com/apache/arrow-dotnet/issues/179
### Describe the bug, including details regarding any error messages, version, and platform. Moved from [https://github.com/apache/arrow/issues/37069](https://github.com/apache/arrow/issues/37069). Original text follows: I have some C# code using the Apache.Arrow 12.0.1 nuget to write .feather files with `ArrowFileWriter.WriteRecordBatch()` and its companions. These files are then read in R 4.2 using `read_feather()` from the arrow 12.0.0 package (Windows 10 22H2, RStudio 2023.06.1). This process works fine for files with a single record batch up to at least 2,129,214,698 bytes (1.98 GB). Much above that—the next file size up the code I'm running produces happens to be 2,176,530,466 bytes, 2.02 GB—and `read_feather()` fails with ```R Error: IOError: Invalid IPC message: negative bodyLength ``` This appears to coming from [CheckMetadataAndGetBodyLength()](https://github.com/apache/arrow/blob/main/cpp/src/arrow/ipc/message.cc#L188), which takes `bodyLength` as an `int64_t`. So presumably what's happening here is something upstream of `CheckMetadataAndGetBodyLength()` is handling `bodyLength` as 32 bit signed and inducing integer rollover. The error message is a bit cryptic and, from a brief look at the code it's unclear how a block's `bodyLength` relates to file and record batch size. Since forcing the total (uncompressed) size of a record batch below 2 GB avoids the negative bodyLength error it appears `CheckMetadataAndGetBodyLength()` is picking up something controlled by record batch size rather than file size. At the moment I've tested up to 3.6 GB multibatch files and `read_feather()` handles them fine. The [Arrow spec](https://arrow.apache.org/docs/format/Columnar.html) seems to suggest compliant Arrow implementations support array lengths to 2³¹ - 1 elements. The record batch format here is 11 four byte columns, plus a one byte and a couple two bytes, so one reading of the spec is files (or maybe record batches) up to 98 GB should be compatible with the recommended limits on multi-language use. I'm including the C++ tag on this as I'm not seeing where C# or R would be inducing rollover, suggesting it may be a lower level issue or perhaps at the managed-unmanaged boundary. [`ArrowBuffer`](https://github.com/apache/arrow/blob/main/csharp/src/Apache.Arrow/ArrowBuffer.cs#L51) is built on [`ReadOnlyMemory<byte>`](https://learn.microsoft.com/en-us/dotnet/api/system.readonlymemory-1?view=net-7.0) and therefore constrained by its int32 `Length` property. Not sure if that's considered to meet spec but it doesn't appear it should be causing rollover here—the 2.02 GB file is only 44M records, so not more than 170 MB per column even when everything's in one `RecordBatch`—and the unchecked int32 conversion in [`ArrowMemoryReaderImplementation.ReadNextRecordBatch()`](https://github.com/apache/arrow/blob/main/csharp/src/Apache.Arrow/Ipc/ArrowMemoryReaderImplementation.cs#L83) probably isn't involved either. Since I'm not getting an exception at write the checked C# conversions from `long` can be excluded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
