hadrian-reppas opened a new issue, #46780:
URL: https://github.com/apache/arrow/issues/46780

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   **Describe the bug, including details regarding any error messages, version, 
and platform.**
   
   The current asof join implementation does not check the null bitmap before 
indexing into the time column's value buffer. This creates matches that should 
not be possible. For example,
   
   ```python
   import pyarrow as pa
   
   lhs = pa.table({
       "time": pa.array([None], type=pa.int64())
   })
   rhs = pa.table({
       "time": [0],
       "data": [True],
   })
   
   lhs.join_asof(rhs, "time", [], 0)
   ```
   produces
   ```
      time  data
   0  null  true
   ```
   which implies that null equals 0. Using C++ we can manipulate the value 
buffer and get null to "equal" any integer:
   ```cpp
   #include <cassert>
   
   #include <arrow/api.h>
   #include <arrow/acero/asof_join_node.h>
   #include <arrow/acero/exec_plan.h>
   
   int main() {
     std::shared_ptr<arrow::Table> lhs, rhs;
     {
       auto null_bitmap = arrow::AllocateEmptyBitmap(1).ValueOrDie();
   
       std::shared_ptr<arrow::Buffer> value_buffer = 
arrow::AllocateBuffer(sizeof(int32_t)).ValueOrDie();
       auto value_ptr = 
reinterpret_cast<int32_t*>(value_buffer->mutable_data());
       value_ptr[0] = 123;
   
       auto array_data = arrow::ArrayData::Make(arrow::int32(), 1, 
{null_bitmap, value_buffer}, 1);
       auto time = arrow::MakeArray(array_data);
   
       auto schema = arrow::schema({arrow::field("time", arrow::int32())});
       lhs = arrow::Table::Make(schema, {time});
     }
   
     {
       arrow::Int32Builder builder;
       std::shared_ptr<arrow::Array> time, payload;
       assert(builder.Append(123).ok());
       assert(builder.Finish(&time).ok());
   
       assert(builder.Append(1).ok());
       assert(builder.Finish(&payload).ok());
   
       auto schema = arrow::schema({arrow::field("time", arrow::int32()), 
arrow::field("payload", arrow::int32())});
       rhs = arrow::Table::Make(schema, {time, payload});
     }
     
     // lhs:         rhs:
     //     time          time  payload
     //  0  null       0   123        1
   
     std::shared_ptr<arrow::Table> result;
     {
       arrow::acero::AsofJoinNodeOptions options({{"time", {}}, {"time", {}}}, 
0);
       auto asofjoin = arrow::acero::Declaration("asofjoin", 
           {arrow::acero::Declaration("table_source", 
arrow::acero::TableSourceNodeOptions(lhs)),
            arrow::acero::Declaration("table_source", 
arrow::acero::TableSourceNodeOptions(rhs))},
           std::move(options));
   
       result = arrow::acero::DeclarationToTable(asofjoin).ValueOrDie();
     }
   
     // result:
     //     time  payload
     //  0  null        1   -> non-null payload means null matched 123
   }
   ```
   
   This could be fixed by adding a null check in the `GetTime` function in 
`time_series_util.cc`, but it's not clear to me what the correct behavior 
should be.
   
   **Component(s)**
   C++
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to