Yicong-Huang opened a new issue, #4524:
URL: https://github.com/apache/texera/issues/4524

   ## Summary
   
   
`core/runnables/test_main_loop.py::TestMainLoop::test_main_loop_thread_can_align_ecm`
 is intermittently red on CI for unrelated PRs (observed on #4512 and #4520, 
same assertion at `test_main_loop.py:1176`). Locally it always passes (100/100 
with `-x` on macOS, 30/30 of the full file).
   
   ## Root cause
   
   The test assumes that two items put into `output_queue` come out in FIFO 
order:
   
   ```python
   input_queue.put(ECMElement(tag=mock_control_input_channel, 
payload=test_ecm))  # 1
   input_queue.put(mock_binary_data_element)                                    
   # 2
   input_queue.put(ECMElement(tag=mock_data_input_channel, payload=test_ecm))   
  # 3 -> aligns ECM, runs NoOperation
   output_data_element: DataElement = output_queue.get()    # expects data first
   ...
   output_control_element: DCMElement = output_queue.get()  # expects control 
reply second
   ```
   
   But `output_queue` is an `InternalQueue`, which is a 
`LinkedBlockingMultiQueue` keyed by channel with **priority 1 for control 
sub-queues and priority 2 for data sub-queues** (`internal_queue.py:80`). 
`LinkedBlockingMultiQueue.get()` always pops the highest-priority enabled 
sub-queue first.
   
   In MainLoop the puts happen sequentially:
   
   1. DataElement → data sub-queue (priority 2)
   2. NoOperation reply DCMElement → control sub-queue (priority 1)
   
   On a fast machine, the test calls `.get()` after step 1 but before step 2, 
so only the data is in the queue — it comes out first, the test passes. On a 
slow CI runner, MainLoop reaches step 2 before the test calls `.get()` — both 
items are queued, the priority queue returns the control reply first, and the 
assertion at line 1176 fails with:
   
   ```
   AssertionError: ChannelIdentity(..., to='sender', is_control=True)  # actual 
(control)
               != ChannelIdentity(..., to='dummy_worker_id', is_control=False)  
# expected (data)
   ```
   
   The production behavior is correct — control should win priority over data 
on the egress queue.
   
   ## Proposed fix
   
   Make the test order-tolerant: drain two items from `output_queue`, identify 
each by type (`DataElement` vs `DCMElement`), and assert each independently. No 
production code change.
   
   ## Priority
   
   P2 – Medium (blocks unrelated PRs every few CI runs)
   
   ## Task Type
   
   - [x] Testing / QA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to