The GitHub Actions job "Required Checks" on 
texera.git/gh-readonly-queue/main/pr-5658-73c76f51920b0900de67bbc0baa1ee5be5b87bf0
 has succeeded.
Run started by GitHub user aglinxinyuan (triggered by aglinxinyuan).

Head commit for run:
190823f7562ba8c0bb2a515b0ce4823cf640e049 / Xinyuan Lin <[email protected]>
test(workflow-operator): add unit test coverage for CaseSensitiveAnalyzer 
(#5658)

### What changes were proposed in this PR?

Pin behavior of the Lucene `Analyzer` used by the keyword-search
operator when the user opts into case-sensitive matching. The
abstraction skips the lowercasing pipeline used by `StandardAnalyzer`,
so a regression here would silently downgrade case-sensitive search. No
production-code changes.

| Spec | Source class | Tests |
| --- | --- | --- |
| `CaseSensitiveAnalyzerSpec` | `CaseSensitiveAnalyzer` | 13 |

Spec file name follows the `<srcClassName>Spec.scala` one-to-one
convention.

**Behavior pinned**

| Surface | Contract |
| --- | --- |
| Mixed-case input | every emitted token preserves its original case |
| All-uppercase / all-lowercase tokens | preserved (no normalization in
either direction) |
| Single-space splitting | tokens are separated cleanly |
| Tabs and newlines | also split tokens |
| Collapsed whitespace runs | no empty tokens emitted |
| Embedded punctuation (`abc,def`) | stays one token
(`WhitespaceTokenizer` only splits on whitespace) |
| Sentence-final punctuation (`Hello, world!`) | stays attached
(`Hello,`, `world!`) |
| Empty input | no tokens |
| Pure-whitespace input | no tokens |
| `StopFilter` with `CharArraySet.EMPTY_SET` | English stop words (`the`
/ `and` / `a`) are NOT removed (vs `StandardAnalyzer`'s default
behavior) |
| Different field names | same tokenization (field-name independent) |
| Successive `tokenStream` calls | each gets its own independent stream
|

The harness uses the canonical Lucene `reset → incrementToken → end →
close` lifecycle and collects `CharTermAttribute` values into a buffer —
same pattern any future analyzer spec in this codebase should follow.

### Any related issues, documentation, discussions?

Closes #5654.

### How was this PR tested?

Pure unit-test addition; verified locally with:

- `sbt "WorkflowOperator/testOnly
org.apache.texera.amber.operator.keywordSearch.CaseSensitiveAnalyzerSpec"`
— 13 tests, all green
- `sbt scalafmtCheckAll` — clean
- CI to confirm

### Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7 [1M context])

Report URL: https://github.com/apache/texera/actions/runs/27450622230

With regards,
GitHub Actions via GitBox

Reply via email to