Branch: refs/heads/main
  Home:   https://github.com/WebKit/WebKit
  Commit: 85ab337b0ba3700b847d0568c2c9ff4423f770a2
      
https://github.com/WebKit/WebKit/commit/85ab337b0ba3700b847d0568c2c9ff4423f770a2
  Author: Wenson Hsieh <[email protected]>
  Date:   2025-12-19 (Fri, 19 Dec 2025)

  Changed paths:
    M 
LayoutTests/fast/text-extraction/debug-text-extraction-markdown-expected.txt
    M LayoutTests/fast/text-extraction/debug-text-extraction-markdown.html
    M Source/WTF/wtf/text/CharacterProperties.h
    M Source/WebCore/page/text-extraction/TextExtraction.cpp
    M Source/WebCore/page/text-extraction/TextExtractionTypes.h
    M Source/WebKit/Shared/TextExtractionToStringConversion.cpp
    M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in

  Log Message:
  -----------
  [AutoFill Debugging] Add a heuristic to represent numerals split across 
superscript elements as decimals
https://bugs.webkit.org/show_bug.cgi?id=304446
rdar://166707253

Reviewed by Abrar Rahman Protyasha.

When performing text extraction on a DOM subtree like:

```
<span>$</span>11<sup>99</sup>
```

...we currently represent this in markdown text extraction results as:

```
$
11
99
```

...since each element is separated out into its own line. To ensure that this 
common way of
rendering monetary values is mapped into a more readable format, we add a 
heuristic to post-process
per-line text extraction results, so that:

1.  If a currency symbol is followed by a numeric value on the same line, the 
two pieces of text are
    joined together with no separator between them.

2.  If an integer is followed by another integer inside a superscript on the 
same line, the two
    pieces of text are joined together with a full stop character `.` between 
them.

* LayoutTests/fast/text-extraction/debug-text-extraction-markdown-expected.txt:
* LayoutTests/fast/text-extraction/debug-text-extraction-markdown.html:

Augment an existing test to exercise the change.

* Source/WTF/wtf/text/CharacterProperties.h:
(WTF::isCurrencySymbol):

Add a new helper function that uses ICU to determine whether the given 
character represents a
currency symbol.

* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::TraversalContext::pushEnclosingBlock):
(WebCore::TextExtraction::TraversalContext::enclosingBlockNumber const):
(WebCore::TextExtraction::TraversalContext::popEnclosingBlock):

Add a mechanism to assign numbers to each enclosing block-level container as we 
recursively extract
items from the DOM. This allows us to easily determine whether it's appropriate 
to merge the two
pieces of text later down the line (see below).

(WebCore::TextExtraction::extractRecursive):
(WebCore::TextExtraction::extractItem):
* Source/WebCore/page/text-extraction/TextExtractionTypes.h:
* Source/WebKit/Shared/TextExtractionToStringConversion.cpp:
(WebKit::shouldEmitFullStopBetweenLines):
(WebKit::shouldJoinWithPreviousLine):

Implement a heuristic to merge adjacent text items; see description above.

(WebKit::TextExtractionAggregator::~TextExtractionAggregator):
(WebKit::TextExtractionAggregator::takeResults):
(WebKit::TextExtractionAggregator::addResult):
(WebKit::TextExtractionAggregator::useTextTreeOutput const):
(WebKit::TextExtractionAggregator::appendToLine):

Refactor this logic to keep track of superscript level while traversing items, 
and add more
information about each line so that we can post-process the final result before 
invoking the
completion handler.

(WebKit::TextExtractionAggregator::pushSuperscript):
(WebKit::TextExtractionAggregator::superscriptLevel const):
(WebKit::TextExtractionAggregator::popSuperscript):
(WebKit::addTextRepresentationRecursive):
* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:

Canonical link: https://commits.webkit.org/304753@main



To unsubscribe from these emails, change your notification settings at 
https://github.com/WebKit/WebKit/settings/notifications

Reply via email to