Branch: refs/heads/main
Home: https://github.com/WebKit/WebKit
Commit: 85ab337b0ba3700b847d0568c2c9ff4423f770a2
https://github.com/WebKit/WebKit/commit/85ab337b0ba3700b847d0568c2c9ff4423f770a2
Author: Wenson Hsieh <[email protected]>
Date: 2025-12-19 (Fri, 19 Dec 2025)
Changed paths:
M
LayoutTests/fast/text-extraction/debug-text-extraction-markdown-expected.txt
M LayoutTests/fast/text-extraction/debug-text-extraction-markdown.html
M Source/WTF/wtf/text/CharacterProperties.h
M Source/WebCore/page/text-extraction/TextExtraction.cpp
M Source/WebCore/page/text-extraction/TextExtractionTypes.h
M Source/WebKit/Shared/TextExtractionToStringConversion.cpp
M Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in
Log Message:
-----------
[AutoFill Debugging] Add a heuristic to represent numerals split across
superscript elements as decimals
https://bugs.webkit.org/show_bug.cgi?id=304446
rdar://166707253
Reviewed by Abrar Rahman Protyasha.
When performing text extraction on a DOM subtree like:
```
<span>$</span>11<sup>99</sup>
```
...we currently represent this in markdown text extraction results as:
```
$
11
99
```
...since each element is separated out into its own line. To ensure that this
common way of
rendering monetary values is mapped into a more readable format, we add a
heuristic to post-process
per-line text extraction results, so that:
1. If a currency symbol is followed by a numeric value on the same line, the
two pieces of text are
joined together with no separator between them.
2. If an integer is followed by another integer inside a superscript on the
same line, the two
pieces of text are joined together with a full stop character `.` between
them.
* LayoutTests/fast/text-extraction/debug-text-extraction-markdown-expected.txt:
* LayoutTests/fast/text-extraction/debug-text-extraction-markdown.html:
Augment an existing test to exercise the change.
* Source/WTF/wtf/text/CharacterProperties.h:
(WTF::isCurrencySymbol):
Add a new helper function that uses ICU to determine whether the given
character represents a
currency symbol.
* Source/WebCore/page/text-extraction/TextExtraction.cpp:
(WebCore::TextExtraction::TraversalContext::pushEnclosingBlock):
(WebCore::TextExtraction::TraversalContext::enclosingBlockNumber const):
(WebCore::TextExtraction::TraversalContext::popEnclosingBlock):
Add a mechanism to assign numbers to each enclosing block-level container as we
recursively extract
items from the DOM. This allows us to easily determine whether it's appropriate
to merge the two
pieces of text later down the line (see below).
(WebCore::TextExtraction::extractRecursive):
(WebCore::TextExtraction::extractItem):
* Source/WebCore/page/text-extraction/TextExtractionTypes.h:
* Source/WebKit/Shared/TextExtractionToStringConversion.cpp:
(WebKit::shouldEmitFullStopBetweenLines):
(WebKit::shouldJoinWithPreviousLine):
Implement a heuristic to merge adjacent text items; see description above.
(WebKit::TextExtractionAggregator::~TextExtractionAggregator):
(WebKit::TextExtractionAggregator::takeResults):
(WebKit::TextExtractionAggregator::addResult):
(WebKit::TextExtractionAggregator::useTextTreeOutput const):
(WebKit::TextExtractionAggregator::appendToLine):
Refactor this logic to keep track of superscript level while traversing items,
and add more
information about each line so that we can post-process the final result before
invoking the
completion handler.
(WebKit::TextExtractionAggregator::pushSuperscript):
(WebKit::TextExtractionAggregator::superscriptLevel const):
(WebKit::TextExtractionAggregator::popSuperscript):
(WebKit::addTextRepresentationRecursive):
* Source/WebKit/Shared/WebCoreArgumentCoders.serialization.in:
Canonical link: https://commits.webkit.org/304753@main
To unsubscribe from these emails, change your notification settings at
https://github.com/WebKit/WebKit/settings/notifications