Charles Connell created HBASE-28905:
---------------------------------------

             Summary: Skip excessive evaluations of LINK_NAME_PATTERN and 
REF_NAME_PATTERN regular expressions
                 Key: HBASE-28905
                 URL: https://issues.apache.org/jira/browse/HBASE-28905
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 2.6.0
            Reporter: Charles Connell
            Assignee: Charles Connell


To test if a file is a link file, HBase checks if its file name matches the 
regex
{code:java}
^(?:((?:[_\p{Digit}\p{IsAlphabetic}]+))(?:\=))?((?:[_\p{Digit}\p{IsAlphabetic}][-_.\p{Digit}\p{IsAlphabetic}]*))=((?:[a-f0-9]+))-([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?)$
{code}
To test if an HFile has a "reference name," HBase checks if its file name 
matches the regex
{code:java}
^([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?|^(?:((?:[_\p{Digit}\p{IsAlphabetic}]+))(?:\=))?((?:[_\p{Digit}\p{IsAlphabetic}][-_.\p{Digit}\p{IsAlphabetic}]*))=((?:[a-f0-9]+))-([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?)$)\.(.+)$
{code}
Matching against these big regexes is computationally expensive. HBASE-27474 
introduced (in 2.6.0) code in a hot path in HFileReaderImpl that checks whether 
an HFile is a link or reference file while deciding whether to cache blocks 
from that file. In flamegraphs taken at my company during performance tests, 
this meant that these regex evaulations take 2-3% of the CPU time on a busy 
RegionServer.

Later, the hot-path invocation of the regexes was removed in HBASE-28596 in 
branch-2 and later, but not branch-2.6, so only the 2.6.x series suffers the 
performance regression. Nonetheless, all invocations of these regexes are still 
unnecessarily expensive and can be fast-failed easily.

The link name pattern contains a literal "=", so any string that does not 
contain a "=" can be assumed to not match the regex. The reference name pattern 
contains a literal ".", so any string that does not contain a "." can be 
assumed to not match the regex. This optimization is mostly helpful in 2.6.x, 
but is valid in all branches.

Running performance tests of this optimization removed the regex evaluations 
from my flamegraphs entirely, and reduced query latency by 15%. Some charts are 
attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to