Charles Connell created HBASE-28905: ---------------------------------------
Summary: Skip excessive evaluations of LINK_NAME_PATTERN and REF_NAME_PATTERN regular expressions Key: HBASE-28905 URL: https://issues.apache.org/jira/browse/HBASE-28905 Project: HBase Issue Type: Improvement Affects Versions: 2.6.0 Reporter: Charles Connell Assignee: Charles Connell To test if a file is a link file, HBase checks if its file name matches the regex {code:java} ^(?:((?:[_\p{Digit}\p{IsAlphabetic}]+))(?:\=))?((?:[_\p{Digit}\p{IsAlphabetic}][-_.\p{Digit}\p{IsAlphabetic}]*))=((?:[a-f0-9]+))-([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?)$ {code} To test if an HFile has a "reference name," HBase checks if its file name matches the regex {code:java} ^([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?|^(?:((?:[_\p{Digit}\p{IsAlphabetic}]+))(?:\=))?((?:[_\p{Digit}\p{IsAlphabetic}][-_.\p{Digit}\p{IsAlphabetic}]*))=((?:[a-f0-9]+))-([0-9a-f]+(?:(?:_SeqId_[0-9]+_)|(?:_del))?)$)\.(.+)$ {code} Matching against these big regexes is computationally expensive. HBASE-27474 introduced (in 2.6.0) code in a hot path in HFileReaderImpl that checks whether an HFile is a link or reference file while deciding whether to cache blocks from that file. In flamegraphs taken at my company during performance tests, this meant that these regex evaulations take 2-3% of the CPU time on a busy RegionServer. Later, the hot-path invocation of the regexes was removed in HBASE-28596 in branch-2 and later, but not branch-2.6, so only the 2.6.x series suffers the performance regression. Nonetheless, all invocations of these regexes are still unnecessarily expensive and can be fast-failed easily. The link name pattern contains a literal "=", so any string that does not contain a "=" can be assumed to not match the regex. The reference name pattern contains a literal ".", so any string that does not contain a "." can be assumed to not match the regex. This optimization is mostly helpful in 2.6.x, but is valid in all branches. Running performance tests of this optimization removed the regex evaluations from my flamegraphs entirely, and reduced query latency by 15%. Some charts are attached. -- This message was sent by Atlassian Jira (v8.20.10#820010)