[
https://issues.apache.org/jira/browse/HADOOP-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
BELUGA BEHR updated HADOOP-14525:
---------------------------------
Description:
For Apache Hive, VARCHAR fields are much slower than STRING fields when a
precision (string length cap) is included. Keep in mind that this precision is
the number of UTF-8 characters in the string, not the number of bytes.
The general procedure is:
# Load an entire byte buffer into a {{Text}} object
# Convert it to a {{String}}
# Count N number of character code points
# Substring the {{String}} at the correct place
# Convert the String back into a byte array and populate the {{Text}} object
It would be great if the {{Text}} object could offer a truncate/substring
method based on character count that did not require copying data around.
Along the same lines, a "getCharacterLength()" method may also be useful to
determine if the precision has been exceeded.
was:
For Apache Hive, VARCHAR fields are much slower than STRING fields when a
precision (string length cap) is included. Keep in mind that this precision is
the number of UTF-8 characters in the string, not the number of bytes.
The general procedure is:
# Load an entire byte buffer into a {{Text}} object
# Convert it to a {{String}}
# Count N number of character code points
# Substring the {{String}} at the correct place
# Convert the String back into a byte array and populate the {{Text}} object
It would be great if the {{Text}} object could offer a truncate/substring
method based on character count that did not require copying data around
> org.apache.hadoop.io.Text Truncate
> ----------------------------------
>
> Key: HADOOP-14525
> URL: https://issues.apache.org/jira/browse/HADOOP-14525
> Project: Hadoop Common
> Issue Type: Improvement
> Components: io
> Affects Versions: 2.8.1
> Reporter: BELUGA BEHR
>
> For Apache Hive, VARCHAR fields are much slower than STRING fields when a
> precision (string length cap) is included. Keep in mind that this precision
> is the number of UTF-8 characters in the string, not the number of bytes.
> The general procedure is:
> # Load an entire byte buffer into a {{Text}} object
> # Convert it to a {{String}}
> # Count N number of character code points
> # Substring the {{String}} at the correct place
> # Convert the String back into a byte array and populate the {{Text}} object
> It would be great if the {{Text}} object could offer a truncate/substring
> method based on character count that did not require copying data around.
> Along the same lines, a "getCharacterLength()" method may also be useful to
> determine if the precision has been exceeded.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]