[ https://issues.apache.org/jira/browse/HBASE-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HBASE-25357: ----------------------------------- Labels: pull-request-available (was: ) > allow specifying binary row key range to pre-split regions > ---------------------------------------------------------- > > Key: HBASE-25357 > URL: https://issues.apache.org/jira/browse/HBASE-25357 > Project: HBase > Issue Type: Improvement > Components: spark > Reporter: Yubao Liu > Priority: Major > Labels: pull-request-available > > Currently, spark hbase connector use `String` to specify regionStart and > regionEnd, but we often have serialized binary row key, I made a little > patch at [https://github.com/apache/hbase-connectors/pull/72/files] to always > treat the `String` in ISO_8859_1, so we can put raw bytes into the String > object and get it unchanged. > This has a drawback, if your row key is really Unicode strings beyond > ISO_8859_1 charset, you should convert it to UTF-8 encoded bytes and then > encapsulate it in ISO_8859_1 string. This is a limitation of Spark option > interface which allows only string to string map. > {code:java} > import java.nio.charset.StandardCharsets; > df.write() > .format("org.apache.hadoop.hbase.spark") > .option(HBaseTableCatalog.tableCatalog(), catalog) > .option(HBaseTableCatalog.newTable(), 5) > .option(HBaseTableCatalog.regionStart(), new > String("你好".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1)) > .option(HBaseTableCatalog.regionEnd(), new > String("世界".getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1)) > .mode(SaveMode.Append) > .save(); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)