eldenmoon opened a new issue, #15571:
URL: https://github.com/apache/doris/issues/15571

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   # Background
   Doris is build as a MPP OLAP system,it's very fast on analytic queries,but 
as for queries in hight concurrency scenario, doris has it's drawback as bellow:
   1. It's build on a column oriented format engine, in hight concurrency 
serving scenario users always want to get a whole row from system.But column 
oriented format will massively amplify random read IO when table is wide.
   2. Doris query engine and plan is too heavy for some simple queries like 
point query.We need a short fast path for such queries
   3. FE is an access layer service for SQL queries and write in java,analyzing 
and parsing SQLs will lead very high CPU overhead for hight concurrency queries
   
   As for above drawback in the system,we provided 3  methods to optimize each. 
      1. provide a row store format in the system
      2. short path optimize for point queries
      3. prepared statement
    I'll discuss details of each one in the follow.
   
   # Row store
   For row store format, we need to consider bellow performance and storage 
issue:
   1. Fast access of each column of a row
   2. Disk storage cost, less is better
   3. Simple enough and extensible 
      I investigate and survey some design and find out a better codec choice 
which Doris already used, `JSONB`.It stores binary format of each column, and 
we mapped each column id to each column value for fast column access.The binary 
format is compacted and it's binary is directly from vectorized column binary 
format.The codec detail could be find in file `be/src/vec/jsonb/serialize.cpp`, 
the codec logic is very simple and extensible.
      For storing a row, an extra column called `__DORIS_SOURCE_COL__` is added 
to each segment, so the storage is a simple column, we convert each Block to 
jsonb column and add this column to the back of Block, the extra column will be 
flushed with the same Block from memtable.
   
   # Short path optimize
   For simple queries like `select * from tbl where pk1 = 123 and pk2 = 456`  
such query will only act on a single tablet.So we could use a light weight RPC 
interface to fetch data instead of generate a huge Fragment plan and execute on 
the heavy execution engine.The RPC  interface is as follow
   ```
   // We use thrift definition for some structure, since TExpr,
   // list<Exprs.TExpr>, Descriptors.TDescriptorTable are all thrift format.
   // Modify them to protobuf is a redundant work.
   message PTabletKeyLookupRequest {
       required int64 tablet_id = 1;
       repeated KeyTuple key_tuples = 2;
       // serilized from Descriptors.TDescriptorTable
       optional bytes desc_tbl = 4;
       // serilized from TExprList 
       optional bytes output_expr = 5;
   }
   
   message PTabletKeyLookupResponse {
       required PStatus status = 1;
       optional bytes row_batch = 5;
       optional bool empty_batch = 6;
   }
   rpc tablet_fetch_data(PTabletKeyLookupRequest) returns 
(PTabletKeyLookupResponse);
   ```
   `tablet_id` is caculated from condition column `pk`, and `key_tuples` is 
string format of pk, in the above example, `key_tuples` is like `['123', 
'456']`, the key tuples will be encoded to the primary key format and using 
`lookup_row_key` for indentify positions of keys in the tablet.`lookup_row_key` 
will locate key row positions and check if the key is in delete bitmap, if not 
in return it's `RowLocation` otherwise return NotFound.Then use the location to 
directly get the row from related segment, we only need the 
`__DORIS_SOURCE_COL__` column for point queries, so we just located a row in 
this column and get a jsonb format raw and decode it to Block for later 
expressions evaluation.
   
   # Prepared Statement
   Mysql provided a useful statement caching mechanism on both server and 
client side called Prepared Statement.For reducing FE analyze and parse CPU 
cost of a query, we could use server side Prepared Statement mechanism.We 
provide `PreparedStatement` feature in FE fully compatible with mysql protocol 
(currently only support point queries like above mentioned).Enable it will pre 
caculate PreparedStatement SQL and expresions and caches it in a session level 
memory HashMap and will be reused later on.We could improve 4x+ performance by 
using `PreparedStatement` when CPU became hotspot doing such queries.Server 
side Prepared Statement using mysql binary protocol as it's transfer protocol 
see [mysql binary 
row](https://dev.mysql.com/doc/dev/mysqlserver/latest/page_protocol_binary_resultset.html#sect_protocol_binary_resultset_row)
  for more details.We also implement such protocol in 
`mysql_row_buffer.[h|cpp]` according to the standard mysql binary protocol.
       It's not enough just caching statement in FE, as for BE we also need 
caching some resuseful structures such as preallocted Block pool query 
descriptors and expressions since serialize and deserialize them are also heavy 
CPU works.For each Prepared Statement, it attached with an UUID named 
`cacheID`, each prepared query will execute and resused above mentioned resusul 
structures in BE according to related `cacheID`.
   
   # Benchmark
   I run a standard [YCSB](https://github.com/eldenmoon/YCSB) benchamrk for 
comparing the performance promoted.Bellow is the result.
   (The machine env is a single cloud machine with 16 cores and 64G memory plus 
4 * 1T HDD.Before the benchmark I did some warmup for caching.I will test 
performance in a cluster later on, it should be linear scalability) 
   
   cost about 12 cores
   ```
   [OVERALL], Throughput(ops/sec), 31134.704298145927
   [READ], Operations, 9882727
   [READ], AverageLatency(us), 638.3491656705685
   [READ], MinLatency(us), 250
   [READ], MaxLatency(us), 152447
   [READ], 95thPercentileLatency(us), 856
   [READ], 99thPercentileLatency(us), 1277
   [READ], Return=OK, 9882727
   [READ], Return=NOT_FOUND, 117273
   ```
   
   the table is created by following SQL
   ```
   CREATE TABLE `usertable` (
     `YCSB_KEY` varchar(255) NULL,
     `FIELD0` text NULL,
     `FIELD1` text NULL,
     `FIELD2` text NULL,
     `FIELD3` text NULL,
     `FIELD4` text NULL,
     `FIELD5` text NULL,
     `FIELD6` text NULL,
     `FIELD7` text NULL,
     `FIELD8` text NULL,
     `FIELD9` text NULL
   ) ENGINE=OLAP
   UNIQUE KEY(`YCSB_KEY`)
   COMMENT 'OLAP'
   DISTRIBUTED BY HASH(`YCSB_KEY`) BUCKETS 16
   PROPERTIES (
   "replication_allocation" = "tag.location.default: 1",
   "in_memory" = "false",
   "persistent" = "false",
   "storage_format" = "V2",
   "enable_unique_key_merge_on_write" = "true",
   "light_schema_change" = "true",
   "store_row_column" = "true",
   "disable_auto_compaction" = "false"
   );
   ```
   and executed in SQL follow with `useServerPrepStmts=true`
   ```
   SELECT * from usertable WHERE YCSB_KEY = ?
   ```
   
   
   ### Solution
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to