johnpyp opened a new issue, #37755:
URL: https://github.com/apache/doris/issues/37755

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   doris-2.1.4-rc03-e93678fd1e
   
   ### What's Wrong?
   
   I have one table defined with `DUPLICATE KEY(...)`, and another with `UNIQUE 
KEY(...)`. These tables have an *identical* table definition except for that 
one difference, and both use the same keys. They each have nearly identical 
counts as well (~800m rows):
   
   (using `SHOW DATA`: first table uses DUPLICATE KEY, second table uses UNIQUE 
KEY)
   
![image](https://github.com/user-attachments/assets/68c3a2d8-e418-4d37-9602-8ad3198896bf)
   
   
![image](https://github.com/user-attachments/assets/6651886c-9288-447e-b63a-7421269ddcc7)
   
   
   ### What You Expected?
   
   Unique table should be approximately the same size as the Duplicate table, 
maybe slightly larger due to hidden column overhead - definitely not more than 
2x as large
   
   ### How to Reproduce?
   
   1. Create any two tables, one using `DUPLICATE KEY` and one using `UNIQUE 
KEY`.
   2. Ingest the same data into each.
   3. `ANALYZE TABLE` on each table to make sure the storage numbers are up to 
date.
   4. Compare data sizes with `SHOW DATA`
   
   ### Anything Else?
   
   If this is for some reason an intended feature of the UNIQUE data model, it 
would be great to warn about it in the documentation (I couldn't find anything 
about it).
   
   Additionally, it would be nice to have an "Offline Deduplication" that I can 
run on-demand for `DUPLICATE` tables (maybe by using temporary segment swaps or 
something) - similar to Clickhouse's `OPTIMIZE TABLE ... DEDUPLICATE`.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to