Steve Carlin created IMPALA-12961:
-------------------------------------
Summary: Use a Map instead of an ArrayList for Expr in HDFS RelNode
Key: IMPALA-12961
URL: https://issues.apache.org/jira/browse/IMPALA-12961
Project: IMPALA
Issue Type: Sub-task
Reporter: Steve Carlin
This came up in code review in ImpalaHdfsScanRel:
"For wide tables where we are only needing a few columns projected, we will end
up with a long list with mostly Nulls. A LinkedHashMap (preserves Insertion
order) where the key is position and value is the SlotRef would be better
suited despite the cpu cost of hashing. In general, in a query planner, memory
is the most precious commodity since the plan search space can be large, so
anything we can do to reduce memory footprint would be preferred."
One counter argument: The list is used in other Rel Nodes, and it seems more
natural. For instance, the Project RelNode will have a RexInputRef RexNode
which is "$2". It seems more natural to have an array in this case. Every
other RelNode works this way except for the ScanNode.
To add to the counter argument: Let's take a worst case scenario of a query
that has 10 tables with 500 columns apiece. If we are allocating 8 byte
pointers, we would need 10*500*8 to hold this information, which is 40,000
bytes. While reducing the memory footprint is more important, reducing it by
40,000 bytes really isn't going to make an impact. Even if we take into
account that multiple queries would be running simultaneously, this is a very
shortlived code path. So should we go with the more natural approach versus
the less memory intensive approach?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)