[GitHub] [iceberg] xwmr-max commented on pull request #6440: Flink: Support Look-up Function

GitBox Thu, 12 Jan 2023 00:28:09 -0800


xwmr-max commented on PR #6440:
URL: https://github.com/apache/iceberg/pull/6440#issuecomment-1379970862


   > lookup function is for lookup join in Flink [1]. I have the same question 
as @zinking . normally lookup functions fit better for point query storage 
systems (like JDBC).
   > 
   > let's discuss the two scenarios separately
   > 
   > * small Iceberg table that can fit into memory comfortably using caching. 
In this case, cache should always be enabled. I don't see a reason where cache 
should be disabled. Also if a taskmanager has 8 slots, does lookup function 
cache 1 or 8 copies of reference data set?
   > * large Iceberg table. would FLILP-204 [2] help?
   > 
   > [1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/joins/#lookup-join
 [2] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-204%3A+Introduce+Hash+Lookup+Join
   
   Hi stevenzwu. Thank you for your review. Let's discuss the problem you 
raised separately.
   
   - As you said, small iceberg table can be easily loaded into the memory by 
using cache, and the query performance is also very fast. Therefore, from this 
point on, cache may always be enabled. However, there are some circumstances to 
consider. First, in our solution, _lookup-join-cache-size_ and 
_lookup-join-cache-ttl_ are provided to control the cache size and expiration 
time respectively, so that the cache size can be set according to actual 
conditions and the queried data can be guaranteed to be the latest. Secondly, 
this scheme improves query efficiency by storing data with the same primary key 
in the cache. If the cache does not contain data with the same primary key, the 
latest data will be loaded from the table. In addition, if a taskmanager has 8 
slots,lookup function needs to cache a copy of the data set. lookup function is 
just a basic function capability that can be used in the future to optimize 
enhanced performance, such as secondary indexes and so on. At present iceb
 erg does not support this basic function, which can satisfy the requirements 
of many scenarios.
   - FLILP-204 [2] just raises the cache hit ratio, user could use a hint to 
enable partitioned lookup join which enforces input of lookup join to hash 
shuffle by look up  keys.  This can indeed relieve the pressure of cache, but 
the iceberg table for larger data does not support it well. But based on the 
basic lookup function, we can apply this in the future.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] xwmr-max commented on pull request #6440: Flink: Support Look-up Function

Reply via email to