dario-liberman commented on PR #10867:
URL: https://github.com/apache/pinot/pull/10867#issuecomment-1632555378

   > I am working on other aggregation strategies that do not require 
partitioning - 
[master...dario-liberman:pinot:funnel-strategies](https://github.com/apache/pinot/compare/master...dario-liberman:pinot:funnel-strategies)
   
   @kishoreg @chenboat - Finally had a chance to add tests and complete the PR 
for the remaining aggregation strategies - 
https://github.com/apache/pinot/pull/11092
   
   > > > > In order for this aggregation to work, does it require all the data 
to be partitioned by segments (i.e. all users show up in the same segment, and 
no user can be shared across segments)? That is the pre-requisite for 
`SEGMENT_PARTITIONED_DISTINCT_COUNT`
   > > > 
   > > > 
   > > > Yes. That is the pre-requisite to use the aggregation function. For 
realtime table, it needs the Kafka topic to be partitioned (eg., by user ids).
   > > 
   > > 
   > > this is probably not practical and we should consider fixing this. Even 
if the kafka topic is partitioned by the same user_id, there is not guarantee 
that all users will be part of same segment.
   > 
   > I shared above a work in progress branch with more funnel count 
aggregation strategies, effectively equivalents to DISTINCTCOUNT, 
DISTINCTCOUNTBITMAP and DISTINCTCOUNTTHETASKETCH. These do not depend on 
partitioning.
   > 
   > The strategy equivalent to SEGMENTPARTITIONEDDISTINCTCOUNT we have here is 
just a first version. When the column is configured as partition column we only 
have the same users across time boundaries between segments, which when 
grouping over time (eg per hour) to see funnel trends, gives good enough 
approximations. In the future it should be possible to incorporate a partition 
level (or server level?) phase so that we aggregate differently between 
segments within the same partition and segments across partitions. I will need 
more time for that though, for now I am adding different strategies so we can 
use the right one for each use case, as it will also depend on the 
sessionization window desired.
   
   Hopefully these aggregation strategies address the concerns here?
   
   Regarding the other concern about sub-function arguments not being real 
transform functions, that will be addressed in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to