Why would that require a fancy query optimiser? It doesn’t seem like it would be much more difficult than modifying our function binding rules and having a specialised version.
> On 9 Dec 2022, at 13:26, Andrés de la Peña <adelap...@apache.org> wrote: > > > Note that specialized collection functions are also an opportunity for > optimization. For example, COLLECTION_COUNT reads only the first bytes of a > serialized collection, since those bytes contain the number of elements in > that collection. The most simple implementation of COUNT(UNNEST(collection)) > wouldn't do that. It would probably require to deserialize the entire > collection. A fancy query optimizer could internally translate > COUNT(UNNEST(collection)) to COLLECTION_COUNT(collection) to get a nice > performance improvement. Unfortunately we don't have such optimizer at the > moment. > > I don't see a reason not to leave those collection functions as they are, > renaming aside. They can be later complemented by the generic UNNEST, or > subqueries, when someone is willing to work on those features. > > It's not clear to me how UNNEST + UDA would operate on maps. We would still > need a way to extract the keys or the values of maps, like the current > MAP_KEYS and MAP_VALUES functions, wouldn't we? > > As for the impossibility of applying COLLECTION_MAX, COLLECTION_MIN, etc. to > maps, I wouldn't be against renaming those to SET_MAX, LIST_MAX, SET_MIN, > SET_MAX, etc. Sets and lists have many things in common and it's a pity that > we don't have a common name for them. This lack of a common name for lists > and sets is something that permeates into the code at multiple places. This > kind of problems are probably the reason why Java's maps aren't collections. > > > > > >> On Fri, 9 Dec 2022 at 11:26, Benedict <bened...@apache.org> wrote: >> Right, this is basically my view - it can be syntactic sugar for UNNEST >> subqueries as and when we offer those (if not now), but I think we should be >> able to apply any UDA or aggregate to collections with some syntax that’s >> ergonomic. >> >> I don’t think APPLY is the right way to express it, my version was >> MAX(column AS COLLECTION) which means bind this operator to the collection >> rather than the rows (and since this is CAST-like, I’d say this is also a >> reasonable way to apply aggregations to single values too) >> >> But perhaps there’s some other third option. Or, if not, let’s simply >> support UNNEST subqueries. >> >> >>>> On 9 Dec 2022, at 11:19, Claude Warren, Jr via dev >>>> <dev@cassandra.apache.org> wrote: >>>> >>> >>> I still think that semantically it makes sense to have a function that >>> applies an aggregate to various collection types. So rather than building >>> ARRAY_MAX do APPLY(MAX, column)) or APPLY(MAX(column)) it is clear what is >>> being requested and APPLY can be the source of truth for which aggregate >>> functions work on which column types. >>> >>>> On Fri, Dec 9, 2022 at 10:28 AM Andrés de la Peña <adelap...@apache.org> >>>> wrote: >>>> Indeed this discussion is useful now that we know that there is dissension >>>> about these changes. However, as those changes were happening none of the >>>> persons involved on them felt the need of a discuss thread, and I opened >>>> this thread as soon as Benedict objected the changes. I think this is >>>> perfectly in line with ASF's wise policy about lazy consensus: >>>> https://community.apache.org/committers/lazyConsensus.html >>>> >>>>> My point here was that none I could find use ARRAY_COUNT - either >>>>> ARRAY_SIZE or ARRAY_LENGTH >>>> >>>> A quick search on Google shows: >>>> https://www.ibm.com/docs/en/psfa/7.2.1?topic=functions-array-count >>>> https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/arrayfun.html#fn-array-count >>>> https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Array/ARRAY_COUNT.htm >>>> >>>>> We seem to be mixing and matching our databases we use as templates here. >>>>> Most of the standard examples we use (Postgres, MySQL etc) do not offer >>>>> this, and do not offer equivalent functions on any of their other >>>>> collection types. >>>> >>>> I don't know what's wrong with those databases, nor what makes MySQL and >>>> Postgres more standard than others. AFAIK MySQL doesn't even have an array >>>> type, so it hardly is going to have array functions. Postgres however does >>>> have arrays. Those arrays can manipulated with both subqueries and an >>>> unnest function. However, this doesn't impede Postgres to also have a set >>>> of array functions: https://www.postgresql.org/docs/12/functions-array.html >>>> >>>> It seems that it is difficult to find a DB supporting arrays that doesn't >>>> also offer an assorted set of array functions. DBs can perfectly support >>>> subqueries, unnesting and utility functions at the same time. For example, >>>> with Postgres you can get the size of an array with a subquery, or with >>>> UNNEST, or with the ARRAY_LENGTH function. >>>> >>>> The collection functions that we discuss here are mostly analogous to >>>> those sets of functions that we find anywhere. They are a quite small, >>>> well encapsulated and not invasive feature that has the advantage of being >>>> already done. Those functions don't seem to impede us to add support for >>>> subqueries or unnesting whenever someone wants to work on them. >>>> >>>> Adding subqueries on a later stage wouldn't involve the deprecation of the >>>> collection functions, since those are still useful as a shortcut, as we >>>> see in other DBs out there. >>>> >>>> >>>> >>>>> On Thu, 8 Dec 2022 at 20:47, J. D. Jordan <jeremiah.jor...@gmail.com> >>>>> wrote: >>>>> I think this thread proves the point that a DISCUSS thread for API >>>>> changes on dev@ will get more viewpoints than just having something in >>>>> JIRA. I think this thread has been useful and should result in us having >>>>> a better user facing API than without it. >>>>> >>>>>>> On Dec 8, 2022, at 1:57 PM, Andrés de la Peña <adelap...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>> >>>>>>> I expect we’ll rehash it every API thread otherwise. >>>>>> >>>>>> Since you bring up the topic, I understand that opposing to every single >>>>>> reviewed decision that has been taken on CASSANDRA-17811, >>>>>> CASSANDRA-8877, CASSANDRA-17425 and CASSANDRA-18085 could make an >>>>>> argument in favour of the policy demanding a DISCUSS thread for every >>>>>> new feature, big or small. One could argue that any change to the work >>>>>> on this area that comes out from the current discussion could have been >>>>>> prevented by a previous DISCUSS thread. >>>>>> >>>>>> However, those tickets have been around for long in a very public way, >>>>>> and there hasn't been any controversy around them until now. So I think >>>>>> that an initial thread wouldn't have prevented that anyone can resurrect >>>>>> the discussion at any point in time if they haven't put the time to look >>>>>> into the changes before. We have already seen those after-commit >>>>>> discussions even for discussed, voted, approved and reviewed CEPs. The >>>>>> only thing we would get is two discussion threads instead of one. By the >>>>>> looks of it, I doubt that the suggested policy about discuss threads is >>>>>> going to be accepted. >>>>>> >>>>>> In any case, this is a separate topic from what we're discussing here. >>>>>> >>>>>> >>>>>>> On Thu, 8 Dec 2022 at 18:21, Benedict <bened...@apache.org> wrote: >>>>>>> It feels like this is a recurring kind of discussion, and I wonder if >>>>>>> there’s any value in deciding on a general approach to guide these >>>>>>> discussions in future? Are we aiming to look like SQL as we evolve, and >>>>>>> if so which products do we want to be informed by? >>>>>>> >>>>>>> I expect we’ll rehash it every API thread otherwise. >>>>>>> >>>>>>>>> On 8 Dec 2022, at 17:37, Benedict <bened...@apache.org> wrote: >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG? >>>>>>>>> Yes, a quick search on Google shows some examples: >>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg >>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g >>>>>>>>> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum >>>>>>>>> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html >>>>>>>> >>>>>>>> We seem to be mixing and matching our databases we use as templates >>>>>>>> here. Most of the standard examples we use (Postgres, MySQL etc) do >>>>>>>> not offer this, and do not offer equivalent functions on any of their >>>>>>>> other collection types. >>>>>>>> >>>>>>>> If we are to use industry norms as inspiration or justification, >>>>>>>> subqueries do seem to be the approach taken. So the question is really >>>>>>>> why we should introduce non-standard variants of fairly atypical >>>>>>>> functions instead of the more typical approach of subqueries? Because >>>>>>>> it’s more work, or some other reason? We should be explicit. >>>>>>>> >>>>>>>> If we aren’t using industry norms as our justification, we should >>>>>>>> probably jettison the lot as honestly the landscape is a mess in this >>>>>>>> area. >>>>>>>> >>>>>>>>> >>>>>>>>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH? >>>>>>>>> Yes, again we can search for some examples: >>>>>>>>> https://docs.snowflake.com/en/sql-reference/functions/array_size.html >>>>>>>>> https://docs.databricks.com/sql/language-manual/functions/array_size.html >>>>>>>> >>>>>>>> My point here was that none I could find use ARRAY_COUNT - either >>>>>>>> ARRAY_SIZE or ARRAY_LENGTH >>>>>>>> >>>>>>>>>> 3) A map is a collection in C* parlance, but I gather from below you >>>>>>>>>> expect these methods not to operate on them? >>>>>>>>> Nope, only COLLECTION_COUNT works on sets, lists and maps. >>>>>>>>> COLLECTION_MIN and COLLECTION_MAX require a set or list. >>>>>>>>> COLLECTION_SUM and COLLECTION_AVG require a numeric collection, the >>>>>>>>> same way that the ARRAY_SUM and ARRAY_AVG functions above require a >>>>>>>>> numeric array. >>>>>>>> >>>>>>>> Should collection_max only work on list and set when we can call >>>>>>>> MAX(map) today (at least I think we can)? Whatever we decide our >>>>>>>> aggregate behaviour is should be consistent here. >>>>>>>> >>>>>>>>>> Does ARRAY_MAX operate on single values? If we are to base our >>>>>>>>>> decisions on norms elsewhere, we should be consistent about it. >>>>>>>>> It doesn't in any the examples above. Those functions aren't a >>>>>>>>> standard so I don't know if there are others dbs around that support >>>>>>>>> it. In any case, the fact that we look for inspiration in other >>>>>>>>> databases to minimize surprise etc. doesn't mean that we have to do >>>>>>>>> exactly the same. After all, CQL is not SQL and our collections >>>>>>>>> aren't SQL arrays. >>>>>>>>> >>>>>>>>> Note that the collection functions added by CASSANDRA-8877 don't >>>>>>>>> operate on single values either. That idea was proposed by Yifan on >>>>>>>>> CASSANDRA-18078 and it looked good to Francisco and me. The patch is >>>>>>>>> on CASSANDRA-18085, already reviewed and blocked waiting on the >>>>>>>>> outcome of this discussion. >>>>>>>>> >>>>>>>>> The new collection functions can do the same as the new MAXWRITE >>>>>>>>> function, but not only for getting max timestamps, but also min >>>>>>>>> timestamps and min/max ttls. The missing part is that MAXWRITE can >>>>>>>>> accept both collections and single elements, so callers don't need to >>>>>>>>> know the type of the column. That's the motivation behind the idea of >>>>>>>>> doing the same with the collection functions, so they can entirely >>>>>>>>> replace MAXWRITE. >>>>>>>>> >>>>>>>>> However I wouldn't be against leaving the collection functions >>>>>>>>> working only on collections, as originally designed, and as they >>>>>>>>> currently are on trunk. The question is what we do with MAXWRITETIME. >>>>>>>>> That function is also only on trunk, and it might be repetitive given >>>>>>>>> the more generic collection functions. It's also a bit odd that there >>>>>>>>> isn't, for example, a similar MINTTL function. Maybe we should start >>>>>>>>> a separate discussion thread about that new function? >>>>>>>> >>>>>>>> I think we should figure out our overall strategy - these are all >>>>>>>> pieces of the puzzle IMO. But I guess the above questions seem to come >>>>>>>> first and will shape this. I would be in favour of some general >>>>>>>> approach, however, such as either first casting to a collection, or >>>>>>>> passing an aggregation operator to WRITETIME. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> On 8 Dec 2022, at 17:13, Andrés de la Peña <adelap...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>> >>>>>>>>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG? >>>>>>>>> Yes, a quick search on Google shows some examples: >>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg >>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g >>>>>>>>> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum >>>>>>>>> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html >>>>>>>>> >>>>>>>>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH? >>>>>>>>> Yes, again we can search for some examples: >>>>>>>>> https://docs.snowflake.com/en/sql-reference/functions/array_size.html >>>>>>>>> https://docs.databricks.com/sql/language-manual/functions/array_size.html >>>>>>>>> >>>>>>>>>> 3) A map is a collection in C* parlance, but I gather from below you >>>>>>>>>> expect these methods not to operate on them? >>>>>>>>> Nope, only COLLECTION_COUNT works on sets, lists and maps. >>>>>>>>> COLLECTION_MIN and COLLECTION_MAX require a set or list. >>>>>>>>> COLLECTION_SUM and COLLECTION_AVG require a numeric collection, the >>>>>>>>> same way that the ARRAY_SUM and ARRAY_AVG functions above require a >>>>>>>>> numeric array. >>>>>>>>> >>>>>>>>>> Does ARRAY_MAX operate on single values? If we are to base our >>>>>>>>>> decisions on norms elsewhere, we should be consistent about it. >>>>>>>>> It doesn't in any the examples above. Those functions aren't a >>>>>>>>> standard so I don't know if there are others dbs around that support >>>>>>>>> it. In any case, the fact that we look for inspiration in other >>>>>>>>> databases to minimize surprise etc. doesn't mean that we have to do >>>>>>>>> exactly the same. After all, CQL is not SQL and our collections >>>>>>>>> aren't SQL arrays. >>>>>>>>> >>>>>>>>> Note that the collection functions added by CASSANDRA-8877 don't >>>>>>>>> operate on single values either. That idea was proposed by Yifan on >>>>>>>>> CASSANDRA-18078 and it looked good to Francisco and me. The patch is >>>>>>>>> on CASSANDRA-18085, already reviewed and blocked waiting on the >>>>>>>>> outcome of this discussion. >>>>>>>>> >>>>>>>>> The new collection functions can do the same as the new MAXWRITE >>>>>>>>> function, but not only for getting max timestamps, but also min >>>>>>>>> timestamps and min/max ttls. The missing part is that MAXWRITE can >>>>>>>>> accept both collections and single elements, so callers don't need to >>>>>>>>> know the type of the column. That's the motivation behind the idea of >>>>>>>>> doing the same with the collection functions, so they can entirely >>>>>>>>> replace MAXWRITE. >>>>>>>>> >>>>>>>>> However I wouldn't be against leaving the collection functions >>>>>>>>> working only on collections, as originally designed, and as they >>>>>>>>> currently are on trunk. The question is what we do with MAXWRITETIME. >>>>>>>>> That function is also only on trunk, and it might be repetitive given >>>>>>>>> the more generic collection functions. It's also a bit odd that there >>>>>>>>> isn't, for example, a similar MINTTL function. Maybe we should start >>>>>>>>> a separate discussion thread about that new function? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Thu, 8 Dec 2022 at 14:21, Benedict <bened...@apache.org> wrote: >>>>>>>>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG? >>>>>>>>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH? >>>>>>>>>> 3) A map is a collection in C* parlance, but I gather from below you >>>>>>>>>> expect these methods not to operate on them? >>>>>>>>>> >>>>>>>>>> Does ARRAY_MAX operate on single values? If we are to base our >>>>>>>>>> decisions on norms elsewhere, we should be consistent about it. >>>>>>>>>> >>>>>>>>>> It’s worth noting that ARRAY is an ISO SQL concept, as is MULTISET. >>>>>>>>>> Some databases also have Set or Map types, such as MySQL’s Set and >>>>>>>>>> Postgres’ hstore. These databases only support ARRAY_ functions, >>>>>>>>>> seemingly, plus special MULTISET operators defined by the SQL >>>>>>>>>> standard where that data type is supported. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> On 8 Dec 2022, at 12:11, Andrés de la Peña <adelap...@apache.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> "ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element in >>>>>>>>>>> a list are not an uncommon practice. You can find them in SparkSQL, >>>>>>>>>>> Amazon Timestream, Teradata, etc. Since we have what we call >>>>>>>>>>> collections instead or arrays, it makes sense to call the analogous >>>>>>>>>>> functions "COLLECTION_MAX", "COLLECTION_MIN", etc. >>>>>>>>>>> >>>>>>>>>>> As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and >>>>>>>>>>> "MAP_VALUES" functions to get the keys or the values of a map, so >>>>>>>>>>> one can feed them to "MAX", "COLLECTION_MAX", etc. That isn't >>>>>>>>>>> anything too original either, you can find identical functions on >>>>>>>>>>> SparkSQL for example. >>>>>>>>>>> >>>>>>>>>>> I find simple utility functions easier to use than subqueries. But >>>>>>>>>>> we don't have to chose. We can also have subqueries if someone >>>>>>>>>>> finds the time to work on them. >>>>>>>>>>> >>>>>>>>>>>> On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev >>>>>>>>>>>> <dev@cassandra.apache.org> wrote: >>>>>>>>>>>> I think the semantics of the situation is important here. >>>>>>>>>>>> >>>>>>>>>>>> Let’s take MAX as our example aggregate function.. >>>>>>>>>>>> >>>>>>>>>>>> We all expect that in a DB context MAX(column) will return the >>>>>>>>>>>> value of the column with the maximum value. That is the expected >>>>>>>>>>>> semantics of MAX. >>>>>>>>>>>> >>>>>>>>>>>> The question here is that there are data types that are >>>>>>>>>>>> multi-valued and there is a desire to apply MAX to the values >>>>>>>>>>>> within the column. I would expect that this would return the >>>>>>>>>>>> maximum value of the column every row in the DB. >>>>>>>>>>>> >>>>>>>>>>>> So if there were a keyword that operated like the Java BiFunction >>>>>>>>>>>> class where the Function would apply a second function to the >>>>>>>>>>>> column data. For purposes of this discussion let’s call this >>>>>>>>>>>> Function APPLY. >>>>>>>>>>>> >>>>>>>>>>>> So APPLY( MAX, column ) would return the maximum value from the >>>>>>>>>>>> column for each row in the DB. >>>>>>>>>>>> >>>>>>>>>>>> MAX(APPLY(MAX,column)) would get the maximum value from the column >>>>>>>>>>>> across all the rows. >>>>>>>>>>>> >>>>>>>>>>>> Similarly APPLY could be used with other functions >>>>>>>>>>>> MAX(APPLY(MIN,column)) the largest minimum value from the column >>>>>>>>>>>> across all rows. >>>>>>>>>>>> >>>>>>>>>>>> These statements make clear semantically what is being asked for. >>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Dec 8, 2022 at 10:57 AM Benedict <bened...@apache.org> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> I meant unnest, not unwrap. >>>>>>>>>>>>> >>>>>>>>>>>>>>> On 8 Dec 2022, at 10:34, Benedict <bened...@apache.org> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I do not think we should have functions that aggregate across >>>>>>>>>>>>>>> rows and functions that operate within a row use the same name. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I’m sympathetic to that view for sure. I wouldn’t be too >>>>>>>>>>>>>> disappointed by that outcome, and SQL engines seem to take a >>>>>>>>>>>>>> similar approach, however they mostly rely on sub-queries to get >>>>>>>>>>>>>> around this problem, and the SQL standard introduces UNWRAP for >>>>>>>>>>>>>> operating on arrays (by translating them into a table), >>>>>>>>>>>>>> permitting subqueries to aggregate them. It seems to me we have >>>>>>>>>>>>>> four options: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1) introduce functionality similar to UNWRAP and subqueries >>>>>>>>>>>>>> 2) introduce new syntax to permit operating on collections with >>>>>>>>>>>>>> the same functions >>>>>>>>>>>>>> 3) permit the same functions to operate on both, with a >>>>>>>>>>>>>> precedence order, and introduce syntax to permit breaking the >>>>>>>>>>>>>> precedence order >>>>>>>>>>>>>> 4) introduce new functions >>>>>>>>>>>>>> >>>>>>>>>>>>>> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) >>>>>>>>>>>>>> AS max_item FROM table >>>>>>>>>>>>>> >>>>>>>>>>>>>> (2) and (3) might look something like: >>>>>>>>>>>>>> >>>>>>>>>>>>>> SELECT MAX(list AS COLLECTION) or >>>>>>>>>>>>>> SELECT MAX(list AS ROWS) >>>>>>>>>>>>>> >>>>>>>>>>>>>> (4) might look something like we have already, but perhaps with >>>>>>>>>>>>>> different names >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The comparator for collections is the lexicographical compare >>>>>>>>>>>>>>> on the collection items >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is a fair point, I mistakenly thought it sorted first on >>>>>>>>>>>>>> size. Even this definition is a little funkier for Map types, >>>>>>>>>>>>>> where the values of a key may cause something to sort earlier >>>>>>>>>>>>>> than a map whose next key sorts first. There are multiple >>>>>>>>>>>>>> potential lexicographical sorts for Maps (i.e., by keys first, >>>>>>>>>>>>>> then values, or by (key, value) pairs), so this is particularly >>>>>>>>>>>>>> poorly defined IMO. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The maximum of a blob type is pretty well defined I think, as >>>>>>>>>>>>>> are boolean, inetaddress etc. However, even for List or Set >>>>>>>>>>>>>> collections there’s multiple reasonable functions one could >>>>>>>>>>>>>> define for maximum, so it would make more sense to me to permit >>>>>>>>>>>>>> the user to define the comparison as part of the MAX function if >>>>>>>>>>>>>> we are to offer it. However, with the lexicographical definition >>>>>>>>>>>>>> we have I am somewhat less concerned for Set and List. Map seems >>>>>>>>>>>>>> like a real problem though, if we support these operators (which >>>>>>>>>>>>>> perhaps we do not). >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 7 Dec 2022, at 12:13, Andrés de la Peña >>>>>>>>>>>>>>>> <adelap...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The comparator for collections is the lexicographical compare >>>>>>>>>>>>>>> on the collection items. That might nor be the more useful >>>>>>>>>>>>>>> thing but it's not impossible to imagine cases where that >>>>>>>>>>>>>>> ordering can be useful. To make a random example, you can use a >>>>>>>>>>>>>>> list column to store the name and surnames of a person, >>>>>>>>>>>>>>> considering that some persons can have multiple surnames. You >>>>>>>>>>>>>>> can then sort rows based on that list column, to get the names >>>>>>>>>>>>>>> in alphabetical order, or to get the first or last person >>>>>>>>>>>>>>> according to that order. I'm sure we can think on more cases >>>>>>>>>>>>>>> were the lexicographical order of a list can be useful, >>>>>>>>>>>>>>> although I agree that's is not the greatest feature ever. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It's worth mentioning that collections are not the only data >>>>>>>>>>>>>>> types where the MIN/MAX functions are of dubious utility. For >>>>>>>>>>>>>>> example, blob columns can also be used with MIN/MAX. Same as >>>>>>>>>>>>>>> with collections, the min/max blobs are selected according to >>>>>>>>>>>>>>> the comparator for the data type. That comparator is the >>>>>>>>>>>>>>> lexicographic compare on the unsigned values of the byte >>>>>>>>>>>>>>> contents. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The utility of MIN/MAX on inet and boolean columns isn't very >>>>>>>>>>>>>>> clear either, although one can always imagine use cases. Fox >>>>>>>>>>>>>>> example, MAX of a boolean column can be used as a logical >>>>>>>>>>>>>>> disjunction. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If we were to special-case MIN/MAX functions to reject >>>>>>>>>>>>>>> collections, we should also reject other data types such as, at >>>>>>>>>>>>>>> least, blobs. That would require a deprecation plan. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, it's not that the comparator used by MIN/MAX is an >>>>>>>>>>>>>>> internal obscure thing. The action of that comparator is very >>>>>>>>>>>>>>> visible when any of those data types is used in a clustering >>>>>>>>>>>>>>> column, and it's used as the basis for "ORDER BY" clauses. >>>>>>>>>>>>>>> Should we also reject blobs, collections, tuples and UDTs on >>>>>>>>>>>>>>> "ORDER BY"? I don't think so. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I rather think that basing MIN/MAX on the regular order of the >>>>>>>>>>>>>>> column data type is consistent, easy to do and easy to >>>>>>>>>>>>>>> understand. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I don't see the need to add rules explicitly forbidding some >>>>>>>>>>>>>>> data types on MIN/MAX functions just because we can't easily >>>>>>>>>>>>>>> figure out a use case for their ordering. Especially when we >>>>>>>>>>>>>>> are exposing that same ordering on clusterings and "ORDER BY". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan >>>>>>>>>>>>>>>> <jeremiah.jor...@gmail.com> wrote: >>>>>>>>>>>>>>>> If the functionality truly has never actually worked, then >>>>>>>>>>>>>>>> throwing an error that MAX is not supported for collections >>>>>>>>>>>>>>>> seems reasonable. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But we should throw an error, I do not think we should have >>>>>>>>>>>>>>>> functions that aggregate across rows and functions that >>>>>>>>>>>>>>>> operate within a row use the same name. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My expectation as a user would be that MAX either always >>>>>>>>>>>>>>>> aggregates across rows, so results in a single row of output >>>>>>>>>>>>>>>> or always operates within a row, so returns the full set of >>>>>>>>>>>>>>>> rows matching the query. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So if we want a max that aggregates across rows that works for >>>>>>>>>>>>>>>> collections we could change it to return the aggregated max >>>>>>>>>>>>>>>> across all rows. Or we just leave it as an error and if >>>>>>>>>>>>>>>> someone wants the max across all rows they would ask for >>>>>>>>>>>>>>>> MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX >>>>>>>>>>>>>>>> may be a bad name. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> > On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> >>>>>>>>>>>>>>>> > wrote: >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > As far as I am aware it has never worked in a release, and >>>>>>>>>>>>>>>> > so deprecating it is probably not as challenging as you >>>>>>>>>>>>>>>> > think. Only folk that have been able to parse the raw bytes >>>>>>>>>>>>>>>> > of the collection in storage format would be affected - >>>>>>>>>>>>>>>> > which we can probably treat as zero. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan >>>>>>>>>>>>>>>> >> <jeremiah.jor...@gmail.com> wrote: >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that >>>>>>>>>>>>>>>> >>> operates over rows containing collections, returning the >>>>>>>>>>>>>>>> >>> collection with the most elements. This is just a >>>>>>>>>>>>>>>> >>> nonsensical operation to support IMO. We should decide as >>>>>>>>>>>>>>>> >>> a community whether we “fix” this aggregation, or remove >>>>>>>>>>>>>>>> >>> it. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> The current MAX function does not work this way afaik? It >>>>>>>>>>>>>>>> >> returns the row with the column that has the highest value >>>>>>>>>>>>>>>> >> in clustering order sense, like if the collection was used >>>>>>>>>>>>>>>> >> as a clustering key. While that also may have limited use, >>>>>>>>>>>>>>>> >> I don’t think it worth while to deprecate such use and all >>>>>>>>>>>>>>>> >> the headache that comes with doing so. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >>> 2. I think “collection_" prefixed methods are >>>>>>>>>>>>>>>> >>> non-intuitive for discovery, and all-else equal it would >>>>>>>>>>>>>>>> >>> be better to use MAX,MIN, etc, same as for aggregations. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> If we actually wanted to move towards using the existing >>>>>>>>>>>>>>>> >> names with new meanings, then I think that would take us >>>>>>>>>>>>>>>> >> multiple major releases. First deprecate existing use in >>>>>>>>>>>>>>>> >> current releases. Then make it an error in the next major >>>>>>>>>>>>>>>> >> release X. Then change the behavior in major release X+1. >>>>>>>>>>>>>>>> >> Just switching the behavior without having a major where >>>>>>>>>>>>>>>> >> such queries error out would make a bunch of user queries >>>>>>>>>>>>>>>> >> start returning “wrong” data. >>>>>>>>>>>>>>>> >> Also I don’t think those functions being cross row >>>>>>>>>>>>>>>> >> aggregations for some column types, but within row >>>>>>>>>>>>>>>> >> collection operations for other types, is any more >>>>>>>>>>>>>>>> >> intuitive, and actually would be more confusing. So I am >>>>>>>>>>>>>>>> >> -1 on using the same names. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >>> 3. I think it is peculiar to permit methods named >>>>>>>>>>>>>>>> >>> collection_ to operate over non-collection types when they >>>>>>>>>>>>>>>> >>> are explicitly collection variants. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> While I could see some point to this, I do not think it >>>>>>>>>>>>>>>> >> would be confusing for something named collection_XXX to >>>>>>>>>>>>>>>> >> treat a non-collection as a collection of 1. But maybe >>>>>>>>>>>>>>>> >> there is a better name for these function. Rather than >>>>>>>>>>>>>>>> >> seeing them as collection variants, we should see them as >>>>>>>>>>>>>>>> >> variants that operate on the data in a single row, rather >>>>>>>>>>>>>>>> >> than aggregating across multiple rows. But even with that >>>>>>>>>>>>>>>> >> perspective I don’t know what the best name would be. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict >>>>>>>>>>>>>>>> >>>> <bened...@apache.org> wrote: >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> Thanks Andres, I think community input on direction here >>>>>>>>>>>>>>>> >>> will be invaluable. There’s a bunch of interrelated >>>>>>>>>>>>>>>> >>> tickets, and my opinions are as follows: >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that >>>>>>>>>>>>>>>> >>> operates over rows containing collections, returning the >>>>>>>>>>>>>>>> >>> collection with the most elements. This is just a >>>>>>>>>>>>>>>> >>> nonsensical operation to support IMO. We should decide as >>>>>>>>>>>>>>>> >>> a community whether we “fix” this aggregation, or remove >>>>>>>>>>>>>>>> >>> it. >>>>>>>>>>>>>>>> >>> 2. I think “collection_" prefixed methods are >>>>>>>>>>>>>>>> >>> non-intuitive for discovery, and all-else equal it would >>>>>>>>>>>>>>>> >>> be better to use MAX,MIN, etc, same as for aggregations. >>>>>>>>>>>>>>>> >>> 3. I think it is peculiar to permit methods named >>>>>>>>>>>>>>>> >>> collection_ to operate over non-collection types when they >>>>>>>>>>>>>>>> >>> are explicitly collection variants. >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which >>>>>>>>>>>>>>>> >>> remains ambiguous, but this could be solved by either >>>>>>>>>>>>>>>> >>> providing a separate method for collections (e.g. SIZE) >>>>>>>>>>>>>>>> >>> which seems fine to me, or by offering a precedence order >>>>>>>>>>>>>>>> >>> for matching and a keyword for overriding the precedence >>>>>>>>>>>>>>>> >>> order (e.g. COUNT(collection AS COLLECTION)). >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> Given (2), (3) is a little more difficult. However, I >>>>>>>>>>>>>>>> >>> think this can be solved several ways. >>>>>>>>>>>>>>>> >>> - We could permit explicit casts to collection types, that >>>>>>>>>>>>>>>> >>> for a collection type would be a no-op, and for a single >>>>>>>>>>>>>>>> >>> value would create a collection >>>>>>>>>>>>>>>> >>> - With precedence orders, by always selecting the scalar >>>>>>>>>>>>>>>> >>> function last >>>>>>>>>>>>>>>> >>> - By permitting WRITETIME to accept a binary operator >>>>>>>>>>>>>>>> >>> reduce function to resolve multiple values >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>> These decisions all imply trade-offs on each other, and >>>>>>>>>>>>>>>> >>> affect the evolution of CQL, so I think community input >>>>>>>>>>>>>>>> >>> would be helpful. >>>>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña >>>>>>>>>>>>>>>> >>>>> <adelap...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> This will require some long introduction for context: >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with >>>>>>>>>>>>>>>> >>>> min/max column value according to their comparator. For >>>>>>>>>>>>>>>> >>>> collections, the comparison is on the lexicographical >>>>>>>>>>>>>>>> >>>> order of the collection elements. That's the very same >>>>>>>>>>>>>>>> >>>> comparator that is used when collections are used as >>>>>>>>>>>>>>>> >>>> clustering keys and for ORDER BY. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to >>>>>>>>>>>>>>>> >>>> make that the results were presented in their >>>>>>>>>>>>>>>> >>>> unserialized form, although the row selection was >>>>>>>>>>>>>>>> >>>> correct. That bug was recently solved by CASSANDRA-17811. >>>>>>>>>>>>>>>> >>>> During that ticket it was also considered the option of >>>>>>>>>>>>>>>> >>>> simply disabling MIN/MAX on collection since applying >>>>>>>>>>>>>>>> >>>> those functions to collections, since they don't seem >>>>>>>>>>>>>>>> >>>> super useful. However, that option was quickly discarded >>>>>>>>>>>>>>>> >>>> and the operation was fixed so the MIN/MAX functions >>>>>>>>>>>>>>>> >>>> correctly work for every data type. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, >>>>>>>>>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can >>>>>>>>>>>>>>>> >>>> perform aggregations of the elements of a collection. >>>>>>>>>>>>>>>> >>>> Those where named "map_keys", "map_values", >>>>>>>>>>>>>>>> >>>> "collection_min", "collection_max", "collection_sum", and >>>>>>>>>>>>>>>> >>>> "collection_count". Those are the names mentioned on the >>>>>>>>>>>>>>>> >>>> mail list thread about function naming conventions. >>>>>>>>>>>>>>>> >>>> Despite doing a kind of within-collection aggregation, >>>>>>>>>>>>>>>> >>>> these functions are not what we usually call aggregate >>>>>>>>>>>>>>>> >>>> functions, since they don't aggregate multiple rows >>>>>>>>>>>>>>>> >>>> together. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to >>>>>>>>>>>>>>>> >>>> trunk a MAXWRITETIME function to get the max timestamp of >>>>>>>>>>>>>>>> >>>> a multi-cell column. However, the new collection >>>>>>>>>>>>>>>> >>>> functions can be used in combination with the WRITETIME >>>>>>>>>>>>>>>> >>>> and TTL functions to retrieve the min/max/sum/avg >>>>>>>>>>>>>>>> >>>> timestamp or ttl of a multi-cell column. Since the new >>>>>>>>>>>>>>>> >>>> functions give a generic way of aggreagting timestamps >>>>>>>>>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed >>>>>>>>>>>>>>>> >>>> to remove that MAXWRITETIME function. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to >>>>>>>>>>>>>>>> >>>> remove that function in favour of the new generic >>>>>>>>>>>>>>>> >>>> collection functions. However, the MAXWRITETIME function >>>>>>>>>>>>>>>> >>>> can work on both single-cell and multi-cell columns, >>>>>>>>>>>>>>>> >>>> whereas "COLLECTION_MAX(WRITETIME(column))" would only >>>>>>>>>>>>>>>> >>>> work on multi-cell columns, That's because MAXWRITETIME >>>>>>>>>>>>>>>> >>>> of a not-multicell column doesn't return a collection, >>>>>>>>>>>>>>>> >>>> and one should simply use "WRITETIME(column)" instead. So >>>>>>>>>>>>>>>> >>>> it was proposed in CASSANDRA-18037 that collections >>>>>>>>>>>>>>>> >>>> functions applied to a not-collection value consider that >>>>>>>>>>>>>>>> >>>> value as the only element of a singleton collection. So, >>>>>>>>>>>>>>>> >>>> for example, COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. >>>>>>>>>>>>>>>> >>>> That ticket has already been reviewed and it's mostly >>>>>>>>>>>>>>>> >>>> ready to commit. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> Now we can go straight to the point: >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating >>>>>>>>>>>>>>>> >>>> aggregate functions applied to collections, the very same >>>>>>>>>>>>>>>> >>>> idea that was mentioned on CASSANDRA-17811 description >>>>>>>>>>>>>>>> >>>> almost four months ago. That way we could rename the new >>>>>>>>>>>>>>>> >>>> collection functions MIN/MAX/SUM/AVG, same as the classic >>>>>>>>>>>>>>>> >>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an >>>>>>>>>>>>>>>> >>>> aggregate function when applied to not-collection >>>>>>>>>>>>>>>> >>>> columns, and a scalar function when applied to >>>>>>>>>>>>>>>> >>>> collection. We can't do that with COUNT because there >>>>>>>>>>>>>>>> >>>> would be an ambiguity, so the proposal for that case is >>>>>>>>>>>>>>>> >>>> renaming COLLECTION_COUNT to SIZE. Benedict, please >>>>>>>>>>>>>>>> >>>> correct me if I'm not correctly exposing the proposal. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> I however would prefer to keep aggregate functions >>>>>>>>>>>>>>>> >>>> working on collections, and keep the names of the new >>>>>>>>>>>>>>>> >>>> collection functions as "COLLECTION_*". Reasons are: >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> 1 - Making aggregate functions not work on collections >>>>>>>>>>>>>>>> >>>> might be cosidered as breaking backward compatibility and >>>>>>>>>>>>>>>> >>>> require a deprecation plan. >>>>>>>>>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections >>>>>>>>>>>>>>>> >>>> might not look superuseful, but they make the set of >>>>>>>>>>>>>>>> >>>> aggregate functions consistent and applicable to every >>>>>>>>>>>>>>>> >>>> column type. >>>>>>>>>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection >>>>>>>>>>>>>>>> >>>> functions establishes a clear distinction between row >>>>>>>>>>>>>>>> >>>> aggregations and collection aggregations, while at the >>>>>>>>>>>>>>>> >>>> same time exposing the analogy between each pair of >>>>>>>>>>>>>>>> >>>> functions. >>>>>>>>>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to >>>>>>>>>>>>>>>> >>>> search for workarounds such as using the column type when >>>>>>>>>>>>>>>> >>>> possible, or trying to figure out synonyms like in the >>>>>>>>>>>>>>>> >>>> case of COUNT/SIZE. Even if that works for this case, >>>>>>>>>>>>>>>> >>>> future functions can find more trouble when trying to >>>>>>>>>>>>>>>> >>>> figure out workarounds to avoid clashing with existing >>>>>>>>>>>>>>>> >>>> function names. For example, we might want to add a SIZE >>>>>>>>>>>>>>>> >>>> function that gets the size in bytes of any column, or we >>>>>>>>>>>>>>>> >>>> might want to add a MAX function that gets the maximum of >>>>>>>>>>>>>>>> >>>> a set of columns, etc. And example of the synonym-based >>>>>>>>>>>>>>>> >>>> approach that comes to mind is MySQL's MAX and GREATEST >>>>>>>>>>>>>>>> >>>> functions, where MAX is for row aggregation and GREATEST >>>>>>>>>>>>>>>> >>>> is for column aggregation. >>>>>>>>>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column >>>>>>>>>>>>>>>> >>>> type, we can't implement Yifan's proposal of making >>>>>>>>>>>>>>>> >>>> COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7, which would >>>>>>>>>>>>>>>> >>>> be very useful for combining collection functions with >>>>>>>>>>>>>>>> >>>> time functions. >>>>>>>>>>>>>>>> >>>> >>>>>>>>>>>>>>>> >>>> What do others think? What should we do with aggregate >>>>>>>>>>>>>>>> >>>> functions on collections, collection functions and >>>>>>>>>>>>>>>> >>>> MAXWRITETIME? >>>>>>>>>>>>>>>> >>