Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-17 Thread Jasonstack Zhao Yang
Hi,

I have updated the CEP with some details about distributed queries in the
*Approach* section.

David:

> given results have a real ranking, the current 2i logic may yield
incorrect results

C* internal iterators are all in primary key order. So we need two
in-memory top-k filters, one at replica side and one at coordinator side,
to make sure the returned rows are actually top-k but still primary key
order.

> if 1 of the queries fails and can’t fall back to peers… does the query
fail (I assume so)

yes, it will fail. we can make it pass if lower recall is acceptable.

Caleb:

> With smaller clusters or use-cases that are extremely
write-heavy/read-light, it's possible that the full scatter/gather won't be
too onerous, especially w/ a few small tweaks (on top of a non-vnode
cluster)

You are right. Smaller cluster would definitely requires less coordinator
memory to cache all required replicas' responses.


Jeremy:

>  With SAI, can you have partial results?  When you have a query that is
non-key based, you need to have full token range coverage of the results.
If that isn't possible, will Vector Search/SAI return partial results?

No partial result allowed. Query will failed with unavailability exception
if some required token range is not available. For ANN search, users might
be willing to have lower recall (partial results) with higher availability.

>  First, how is ordering/scoring done?
> Each replica returns back to the coordinator a sorted set of results and
the coordinator will have to see all of the results globally in order to do
a global ordering.  You can't know what the top result is unless you've
seen everything.  As to the scoring, I'm not sure how that will get
calculated.

The results will be top-k but still in primary key order. Scores are
computed based on vector similarly function.

Top-K search need two top-k filter as described in CEP.

> Second, if I am ordering the results like for a Vector Search and I want
to have the top 1 result.  How is the scoring done and what happens if
there are 20 that have the same score?  How will the coordinator decide
which 1 is returned out of 20?

It will be the row with smaller primary key order.

On Wed, 10 May 2023 at 05:39, Jeremy Hanna 
wrote:

> Just wanted to add that I don't have any special knowledge of CEP-30
> beyond what Jonathan posted and just trying to help clarify and answer
> questions as I can with some knowledge and experience from DSE Search and
> SAI.  Thanks to Caleb for helping validate some things as well.  And to be
> clear about partial results - the default with DSE Search at least is to
> fail a query if it can't get the full token range coverage.  However there
> is an option to allow for shards being unavailable and return partial
> results.
>
> On May 9, 2023, at 3:38 PM, Jeremy Hanna 
> wrote:
>
> I talked to David and some others in slack to hopefully clarify:
>
> With SAI, can you have partial results?  When you have a query that is
> non-key based, you need to have full token range coverage of the results.
> If that isn't possible, will Vector Search/SAI return partial results?
>
> Anything can happen in the implementation, but for scoring, it may not
> make sense to return partial results because it's misleading.  For
> non-global queries, it could or couldn't return partial results depending
> on implementation/configuration.  In DSE you could have partial results
> depending on the options.   However I couldn't find partial results defined
> in CEP-7 or CEP-30.
>
> The other questions are about scoring.
>
> First, how is ordering/scoring done?
>
> Each replica returns back to the coordinator a sorted set of results and
> the coordinator will have to see all of the results globally in order to do
> a global ordering.  You can't know what the top result is unless you've
> seen everything.  As to the scoring, I'm not sure how that will get
> calculated.
>
> Second, if I am ordering the results like for a Vector Search and I want
> to have the top 1 result.  How is the scoring done and what happens if
> there are 20 that have the same score?  How will the coordinator decide
> which 1 is returned out of 20?
>
> It returns results in token/partition and then clustering order.
>
> On May 9, 2023, at 2:53 PM, Caleb Rackliffe 
> wrote:
>
> Anyone on this ML who still remembers DSE Search (or has experience w/
> Elastic or SolrCloud) probably also knows that there are some significant
> pieces of an optimized scatter/gather apparatus for IR (even without
> sorting, which also doesn't exist yet) that do not exist in C* or it's
> range query system (which SAI and all other 2i implementations use). SAI,
> like all C* 2i implementations, is still a local index, and as that is the
> case, anything built on it will perform best in partition-scoped (at least
> on the read side) use-cases. (On the bright side, the project is moving
> toward larger partitions being a possibility.) With smaller clusters or
> use-cases that are

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

2023-05-17 Thread David Capwell
Thanks for the update, LGTM

> On May 17, 2023, at 5:35 AM, Jasonstack Zhao Yang  
> wrote:
> 
> Hi,
> 
> I have updated the CEP with some details about distributed queries in the 
> Approach section.
> 
> David:
> 
> > given results have a real ranking, the current 2i logic may yield incorrect 
> > results
> 
> C* internal iterators are all in primary key order. So we need two in-memory 
> top-k filters, one at replica side and one at coordinator side, to make sure 
> the returned rows are actually top-k but still primary key order.
> 
> > if 1 of the queries fails and can’t fall back to peers… does the query fail 
> > (I assume so)
> 
> yes, it will fail. we can make it pass if lower recall is acceptable.
> 
> Caleb:
> 
> > With smaller clusters or use-cases that are extremely 
> > write-heavy/read-light, it's possible that the full scatter/gather won't be 
> > too onerous, especially w/ a few small tweaks (on top of a non-vnode 
> > cluster)
> 
> You are right. Smaller cluster would definitely requires less coordinator 
> memory to cache all required replicas' responses.
> 
> 
> Jeremy:
> 
> >  With SAI, can you have partial results?  When you have a query that is 
> > non-key based, you need to have full token range coverage of the results.  
> > If that isn't possible, will Vector Search/SAI return partial results?
> 
> No partial result allowed. Query will failed with unavailability exception if 
> some required token range is not available. For ANN search, users might be 
> willing to have lower recall (partial results) with higher availability.
> 
> >  First, how is ordering/scoring done?
> > Each replica returns back to the coordinator a sorted set of results and 
> > the coordinator will have to see all of the results globally in order to do 
> > a global ordering.  You can't know what the top result is unless you've 
> > seen everything.  As to the scoring, I'm not sure how that will get 
> > calculated.
> 
> The results will be top-k but still in primary key order. Scores are computed 
> based on vector similarly function.
> 
> Top-K search need two top-k filter as described in CEP. 
> 
> > Second, if I am ordering the results like for a Vector Search and I want to 
> > have the top 1 result.  How is the scoring done and what happens if there 
> > are 20 that have the same score?  How will the coordinator decide which 1 
> > is returned out of 20?
> 
> It will be the row with smaller primary key order.
> 
> On Wed, 10 May 2023 at 05:39, Jeremy Hanna  > wrote:
>> Just wanted to add that I don't have any special knowledge of CEP-30 beyond 
>> what Jonathan posted and just trying to help clarify and answer questions as 
>> I can with some knowledge and experience from DSE Search and SAI.  Thanks to 
>> Caleb for helping validate some things as well.  And to be clear about 
>> partial results - the default with DSE Search at least is to fail a query if 
>> it can't get the full token range coverage.  However there is an option to 
>> allow for shards being unavailable and return partial results.
>> 
>>> On May 9, 2023, at 3:38 PM, Jeremy Hanna >> > wrote:
>>> 
>>> I talked to David and some others in slack to hopefully clarify:
>>> 
>>> With SAI, can you have partial results?  When you have a query that is 
>>> non-key based, you need to have full token range coverage of the results.  
>>> If that isn't possible, will Vector Search/SAI return partial results?
>>> 
>>> Anything can happen in the implementation, but for scoring, it may not make 
>>> sense to return partial results because it's misleading.  For non-global 
>>> queries, it could or couldn't return partial results depending on 
>>> implementation/configuration.  In DSE you could have partial results 
>>> depending on the options.   However I couldn't find partial results defined 
>>> in CEP-7 or CEP-30.
>>> 
>>> The other questions are about scoring.
>>> 
>>> First, how is ordering/scoring done?
>>> 
>>> Each replica returns back to the coordinator a sorted set of results and 
>>> the coordinator will have to see all of the results globally in order to do 
>>> a global ordering.  You can't know what the top result is unless you've 
>>> seen everything.  As to the scoring, I'm not sure how that will get 
>>> calculated.
>>> 
>>> Second, if I am ordering the results like for a Vector Search and I want to 
>>> have the top 1 result.  How is the scoring done and what happens if there 
>>> are 20 that have the same score?  How will the coordinator decide which 1 
>>> is returned out of 20?
>>> 
>>> It returns results in token/partition and then clustering order.
>>> 
 On May 9, 2023, at 2:53 PM, Caleb Rackliffe >>> > wrote:
 
 Anyone on this ML who still remembers DSE Search (or has experience w/ 
 Elastic or SolrCloud) probably also knows that there are some significant 
 pieces of an optimized scatter/gather apparatus 

Re: [DISCUSS] The future of CREATE INDEX

2023-05-17 Thread Henrik Ingo
I have read the thread but chose to reply to the top message...

I'm coming to this with the background of having worked with MySQL, where
both the storage engine and index implementation had many options, and
often of course some index types were only available in some engines.

I would humbly suggest:

1. What's up with naming anything "legacy". Calling the current index type
"2i" seems perfectly fine with me. From what I've heard it can work great
for many users?

2. It should be possible to always specify the index type explicitly. In
other words, it should be possible to CREATE CUSTOM INDEX ... USING "2i"
(if it isn't already)

2b) It should be possible to just say "SAI" or "SASIIndex", not the full
Java path.

3. It's a fair point that the "CUSTOM" word may make this sound a bit too
special... The simplest change IMO is to just make the CUSTOM work optional.

4. Benedict's point that a YAML option is per node is a good one... For
example, you wouldn't want some nodes to create a 2i index and other nodes
a SAI index for the same index That said, how many other YAML options
can you think of that would create total chaos if different nodes actually
had different values for them? For example what if a guardrail allowed some
action on some nodes but not others?  Maybe what we need is a jira ticket
to enforce that certain sections of the config must not differ?

5. That said, the default index type could also be a property of the
keyspace

6. MySQL allows the DBA to determine the default engine. This seems to work
well. If the user doesn't care, they don't care, if they do, they use the
explicit syntax.

henrik


On Wed, May 10, 2023 at 12:45 AM Caleb Rackliffe 
wrote:

> Earlier today, Mick started a thread on the future of our index creation
> DDL on Slack:
>
> https://the-asf.slack.com/archives/C018YGVCHMZ/p1683527794220019
> 
>
> At the moment, there are two ways to create a secondary index.
>
> *1.) CREATE INDEX [IF NOT EXISTS] [name] ON  ()*
>
> This creates an optionally named legacy 2i on the provided table and
> column.
>
> ex. CREATE INDEX my_index ON kd.tbl(my_text_col)
>
> *2.) CREATE CUSTOM INDEX [IF NOT EXISTS] [name] ON  ()
> USING  [WITH OPTIONS = ]*
>
> This creates a secondary index on the provided table and column using the
> specified 2i implementation class and (optional) parameters.
>
> ex. CREATE CUSTOM INDEX my_index ON ks.tbl(my_text_col) USING
> 'StorageAttachedIndex'
>
> (Note that the work on SAI added aliasing, so `StorageAttachedIndex` is
> shorthand for the fully-qualified class name, which is also valid.)
>
> So what is there to discuss?
>
> The concern Mick raised is...
>
> "...just folk continuing to use CREATE INDEX  because they think CREATE
> CUSTOM INDEX is advanced (or just don't know of it), and we leave users
> doing 2i (when they think they are, and/or we definitely want them to be,
> using SAI)"
>
> To paraphrase, we want people to use SAI once it's available where
> possible, and the default behavior of CREATE INDEX could be at odds w/
> that.
>
> The proposal we seem to have landed on is something like the following:
>
> For 5.0:
>
> 1.) Disable by default the creation of new legacy 2i via CREATE INDEX.
> 2.) Leave CREATE CUSTOM INDEX...USING... available by default.
>
> (Note: How this would interact w/ the existing secondary_indexes_enabled
> YAML options isn't clear yet.)
>
> Post-5.0:
>
> 1.) Deprecate and eventually remove SASI when SAI hits full feature parity
> w/ it.
> 2.) Replace both CREATE INDEX and CREATE CUSTOM INDEX w/ something of a
> hybrid between the two. For example, CREATE INDEX...USING...WITH. This
> would both be flexible enough to accommodate index implementation selection
> and prescriptive enough to force the user to make a decision (and wouldn't
> change the legacy behavior of the existing CREATE INDEX). In this world,
> creating a legacy 2i might look something like CREATE INDEX...USING
> `legacy`.
> 3.) Eventually deprecate CREATE CUSTOM INDEX...USING.
>
> Eventually we would have a single enabled DDL statement for index creation
> that would be minimal but also explicit/able to handle some evolution.
>
> What does everyone think?
>


-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

  
  


Re: [DISCUSS] The future of CREATE INDEX

2023-05-17 Thread Caleb Rackliffe
> 1. What's up with naming anything "legacy". Calling the current index
type "2i" seems perfectly fine with me. From what I've heard it can work
great for many users?

We can give the existing default secondary index any public-facing name we
like, but "2i" is too broad. It just stands for "secondary index", which is
obviously broad enough to cover anything. The use of "legacy" is
conversational, and it reflects the assertion that SAI should, when at
feature parity, be superior to the existing default 2i implementation for
any workload w/ partition-restricted queries. It will surely be possible to
construct a scenario where SAI's SSTable-attached design, combined with
global scatter/gather queries and a huge number of local/per-node SSTables,
causes it to perform worse than the existing default 2i, which is just an
inverted index implemented as a hidden table w/ search terms as partition
keys.

> 2. It should be possible to always specify the index type explicitly. In
other words, it should be possible to CREATE CUSTOM INDEX ... USING "2i"
(if it isn't already)

Yes. It should be possible to specify the type no matter what syntax we
use. However, if we started this project from scratch, I don't think we
would build CREATE CUSTOM INDEX in the first place.

> 2b) It should be possible to just say "SAI" or "SASIIndex", not the full
Java path.
> 3. It's a fair point that the "CUSTOM" word may make this sound a bit too
special... The simplest change IMO is to just make the CUSTOM work optional.

Agreed on both, and 2b (aliasing) is already supported for CREATE CUSTOM
INDEX. (It may be that we should move toward something like a
ServiceLoader-enabled set of named 2i's.)

> 4. Benedict's point that a YAML option is per node is a good one... For
example, you wouldn't want some nodes to create a 2i index and other nodes
a SAI index for the same index That said, how many other YAML options
can you think of that would create total chaos if different nodes actually
had different values for them? For example what if a guardrail allowed some
action on some nodes but not others?  Maybe what we need is a jira ticket
to enforce that certain sections of the config must not differ?

At some point, my guess is that TCM will give us the ability to have
consistent, cluster-wide metadata/configuration. Right now, we have quite a
few YAML options that control cluster-wide behavior including our
prohibition on creating experimental SASI indexes and our option to disable
2i creation. None of the options we've discussed should make it possible
for a single secondary index on a column of a table to have differing local
implementations.

> 6. MySQL allows the DBA to determine the default engine. This seems to
work well. If the user doesn't care, they don't care, if they do, they use
the explicit syntax.

Sounds like option #3 on the 3rd POLL.

On Wed, May 17, 2023 at 3:33 PM Henrik Ingo 
wrote:

> I have read the thread but chose to reply to the top message...
>
> I'm coming to this with the background of having worked with MySQL, where
> both the storage engine and index implementation had many options, and
> often of course some index types were only available in some engines.
>
> I would humbly suggest:
>
> 1. What's up with naming anything "legacy". Calling the current index type
> "2i" seems perfectly fine with me. From what I've heard it can work great
> for many users?
>
> 2. It should be possible to always specify the index type explicitly. In
> other words, it should be possible to CREATE CUSTOM INDEX ... USING "2i"
> (if it isn't already)
>
> 2b) It should be possible to just say "SAI" or "SASIIndex", not the full
> Java path.
>
> 3. It's a fair point that the "CUSTOM" word may make this sound a bit too
> special... The simplest change IMO is to just make the CUSTOM work optional.
>
> 4. Benedict's point that a YAML option is per node is a good one... For
> example, you wouldn't want some nodes to create a 2i index and other nodes
> a SAI index for the same index That said, how many other YAML options
> can you think of that would create total chaos if different nodes actually
> had different values for them? For example what if a guardrail allowed some
> action on some nodes but not others?  Maybe what we need is a jira ticket
> to enforce that certain sections of the config must not differ?
>
> 5. That said, the default index type could also be a property of the
> keyspace
>
> 6. MySQL allows the DBA to determine the default engine. This seems to
> work well. If the user doesn't care, they don't care, if they do, they use
> the explicit syntax.
>
> henrik
>
>
> On Wed, May 10, 2023 at 12:45 AM Caleb Rackliffe 
> wrote:
>
>> Earlier today, Mick started a thread on the future of our index creation
>> DDL on Slack:
>>
>> https://the-asf.slack.com/archives/C018YGVCHMZ/p1683527794220019
>> 

Re: [DISCUSS] Feature branch version hygiene

2023-05-17 Thread Mick Semb Wever
On Tue, 16 May 2023 at 13:02, J. D. Jordan 
wrote:

> Process question/discussion. Should tickets that are merged to CEP feature
> branches, like  https://issues.apache.org/jira/browse/CASSANDRA-18204, have
> a fixver of 5.0 on them After merging to the feature branch?
>
>
> For the SAI CEP which is also using the feature branch method the
> "reviewed and merged to feature branch" tickets seem to be given a version
> of NA.
>
>
> Not sure that's the best “waiting for cep to merge” version either?  But
> it seems better than putting 5.0 on them to me.
>
>
> Why I’m not keen on 5.0 is because if we cut the release today those
> tickets would not be there.
>
>
> What do other people think?  Is there a better version designation we can
> use?
>
>
> On a different project I have in the past made a “version number” in JIRA
> for each long running feature branch. Tickets merged to the feature branch
> got the epic ticket number as their version, and then it got updated to the
> “real” version when the feature branch was merged to trunk.
>


Thanks for raising the thread, I remember there was some confusion early
wrt features branches too.

To rehash, for everything currently resolved in trunk 5.0 is the correct
fixVersion.  (And there should be no unresolved issues today with 5.0
fixVersion, they should be 5.x)


When alpha1 is cut, then the 5.0-alpha1 fixVersion is created and
everything with 5.0 also gets  5.0-alpha1. At the same time 5.0-alpha2,
5.0-beta, 5.0-rc, 5.0.0 fixVersions are created. Here both 5.0-beta and
5.0-rc are blocking placeholder fixVersions: no resolved issues are left
with this fixVersion the same as the .x placeholder fixVersions. The 5.0.0
is also used as a blocking version, though it is also an eventual
fixVersion for resolved tickets. Also note, all tickets up to and including
5.0.0 will also have the 5.0 fixVersion.


A particular reason for doing things the way they are is to make it easy
for the release manager to bulk correct fixVersions, at release time or
even later, i.e. without having to read the ticket or go talk to authors or
painstakingly crawl CHANGES.txt.


For feature branches my suggestion is that we create a fixVersion for each
of them, e.g. 5.0-cep-15

Yup, that's your suggestion Jeremiah (I wrote this up on the plane before I
got to read your post properly).


(As you say) This then makes it easy to see where the code is (or what the
patch is currently being based on). And when the feature branch is merged
then it is easy to bulk replace it with trunk's fixVersion, e.g.  5.0-cep-15
with 5.0


The NA fixVersion was introduced for the other repositories, e.g. website
updates.


Re: [DISCUSS] Feature branch version hygiene

2023-05-17 Thread Caleb Rackliffe
So when a CEP slips, do we have to create a 5.1-cep-N? Could we just have a
version that's "NextMajorRelease" or something like that? It should still
be pretty easy to bulk replace if we have something else to filter on, like
belonging to an epic?

On Wed, May 17, 2023 at 6:42 PM Mick Semb Wever  wrote:

>
>
> On Tue, 16 May 2023 at 13:02, J. D. Jordan 
> wrote:
>
>> Process question/discussion. Should tickets that are merged to CEP
>> feature branches, like
>> https://issues.apache.org/jira/browse/CASSANDRA-18204, have a fixver of
>> 5.0 on them After merging to the feature branch?
>>
>>
>> For the SAI CEP which is also using the feature branch method the
>> "reviewed and merged to feature branch" tickets seem to be given a version
>> of NA.
>>
>>
>> Not sure that's the best “waiting for cep to merge” version either?  But
>> it seems better than putting 5.0 on them to me.
>>
>>
>> Why I’m not keen on 5.0 is because if we cut the release today those
>> tickets would not be there.
>>
>>
>> What do other people think?  Is there a better version designation we can
>> use?
>>
>>
>> On a different project I have in the past made a “version number” in JIRA
>> for each long running feature branch. Tickets merged to the feature branch
>> got the epic ticket number as their version, and then it got updated to the
>> “real” version when the feature branch was merged to trunk.
>>
>
>
> Thanks for raising the thread, I remember there was some confusion early
> wrt features branches too.
>
> To rehash, for everything currently resolved in trunk 5.0 is the correct
> fixVersion.  (And there should be no unresolved issues today with 5.0
> fixVersion, they should be 5.x)
>
>
> When alpha1 is cut, then the 5.0-alpha1 fixVersion is created and
> everything with 5.0 also gets  5.0-alpha1. At the same time 5.0-alpha2,
> 5.0-beta, 5.0-rc, 5.0.0 fixVersions are created. Here both 5.0-beta and
> 5.0-rc are blocking placeholder fixVersions: no resolved issues are left
> with this fixVersion the same as the .x placeholder fixVersions. The 5.0.0
> is also used as a blocking version, though it is also an eventual
> fixVersion for resolved tickets. Also note, all tickets up to and
> including 5.0.0 will also have the 5.0 fixVersion.
>
>
> A particular reason for doing things the way they are is to make it easy
> for the release manager to bulk correct fixVersions, at release time or
> even later, i.e. without having to read the ticket or go talk to authors
> or painstakingly crawl CHANGES.txt.
>
>
> For feature branches my suggestion is that we create a fixVersion for each
> of them, e.g. 5.0-cep-15
>
> Yup, that's your suggestion Jeremiah (I wrote this up on the plane before
> I got to read your post properly).
>
>
> (As you say) This then makes it easy to see where the code is (or what
> the patch is currently being based on). And when the feature branch is
> merged then it is easy to bulk replace it with trunk's fixVersion, e.g.  
> 5.0-cep-15
> with 5.0
>
>
> The NA fixVersion was introduced for the other repositories, e.g. website
> updates.
>


Re: [DISCUSS] Feature branch version hygiene

2023-05-17 Thread Caleb Rackliffe
...otherwise I'm fine w/ just the CEP name, like "CEP-7" for SAI, etc.

On Wed, May 17, 2023 at 11:24 PM Caleb Rackliffe 
wrote:

> So when a CEP slips, do we have to create a 5.1-cep-N? Could we just have
> a version that's "NextMajorRelease" or something like that? It should still
> be pretty easy to bulk replace if we have something else to filter on, like
> belonging to an epic?
>
> On Wed, May 17, 2023 at 6:42 PM Mick Semb Wever  wrote:
>
>>
>>
>> On Tue, 16 May 2023 at 13:02, J. D. Jordan 
>> wrote:
>>
>>> Process question/discussion. Should tickets that are merged to CEP
>>> feature branches, like
>>> https://issues.apache.org/jira/browse/CASSANDRA-18204, have a fixver of
>>> 5.0 on them After merging to the feature branch?
>>>
>>>
>>> For the SAI CEP which is also using the feature branch method the
>>> "reviewed and merged to feature branch" tickets seem to be given a version
>>> of NA.
>>>
>>>
>>> Not sure that's the best “waiting for cep to merge” version either?  But
>>> it seems better than putting 5.0 on them to me.
>>>
>>>
>>> Why I’m not keen on 5.0 is because if we cut the release today those
>>> tickets would not be there.
>>>
>>>
>>> What do other people think?  Is there a better version designation we
>>> can use?
>>>
>>>
>>> On a different project I have in the past made a “version number” in
>>> JIRA for each long running feature branch. Tickets merged to the feature
>>> branch got the epic ticket number as their version, and then it got updated
>>> to the “real” version when the feature branch was merged to trunk.
>>>
>>
>>
>> Thanks for raising the thread, I remember there was some confusion early
>> wrt features branches too.
>>
>> To rehash, for everything currently resolved in trunk 5.0 is the correct
>> fixVersion.  (And there should be no unresolved issues today with 5.0
>> fixVersion, they should be 5.x)
>>
>>
>> When alpha1 is cut, then the 5.0-alpha1 fixVersion is created and
>> everything with 5.0 also gets  5.0-alpha1. At the same time 5.0-alpha2,
>> 5.0-beta, 5.0-rc, 5.0.0 fixVersions are created. Here both 5.0-beta and
>> 5.0-rc are blocking placeholder fixVersions: no resolved issues are left
>> with this fixVersion the same as the .x placeholder fixVersions. The 5.0.0
>> is also used as a blocking version, though it is also an eventual
>> fixVersion for resolved tickets. Also note, all tickets up to and
>> including 5.0.0 will also have the 5.0 fixVersion.
>>
>>
>> A particular reason for doing things the way they are is to make it easy
>> for the release manager to bulk correct fixVersions, at release time or
>> even later, i.e. without having to read the ticket or go talk to authors
>> or painstakingly crawl CHANGES.txt.
>>
>>
>> For feature branches my suggestion is that we create a fixVersion for
>> each of them, e.g. 5.0-cep-15
>>
>> Yup, that's your suggestion Jeremiah (I wrote this up on the plane before
>> I got to read your post properly).
>>
>>
>> (As you say) This then makes it easy to see where the code is (or what
>> the patch is currently being based on). And when the feature branch is
>> merged then it is easy to bulk replace it with trunk's fixVersion, e.g.  
>> 5.0-cep-15
>> with 5.0
>>
>>
>> The NA fixVersion was introduced for the other repositories, e.g. website
>> updates.
>>
>