[
https://issues.apache.org/jira/browse/CASSANDRA-21173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065705#comment-18065705
]
Matt Byrd commented on CASSANDRA-21173:
---------------------------------------
Yeah I think it's a slightly more intuitive output have updated the snapshotId
to be "ks:tb:tag" when tableId is null.
In general there seem to be very few actual problems with tableId being null.
The only one I've managed to come up with is still sort of fairly harmless it
seems if perhaps a bit confusing:
1. User takes a snapshot with pre-2.1 table
2. Does a restart, snapshot gets loaded with tableId=null (since TCM not up yet)
3. user creates snapshot with same tag, then prePopulateSnapshots will not find
it in existing snapshot (since we get tableId from cfs.metadata.id.asUUID())
4. It's not marked as an existing snapshot and we don't raise
RuntimeException(format("Snapshot %s for %s.%s already exists.", and we'll
proceed to attempt to create the snapshot again likely failing later with new
DuplicateHardlinkException("Tried to create duplicate hard link from " + from +
" to " + to)
If we do want to solve that I think on trunk we can just move SnapshotManager
up in the startup sequencing.
The existing comment and positioning seems primarily to just preserve the
existing behavior (sensible by default)
None of the default existing startup checks seem to interact with snapshots
(nothing checks available disk space, or examines them in any way, we mostly
just do things like look to see that the data-directories are present and with
the correct permissions).
It's not impossible that users have implemented their own custom startup checks
which attempt to do some of these things, like fail to startup if disk is too
full or if there are too many snapshots or something.
So there could be some scenarios where for some reason some expired snapshots
weren't deleted or eligible for deletion yet and we restart and in the interim
have accrued enough on disk to fail the check but are not yet able to clear the
ephemeral snapshots due to the re-ordering.
In such an event you could always configure that check off temporarily, if upon
inspection you deemed it not to be a problem due to the ephemeral snapshots
which would soon be cleared.
It does seem like a tolerable change.
I can't think of any other options for trunk, bar maybe moving the
startup-checks later (which seems intuitively harder to reason about a less
good idea)
It would be good to solve the problem similarly in both places and we likely
need a trunk patch relatively soon anyhow.
I've update both PR's to add a bit more logging of the exceptional cases and do
the startup sequencing re-ordering, but if folks prefer can still go for any of
the other options outlined.
[^ci_summary.html] [^ci_summary.html]
> Snapshots from tables without table-id embedded in their folder name are not
> loaded by SnapshotLoader
> -----------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-21173
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21173
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Local/Snapshots, Local/Startup and Shutdown
> Reporter: Matt Byrd
> Assignee: Matt Byrd
> Priority: Normal
> Fix For: 5.0.x, 5.1, 6.x
>
> Attachments: ci_summary-1.html, ci_summary.html,
> ci_summary_trunk_mbyrd_CASSANDRA-21173.html
>
>
> Tables created prior to 2.1 do not have a table-id embedded in their table
> folder name.
> This is handled correctly in Directories.java (see constructor) unfortunately
> in SnapshotLoader, we use a regex which attempts to extract the table-id and
> hence skips over any tables created prior to 2.1.
> The end result is that these tables are not visible in list snapshot and more
> importantly cannot be cleared via nodetool clearsnapshot. This was noticed
> upon major upgrade to 5.0.
> I've observed this on 5.0, from reading the code it appears likely improved
> in 5.1, in that it now requires a restart in addition to trigger.
> Some related tickets:
> Introduction of table-id and backwards compatible handling of old folders
> originally here:
> https://issues.apache.org/jira/browse/CASSANDRA-5202
> Machinery to list snapshots which doesn’t handle old format was added here:
> https://issues.apache.org/jira/browse/CASSANDRA-16843
> https://github.com/apache/cassandra/commit/31aa17a2a3b18bdda723123cad811f075287807d
> There was some discussion at the time of not handling pre 2.1 tables here:
> https://issues.apache.org/jira/browse/CASSANDRA-16843?focusedCommentId=17440088&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17440088
> Then nodetool clearsnapshot stopped working here with:
> https://issues.apache.org/jira/browse/CASSANDRA-17757
> Things improve a bit in 5.1 with
> https://issues.apache.org/jira/browse/CASSANDRA-18111
> Now we no longer try and load the snapshots via SnapshotLoader in entirety
> before deciding if we can clear them, but instead make use of
> SnapshotManager. Whilst snapshots taken while the jvm is running are now
> visible and clearable, from reading upon restart we lose that information and
> cannot view/clear snapshots created before the restart.
> One solution to handle these pre 2.1 tables, is to include the table-id in
> the manifest.json, then we'll be able to read this information if not
> available from folder name upon restart.
> Another possibility which doesn't fix as many problems, is just to expose via
> jmx/nodetool
> something to allow operators to bypass the snapshot loading mechanism and
> directly clear the old pre-2.1 snapshots.
> A more involved and risky change would be to somehow think about how we
> migrate all this existing data in different folder structures to new
> consistent folder structure, but this seems quite involved and would likely
> deserve it's own JIRA at least.
> I have a patch locally against trunk for the first approach, just storing the
> tableId in the manifest.json, which does this and will run it against CI.
> I'll have a further think about if there are any other approaches, if anyone
> has any ideas let me know.
> Another thing to consider is where we should apply this change.
> Probably at a minimum 5.0, since that's where one can no longer nodetool
> clearsnapshot on certain tables and the effect is a bit worse there than in
> 5.1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]