After a bit more analysis and some testing I have a new branch that I think solves the problem. [1] I have also created a pull request internal to my clone so that it is easy to see the changes. [2]
The strategy change is to move the insertion of the proxy from the Cassandra File class to the Directories class. This means that all action with the table is captured (this solves a problem encountered in the earlier strategy). The strategy is to create a path on a different FileSystem and return that. The example code only moves the data for the table to another directory on the same FileSystem but using a different FileSystem implementation should be a trivial change. The current code works on an entire keyspace. I, while code exists to limit the redirect to a table I have not tested that branch yet and am not certain that it will work. There is also some code (i.e. the PathParser) that may no longer be needed but has not been removed yet. Please take a look and let me know if you see any issues with this solution. Claude [1] https://github.com/Claudenw/cassandra/tree/FileSystemProxy [2] https://github.com/Claudenw/cassandra/pull/5/files On Tue, Oct 10, 2023 at 10:28 AM Claude Warren, Jr <claude.war...@aiven.io> wrote: > I have been exploring adding a second Path to the Cassandra File object. > The original path being the path within the standard Cassandra directory > tree and the second being a translated path when there is what was called a > ChannelProxy in place. > > A problem arises when the Directories.getLocationForDisk() is called. It > seems to be looking for locations that start with the data directory > absolute path. I can change it to make it look for the original path not > the translated path. But in other cases the translated path is the one > that is needed. > > I notice that there is a concept of multiple file locations in the code > base, particularly in the Directories.DataDirectories class where there are > "locationsForNonSystemKeyspaces" and "locationsForSystemKeyspace" in the > constructor, and in the > DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() method > which returns an array of String and is populated from the cassandra.yaml > file. > > The DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() only > ever seems to return an array of one item. > > Why does DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() > return an array? > > Should the system set the path to the root of the ColumnFamilyStore in the > ColumnFamilyStore directories instance? > Should the Directories.getLocationForDisk() do the proxy to the other file > system? > > Where is the proper location to change from the standard internal > representation to the remote location? > > > On Fri, Sep 29, 2023 at 8:07 AM Claude Warren, Jr <claude.war...@aiven.io> > wrote: > >> Sorry I was out sick and did not respond yesterday. >> >> Henrik, How does your system work? What is the design strategy? Also >> is your code available somewhere? >> >> After looking at the code some more I think that the best solution is not >> a FileChannelProxy but to modify the Cassandra File class to get a >> FileSystem object for a Factory to build the Path that is used within that >> object. I think that this makes if very small change that will pick up >> 90+% of the cases. We then just need to find the edge cases. >> >> >> >> >> >> On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev < >> dev@cassandra.apache.org> wrote: >> >>> Super excited about this as well. Happy to help test with Azure and any >>> other way needed. >>> >>> Thanks, >>> German >>> ------------------------------ >>> *From:* guo Maxwell <cclive1...@gmail.com> >>> *Sent:* Wednesday, September 27, 2023 7:38 PM >>> *To:* dev@cassandra.apache.org <dev@cassandra.apache.org> >>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy >>> to alias external storage locations >>> >>> Thanks , So I think a jira can be created now. And I'd be happy to >>> provide some help with this as well if needed. >>> >>> Henrik Ingo <henrik.i...@datastax.com> 于2023年9月28日周四 00:21写道: >>> >>> It seems I was volunteered to rebase the Astra implementation of this >>> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it, >>> of course) I'll try to get going today or tomorrow, so that this >>> discussion can then benefit from having that code available for inspection. >>> And potentially using it as a soluttion to this use case. >>> >>> On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani <jak...@gmail.com> wrote: >>> >>> We (DataStax) have a FileSystemProvider for Astra we can provide. >>> Works with S3/GCS/Azure. >>> >>> I'll ask someone on our end to make it accessible. >>> >>> This would work by having a bucket prefix per node. But there are lots >>> of details needed to support things like out of bound compaction >>> (mentioned in CEP). >>> >>> Jake >>> >>> On Tue, Sep 26, 2023 at 12:56 PM Benedict <bened...@apache.org> wrote: >>> > >>> > I agree with Ariel, the more suitable insertion point is probably the >>> JDK level FileSystemProvider and FileSystem abstraction. >>> > >>> > It might also be that we can reuse existing work here in some cases? >>> > >>> > On 26 Sep 2023, at 17:49, Ariel Weisberg <ar...@weisberg.ws> wrote: >>> > >>> > >>> > Hi, >>> > >>> > Support for multiple storage backends including remote storage >>> backends is a pretty high value piece of functionality. I am happy to see >>> there is interest in that. >>> > >>> > I think that `ChannelProxyFactory` as an integration point is going to >>> quickly turn into a dead end as we get into really using multiple storage >>> backends. We need to be able to list files and really the full range of >>> filesystem interactions that Java supports should work with any backend to >>> make development, testing, and using existing code straightforward. >>> > >>> > It's a little more work to get C* to creates paths for alternate >>> backends where appropriate, but that works is probably necessary even with >>> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple >>> Fileystems). There will probably also be backend specific behaviors that >>> show up above the `ChannelProxy` layer that will depend on the backend. >>> > >>> > Ideally there would be some config to specify several backend >>> filesystems and their individual configuration that can be used, as well as >>> configuration and support for a "backend file router" for file creation >>> (and opening) that can be used to route files to the backend most >>> appropriate. >>> > >>> > Regards, >>> > Ariel >>> > >>> > On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote: >>> > >>> > I have just filed CEP-36 [1] to allow for keyspace/table storage >>> outside of the standard storage space. >>> > >>> > There are two desires driving this change: >>> > >>> > The ability to temporarily move some keyspaces/tables to storage >>> outside the normal directory tree to other disk so that compaction can >>> occur in situations where there is not enough disk space for compaction and >>> the processing to the moved data can not be suspended. >>> > The ability to store infrequently used data on slower cheaper storage >>> layers. >>> > >>> > I have a working POC implementation [2] though there are some issues >>> still to be solved and much logging to be reduced. >>> > >>> > I look forward to productive discussions, >>> > Claude >>> > >>> > [1] >>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations >>> > [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory >>> > >>> > >>> > >>> >>> >>> -- >>> http://twitter.com/tjake >>> >>> >>> >>> -- >>> >>> Henrik Ingo >>> >>> c. +358 40 569 7354 >>> >>> w. www.datastax.com >>> >>> <https://www.facebook.com/datastax> <https://twitter.com/datastax> >>> <https://www.linkedin.com/company/datastax/> >>> <https://github.com/datastax/> >>> >>> >>> >>> -- >>> you are the apple of my eye ! >>> >>