Re: Cassandra on top of B-Tree
> Log structural database has the append-only characteristics, e.g. BDB-JE. > Is it an alternative for SSTable? Those matured database product might have > done a lot for cache management. Not sure whether it can improve the > performance of read or not. BDB JE seems to be targetted mostly at cases where data fits in RAM, or reasonably close to it. A problem is that while writes will be append-only as long as the database is sufficiently small, you start taking reads once the internal btree nodes no longer fit in RAM. So depending on cache size, at a certain number of keys (thus size of the btree) you start being seek-bound on reads while writing, even though the writes are in and of themselves append-only and not subject to seek overhead. Another effect, which I have not specifically confirmed in testing but expect to happen, is that once you reach the point this point of taking reads, compaction is probably going to be a lot more expensive. While normally JE can pick a log segment with the most garbage and mostly stream through it, re-writing non-garbage, that process will then also become entirely seek bound if a only a small subset of the btree fits in RAM. So now you have a seek bound compaction process that must keep up with the append-only write process, meaning that your append-only writes are limited by said seeks in addition to any seeks it takes "directly" when generating the writes. Also keep in mind that JE won't have on-disk locality for neither internal nodes nor leaf (data) nodes. The guaranteed append-only nature of Cassandra, in combination with the on-disk locality, is one reason to prefer it, under some circumstances, over JE even for non-clustered local use on a single machine. (As a parenthesis: I doubt JE is being used very much with huge databases, since a very significant CPU bottleneck became O(n) (with respect to the number of log segments) file listings. This is probably easily patched, or configured away by using larger log segments, but the repeated O(n) file listings suggest to me that huge databases is not an expected use case - beyond some hints in the documentation that would indicate it's meant for smaller databases.) -- / Peter Schuller
Re: Packaging Cassandra for Debian [was: Packaging Cassandra for Ubuntu]
(To be clear: I'm not a cassandra developer, nor have I yet used Cassandra in a real production setting - though I'm hoping to...) > I'm curious what others think: My estimation is that Cassandra is not a "apt-get install and forget" type of software yet. If you're going to use cassandra, you are most likely going to want to spend some serious time reading up on it and carefully configure, monitor and operate it. For that reason, I think the benefit of getting cassandra into Debian stable or Ubuntu is of limited importance to users with a serious need. However, it *is* extremely practical to have software in nicely packaged form. I would like to suggest that if packaging is do be a higher priority for Cassandra, that it might be more useful to provide pre-built .deb packages (possibly in the form of an apt repository) that people can add to the sources.list. Combine that with clear instructions on pinning (in the latter case of apt repositories) and users have a very convenient way of installing and upgrading Cassandra. At the same time, the Cassandra developers would save, I suspect, significant effort involved in trying to remain long-term compatible and giving support to users running very old releases who cannot legitimately be told "upgrade to the latest version" after having been fed Cassandra from e.g. Debian stable. In contrast, remaining compatible with Debian stable (i.e., building .deb:s for new versions of Cassandra on old versions of Debian/Ubuntu) is likely not a big deal at all given the lack of native dependencies; it also means that you need not solve, as a pre-requisite, things like the lack of a separately packaged thrift or other dependencies. In short, I would suggest something as along the above lines as an easer-to-accomplish goal for the Cassandra developers, yet providing high payoff to users. -- / Peter Schuller
Re: Hello
> I've been ramping up the code base, getting acquainted with the major > components, and I feel like there is a lack of commenting. Commenting the > code would be a good way for my to learn the system better, and I suspect it > would also be useful for others interesting in the future. Would providing > some quality comments -- high-level descriptions for each class, more > detailed descriptions for large/ambiguous functions -- be a useful way to > start contributing? I'm not a cassandra developer, but one suggestion may be to fill in some package-info.java:s such that generated javadocs are easier to navigate in terms of the overall structure and the roles of packages. -- / Peter Schuller
Re: Cassandra performance and read/write latency
> Has anyone else seen nodes "hang" for several seconds like this? I'm > not sure if this is a Java VM issue (e.g. garbage collection) or something Since garbage collection is logged (if you're running with default settings etc), any multi-second GC:s should be discoverable in said log. So for testing that hypothesis i'd check there first. Cassandra itself logs GC:s, but you can also turn of the JVM:s GC logging by e.g. "-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps". > I'm also interested in comparing notes with anyone else that has been doing > read/write throughput benchmarks with Cassandara. I did some batch write testing to see how it scaled up to about 200 million rows and 200 gb; I had ocational spikes in latency that were due to disk writes being flushed by the OS. However it was probably exacerbated in this case by the fact that this was ZFS/FreeBSD and ZFS is always (in my humble of opinion, and at least on FreeBSD) exhibiting the behavior for me that it flushes writes too late and end up blocking applications even if you have left-over bandwidth. In my case I "eliminated" the issue for the purpose of my test by having a stupid while loop simply doing "sync" every handful of seconds, to avoid accumulating too much data in the cache. While I expect this to be less of a problem for other setups, it's possible this is what you're seeing. If the operating system is blocking writes to the commit log for example (are you running with periodic fsync or batch wise fsync?). -- / Peter Schuller
Minimizing the impact of compaction on latency and throughput
Hello, I have repeatedly seen users report that background compaction is overly detrimental to the behavior of the node with respect to latency. While I have not yet deployed cassandra in a production situation where latencies are closely monitored, these reports do not really sound very surprising to me given the nature of compaction and unless otherwise stated by developers here on the list I tend to believe that it is a real issue. Ignoring implementation difficulties for a moment, a few things that could improve the situation, that seem sensible to me, are: * Utilizing posix_fadvise() on both reads and writes to avoid obliterating the operating system's caching of the sstables. * Add the ability to rate limit disk I/O (in particular writes). * Add the ability to perform direct I/O. * Add the ability to fsync() regularly on writes to force the operating system to not decide to flush hundreds of megabytes of data out in a single burst. * (Not an improvement but general observation: it seems useless for writes to the commit log to remain in cache after an fsync(), and so they are a good candidate for posix_fadvise()) None of these would be silver bullets, and the importance and appropriate settings for each would be very dependent on operating system, hardware, etc. But having the ability to control some or all of these should, I suspect, allow significantly lessening the impact of compaction under a variety of circumstances. With respect to cache eviction, the this is one area where the impact can probably be expected to be higher the more you rely on the operating systems caching, and the less you rely on in-JVM caching done by cassandra. The most obvious problem points to me include: * posix_fadvise() and direct I/O cause portability and building issues, necessitating native code. * rate limiting is very indirect due to read-ahead, caching, etc. in particular for writes, rate limiting them would likely be almost useless without also having fsync() or direct I/O, unless it is rate limited to an extremely small amount and the cluster is taking very few writes (such that the typical background flushing done by most OS:es is done often enough to not imply huge amounts of data) Any thoughts? Has this already been considered and rejected? Do you think compaction is in fact not a problem already? Are there other, easier, better ways to accomplish the goal? -- / Peter Schuller
Re: Minimizing the impact of compaction on latency and throughput
> This makes sense, but from what I have seen, read contention vs > cassandra is a much bigger deal than write contention (unless you > don't have a separate device for your commitlog, but optimizing for > that case isn't one of our goals). I am not really concerned with write performance, but rather with writes affecting reads. Cache eviction due to streaming writes has the obvious effect on hit ratio on reads, and in general large sustained writes tend to very negatively affect read latency (and typically in a bursty fashion). So; the idea was to specifically optimize to try to make reads be minimally affected by writes (in the sense of the background compaction eventually resulting from those writes). Understood and agreed about the commit log (though as long as write bursts are within what a battery-backed RAID controller can keep in its cache I'd expect it to potentially work pretty well without separation, if you do have such a setup). -- / Peter Schuller
Re: Minimizing the impact of compaction on latency and throughput
> It might be worth experimenting with posix_fadvise. I don't think > implementing our own i/o scheduler or rate-limiter would be as good a > use of time (it sounds like you're on that page too). Ok. And yes I mostly agree, although I can imagine circumstances where a pretty simple rate limiter might help significantly - albeit be something that has to be tweaked very specifically for the situation/hardware rather than being auto-tuned. If I have the time I may look into posix_fadvise() to begin with (but I'm not promising anything). Thanks for the input! -- / Peter Schuller
Re: Minimizing the impact of compaction on latency and throughput
> This looks relevant: > http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see > comments for directions to code sample) Thanks. That's helpful; I've been trying to avoid JNI in the past so wasn't familiar with the API, and the main difficulty was likely to be how to best expose the functionality to Java. Having someone do almost exactly the same thing helps ;) I'm also glad they confirmed the effect in a very similar situation. I'm also leaning towards O_DIRECT as well because: (1) Even if posix_fadvise() is used, on writes you'll need to fsync() before fadvise() anyway in order to allow Linux to evict the pages (a theoretical OS implementation might remember the advise call, but Linux doesn't - at least not up until recently). (2) posix_fadvise() feels more obscure and less portable than O_DIRECT, the latter being well-understood and used by e.g. databases for a long time. (3) O_DIRECT allows more direct control over when I/O happens and to what extent (without playing tricks or making assumptions about e.g. read-ahead) so will probably make it easier to kill both birds with one stone. You indicated you were skeptical about writing an I/O scheduler. While I agree that writing a real I/O scheduler is difficult, I suspect that if we do direct I/O a fairly simple scheme should work well. Being able to tweak a target MB/sec rate, select a chunk size ,and select the time window over which to rate limit, I suspect would go a long way. The situation is a bit special since in this case we are talking about one type of I/O that is run during controlled circumstances (controlled concurrency, we know how much memory we eat in total, etc). I suspect there may be a problem sustaining rates during high read loads though. We'll see. I'll try to make time for trying this out. -- / Peter Schuller
Re: Minimizing the impact of compaction on latency and throughput
> Due to the need for doing data alignment in the application itself (you are > bypassing all the OS magic here), there is really nothing portable about > O_DIRECT. Just have a look at open(2) on linux: [snip] > So, just within Linux you got different mechanisms for this depending on > kernel and fs in use and you need to figure out what to do yourself as the > OS will not tell you that. Don't expect this alignment stuff to be more > standardized across OSes than inside of Linux. Still find this portable? The concept of direct I/O yes. I don't have experience of what the practical portability is with respect to alignment however, so maybe those details are a problem. But things like under what circumstances which flags to postix_fadvise() actually have the desired effect doesn't feel very portable either. One might have a look at to what extent direct I/O works well in e.g. postgresql or something like that, across platforms. But maybe you're right and O_DIRECT is just not worth it. > O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O That was the intent. > scheduling and caching across multiple reads/writers in threaded apps and > separated processes which the OS may offer. This is specifically what I want to bypass. I want to bypass the operating system's caching to (1) avoid trashing the cache and (2) know that a rate limited write translates fairly well to underlying storage. Rate limiting asynchronous writes will often be less than ideal since the operating system will tend to, by design, defer writes. This aspect can of course be overcome with fsync() however. And that does not even require native code, so is a big point in its favor. But if we still need native code for posix_fadvise() anyway (for reads), then that hit is taken anyway. But sure. Perhaps posix_fadvise() in combination with regular fsync():ing on writes may be preferable to direct I/O (with fsync() being required both for rate limiting purposes if one is to combat that, and for avoiding cache eviction given the way fadvise works in Linux atm). > This can especially be a big > loss when you got servers with loads of memory for large filesystem caches > where you might find it hard to actually utilize the cache in the > application. The entire point is to bypass the cache during compaction. But this does not (unless I'm mistaken about how Cassandra works) invalidate already pre-existing caches at the Cassandra/JVM level. In addition, for large data sets (large being significantly larger than RAM size), the data pulled into cache as part of compaction is not going to be useful anyway, as is. There is the special cases where all or most data fit in RAM and having all compaction I/O go through the cache may even be preferable; but in the general case, I really don't see the advantage of having that I/O go through cache. If you do have most or all data in RAM, than certainly having all that data warm at all times is preferably to doing I/O on a cold buffer cache against sstables. But on the other hand, any use of direct I/O of fadvise() will be optional (presumably). Given that a setup whereby your performance is entirely dependent on most data being in RAM at all times, you will already have issues with e.g. cold starts of nodes. In any case, I definitely consider there to be good reasons to not rely only on operating system caching; compaction is one of these reasons both with and without direct I/o or fadvise(). > O_DIRECT was made to solve HW performance limitation on servers 10+ years > ago. It is far from an ideal solution today (but until stuff like fadvice is > implemented properly, somewhat unavoidable) I think there are pretty clear and obvious use-cases where the cache eviction implied by large bulk streaming operations on large amounts of data is not what you want (there are any number of practical situations where this has been an issue for me, if nothing else). But if I'm overlooking something that would mean that this optimization, trying to avoid eviction, is useless with Cassandra please do explain it to me :) But I'll definitely buy that posix_fadvise() is probably a cleaner solution. -- / Peter Schuller
Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]
> More than one fd can be open on a given file, and many of open fd's are > on files that have been deleted. The stale fd's are all on Data.db files in > the > data directory, which I have separate from the commit log directory. > > I haven't had a chance to look at the code handling files, and I am not any > sort of Java expert, but could this be due to Java's lazy resource clean up? > I wonder if when considering writing your own file handling classes for > O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help. The fact that there are open fds to deleted files is interesting... I wonder if people have reported weird disk space usage in the past (since such deleted files would not show up with 'du -sh' but eat space on the device until closed). My general understanding is that Cassandra does specifically rely on the GC to know when unused sstables can be removed. However the fact that the files are deleted I think means that this is not the problem, and the question is rather why open file descriptors/streams are leaking to these deleted sstables. But I'm speaking now without knowing when/where streams are closed. Are the deleted files indeed sstable, or was that a bad assumption on my part? -- / Peter Schuller
Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]
> As a Cassandra newbie, I'm not sure how to tell, but they are all > to *.Data.db files, and all under the DataFileDirectory (as spec'ed > in storage-conf.xml), which is a separate directory than the > CommitLogDirectory. I did not see any *Index.db or *Filter.db > files, but I may have missed them. The *.Data.db files are indeed sstables. -- / Peter Schuller
Re: GC Storm
(adding dev@) > (2) Can we implement multi-thread compaction? I think this is the only way to scale. Or at least to implement concurrent compaction (whether it is by division into threads or not) of multiple size classes. As long as the worst-case compactions are significantly slower than best-base compactions, then presumably you will have the problem of accumulation of lots of sstables during long compactions. Since having few sstables is part of the design goal (or so I have assumed, or else you will seek to much on disk when doing e.g. a range query), triggering situations where this is not the case is a performance problem for readers. I've been thinking about this for a bit and maybe there could be one tweakable configuration setting which sets the desired machine concurrency, that the user tweaks in order to make compaction fast enough in relation to incoming writes. Regardless of the database size, this is necessary whenever cassandra is able to take writes faster than a CPU-bound compaction thread is able to process them. The other thing would be to have an intelligent compaction scheduler that does something along the lines of scheduling a compaction thread for every "level" of compaction (i.e., one for log_m(n) = 1, one for log_m(n) = 2, etc). To avoid inefficiency and huge spikes in CPU usage, these compaction threads could stop every now and then (something reasonable; say ever 100 mb compacted or something) and yield to other compaction threads. This way: (a) a limited amount of threads will be actively runnable at any given moment, allowing the user to limit the effect of background compaction can have on CPU usage (b) but on the other hand, it also means that more than one CPU can be used; whatever is appropriate for the cluster (c) it should be reasonably easy to implement because each compaction is just a regular thread doing what it does now already (d) the synchronization overhead between compaction threads should be completely irrelevant as long as one selects a high enough synchronization threshold (100 mb was just a suggestion; might be 1 gig). (e) log_m(n) will never be large enough for it to be a scaling problem that you have one thread per "level" Thoughts? -- / Peter Schuller
Re: improving read performance
> This drawback is unfortunate for systems that use time-based row keys. In > such systems, row data will generally not be fragmented very much, if at > all, but reads suffer because the assumption is that all data is fragmented. > Even further, in a real-time system where reads occur quickly after > writes, if the data is in memory, the sstables are still checked. Perhaps I am misunderstanding you, but why is this a problem (in the particular case of time based row keys) given that existence of the bloom filters which should eliminate the need to go down to the sstables to any extent more than that they actually contain data for the row (in almost all cases, subject to bloom filter false positives)? Also, for the case of the edges where memtables are flushed, a write-through row cache should help alleviate that. I forget off hand whether the row cache is in fact write-through or not though. -- / Peter Schuller
Re: improving read performance
> Actually, the points you make are things I have overlooked and actually make > me feel more comfortable about how cassandra will perform for my use cases. > I'm interested, in my case, to find out what the bloom filter > false-positive rate is. Hopefully, a stat is kept on this. Assuming lack of implementation bugs and a good enough hash algorithm, the false positive rate of bloom filters are mathematically determined. See: http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html And in cassandra: java/org/apache/cassandra/utils/BloomCalculations.java java/org/apache/cassandra/utils/BloomFilter.java (I don't know without checking (no time right now) whether the false positive rate is actually tracked or not.) > As long as > ALL of the bloom filters are in memory, the hit should be minimal for a Bloom filters are by design in memory at all times (they are the worst possible case you can imagine in terms of random access, so it would never make sense to keep them on disk even partially). (This assumes the JVM isn't being otherwise swapped out, which is another issue.) > Good point on the row cache. I had actually misread the comments in the > yaml, mistaking "do not use on ColumnFamilies with LARGE ROWS" , as "do not > use on ColumnFamilies with a LARGE NUMBER OF ROWS". I don't know if this > will improve performance much since I don't understand yet if this > eliminates the need to check for the data in the SStables. If it doesn't > then what is the point of the row cache since the data is also in an > in-memory memtable? It does eliminate the need to go down to sstables. It also survives compactions (so doesn't go cold when sstables are replaced). Reasons to not use the row cache with large rows include: * In general it's a waste of memory better given to the OS page cache, unless possibly you're continually reading entire rows rather than subsets of rows. * For truly large rows you may have immediate issues with the size of the data being cached; e.g. attempting to cache a 2 GB row is not the best idea in terms of heap space consumption; you'll likely OOM or trigger fallbacks to full GC, etc. * Having a larger key cache may often be more productive. > That aside, splitting the memtable in 2, could make checking the bloom > filters unnecessary in most cases for me, but I'm not sure it's worth the > effort. Write-through row caching seems like a more direct approach to me personally, off hand. Also to the extent that you're worried about false positive rates, larger bloom filters may still be an option (not currently configurable; would require source changes). -- / Peter Schuller
Re: Atomically adding a column to columns_
> Function addColumn at class SuperColumn tries to atomically add a column to > the concurrent collection “columns_” using the following code: Deletions in Cassandra involve an insertion of a tombstone rather than actual column deletion. In the case of this bit of code, I believe, and I am not speaking authoritatively, the removal only happens in (1) the read path when filtering results (on presumably query-local data) and (2) during compaction (on presumably compaction-local data). I did browse through the various callers which seemed to confirm that this was the case. In other words, I do not believe the remove() code path should not ever be taken concurrently with insertions (by design, and not by accident). Anyone care to confirm/deny? -- / Peter Schuller
Re: Atomically adding a column to columns_
> It would be good to document this, or, since the > correct-even-for-remove logic is not much more complicated, switch to > that. Submitted: https://issues.apache.org/jira/browse/CASSANDRA-1559 -- / Peter Schuller
Re: Reducing confusion around client libraries
Without weighing in on the remainder of the discussion; regardless of what happens what do people think about at least making some minor structural changes to e.g. the wiki? For example, putting the Thrift API in a section other than the "user documentation", and being pretty vocal (to the point of sounding repetitious) about recommending that applications use the higher level clients? I think it's pretty easy to "accidentally" land in the Thrift documentation if you browse around. I'd be happy to try to tweak this a bit. Stuff like moving the thrift refs, and trying to find places on the wiki that links to the thrift API page and re-consider whether (or at least how) to link, etc. -- / Peter Schuller
mmap:ed i/o and buffer sizes
I was looking closer at sliced_buffer_size_in_kb and column_index_size_in_kb and reached the conclusion that for the purpose of I/O, these are irrelevant when using mmap:ed I/O mode (which makes sense, since there is no way to use a "buffer size" when all you're doing is touching memory). The only effect is that column_index_size_in_kb still affects the size at which indexing triggers, which is as advertised. Firstly, can anyone confirm/deny my interpretation? Secondly, has anyone done testing as to the effects on mmap():ed I/O on the efficiency (in terms of disk seeks) of reads on large data sets? The CPU benefits of mmap() may be negated when disk-bound if the read-ahead logic of the kernel interacts sub-optimally with Cassandra's use-case. Potentially even reading more than a single page could imply multiple seeks (assuming a loaded system with other I/O in the queue) if there is no read-ahead until the first successive access. I have not checked what actually does happen, nor have I benchmarked for comparison. But I'd be interested in hearing if people have already addressed this in the past. -- / Peter Schuller
Re: Any chance of getting cassandra releases published to repo1.maven.org?
> I will say this once and only once, if you want to migrate to a maven based > build, I could help... However I do not want to mix issues, I'm quite > prepared to work with the current build system, there is a community of > developers here that seem happy with the existing build system, they are the > ones doing the work, whatever they prefer goes... I just want to get the > artifacts into central Speaking only from the point of view of a user (application developer), I think a major plus is coming from that end. Being able to have your standard maven project just pull in dependencies rather than having to work around infrastructure and make special exceptions to get cassandra client X and specific special version of thrift Y is extremely nice. (I'd personally love a pom.xml for cassandra itself too, if only to avoid having to configure IDE:s manually. I may be interested in helping out trying to maintain one, but I'm not sure I have sufficient maven fu yet to be effective (but I'm getting there).) Regardless, the Riptano maven repository is greatly appreciated as it appears already. -- / Peter Schuller
Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)
> Sorry for spam again. :-) No, thanks a lot for tracking that down and reporting details! Presumably a significant amount of users are on that version of Ubuntu running with openjdk. -- / Peter Schuller
Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)
> vic...@:~$ sudo ps aux | grep "cassandra" > cassandra 11034 0.2 22.9 1107772 462764 ? Sl Dec17 6:13 > /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly > -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 > -Dcom.sun.management.jmxremote.ssl=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp > bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar > org.apache.cassandra.thrift.CassandraDaemon > > Cassandra uses 462764 Kb, roughly 460 Mb for 2 Mb of data... And it keeps > getting bigger. > It is important to know that I have just a few insert, quite a lot of read > though. Also Cassandra seams to completly ignore the JVM limitations such as > Xmx. > If I don't stop and launch Cassandra every 15 ou 20 days it simply crashes, > due to oom errors. The resident size is not unexpected given that your Xmx is 512 MB. The virtual may or may not be expected depending; for example thread stacks as previously discussed in this thread. If you're not seeing the *resident* set size go above the maximum heap size, you're unlikely to be seeing the same problem. WIth respect to OOM, see http://www.riptano.com/docs/0.6/operations/tuning - but without more information it's difficult to know what specifically it is that you're hitting. Are you seriously saying you're running for 15-20 days with only 2 mb of live data? -- / Peter Schuller
API pages on the wiki
I have noticed API06 and API07 being created. API being de-synched with API07 now. What is the intentional method of maintaining these? Should API reflect trunk, or the "latest release"? Should we just make API not contain anything and all, and simply link to the appropriate page for each version? -- / Peter Schuller
Re: API pages on the wiki
> ^ I like that idea. Make it a landing page that points to the other > versions (trunk, 07, 06, etc.). RIght. The main downside I can think of is that you don't get the annotated nature of the current API page where you can read what is true now while simultaneously getting a sense of when new features were introduced. But my feeling is that it is better to have single versions of documentation that is more realistic to maintain, than de-normalizing it. Theoretically one should optimize for the reader rather than the writer, but on the theory that out-of-date/incorrect documentation is worse than the user having to read two versions to get a sense of differences, it seems to make sense. -- / Peter Schuller
Clarification on intended bootstrapping semantics
(I considered user@ but I don't want to cause confusion so limiting myself to dev@) I would like to clarify/confirm a few things so that I understand correctly and so that I can update the documentation to be clearer. I was unsure about some of these things until code inspection. (If I have totally missed documentation that covers this in this level of details please let me know.) So to be clear: I'm making statements and asking for confirmation. I'm not making claims, so don't assume this is true if you don't know. There is some overlap between some of the points. (1) Starting from scratch with N nodes, M of which are seed nodes (M <= N), no initial token and autobootstrap false, starting them all up with clean data directories should be supported without concern for start-up order. The cluster is expected to converge. As long as no writes are happening, and there is no data in the cluster, there is no problem. There will be no "divergent histories" that prevent gossip from working. (2) Specifically, given that one wants for example 2 seeds, there is no requirement to join the "second" seed as a non-seed *first*, only to then restart with it as seed after having joined the cluster. (3) The critical invariant for the operator to maintain with respect to seed nodes, is that no node is ever listed as a seed node in other node's configuration, without said seed node first having joined the cluster. (3a) The exception being initial cluster start where there is no data and no traffic, where it's fine to just start all nodes in arbitrary order. (4) It is always fine for a seed node to consider itself a seed even during initial start-up and joining the ring. (5) enabling auto_bootstrap does not just affect the method by which tokens are selected, but also affects *whether* the bootstrap process includes streaming data from other nodes prior to becoming up in the ring (i.e., whether StorageService.bootstrap() is going to be called in initServer()) (6) having a node join a pre-existing cluster with data in it without auto_bootstrap set to true, would cause the ring to join the cluster but be void of data, thus potentially violating consistency guarantees (but recovery is possible by running repair) (7) A consequence of (5)+(6) is that auto_bootstrap should *always* be enabled on all nodes in a production cluster, except: (7a) New nodes being brought in as seeds (7b) During the very first initial cluster setup with no data (7) The above is intended and on purpose, and it would be correct to operate under these assumptions when updating/improving documentation. -- / Peter Schuller
Re: Maintenance releases
> I'm willing to concede that I may have an abnormally conservative > opinion about this. But I wanted to voice my concern in hopes we can > improve the quality and delivery of our maintenance releases. (speaking now from the perspective of a consumer, disregarding the implications on development) I don't think you're being conservative. In particular in light of the recent 1.0 discussion and whether or not to signal that Cassandra is "ready" and no longer needing committers on staff etc; having solid releases is probably an important part of instilling confidence in the project and decreasing the risk associated with adopting a new release. For example, from the point of view of the user, I think that things like CASSANDRA-1992 should preferably result in an almost immediate bugfix-only release with instructions and impact information for users. Being only a very minor contributor I don't have a full grasp of the implications on the agility of development that a change would have, but certainly I think that if a goal is to have users feel more confident (and in fact be safer) in using a Cassandre release without careful monitoring of the mailing lists and JIRA, adjusting the release engineering a bit seems like a high-priority change towards that goal. -- / Peter Schuller
Re: Please unsubscribe me from the email list.
> Please unsubscribe gary.mo...@xerox.com from this email list. http://wiki.apache.org/cassandra/FAQ#unsubscribe -- / Peter Schuller
Cassandra documentation (and in this case the datastax anti-entropy docs)
In response to the apparent mass confusion about nodetool repair that became evidence in the thread: http://www.mail-archive.com/user@cassandra.apache.org/msg11755.html I started looking around to see what is actually claimed about repair. I found that the Datastax docs: http://www.datastax.com/docs/0.7/operations/scheduled_tasks#repair ... uses phrasing which seems very very wrong. It strongly seems to imply that you should not normally run nodetool repair on a cluster. First of all, have I completely flown off the handle and completely and utterly confused myself - is what I say in the E-Mail thread wrong? On the assumption that I'm not crazy, I think this is a good time to talk about documentation. I've been itching for a while about the state of documentation. There is the ad-hoc wiki, and there is the Datastax stuff, but neither is really complete. What I ask myself is how we can achieve the goal that people who are trying to adopt Cassandra can do so, and use it reliably, without extensive time spent following mailinglists, JIRA, and trying to keep track of what's still true and not on the wiki, etc. This includes challenges like: * How to actually phrase and structure documentation in an accessible fashion for people who just want to *use* Cassandra, and not be knee-deep in the community. * Try to minimize the amount of "subtle detail" that you have to get right in order to not have a problem; the very up-to-you-to-fix and not-very-well-advertised state of 'nodetool repair' is a good example. Whatever can be done to avoid there even having to *be* documentation for it, except for people who want to know extra details or are doing stuff like not having deletes and wanting to avoid repair. * Keeping the documentation up-to-date. Do people agree with the goals and the claim that we're not there? What are good ways to achieve the goals? I keep feeling the need that there should really be a handbook. The datastax docs seem to be the right "format" (similarly to the FreeBSD handbook, which is a good example). But it seems we need something more agile that people can easily contribute to, while it still can be kept up-to-date. So what can be done? Is having a handbook a good idea? The key property of what I call a handbook is that there is some section on e.g. "Cassandra operations" that is reasonably long, and that someone can read through from beginning to end and get a coherent overall view of how things work, and know the important aspects that must be taken care of in production clusters. It's fine if every single little detail and potential kink isn't there (like long long details about how to calculate memtable thresholds). But stuff like 'yeah, you need to run nodetool repair at least as often as X"' is important. So are operational best-practices for performing operations on a cluster in a safe manner (e.g., moving nodes, seeds being sensitive, gossip delays, bootstrapping multiple nodes at once, etc). I'm not sure how to get there. It's not like I'm *so* motivated and have *so* much time that if people agree I'll sit down and write 500 pages of Cassandra handbook. So the question is how to achieve something incrementally that is yet more organized than the wiki. Thoughts? -- / Peter Schuller
Re: [ANN] Branched; freeze in effect
> I'd like to get going on the first beta as soon as possible, (this week > would be great). If you know of issues that should be dealt with before > that happens, please let me know. I think this one is a good candidate even if not specific to 0.8: https://issues.apache.org/jira/browse/CASSANDRA-2420 -- / Peter Schuller
Re: [VOTE] Release Apache Cassandra 0.8.1
> If things are working properly, he shouldn't see any of these messages > on a minor version upgrade because the version number should remain > constant. The message version should only change on major releases > (if necessary). > > It could be that the existing cluster was built with non-released code > (sporting a different message version). I believe it is expected in this case due to https://issues.apache.org/jira/browse/CASSANDRA-2280 -- / Peter Schuller
Re: [VOTE] Release Apache Cassandra 0.8.1 (take #4)
> [1]: http://goo.gl/qbvPB (CHANGES.txt) > [2]: http://goo.gl/7uQXl (NEWS.txt) CASSANDRA-2280 (https://issues.apache.org/jira/browse/CASSANDRA-2280) is not listed (and I don't see the change in the branch), which I suspect may be on purpose? However, if so the JIRA ticket should probably be updated to reflect that it is slated for 0.8.2? -- / Peter Schuller
Releasing 0.8.6 due to CASSANDRA-3166?
As came up in a thread on user@, I would suggest that CASSANDRA-3166[1] is enough reason to release 0.8.6. Asking people to build from source and patch to perform a rolling upgrade isn't good. [1] https://issues.apache.org/jira/browse/CASSANDRA-3166 -- / Peter Schuller (@scode on twitter)
Re: Releasing 0.8.6 due to CASSANDRA-3166?
> http://mail-archives.apache.org/mod_mbox/cassandra-dev/201109.mbox/%3CCAKkz8Q307TaOfw=7tpkaooal_a+ry_gewnyo-vwnugoenv3...@mail.gmail.com%3E Oops, I'm sorry. I did actually search my mailbox first, but obviously I failed. -- / Peter Schuller (@scode on twitter)
Re: [VOTE] Release Apache Cassandra 1.0.0-rc2 (Release Candidate 2)
> [1]: http://goo.gl/YtJLq (CHANGES.txt) Contains merge markers. >>>>>>> .merge-right.r1176712 0.8.6 -- / Peter Schuller (@scode on twitter)
Re: major version release schedule
Here is another thing to consider: There is considerable cost involved in running/developing on old branches as the divergence between the version you're running and trunk increases. For those actively doing development, such divergence actually causes extra work and slows down development. A more reasonable approach IMO is to make sure that important *bugfixes* are backported to branches that are sufficiently old to satisfy the criteria of the OP. But that is orthogonal to how often new releases happen. The OP compares with PostgreSQL, and they're in a similar position. You can run on a fairly old version and still get critical bug fixes, meaning that if you don't actually need the new version there is no one telling you that you must upgrade. It seems to me that what matters are mostly two things: (1) When you *do* need/want to upgrade, that upgrade path you care about being stable, and working, and the version you're upgrading too should be stable. (2) Critical fixes need still be maintained for the version you're running (else you are in fact kind of forced to upgrade). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: major version release schedule
> Until recently we were working hard to reach a set of goals that > culminated in a 1.0 release. I'm not sure we've had a formal > discussion on it, but just talking to people, there seems to be > consensus around the idea that we're now shifting our goals and > priorities around some (usability, stability, etc). If that's the > case, I think we should at least be open to reevaluating our release > process and schedule accordingly (whether that means lengthening, > shorting, and/or simply shifting the barrier-to-entry for stable > updates). Personally I am all for added stability, quality, and testing. But I don't see how a decreased release frequency will cause more stability. It may be that decreased release frequency is the necessary *result* of more stability, but I don't think the causality points in the other direction unless developers ship things early to get it into the release. But also keep in mind: If we reach a point where major users of Cassandra need to run on significantly divergent versions of Cassandra because the release is just too old, the "normal" mainstream release will end up getting even less testing. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: cassandra node is not starting
Could this just be commit log reply of the truncate? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: cassandra node is not starting
> Could this just be commit log reply of the truncate? Nevermind :) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra has moved to Git
> So, can I summarize our policy as "git pull --rebase"? If you're talking about the simple case of hacking away at a branch that multiple people are working on, without "semantically" desiring, a branch, than IMO *always* use to --rebase. Here's my stackoverflow rant about why: http://stackoverflow.com/questions/2472254/when-should-i-use-git-pull-rebase/4675513#4675513 I think that should apply to Cassandra too. It's all about what you're trying to do, and doing git pull w/o rebase when just happily hacking along without any intent to diverge history, just leaves a completely soiled history due to git's default of a merge commit on pull. But that's very distinct from cases where you actually *do* have a branch (publicly speaking). Sorry, I don't want to pollute this discussion too much. I just have this pet peeve about the specific issue of "git pull" vs "git pull --rebase" in the simple hacking-away-at-a-single-branch case. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra has moved to Git
(And btw, major +1 on the transition to git!) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Using quorum to read data
> 2012-01-09 11:58:33,325 WARN > [me.prettyprint.cassandra.connection.CassandraHostRetryService] - 192.168.80.31(192.168.80.31):9160 host still appears to be down: Unable to > open transport to 192.168.80.31(192.168.80.31):9160 , > java.net.ConnectException: Connection refused: connect> > > What am I missing? That's not a timeout, but a failure to connect. Did you give Hector all of the nodes in the cluster or enable auto-discovery? If the above is the only error you see, it seems likely that you just need to configure your client to talk to the entire cluster. (I don't remember off hand how to tell hector to use auto-discovery.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Cassandra has moved to Git
[speaking obviously as non-committer, just offering a perspective] A potential factor to consider: If one knows that all work in topic branches end up merged without anyone first rebasing to clean up, you now have a constant trade-off between history readability and committing often. I personally dislike anything that causes there to be a reason *not* to commit. I'd much rather commit like crazy and clean it up before merge, than maintaining additional branches privately just for the purpose, or playing around with stashes. I.e., in addition to the effects on history, if people feel it does make history harder to read, it presumably affects the behavior of those people in day-to-day work in terms of their propensity to commit. If the issue is not rebasing public branches, one can presumably always introduce a convention whereby work happens in branch X; and if rebasing is needed, you do that in X/rebased/1. If a further iteration ends up happening, X/rebased/2. Or something along those lines. This would get you: * History looks the way you want it to look. * All original history is maintained if you really want to look at it (I *do* think it could be useful when diving into a JIRA ticket after the fact to figure out what reasoning was). * You're not rebasing published branches. The downside I suppose is that the branch count increases. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: What's the point of the deviation of java code style in Cassandra?
> that's a disambiguation wiki page. what exactly are you talking about? http://en.wiktionary.org/wiki/when_in_Rome,_do_as_the_Romans_do Can we *please* stop this thread? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Welcome committer Peter Schuller
> The Apache Cassandra PMC has voted to add Peter as a committer. Thank > you Peter, and we look forward to continuing to work with you! Thank *you*, as do I :) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Rules of thumbs for committers
> - Turn on rebase for public branches ("git config branch.$name.rebase > true" for name in cassandra-0.8 through trunk) And for the record, this means that "git pull" becomes the equivalent of "git pull --rebase", avoiding a constant stream of merge commits on every pull+push iteration. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: [VOTE] Release Apache Cassandra 1.1.0-beta1
+1 -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: [VOTE] Release Apache Cassandra 1.0.8
+1 (but FYI changelog has a typo "ahndling"). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
> *The summary is*: we believe virtual nodes are the way forward. We would > like to add virtual nodes to Cassandra and we are asking for comments, > criticism and collaboration! I am very happy to see some momentum on this, and I would like to go even further than what you propose. The main reasons why I do not think simply adding vnodes and making random assignments is the best end goal are: (1) The probability of data loss increases linearly with cluster size. (2) It does not take network topology into account. What follows is mostly a half-finished long text that I have been sitting on for a few months but not finished/posted. Unfortunately I do not have the possibility (due to time constraints) to go through everything in detail and update with current information and to specifically address what you already said, so there will be overlap with your original post. However given your E-Mail and the momentum in this thread, I really wanted to post something rather than not. It would be awesome if interested parties had a chance to read the referenced CRUSH paper, and the deltas proposed below. The goals are to address everything you already wanted in your post, while also addressing: (1) Probability of data loss (2) Network topology awareness The following text will first try to paint a picture of the goals that I have in mind, and then go on to the actual proposed solution. The proposition is very very short and undetailed now and there is plenty of discussion and details to fill in. I apologize, but again, I really want to post something now that this is being brought up. BEGIN un-polished text ("we" = "I"):= = CRUSHing Cassandra Author: Peter Schuller This is a proposal for a significant re-design of some fundamentals of Cassandra, aimed at addressing a number of current issues as well as anticipating future issues. It is particularly aimed at large clusters, but as a side-effect should improve the small cluster experience as well. == New terminology: Distribution factor A Cassandra cluster is today said to have `N` nodes, and data is replicated at a particular replication factor (`RF`). The placement of replicas is such that all rows that has a certain node `N1` as its primary replica, are located on a specific set of `RF-1` other nodes. In addition, it holds secondary replicas of data for `RF-1` other nodes. In total, it shares data with `2RF - 2` other nodes. The number of nodes with whom a node shares data is the distribution factor. In the case of Cassandra, `DF = 2RF - 2`. == Goals The goals this suggestion attempts to help solve include the following. === Goal: DF should not be tied to RF, nor N `DF` is important for these reasons: * The `DF` determines how many nodes are involved in re-constructing a lost node after failure; the higher the `DF`, the less of a performance impact a reconstruction has on remaining nodes. * The `DF` determines the significance on other nodes, with respect to read/write load, on a node being down. * The `DF` affects the probability of multiple failures causing data loss, since one looses data if any `RF` nodes within a group of `DF` nodes all go down. Having `DF` tied to `RF` like Cassandra does now has its problems. A single node failure has a significant effect on the performance characteristics of neighboring nodes (in terms relative to the normal level of load on the neighbors). On large data sets, a failed node needing reconstruction is a significant event, as it * Increases the load on neighbors just from going down. * Further increases the load on neighbors as they have to stream data (adding I/O and cache thrashing). This typically leads to the desire to throttle / rate limit reconstruction, which adds to the reconstruction window in addition to the fact that it was already naturally bottlenecking on neighbors. The other extreme is to tie `DF` to `N`, such that the data contained on one node, has it's secondary replicas spread out over the entire ring. This is an unacceptable choice because the probabiliy of multiple failures increases linearly with the cluster size. In other words, we want `DF` to be tied to neither `RF` nor `N`. Rather, `DF` should be chosen as a trade-off between the effects of `DF`: * The higher the `DF`, the higher the probability of data loss in case of multiple failures. * The higher the `DF`, the faster to reconstruct/replace a lost node. * The higher the `DF`, the less impact is seen on node failures on the performance requirements on other nodes. In making this determination, one must take into account that if a larger `DF` makes reconstruction/replacement significantly faster, that also decreases the time window in which multiple failures can occurr. Increasing `DF` is thus not *necessarily* increasing the total probability of data loss (for small values of `DF`). === Goal: Topologically aware redundancy We maintain the goal of being topology aware f
Re: RFC: Cassandra Virtual Nodes
Point of clarification: My use of the term "bucket" is completely unrelated to the term "bucket" used in the CRUSH paper. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
function to distribute partitions in the topology. Yes, agreed. Also, the distribution factor limiting is also compatible with non-crush by hash chaining from the primary replica instead of the row key. > This brings me on to (1) and the reasons for our choice of virtual > node scheme - choose N random tokens - instead of the Dynamo-like > scheme that you suggest where the partitions are fixed in advance. > With the Dynamo scheme, the size of a virtual node partition will only > ever grow as more data is inserted. Since the number of partitions is > fixed when the cluster is created, the partition size is unbounded. I'm not sure I see why my suggestion is dynamo like. In what way? It is essentially random in the sense of converging toward uniformity, but with the difference that the mapping is produced deterministically (and in a stable fashion). > There are certain advantages to having a limit on partition size. > Streaming failures that cause retries do not have to resend so much > data. Streaming operations can be staggered in smaller chunks to > minimise the impact on the nodes involved. Load balancing can operate > on a finer granularity. In my scheme the limit would be implicit in the number of partitions (combined with cluster size). And yes, this is a very good property IMO. Relatedly, as mentioned, is that you can make sure to structure the data in terms of these partitions. In terms of current Cassandra, this means that "cleanup" is an instantaneous operation instead of an expensive operation that has to truck through a lot of data (less so with leveled compaction). > The other concern you mentioned was >> The probability of data loss increases linearly with cluster size. > > but you also acknowledge that > >> In making this determination, one must take into account that if a >> larger `DF` makes reconstruction/replacement significantly faster, >> that also decreases the time window in which multiple failures can >> occurr. Increasing `DF` is thus not *necessarily* increasing the total >> probability of data loss (for small values of `DF`). > > Our calculations lead us to believe that in fact the shorter rebuild > window more than compensates for the increased probability of multiple > failure, so with DF=N the probability of data loss is minimised. > > The CRUSH paper also states: > > "With 2-way mirroring these two factors cancel > each other out, while overall data safety with more than two > replicas increases with declustering [Xin et al. 2004]" > > ("declustering" meaning increasing DF towards N) This is common argument against limiting RDF, but I am *strongly* skeptical of this in real-life scenarios. This is one of those cases where I think one needs to look beyond the pure math. For one thing, a huge concern for me is that I don't want active re-balancing to have such an extreme availability requirement that a downtime on actively doing that implies a significantly increased risk of data loss. I don't want a system to be constantly teetering on the edge of data loss, and it not even being safe to *shut it down* because your lack of data loss is dependent on the system as a whole being fully available, working and actively moving data around. It also hinges on reality matching theory perfectly well in terms of "independent" failures truly being independent. I would feel much more comfortable with a system that did not rely on super-fast re-replication to ensure data safety. But like I said, a lot of people make this argument - I just remain unconvinced thus far. Further, even looking at just the math, the claim cannot possibly hold as N grows sufficiently large. At some point you will bottleneck on the network and no longer benefit form a higher RDF, but the probability of data loss doesn't drop off until you reach DF=number of partitions (because at that point an increased cluster size doesn't increase the number of nodes with data sharing with another node). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
>> Using this ring bucket in the CRUSH topology, (with the hash function >> being the identity function) would give the exact same distribution >> properties as the virtual node strategy that I suggested previously, >> but of course with much better topology awareness. > > I will have to re-read your orignal post. I seem to have missed something :) I did, and I may or may not understand what you mean. Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by hash + identity hash as you traverse down the topology tree? -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
(I may comment on other things more later) > As a side note: vnodes fail to provide solutions to node-based limitations > that seem to me to cause a substantial portion of operational issues such > as impact of node restarts / upgrades, GC and compaction induced latency. I Actually, it does. At least assumign DF > RF (as in the original proposal, and mine). The impact of a node suffering from a performance degradation is mitigated because the effects are spread out over DF-1 (N-1 in the original post) nodes instead of just RF nodes. > think some progress could be made here by allowing a "pack" of independent > Cassandra nodes to be ran on a single host; somewhat (but nowhere near > entirely) similar to a pre-fork model used by some UNIX-based servers. I have pretty significant knee-jerk negative reactions to that idea to be honest, even if the pack is limited to a handful of instances. In order for vnodes to be useful with random placement, we'd need much more than a handful of vnodes per node (cassandra instances in a "pack" in that model). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
> Each node would have a lower and an upper token, which would form a range > that would be actively distributed via gossip. Read and replication > requests would only be routed to a replica when the key of these operations > matched the replica's token range in the gossip tables. Each node would > locally store it's own current active token range as well as a target token > range it's "moving" towards. How is this significantly different than just using "nodetool move" (in post-1.x) more rapidly and on smaller segments at a time? There is the ring delay stuff which makes it un-workable to do at high granularity, but that should apply to the active range solution too. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
> Software wise it is the same deal. Each node streams off only disk 4 > to the new node. I think an implication on software is that if you want to make specific selections of partitions to move, you are effectively incompatible with deterministically generating the mapping of partition to responsible node. I.e., it probably means the vnode information must be kept as state. It is probably difficult to reconcile with balancing solutions like consistent hashing/crush/etc. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
> You would have to iterate through all sstables on the system to repair one > vnode, yes: but building the tree for just one range of the data means that > huge portions of the sstables files can be skipped. It should scale down > linearly as the number of vnodes increases (ie, with 100 vnodes, it will > take 1/100th the time to repair one vnode). The story is less good for "nodetool cleanup" however, which still has to truck over the entire dataset. (The partitions/buckets in my crush-inspired scheme addresses this by allowing that each ring segment, in vnode terminology, be stored separately in the file system.) -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: RFC: Cassandra Virtual Nodes
> No I don't think you did, in fact, depending on the size of your SSTable a > contiguous range (or the entire SSTable) may or may not be affected by a > cleanup/move or any type of topology change. There is lots of room for > optimization here. After loading the indexes we actually know start/end > range for an SSTable so we can include/exclude it in any such operation. Just note that unless there is some correlation between range and these sstables being created to begin with (like with leveled), you're highly unlikely to be able to optimize here. For uniformly distributed tokens (hashed keys), all sstables are likely to have almost the entire possible token range in them. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)