Re: Cassandra on top of B-Tree

2010-03-30 Thread Peter Schuller
> Log structural database  has the append-only characteristics, e.g. BDB-JE.
> Is it an alternative for SSTable? Those matured database product might have
> done a lot for cache management. Not sure whether it can improve the
> performance of read or not.

BDB JE seems to be targetted mostly at cases where data fits in RAM,
or reasonably close to it. A problem is that while writes will be
append-only as long as the database is sufficiently small, you start
taking reads once the internal btree nodes no longer fit in RAM. So
depending on cache size, at a certain number of keys (thus size of the
btree) you start being seek-bound on reads while writing, even though
the writes are in and of themselves append-only and not subject to
seek overhead.

Another effect, which I have not specifically confirmed in testing but
expect to happen, is that once you reach the point this point of
taking reads, compaction is probably going to be a lot more expensive.
While normally JE can pick a log segment with the most garbage and
mostly stream through it, re-writing non-garbage, that process will
then also become entirely seek bound if a only a small subset of the
btree fits in RAM. So now you have a seek bound compaction process
that must keep up with the append-only write process, meaning that
your append-only writes are limited by said seeks in addition to any
seeks it takes "directly" when generating the writes.

Also keep in mind that JE won't have on-disk locality for neither
internal nodes nor leaf (data) nodes.

The guaranteed append-only nature of Cassandra, in combination with
the on-disk locality, is one reason to prefer it, under some
circumstances, over JE even for non-clustered local use on a single
machine.

(As a parenthesis: I doubt JE is being used very much with huge
databases, since a very significant CPU bottleneck became O(n) (with
respect to the number of log segments) file listings. This is probably
easily patched, or configured away by using larger log segments, but
the repeated O(n) file listings suggest to me that huge databases is
not an expected use case - beyond some hints in the documentation that
would indicate it's meant for smaller databases.)

-- 
/ Peter Schuller


Re: Packaging Cassandra for Debian [was: Packaging Cassandra for Ubuntu]

2010-06-04 Thread Peter Schuller
(To be clear: I'm not a cassandra developer, nor have I yet used
Cassandra in a real production setting - though I'm hoping to...)

> I'm curious what others think:

My estimation is that Cassandra is not a "apt-get install and forget"
type of software yet. If you're going to use cassandra, you are most
likely going to want to spend some serious time reading up on it and
carefully configure, monitor and operate it.

For that reason, I think the benefit of getting cassandra into Debian
stable or Ubuntu is of limited importance to users with a serious
need.

However, it *is* extremely practical to have software in nicely
packaged form. I would like to suggest that if packaging is do be a
higher priority for Cassandra, that it might be more useful to provide
pre-built .deb packages (possibly in the form of an apt repository)
that people can add to the sources.list. Combine that with clear
instructions on pinning (in the latter case of apt repositories) and
users have a very convenient way of installing and upgrading
Cassandra.

At the same time, the Cassandra developers would save, I suspect,
significant effort involved in trying to remain long-term compatible
and giving support to users running very old releases who cannot
legitimately be told "upgrade to the latest version" after having been
fed Cassandra from e.g. Debian stable. In contrast, remaining
compatible with Debian stable (i.e., building .deb:s for new versions
of Cassandra on old versions of Debian/Ubuntu) is likely not a big
deal at all given the lack of native dependencies; it also means that
you need not solve, as a pre-requisite, things like the lack of a
separately packaged thrift or other dependencies.

In short, I would suggest something as along the above lines as an
easer-to-accomplish goal for the Cassandra developers, yet providing
high payoff to users.

-- 
/ Peter Schuller


Re: Hello

2010-07-02 Thread Peter Schuller
> I've been ramping up the code base, getting acquainted with the major
> components, and I feel like there is a lack of commenting. Commenting the
> code would be a good way for my to learn the system better, and I suspect it
> would also be useful for others interesting in the future. Would providing
> some quality comments -- high-level descriptions for each class, more
> detailed descriptions for large/ambiguous functions -- be a useful way to
> start contributing?

I'm not a cassandra developer, but one suggestion may be to fill in
some package-info.java:s such that generated javadocs are easier to
navigate in terms of the overall structure and the roles of packages.

-- 
/ Peter Schuller


Re: Cassandra performance and read/write latency

2010-07-06 Thread Peter Schuller
> Has anyone else seen nodes "hang" for several seconds like this?  I'm
> not sure if this is a Java VM issue (e.g. garbage collection) or something

Since garbage collection is logged (if you're running with default
settings etc), any multi-second GC:s should be discoverable in said
log. So for testing that hypothesis i'd check there first. Cassandra
itself logs GC:s, but you can also turn of the JVM:s GC logging by
e.g. "-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps".

> I'm also interested in comparing notes with anyone  else that has been doing
> read/write throughput benchmarks with Cassandara.

I did some batch write testing to see how it scaled up to about 200
million rows and 200 gb; I had ocational spikes in latency that were
due to disk writes being flushed by the OS. However it was probably
exacerbated in this case by the fact that this was ZFS/FreeBSD and ZFS
is always (in my humble of opinion, and at least on FreeBSD)
exhibiting the behavior for me that it flushes writes too late and end
up blocking applications even if you have left-over bandwidth.

In my case I "eliminated" the issue for the purpose of my test by
having a stupid while loop simply doing "sync" every handful of
seconds, to avoid accumulating too much data in the cache.

While I expect this to be less of a problem for other setups, it's
possible this is what you're seeing. If the operating system is
blocking writes to the commit log for example (are you running with
periodic fsync or batch wise fsync?).

-- 
/ Peter Schuller


Minimizing the impact of compaction on latency and throughput

2010-07-07 Thread Peter Schuller
Hello,

I have repeatedly seen users report that background compaction is
overly detrimental to the behavior of the node with respect to
latency. While I have not yet deployed cassandra in a production
situation where latencies are closely monitored, these reports do not
really sound very surprising to me given the nature of compaction and
unless otherwise stated by developers here on the list I tend to
believe that it is a real issue.

Ignoring implementation difficulties for a moment, a few things that
could improve the situation, that seem sensible to me, are:

* Utilizing posix_fadvise() on both reads and writes to avoid
obliterating the operating system's caching of the sstables.
* Add the ability to rate limit disk I/O (in particular writes).
* Add the ability to perform direct I/O.
* Add the ability to fsync() regularly on writes to force the
operating system to not decide to flush hundreds of megabytes of data
out in a single burst.
* (Not an improvement but general observation: it seems useless for
writes to the commit log to remain in cache after an fsync(), and so
they are a good candidate for posix_fadvise())

None of these would be silver bullets, and the importance and
appropriate settings for each would be very dependent on operating
system, hardware, etc. But having the ability to control some or all
of these should, I suspect, allow significantly lessening the impact
of compaction under a variety of circumstances.

With respect to cache eviction, the this is one area where the impact
can probably be expected to be higher the more you rely on the
operating systems caching, and the less you rely on in-JVM caching
done by cassandra.

The most obvious problem points to me include:

* posix_fadvise() and direct I/O cause portability and building
issues, necessitating native code.
* rate limiting is very indirect due to read-ahead, caching, etc. in
particular for writes, rate limiting them would likely be almost
useless without also having fsync() or direct I/O, unless it is rate
limited to an extremely small amount and the cluster is taking very
few writes (such that the typical background flushing done by most
OS:es is done often enough to not imply huge amounts of data)

Any thoughts? Has this already been considered and rejected? Do you
think compaction is in fact not a problem already? Are there other,
easier, better ways to accomplish the goal?

-- 
/ Peter Schuller


Re: Minimizing the impact of compaction on latency and throughput

2010-07-07 Thread Peter Schuller
> This makes sense, but from what I have seen, read contention vs
> cassandra is a much bigger deal than write contention (unless you
> don't have a separate device for your commitlog, but optimizing for
> that case isn't one of our goals).

I am not really concerned with write performance, but rather with
writes affecting reads. Cache eviction due to streaming writes has the
obvious effect on hit ratio on reads, and in general large sustained
writes tend to very negatively affect read latency (and typically in a
bursty fashion). So; the idea was to specifically optimize to try to
make reads be minimally affected by writes (in the sense of the
background compaction eventually resulting from those writes).

Understood and agreed about the commit log (though as long as write
bursts are within what a battery-backed RAID controller can keep in
its cache I'd expect it to potentially work pretty well without
separation, if you do have such a setup).

-- 
/ Peter Schuller


Re: Minimizing the impact of compaction on latency and throughput

2010-07-08 Thread Peter Schuller
> It might be worth experimenting with posix_fadvise.  I don't think
> implementing our own i/o scheduler or rate-limiter would be as good a
> use of time (it sounds like you're on that page too).

Ok. And yes I mostly agree, although I can imagine circumstances where
a pretty simple rate limiter might help significantly - albeit be
something that has to be tweaked very specifically for the
situation/hardware rather than being auto-tuned.

If I have the time I may look into posix_fadvise() to begin with (but
I'm not promising anything).

Thanks for the input!

-- 
/ Peter Schuller


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Peter Schuller
> This looks relevant:
> http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
> comments for directions to code sample)

Thanks. That's helpful; I've been trying to avoid JNI in the past so
wasn't familiar with the API, and the main difficulty was likely to be
how to best expose the functionality to Java. Having someone do almost
exactly the same thing helps ;)

I'm also glad they confirmed the effect in a very similar situation.
I'm also leaning towards O_DIRECT as well because:

(1) Even if posix_fadvise() is used, on writes you'll need to fsync()
before fadvise() anyway in order to allow Linux to evict the pages (a
theoretical OS implementation might remember the advise call, but
Linux doesn't - at least not up until recently).

(2) posix_fadvise() feels more obscure and less portable than
O_DIRECT, the latter being well-understood and used by e.g. databases
for a long time.

(3) O_DIRECT allows more direct control over when I/O happens and to
what extent (without playing tricks or making assumptions about e.g.
read-ahead) so will probably make it easier to kill both birds with
one stone.

You indicated you were skeptical about writing an I/O scheduler. While
I agree that writing a real I/O scheduler is difficult, I suspect that
if we do direct I/O a fairly simple scheme should work well. Being
able to tweak a target MB/sec rate, select a chunk size ,and select
the time window over which to rate limit, I suspect would go a long
way.

The situation is a bit special since in this case we are talking about
one type of I/O that is run during controlled circumstances
(controlled concurrency, we know how much memory we eat in total,
etc).

I suspect there may be a problem sustaining rates during high read
loads though. We'll see.

I'll try to make time for trying this out.

-- 
/ Peter Schuller


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Peter Schuller
> Due to the need for doing data alignment in the application itself (you are
> bypassing all the OS magic here), there is really nothing portable about
> O_DIRECT. Just have a look at open(2) on linux:

[snip]

> So, just within Linux you got different mechanisms for this depending on
> kernel and fs in use and you need to figure out what to do yourself as the
> OS will not tell you that. Don't expect this alignment stuff to be more
> standardized across OSes than inside of Linux. Still find this portable?

The concept of direct I/O yes. I don't have experience of what the
practical portability is with respect to alignment however, so maybe
those details are a problem. But things like under what circumstances
which flags to postix_fadvise() actually have the desired effect
doesn't feel very portable either.

One might have a look at to what extent direct I/O works well in e.g.
postgresql or something like that, across platforms. But maybe you're
right and O_DIRECT is just not worth it.

> O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O

That was the intent.

> scheduling and caching across multiple reads/writers in threaded apps and
> separated processes which the OS may offer.

This is specifically what I want to bypass. I want to bypass the
operating system's caching to (1) avoid trashing the cache and (2)
know that a rate limited write translates fairly well to underlying
storage. Rate limiting asynchronous writes will often be less than
ideal since the operating system will tend to, by design, defer
writes. This aspect can of course be overcome with fsync() however.
And that does not even require native code, so is a big point in its
favor. But if we still need native code for posix_fadvise() anyway
(for reads), then that hit is taken anyway.

But sure. Perhaps posix_fadvise() in combination with regular
fsync():ing on writes may be preferable to direct I/O (with fsync()
being required both for rate limiting purposes if one is to combat
that, and for avoiding cache eviction given the way fadvise works in
Linux atm).

> This can especially be a big
> loss when you got servers with loads of memory for large filesystem caches
> where you might find it hard to actually utilize the cache in the
> application.

The entire point is to bypass the cache during compaction. But this
does not (unless I'm mistaken about how Cassandra works) invalidate
already pre-existing caches at the Cassandra/JVM level. In addition,
for large data sets (large being significantly larger than RAM size),
the data pulled into cache as part of compaction is not going to be
useful anyway, as is. There is the special cases where all or most
data fit in RAM and having all compaction I/O go through the cache may
even be preferable; but in the general case, I really don't see the
advantage of having that I/O go through cache.

If you do have most or all data in RAM, than certainly having all that
data warm at all times is preferably to doing I/O on a cold buffer
cache against sstables. But on the other hand, any use of direct I/O
of fadvise() will be optional (presumably). Given that a setup whereby
your performance is entirely dependent on most data being in RAM at
all times, you will already have issues with e.g. cold starts of
nodes.

In any case, I definitely consider there to be good reasons to not
rely only on operating system caching; compaction is one of these
reasons both with and without direct I/o or fadvise().

> O_DIRECT was made to solve HW performance limitation on servers 10+ years
> ago. It is far from an ideal solution today (but until stuff like fadvice is
> implemented properly, somewhat unavoidable)

I think there are pretty clear and obvious use-cases where the cache
eviction implied by large bulk streaming operations on large amounts
of data is not what you want (there are any number of practical
situations where this has been an issue for me, if nothing else). But
if I'm overlooking something that would mean that this optimization,
trying to avoid eviction, is useless with Cassandra please do explain
it to me :)

But I'll definitely buy that posix_fadvise() is probably a cleaner solution.

-- 
/ Peter Schuller


Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

2010-07-14 Thread Peter Schuller
> More than one fd can be open on a given file, and many of open fd's are
> on files that have been deleted.  The stale fd's are all on Data.db files in
> the
> data directory, which I have separate from the commit log directory.
>
> I haven't had a chance to look at the code handling files, and I am not any
> sort of Java expert, but could this be due to Java's lazy resource clean up?
> I wonder if when considering writing your own file handling classes for
> O_DIRECT or posix_fadvise or whatever, an explicit close(2) might help.

The fact that there are open fds to deleted files is interesting... I
wonder if people have reported weird disk space usage in the past
(since such deleted files would not show up with 'du -sh' but eat
space on the device until closed).

My general understanding is that Cassandra does specifically rely on
the GC to know when unused sstables can be removed. However the fact
that the files are deleted I think means that this is not the problem,
and the question is rather why open file descriptors/streams are
leaking to these deleted sstables. But I'm speaking now without
knowing when/where streams are closed.

Are the deleted files indeed sstable, or was that a bad assumption on my part?

-- 
/ Peter Schuller


Re: Too many open files [was Re: Minimizing the impact of compaction on latency and throughput]

2010-07-14 Thread Peter Schuller
> As a Cassandra newbie, I'm not sure how to tell, but they are all
> to *.Data.db files, and all under the DataFileDirectory (as spec'ed
> in storage-conf.xml), which is a separate directory than the
> CommitLogDirectory.  I did not see any *Index.db or *Filter.db
> files, but I may have missed them.

The *.Data.db files are indeed sstables.

-- 
/ Peter Schuller


Re: GC Storm

2010-07-18 Thread Peter Schuller
(adding dev@)

> (2) Can we implement multi-thread compaction?

I think this is the only way to scale. Or at least to implement
concurrent compaction (whether it is by division into threads or not)
of multiple size classes. As long as the worst-case compactions are
significantly slower than best-base compactions, then presumably you
will have the problem of accumulation of lots of sstables during long
compactions. Since having few sstables is part of the design goal (or
so I have assumed, or else you will seek to much on disk when doing
e.g. a range query), triggering situations where this is not the case
is a performance problem for readers.

I've been thinking about this for a bit and maybe there could be one
tweakable configuration setting which sets the desired machine
concurrency, that the user tweaks in order to make compaction fast
enough in relation to incoming writes. Regardless of the database
size, this is necessary whenever cassandra is able to take writes
faster than a CPU-bound compaction thread is able to process them.

The other thing would be to have an intelligent compaction scheduler
that does something along the lines of scheduling a compaction thread
for every "level" of compaction (i.e., one for log_m(n) = 1, one for
log_m(n) = 2, etc). To avoid inefficiency and huge spikes in CPU
usage, these compaction threads could stop every now and then
(something reasonable; say ever 100 mb compacted or something) and
yield to other compaction threads.

This way:

(a) a limited amount of threads will be actively runnable at any given
moment, allowing the user to limit the effect of background compaction
can have on CPU usage
(b) but on the other hand, it also means that more than one CPU can be
used; whatever is appropriate for the cluster
(c) it should be reasonably easy to implement because each compaction
is just a regular thread doing what it does now already
(d) the synchronization overhead between compaction threads should be
completely irrelevant as long as one selects a high enough
synchronization threshold (100 mb was just a suggestion; might be 1
gig).
(e) log_m(n) will never be large enough for it to be a scaling problem
that you have one thread per "level"

Thoughts?

-- 
/ Peter Schuller


Re: improving read performance

2010-09-20 Thread Peter Schuller
> This drawback is unfortunate for systems that use time-based row keys.    In
> such systems, row data will generally not be fragmented very much, if at
> all, but reads suffer because the assumption is that all data is fragmented.
>    Even further, in a real-time system where reads occur quickly after
> writes, if the data is in memory, the sstables are still checked.

Perhaps I am misunderstanding you, but why is this a problem (in the
particular case of time based row keys) given that existence of the
bloom filters which should eliminate the need to go down to the
sstables to any extent more than that they actually contain data for
the row (in almost all cases, subject to bloom filter false
positives)?

Also, for the case of the edges where memtables are flushed, a
write-through row cache should help alleviate that. I forget off hand
whether the row cache is in fact write-through or not though.

-- 
/ Peter Schuller


Re: improving read performance

2010-09-20 Thread Peter Schuller
> Actually, the points you make are things I have overlooked and actually make
> me feel more comfortable about how cassandra will perform for my use cases.
>   I'm interested, in my case, to find out what the bloom filter
> false-positive rate is.   Hopefully, a stat is kept on this.

Assuming lack of implementation bugs and a good enough hash algorithm,
 the false positive rate of bloom filters are mathematically
determined. See:

   http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html

And in cassandra:

   java/org/apache/cassandra/utils/BloomCalculations.java
   java/org/apache/cassandra/utils/BloomFilter.java

(I don't know without checking (no time right now) whether the false
positive rate is actually tracked or not.)

>   As long as
> ALL of the bloom filters are in memory, the hit should be minimal  for a

Bloom filters are by design in memory at all times (they are the worst
possible case you can imagine in terms of random access, so it would
never make sense to keep them on disk even partially).

(This assumes the JVM isn't being otherwise swapped out, which is
another issue.)

> Good point on the row cache.   I had actually misread the comments in the
> yaml, mistaking "do not use on ColumnFamilies with LARGE ROWS" , as "do not
> use on ColumnFamilies with a LARGE NUMBER OF ROWS".    I don't know if this
> will improve performance much since I don't understand yet if this
> eliminates the need to check for the data in the SStables.   If it doesn't
> then what is the point of the row cache since the data is also in an
> in-memory memtable?

It does eliminate the need to go down to sstables. It also survives
compactions (so doesn't go cold when sstables are replaced).

Reasons to not use the row cache with large rows include:

* In general it's a waste of memory better given to the OS page cache,
unless possibly you're continually reading entire rows rather than
subsets of rows.

* For truly large rows you may have immediate issues with the size of
the data being cached; e.g. attempting to cache a 2 GB row is not the
best idea in terms of heap space consumption; you'll likely OOM or
trigger fallbacks to full GC, etc.

* Having a larger key cache may often be more productive.

> That aside, splitting the memtable in 2, could make checking the bloom
> filters unnecessary in most cases for me, but I'm not sure it's worth the
> effort.

Write-through row caching seems like a more direct approach to me
personally, off hand. Also to the extent that you're worried about
false positive rates, larger bloom filters may still be an option (not
currently configurable; would require source changes).

-- 
/ Peter Schuller


Re: Atomically adding a column to columns_

2010-09-27 Thread Peter Schuller
> Function addColumn at class SuperColumn tries to atomically add a column to
> the concurrent collection “columns_” using the following code:

Deletions in Cassandra involve an insertion of a tombstone rather than
actual column deletion.

In the case of this bit of code, I believe, and I am not speaking
authoritatively, the removal only happens in (1) the read path when
filtering results (on presumably query-local data) and (2) during
compaction (on presumably compaction-local data). I did browse through
the various callers which seemed to confirm that this was the case. In
other words, I do not believe the remove() code path should not ever
be taken concurrently with insertions (by design, and not by
accident).

Anyone care to confirm/deny?

-- 
/ Peter Schuller


Re: Atomically adding a column to columns_

2010-09-29 Thread Peter Schuller
> It would be good to document this, or, since the
> correct-even-for-remove logic is not much more complicated, switch to
> that.

Submitted:

https://issues.apache.org/jira/browse/CASSANDRA-1559

-- 
/ Peter Schuller


Re: Reducing confusion around client libraries

2010-12-04 Thread Peter Schuller
Without weighing in on the remainder of the discussion; regardless of
what happens what do people think about at least making some minor
structural changes to e.g. the wiki? For example, putting the Thrift
API in a section other than the "user documentation", and being pretty
vocal (to the point of sounding repetitious) about recommending that
applications use the higher level clients?

I think it's pretty easy to "accidentally" land in the Thrift
documentation if you browse around.

I'd be happy to try to tweak this a bit. Stuff like moving the thrift
refs, and trying to find places on the wiki that links to the thrift
API page and re-consider whether (or at least how) to link, etc.

-- 
/ Peter Schuller


mmap:ed i/o and buffer sizes

2010-12-10 Thread Peter Schuller
I was looking closer at sliced_buffer_size_in_kb and
column_index_size_in_kb and reached the conclusion that for the
purpose of I/O, these are irrelevant when using mmap:ed I/O mode
(which makes sense, since there is no way to use a "buffer size" when
all you're doing is touching memory). The only effect is that
column_index_size_in_kb still affects the size at which indexing
triggers, which is as advertised.

Firstly, can anyone confirm/deny my interpretation?

Secondly, has anyone done testing as to the effects on mmap():ed I/O
on the efficiency (in terms of disk seeks) of reads on large data
sets? The CPU benefits of mmap() may be negated when disk-bound if the
read-ahead logic of the kernel interacts sub-optimally with
Cassandra's use-case. Potentially even reading more than a single page
could imply multiple seeks (assuming a loaded system with other I/O in
the queue) if there is no read-ahead until the first successive
access.

I have not checked what actually does happen, nor have I benchmarked
for comparison. But I'd be interested in hearing if people have
already addressed this in the past.

-- 
/ Peter Schuller


Re: Any chance of getting cassandra releases published to repo1.maven.org?

2010-12-14 Thread Peter Schuller
> I will say this once and only once, if you want to migrate to a maven based
> build, I could help... However I do not want to mix issues, I'm quite
> prepared to work with the current build system, there is a community of
> developers here that seem happy with the existing build system, they are the
> ones doing the work, whatever they prefer goes... I just want to get the
> artifacts into central

Speaking only from the point of view of a user (application
developer), I think a major plus is coming from that end. Being able
to have your standard maven project just pull in dependencies rather
than having to work around infrastructure and make special exceptions
to get cassandra client X and specific special version of thrift Y is
extremely nice.

(I'd personally love a pom.xml for cassandra itself too, if only to
avoid having to configure IDE:s manually. I may be interested in
helping out trying to maintain one, but I'm not sure I have sufficient
maven fu yet to be effective (but I'm getting there).)

Regardless, the Riptano maven repository is greatly appreciated as it
appears already.

-- 
/ Peter Schuller


Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-16 Thread Peter Schuller
> Sorry for spam again. :-)

No, thanks a lot for tracking that down and reporting details!
Presumably a significant amount of users are on that version of Ubuntu
running with openjdk.

-- 
/ Peter Schuller


Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-19 Thread Peter Schuller
> vic...@:~$ sudo ps aux | grep "cassandra"
> cassandra     11034  0.2 22.9 1107772 462764 ?  Sl   Dec17   6:13
> /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar
> org.apache.cassandra.thrift.CassandraDaemon
>
> Cassandra uses 462764 Kb, roughly 460 Mb for 2 Mb of data... And it keeps
> getting bigger.
> It is important to know that I have just a few insert, quite a lot of read
> though. Also Cassandra seams to completly ignore the JVM limitations such as
> Xmx.
> If I don't stop and launch Cassandra every 15 ou 20 days it simply crashes,
> due to oom errors.

The resident size is not unexpected given that your Xmx is 512 MB. The
virtual may or may not be expected depending; for example thread
stacks as previously discussed in this thread.

If you're not seeing the *resident* set size go above the maximum heap
size, you're unlikely to be seeing the same problem.

WIth respect to OOM, see
http://www.riptano.com/docs/0.6/operations/tuning - but without more
information it's difficult to know what specifically it is that you're
hitting. Are you seriously saying you're running for 15-20 days with
only 2 mb of live data?

-- 
/ Peter Schuller


API pages on the wiki

2011-01-11 Thread Peter Schuller
I have noticed API06 and API07 being created. API being de-synched
with API07 now.

What is the intentional method of maintaining these? Should API
reflect trunk, or the "latest release"?

Should we just make API not contain anything and all, and simply link
to the appropriate page for each version?

-- 
/ Peter Schuller


Re: API pages on the wiki

2011-01-11 Thread Peter Schuller
> ^ I like that idea.  Make it a landing page that points to the other
> versions (trunk, 07, 06, etc.).

RIght. The main downside I can think of is that you don't get the
annotated nature of the current API page where you can read what is
true now while simultaneously getting a sense of when new features
were introduced.

But my feeling is that it is better to have single versions of
documentation that is more realistic to maintain, than de-normalizing
it. Theoretically one should optimize for the reader rather than the
writer, but on the theory that out-of-date/incorrect documentation is
worse than the user having to read two versions to get a sense of
differences, it seems to make sense.

-- 
/ Peter Schuller


Clarification on intended bootstrapping semantics

2011-01-20 Thread Peter Schuller
(I considered user@ but I don't want to cause confusion so limiting
myself to dev@)

I would like to clarify/confirm a few things so that I understand
correctly and so that I can update the documentation to be clearer. I
was unsure about some of these things until code inspection. (If I
have totally missed documentation that covers this in this level of
details please let me know.)

So to be clear: I'm making statements and asking for confirmation. I'm
not making claims, so don't assume this is true if you don't know.
There is some overlap between some of the points.

(1) Starting from scratch with N nodes, M of which are seed nodes (M
<= N), no initial token and autobootstrap false, starting them all up
with clean data directories should be supported without concern for
start-up order.  The cluster is expected to converge. As long as no
writes are happening, and there is no data in the cluster, there is no
problem. There will be no "divergent histories" that prevent gossip
from working.

(2) Specifically, given that one wants for example 2 seeds, there is
no requirement to join the "second" seed as a non-seed *first*, only
to then restart with it as seed after having joined the cluster.

(3) The critical invariant for the operator to maintain with respect
to seed nodes, is that no node is ever listed as a seed node in other
node's configuration, without said seed node first having joined the
cluster.
(3a) The exception being initial cluster start where there is no data
and no traffic, where it's fine to just start all nodes in arbitrary
order.

(4) It is always fine for a seed node to consider itself a seed even
during initial start-up and joining the ring.

(5) enabling auto_bootstrap does not just affect the method by which
tokens are selected, but also affects *whether* the bootstrap process
includes streaming data from other nodes prior to becoming up in the
ring (i.e., whether StorageService.bootstrap() is going to be called
in initServer())

(6) having a node join a pre-existing cluster with data in it without
auto_bootstrap set to true, would cause the ring to join the cluster
but be void of data, thus potentially violating consistency guarantees
(but recovery is possible by running repair)

(7) A consequence of (5)+(6) is that auto_bootstrap should *always* be
enabled on all nodes in a production cluster, except:
(7a) New nodes being brought in as seeds
(7b) During the very first initial cluster setup with no data

(7) The above is intended and on purpose, and it would be correct to
operate under these assumptions when updating/improving documentation.

-- 
/ Peter Schuller


Re: Maintenance releases

2011-02-11 Thread Peter Schuller
> I'm willing to concede that I may have an abnormally conservative
> opinion about this.  But I wanted to voice my concern in hopes we can
> improve the quality and delivery of our maintenance releases.

(speaking now from the perspective of a consumer, disregarding the
implications on development)

I don't think you're being conservative. In particular in light of the
recent 1.0 discussion and whether or not to signal that Cassandra is
"ready" and no longer needing committers on staff etc; having solid
releases is probably an important part of instilling confidence in the
project and decreasing the risk associated with adopting a new
release. For example, from the point of view of the user, I think that
things like CASSANDRA-1992 should preferably result in an almost
immediate bugfix-only release with instructions and impact information
for users.

Being only a very minor contributor I don't have a full grasp of the
implications on the agility of development that a change would have,
but certainly I think that if a goal is to have users feel more
confident (and in fact be safer) in using a Cassandre release without
careful monitoring of the mailing lists and JIRA, adjusting the
release engineering a bit seems like a high-priority change towards
that goal.

-- 
/ Peter Schuller


Re: Please unsubscribe me from the email list.

2011-02-14 Thread Peter Schuller
> Please unsubscribe gary.mo...@xerox.com from this email list.

http://wiki.apache.org/cassandra/FAQ#unsubscribe

-- 
/ Peter Schuller


Cassandra documentation (and in this case the datastax anti-entropy docs)

2011-03-31 Thread Peter Schuller
In response to the apparent mass confusion about nodetool repair that
became evidence in the thread:

   http://www.mail-archive.com/user@cassandra.apache.org/msg11755.html

I started looking around to see what is actually claimed about repair.
I found that the Datastax docs:

   http://www.datastax.com/docs/0.7/operations/scheduled_tasks#repair

... uses phrasing which seems very very wrong. It strongly seems to
imply that you should not normally run nodetool repair on a cluster.

First of all, have I completely flown off the handle and completely
and utterly confused myself - is what I say in the E-Mail thread
wrong?

On the assumption that I'm not crazy, I think this is a good time to
talk about documentation. I've been itching for a while about the
state of documentation. There is the ad-hoc wiki, and there is the
Datastax stuff, but neither is really complete.

What I ask myself is how we can achieve the goal that people who are
trying to adopt Cassandra can do so, and use it reliably, without
extensive time spent following mailinglists, JIRA, and trying to keep
track of what's still true and not on the wiki, etc.

This includes challenges like:

* How to actually phrase and structure documentation in an accessible
fashion for people who just want to *use* Cassandra, and not be
knee-deep in the community.

* Try to minimize the amount of "subtle detail" that you have to get
right in order to not have a problem; the very up-to-you-to-fix and
not-very-well-advertised state of 'nodetool repair' is a good example.
Whatever can be done to avoid there even having to *be* documentation
for it, except for people who want to know extra details or are doing
stuff like not having deletes and wanting to avoid repair.

* Keeping the documentation up-to-date.

Do people agree with the goals and the claim that we're not there?
What are good ways to achieve the goals?

I keep feeling the need that there should really be a handbook. The
datastax docs seem to be the right "format" (similarly to the FreeBSD
handbook, which is a good example). But it seems we need something
more agile that people can easily contribute to, while it still can be
kept up-to-date. So what can be done?

Is having a handbook a good idea? The key property of what I call a
handbook is that there is some section on e.g. "Cassandra operations"
that is reasonably long, and that someone can read through from
beginning to end and get a coherent overall view of how things work,
and know the important aspects that must be taken care of in
production clusters.

It's fine if every single little detail and potential kink isn't there
(like long long details about how to calculate memtable thresholds).
But stuff like 'yeah, you need to run nodetool repair at least as
often as X"' is important. So are operational best-practices for
performing operations on a cluster in a safe manner (e.g., moving
nodes, seeds being sensitive, gossip delays, bootstrapping multiple
nodes at once, etc).

I'm not sure how to get there. It's not like I'm *so* motivated and
have *so* much time that if people agree I'll sit down and write 500
pages of Cassandra handbook. So the question is how to achieve
something incrementally that is yet more organized than the wiki.

Thoughts?

-- 
/ Peter Schuller


Re: [ANN] Branched; freeze in effect

2011-04-11 Thread Peter Schuller
> I'd like to get going on the first beta as soon as possible, (this week
> would be great).  If you know of issues that should be dealt with before
> that happens, please let me know.

I think this one is a good candidate even if not specific to 0.8:

   https://issues.apache.org/jira/browse/CASSANDRA-2420

-- 
/ Peter Schuller


Re: [VOTE] Release Apache Cassandra 0.8.1

2011-06-20 Thread Peter Schuller
> If things are working properly, he shouldn't see any of these messages
> on a minor version upgrade because the version number should remain
> constant.  The message version should only change on major releases
> (if necessary).
>
> It could be that the existing cluster was built with non-released code
> (sporting a different message version).

I believe it is expected in this case due to
https://issues.apache.org/jira/browse/CASSANDRA-2280

-- 
/ Peter Schuller


Re: [VOTE] Release Apache Cassandra 0.8.1 (take #4)

2011-06-27 Thread Peter Schuller
> [1]: http://goo.gl/qbvPB (CHANGES.txt)
> [2]: http://goo.gl/7uQXl (NEWS.txt)

CASSANDRA-2280 (https://issues.apache.org/jira/browse/CASSANDRA-2280)
is not listed (and I don't see the change in the branch), which I
suspect may be on purpose? However, if so the JIRA ticket should
probably be updated to reflect that it is slated for 0.8.2?

-- 
/ Peter Schuller


Releasing 0.8.6 due to CASSANDRA-3166?

2011-09-17 Thread Peter Schuller
As came up in a thread on user@, I would suggest that
CASSANDRA-3166[1] is enough reason to release 0.8.6. Asking people to
build from source and patch to perform a rolling upgrade isn't good.

[1] https://issues.apache.org/jira/browse/CASSANDRA-3166

-- 
/ Peter Schuller (@scode on twitter)


Re: Releasing 0.8.6 due to CASSANDRA-3166?

2011-09-17 Thread Peter Schuller
> http://mail-archives.apache.org/mod_mbox/cassandra-dev/201109.mbox/%3CCAKkz8Q307TaOfw=7tpkaooal_a+ry_gewnyo-vwnugoenv3...@mail.gmail.com%3E

Oops, I'm sorry. I did actually search my mailbox first, but obviously I failed.

-- 
/ Peter Schuller (@scode on twitter)


Re: [VOTE] Release Apache Cassandra 1.0.0-rc2 (Release Candidate 2)

2011-09-30 Thread Peter Schuller
> [1]: http://goo.gl/YtJLq (CHANGES.txt)

Contains merge markers.

>>>>>>> .merge-right.r1176712
0.8.6


-- 
/ Peter Schuller (@scode on twitter)


Re: major version release schedule

2011-12-20 Thread Peter Schuller
Here is another thing to consider: There is considerable cost involved
in running/developing on old branches as the divergence between the
version you're running and trunk increases.

For those actively doing development, such divergence actually causes
extra work and slows down development.

A more reasonable approach IMO is to make sure that important
*bugfixes* are backported to branches that are sufficiently old to
satisfy the criteria of the OP. But that is orthogonal to how often
new releases happen.

The OP compares with PostgreSQL, and they're in a similar position.
You can run on a fairly old version and still get critical bug fixes,
meaning that if you don't actually need the new version there is no
one telling you that you must upgrade.

It seems to me that what matters are mostly two things:

(1) When you *do* need/want to upgrade, that upgrade path you care
about being stable, and working, and the version you're upgrading too
should be stable.
(2) Critical fixes need still be maintained for the version you're
running (else you are in fact kind of forced to upgrade).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: major version release schedule

2011-12-20 Thread Peter Schuller
> Until recently we were working hard to reach a set of goals that
> culminated in a 1.0 release.  I'm not sure we've had a formal
> discussion on it, but just talking to people, there seems to be
> consensus around the idea that we're now shifting our goals and
> priorities around some (usability, stability, etc).  If that's the
> case, I think we should at least be open to reevaluating our release
> process and schedule accordingly (whether that means lengthening,
> shorting, and/or simply shifting the barrier-to-entry for stable
> updates).

Personally I am all for added stability, quality, and testing. But I
don't see how a decreased release frequency will cause more stability.
It may be that decreased release frequency is the necessary *result*
of more stability, but I don't think the causality points in the other
direction unless developers ship things early to get it into the
release.

But also keep in mind: If we reach a point where major users of
Cassandra need to run on significantly divergent versions of Cassandra
because the release is just too old, the "normal" mainstream release
will end up getting even less testing.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: cassandra node is not starting

2012-01-02 Thread Peter Schuller
Could this just be commit log reply of the truncate?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: cassandra node is not starting

2012-01-02 Thread Peter Schuller
> Could this just be commit log reply of the truncate?

Nevermind :)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Cassandra has moved to Git

2012-01-04 Thread Peter Schuller
> So, can I summarize our policy as "git pull --rebase"?

If you're talking about the simple case of hacking away at a branch
that multiple people are working on, without "semantically" desiring,
a branch, than IMO *always* use to --rebase. Here's my stackoverflow
rant about why:

   
http://stackoverflow.com/questions/2472254/when-should-i-use-git-pull-rebase/4675513#4675513

I think that should apply to Cassandra too. It's all about what you're
trying to do, and doing git pull w/o rebase when just happily hacking
along without any intent to diverge history, just leaves a completely
soiled history due to git's default of a merge commit on pull.

But that's very distinct from cases where you actually *do* have a
branch (publicly speaking).

Sorry, I don't want to pollute this discussion too much. I just have
this pet peeve about the specific issue of "git pull" vs "git pull
--rebase" in the simple hacking-away-at-a-single-branch case.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Cassandra has moved to Git

2012-01-04 Thread Peter Schuller
(And btw, major +1 on the transition to git!)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Using quorum to read data

2012-01-09 Thread Peter Schuller
> 2012-01-09 11:58:33,325 WARN
> [me.prettyprint.cassandra.connection.CassandraHostRetryService] -  192.168.80.31(192.168.80.31):9160 host still appears to be down: Unable to
> open transport to 192.168.80.31(192.168.80.31):9160 ,
> java.net.ConnectException: Connection refused: connect>
>
> What am I missing?

That's not a timeout, but a failure to connect. Did you give Hector
all of the nodes in the cluster or enable auto-discovery? If the above
is the only error you see, it seems likely that you just need to
configure your client to talk to the entire cluster.

(I don't remember off hand how to tell hector to use auto-discovery.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Cassandra has moved to Git

2012-01-09 Thread Peter Schuller
[speaking obviously as non-committer, just offering a perspective]

A potential factor to consider: If one knows that all work in topic
branches end up merged without anyone first rebasing to clean up, you
now have a constant trade-off between history readability and
committing often. I personally dislike anything that causes there to
be a reason *not* to commit. I'd much rather commit like crazy and
clean it up before merge, than maintaining additional branches
privately just for the purpose, or playing around with stashes.

I.e., in addition to the effects on history, if people feel it does
make history harder to read, it presumably affects the behavior of
those people in day-to-day work in terms of their propensity to
commit.

If the issue is not rebasing public branches, one can presumably
always introduce a convention whereby work happens in branch X; and if
rebasing is needed, you do that in X/rebased/1. If a further iteration
ends up happening, X/rebased/2. Or something along those lines. This
would get you:

* History looks the way you want it to look.
 * All original history is maintained if you really want to look at it
(I *do* think it could be useful when diving into a JIRA ticket after
the fact to figure out what reasoning was).
* You're not rebasing published branches.

The downside I suppose is that the branch count increases.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: What's the point of the deviation of java code style in Cassandra?

2012-01-27 Thread Peter Schuller
> that's a disambiguation wiki page. what exactly are you talking about?

http://en.wiktionary.org/wiki/when_in_Rome,_do_as_the_Romans_do

Can we *please* stop this thread?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Welcome committer Peter Schuller

2012-02-13 Thread Peter Schuller
> The Apache Cassandra PMC has voted to add Peter as a committer.  Thank
> you Peter, and we look forward to continuing to work with you!

Thank *you*, as do I :)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: Rules of thumbs for committers

2012-02-14 Thread Peter Schuller
> - Turn on rebase for public branches ("git config branch.$name.rebase
> true" for name in cassandra-0.8 through trunk)

And for the record, this means that "git pull" becomes the equivalent
of "git pull --rebase", avoiding a constant stream of merge commits on
every pull+push iteration.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: [VOTE] Release Apache Cassandra 1.1.0-beta1

2012-02-16 Thread Peter Schuller
+1

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: [VOTE] Release Apache Cassandra 1.0.8

2012-02-22 Thread Peter Schuller
+1 (but FYI changelog has a typo "ahndling").

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!

I am very happy to see some momentum on this, and I would like to go
even further than what you propose. The main reasons why I do not
think simply adding vnodes and making random assignments is the best
end goal are:

(1) The probability of data loss increases linearly with cluster size.
(2) It does not take network topology into account.

What follows is mostly a half-finished long text that I have been
sitting on for a few months but not finished/posted. Unfortunately I
do not have the possibility (due to time constraints) to go through
everything in detail and update with current information and to
specifically address what you already said, so there will be overlap
with your original post. However given your E-Mail and the momentum in
this thread, I really wanted to post something rather than not. It
would be awesome if interested parties had a chance to read the
referenced CRUSH paper, and the deltas proposed below.

The goals are to address everything you already wanted in your post,
while also addressing:

(1) Probability of data loss
(2) Network topology awareness

The following text will first try to paint a picture of the goals that
I have in mind, and then go on to the actual proposed solution. The
proposition is very very short and undetailed now and there is plenty
of discussion and details to fill in. I apologize, but again, I really
want to post something now that this is being brought up.

BEGIN un-polished text ("we" = "I"):=

= CRUSHing Cassandra

Author: Peter Schuller 

This is a proposal for a significant re-design of some fundamentals of
Cassandra, aimed at addressing a number of current issues as well as
anticipating future issues. It is particularly aimed at large
clusters, but as a side-effect should improve the small cluster
experience as well.

== New terminology: Distribution factor

A Cassandra cluster is today said to have `N` nodes, and data is
replicated at a particular replication factor (`RF`). The placement of
replicas is such that all rows that has a certain node `N1` as its
primary replica, are located on a specific set of `RF-1` other
nodes. In addition, it holds secondary replicas of data for `RF-1`
other nodes. In total, it shares data with `2RF - 2` other nodes.

The number of nodes with whom a node shares data is the distribution
factor. In the case of Cassandra, `DF = 2RF - 2`.

== Goals

The goals this suggestion attempts to help solve include the following.

=== Goal: DF should not be tied to RF, nor N

`DF` is important for these reasons:

* The `DF` determines how many nodes are involved in re-constructing a
  lost node after failure; the higher the `DF`, the less of a
  performance impact a reconstruction has on remaining nodes.
* The `DF` determines the significance on other nodes, with respect to
  read/write load, on a node being down.
* The `DF` affects the probability of multiple failures causing data
  loss, since one looses data if any `RF` nodes within a group of `DF`
  nodes all go down.

Having `DF` tied to `RF` like Cassandra does now has its problems. A
single node failure has a significant effect on the performance
characteristics of neighboring nodes (in terms relative to the normal
level of load on the neighbors).

On large data sets, a failed node needing reconstruction is a
significant event, as it

* Increases the load on neighbors just from going down.
* Further increases the load on neighbors as they have to stream data
(adding I/O
  and cache thrashing).

This typically leads to the desire to throttle / rate limit
reconstruction, which adds to the reconstruction window in addition to
the fact that it was already naturally bottlenecking on neighbors.

The other extreme is to tie `DF` to `N`, such that the data contained
on one node, has it's secondary replicas spread out over the entire
ring. This is an unacceptable choice because the probabiliy of multiple
failures increases linearly with the cluster size.

In other words, we want `DF` to be tied to neither `RF` nor
`N`. Rather, `DF` should be chosen as a trade-off between the effects
of `DF`:

* The higher the `DF`, the higher the probability of data loss in case
  of multiple failures.
* The higher the `DF`, the faster to reconstruct/replace a lost node.
* The higher the `DF`, the less impact is seen on node failures on the
  performance requirements on other nodes.

In making this determination, one must take into account that if a
larger `DF` makes reconstruction/replacement significantly faster,
that also decreases the time window in which multiple failures can
occurr. Increasing `DF` is thus not *necessarily* increasing the total
probability of data loss (for small values of `DF`).

=== Goal: Topologically aware redundancy

We maintain the goal of being topology aware f

Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Peter Schuller
Point of clarification: My use of the term "bucket" is completely
unrelated to the term "bucket" used in the CRUSH paper.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
 function to distribute partitions in the topology.

Yes, agreed. Also, the distribution factor limiting is also compatible
with non-crush by hash chaining from the primary replica instead of
the row key.

> This brings me on to (1) and the reasons for our choice of virtual
> node scheme - choose N random tokens - instead of the Dynamo-like
> scheme that you suggest where the partitions are fixed in advance.
> With the Dynamo scheme, the size of a virtual node partition will only
> ever grow as more data is inserted. Since the number of partitions is
> fixed when the cluster is created, the partition size is unbounded.

I'm not sure I see why my suggestion is dynamo like. In what way? It
is essentially random in the sense of converging toward uniformity,
but with the difference that the mapping is produced deterministically
(and in a stable fashion).

> There are certain advantages to having a limit on partition size.
> Streaming failures that cause retries do not have to resend so much
> data. Streaming operations can be staggered in smaller chunks to
> minimise the impact on the nodes involved. Load balancing can operate
> on a finer granularity.

In my scheme the limit would be implicit in the number of partitions
(combined with cluster size). And yes, this is a very good property
IMO. Relatedly, as mentioned, is that you can make sure to structure
the data in terms of these partitions. In terms of current Cassandra,
this means that "cleanup" is an instantaneous operation instead of an
expensive operation that has to truck through a lot of data (less so
with leveled compaction).

> The other concern you mentioned was
>> The probability of data loss increases linearly with cluster size.
>
> but you also acknowledge that
>
>> In making this determination, one must take into account that if a
>> larger `DF` makes reconstruction/replacement significantly faster,
>> that also decreases the time window in which multiple failures can
>> occurr. Increasing `DF` is thus not *necessarily* increasing the total
>> probability of data loss (for small values of `DF`).
>
> Our calculations lead us to believe that in fact the shorter rebuild
> window more than compensates for the increased probability of multiple
> failure, so with DF=N the probability of data loss is minimised.
>
> The CRUSH paper also states:
>
> "With 2-way mirroring these two factors cancel
> each other out, while overall data safety with more than two
> replicas increases with declustering [Xin et al. 2004]"
>
> ("declustering" meaning increasing DF towards N)

This is common argument against limiting RDF, but I am *strongly*
skeptical of this in real-life scenarios. This is one of those cases
where I think one needs to look beyond the pure math. For one thing, a
huge concern for me is that I don't want active re-balancing to have
such an extreme availability requirement that a downtime on actively
doing that implies a significantly increased risk of data loss. I
don't want a system to be constantly teetering on the edge of data
loss, and it not even being safe to *shut it down* because your lack
of data loss is dependent on the system as a whole being fully
available, working and actively moving data around. It also hinges on
reality matching theory perfectly well in terms of "independent"
failures truly being independent. I would feel much more comfortable
with a system that did not rely on super-fast re-replication to ensure
data safety. But like I said, a lot of people make this argument - I
just remain unconvinced thus far.

Further, even looking at just the math, the claim cannot possibly hold
as N grows sufficiently large. At some point you will bottleneck on
the network and no longer benefit form a higher RDF, but the
probability of data loss doesn't drop off until you reach DF=number of
partitions (because at that point an increased cluster size doesn't
increase the number of nodes with data sharing with another node).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
>> Using this ring bucket in the CRUSH topology, (with the hash function
>> being the identity function) would give the exact same distribution
>> properties as the virtual node strategy that I suggested previously,
>> but of course with much better topology awareness.
>
> I will have to re-read your orignal post. I seem to have missed something :)

I did, and I may or may not understand what you mean.

Are you comparing vnodes + hashing, with CRUSH + pre-partitioning by
hash + identity hash as you traverse down the topology tree?

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-19 Thread Peter Schuller
(I may comment on other things more later)

> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I

Actually, it does. At least assumign DF > RF (as in the original
proposal, and mine). The impact of a node suffering from a performance
degradation is mitigated because the effects are spread out over DF-1
(N-1 in the original post) nodes instead of just RF nodes.

> think some progress could be made here by allowing a "pack" of independent
> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> entirely) similar to a pre-fork model used by some UNIX-based servers.

I have pretty significant knee-jerk negative reactions to that idea to
be honest, even if the pack is limited to a handful of instances. In
order for vnodes to be useful with random placement, we'd need much
more than a handful of vnodes per node (cassandra instances in a
"pack" in that model).

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-20 Thread Peter Schuller
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
> locally store it's own current active token range as well as a target token
> range it's "moving" towards.

How is this significantly different than just using "nodetool move"
(in post-1.x) more rapidly and on smaller segments at a time?

There is the ring delay stuff which makes it un-workable to do at high
granularity, but that should apply to the active range solution too.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Peter Schuller
> Software wise it is the same deal. Each node streams off only disk 4
> to the new node.

I think an implication on software is that if you want to make
specific selections of partitions to move, you are effectively
incompatible with deterministically generating the mapping of
partition to responsible node. I.e., it probably means the vnode
information must be kept as state. It is probably difficult to
reconcile with balancing solutions like consistent hashing/crush/etc.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Peter Schuller
> You would have to iterate through all sstables on the system to repair one
> vnode, yes: but building the tree for just one range of the data means that
> huge portions of the sstables files can be skipped. It should scale down
> linearly as the number of vnodes increases (ie, with 100 vnodes, it will
> take 1/100th the time to repair one vnode).

The story is less good for "nodetool cleanup" however, which still has
to truck over the entire dataset.

(The partitions/buckets in my crush-inspired scheme addresses this by
allowing that each ring segment, in vnode terminology, be stored
separately in the file system.)

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Peter Schuller
> No I don't think you did, in fact, depending on the size of your SSTable a
> contiguous range (or the entire SSTable) may or may not be affected by a
> cleanup/move or any type of topology change. There is lots of room for
> optimization here. After loading the indexes we actually know start/end
> range for an SSTable so we can include/exclude it in any such operation.

Just note that unless there is some correlation between range and
these sstables being created to begin with (like with leveled), you're
highly unlikely to be able to optimize here. For uniformly distributed
tokens (hashed keys), all sstables are likely to have almost the
entire possible token range in them.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)