version.properties missing in trunk?

2010-07-13 Thread Morten Wegelbye Nissen

Hello,

Could it be that org/apache/cassandra/config/version.properties is 
missing in trunk? Or should this file be generated somewhere?


( FBUttilities.getCassandraVersionString is not feeling very good about 
the absent of this file )


./Morten


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Thomas Downing

On a related note:  I am running some feasibility tests looking for
high ingest rate capabilities.  While testing Cassandra the problem
I've encountered is that it runs out of file handles during compaction.
Until that event, there was no significant impact on throughput as
I was using it (0.1 query per second, ~10,000 records/query).

Up to this point, Cassandra was definitely in the lead among the
alternatives.

This was with 0.6.3, single node installation.  Ingest rate was about
4000 records/second, 1600 bytes/record, 24 bytes/key, using
batch_mutate.  Unfortunately, Cassandra seems unable to recover from
this state.  This occurs at about 100M records in the database.

I tried a 0.7.0 snapshot, but encountered earlier and worse problems.

The machine is 4 CPU AMD64 2.2GHz, 4GB.  There was no swapping.

The only mention of running out of file handles I found in the archives
or the defect list was related to queries - but I am notoriously blind.
I see the same behavior running ingest only, no queries

I've blown away the logs and data, but if there is interest in further info
on this problem, such as stacktrace and specific numbers, I will re-run
the test and send them along.

Thanks much for all your work

Thomas Downing

On 7/12/2010 10:52 PM, Jonathan Ellis wrote:

This looks relevant:
http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
comments for directions to code sample)

On Fri, Jul 9, 2010 at 1:52 AM, Peter Schuller
  wrote:
   

It might be worth experimenting with posix_fadvise.  I don't think
implementing our own i/o scheduler or rate-limiter would be as good a
use of time (it sounds like you're on that page too).
   

Ok. And yes I mostly agree, although I can imagine circumstances where
a pretty simple rate limiter might help significantly - albeit be
something that has to be tweaked very specifically for the
situation/hardware rather than being auto-tuned.

If I have the time I may look into posix_fadvise() to begin with (but
I'm not promising anything).

Thanks for the input!

--
/ Peter Schuller

 


   


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Peter Schuller
> This looks relevant:
> http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html (see
> comments for directions to code sample)

Thanks. That's helpful; I've been trying to avoid JNI in the past so
wasn't familiar with the API, and the main difficulty was likely to be
how to best expose the functionality to Java. Having someone do almost
exactly the same thing helps ;)

I'm also glad they confirmed the effect in a very similar situation.
I'm also leaning towards O_DIRECT as well because:

(1) Even if posix_fadvise() is used, on writes you'll need to fsync()
before fadvise() anyway in order to allow Linux to evict the pages (a
theoretical OS implementation might remember the advise call, but
Linux doesn't - at least not up until recently).

(2) posix_fadvise() feels more obscure and less portable than
O_DIRECT, the latter being well-understood and used by e.g. databases
for a long time.

(3) O_DIRECT allows more direct control over when I/O happens and to
what extent (without playing tricks or making assumptions about e.g.
read-ahead) so will probably make it easier to kill both birds with
one stone.

You indicated you were skeptical about writing an I/O scheduler. While
I agree that writing a real I/O scheduler is difficult, I suspect that
if we do direct I/O a fairly simple scheme should work well. Being
able to tweak a target MB/sec rate, select a chunk size ,and select
the time window over which to rate limit, I suspect would go a long
way.

The situation is a bit special since in this case we are talking about
one type of I/O that is run during controlled circumstances
(controlled concurrency, we know how much memory we eat in total,
etc).

I suspect there may be a problem sustaining rates during high read
loads though. We'll see.

I'll try to make time for trying this out.

-- 
/ Peter Schuller


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Terje Marthinussen
> (2) posix_fadvise() feels more obscure and less portable than
> O_DIRECT, the latter being well-understood and used by e.g. databases
> for a long time.
>

Due to the need for doing data alignment in the application itself (you are
bypassing all the OS magic here), there is really nothing portable about
O_DIRECT. Just have a look at open(2) on linux:

  O_DIRECT
   The  O_DIRECT  flag may impose alignment restrictions on the length
and
   address of userspace buffers and the file offset  of  I/Os.   In
Linux
   alignment restrictions vary by file system and kernel version and
might
   be absent entirely.  However there is currently no file
system-indepen‐
   dent  interface for an application to discover these restrictions for
a
   given file or file system.  Some file systems provide their own
inter‐
   faces  for  doing  so,  for  example  the  XFS_IOC_DIOINFO operation
in
   xfsctl(3).

So, just within Linux you got different mechanisms for this depending on
kernel and fs in use and you need to figure out what to do yourself as the
OS will not tell you that. Don't expect this alignment stuff to be more
standardized across OSes than inside of Linux. Still find this portable?

O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O
scheduling and caching across multiple reads/writers in threaded apps and
separated processes which the OS may offer. This can especially be a big
loss when you got servers with loads of memory for large filesystem caches
where you might find it hard to actually utilize the cache in the
application.

O_DIRECT was made to solve HW performance limitation on servers 10+ years
ago. It is far from an ideal solution today (but until stuff like fadvice is
implemented properly, somewhat unavoidable)

Best regards,
Terje


Re: version.properties missing in trunk?

2010-07-13 Thread Gary Dusbabek
`ant createVersionPropFile` takes care of this, as does `ant build`.

Gary.


On Tue, Jul 13, 2010 at 04:08, Morten Wegelbye Nissen  wrote:
> Hello,
>
> Could it be that org/apache/cassandra/config/version.properties is missing
> in trunk? Or should this file be generated somewhere?
>
> ( FBUttilities.getCassandraVersionString is not feeling very good about the
> absent of this file )
>
> ./Morten
>


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Peter Schuller
> Due to the need for doing data alignment in the application itself (you are
> bypassing all the OS magic here), there is really nothing portable about
> O_DIRECT. Just have a look at open(2) on linux:

[snip]

> So, just within Linux you got different mechanisms for this depending on
> kernel and fs in use and you need to figure out what to do yourself as the
> OS will not tell you that. Don't expect this alignment stuff to be more
> standardized across OSes than inside of Linux. Still find this portable?

The concept of direct I/O yes. I don't have experience of what the
practical portability is with respect to alignment however, so maybe
those details are a problem. But things like under what circumstances
which flags to postix_fadvise() actually have the desired effect
doesn't feel very portable either.

One might have a look at to what extent direct I/O works well in e.g.
postgresql or something like that, across platforms. But maybe you're
right and O_DIRECT is just not worth it.

> O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O

That was the intent.

> scheduling and caching across multiple reads/writers in threaded apps and
> separated processes which the OS may offer.

This is specifically what I want to bypass. I want to bypass the
operating system's caching to (1) avoid trashing the cache and (2)
know that a rate limited write translates fairly well to underlying
storage. Rate limiting asynchronous writes will often be less than
ideal since the operating system will tend to, by design, defer
writes. This aspect can of course be overcome with fsync() however.
And that does not even require native code, so is a big point in its
favor. But if we still need native code for posix_fadvise() anyway
(for reads), then that hit is taken anyway.

But sure. Perhaps posix_fadvise() in combination with regular
fsync():ing on writes may be preferable to direct I/O (with fsync()
being required both for rate limiting purposes if one is to combat
that, and for avoiding cache eviction given the way fadvise works in
Linux atm).

> This can especially be a big
> loss when you got servers with loads of memory for large filesystem caches
> where you might find it hard to actually utilize the cache in the
> application.

The entire point is to bypass the cache during compaction. But this
does not (unless I'm mistaken about how Cassandra works) invalidate
already pre-existing caches at the Cassandra/JVM level. In addition,
for large data sets (large being significantly larger than RAM size),
the data pulled into cache as part of compaction is not going to be
useful anyway, as is. There is the special cases where all or most
data fit in RAM and having all compaction I/O go through the cache may
even be preferable; but in the general case, I really don't see the
advantage of having that I/O go through cache.

If you do have most or all data in RAM, than certainly having all that
data warm at all times is preferably to doing I/O on a cold buffer
cache against sstables. But on the other hand, any use of direct I/O
of fadvise() will be optional (presumably). Given that a setup whereby
your performance is entirely dependent on most data being in RAM at
all times, you will already have issues with e.g. cold starts of
nodes.

In any case, I definitely consider there to be good reasons to not
rely only on operating system caching; compaction is one of these
reasons both with and without direct I/o or fadvise().

> O_DIRECT was made to solve HW performance limitation on servers 10+ years
> ago. It is far from an ideal solution today (but until stuff like fadvice is
> implemented properly, somewhat unavoidable)

I think there are pretty clear and obvious use-cases where the cache
eviction implied by large bulk streaming operations on large amounts
of data is not what you want (there are any number of practical
situations where this has been an issue for me, if nothing else). But
if I'm overlooking something that would mean that this optimization,
trying to avoid eviction, is useless with Cassandra please do explain
it to me :)

But I'll definitely buy that posix_fadvise() is probably a cleaner solution.

-- 
/ Peter Schuller


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Jonathan Ellis
On Tue, Jul 13, 2010 at 4:19 AM, Thomas Downing
 wrote:
> On a related note:  I am running some feasibility tests looking for
> high ingest rate capabilities.  While testing Cassandra the problem
> I've encountered is that it runs out of file handles during compaction.

This usually just means "increase the allowed fh via ulimit/etc."

Increasing the memtable thresholds so that you create less sstables,
but larger ones, is also a good idea.  The defaults are small so
Cassandra can work on a 1GB heap which is much smaller than most
production ones.  Reasonable rule of thumb: if you have a heap of N
GB, increase both the throughput and count thresholds by N times.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Jonathan Ellis
On Tue, Jul 13, 2010 at 6:18 AM, Terje Marthinussen
 wrote:
> Due to the need for doing data alignment in the application itself (you are
> bypassing all the OS magic here), there is really nothing portable about
> O_DIRECT.

I'm totally fine with saying "Here's a JNI library for Linux [or even
Linux version >= 2.6.X]" since that makes up 99% of our production
deployments, and leaving the remaining 1% with the status quo.

> O_DIRECT also bypasses the cache completely

Right, that's the idea. :)

> O_DIRECT was made to solve HW performance limitation on servers 10+ years
> ago. It is far from an ideal solution today (but until stuff like fadvice is
> implemented properly, somewhat unavoidable)

Exactly: it the fadvise mode that would actually be useful to us, is a
no-op and not likely to change soon. A bit of history:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm1/broken-out/fadvise-make-posix_fadv_noreuse-a-no-op.patch

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Terje Marthinussen
On Tue, Jul 13, 2010 at 10:26 PM, Jonathan Ellis  wrote:

>
> I'm totally fine with saying "Here's a JNI library for Linux [or even
> Linux version >= 2.6.X]" since that makes up 99% of our production
> deployments, and leaving the remaining 1% with the status quo.
>

You really need to say Linux > 2.6 and filesystem xyz .

That probably reduces the percentage a bit, but probably not critically.

It is quite a while since I have written code for directio (I really try to
avoid using it anymore), but from memory, as long as there is a framework
which is somewhat extendable and can be used as a basis for new platforms,
it should be reasonably trivial for a somewhat experienced person to add a
new unix like platform in a couple of days.

No idea for windows. I have never written code for this there.


> > O_DIRECT also bypasses the cache completely
>
> Right, that's the idea. :)
>

Hm... I would have thought it was clear that my idea is that you do want to
interact with the cache if you can! :)

Under high load, you might reduce performance 10-30% by throwing out the
scheduling benefits you get from the OS (yes, that is based on real life
experience). Of course... that is given that you can somehow can avoid the
worst case scenarios without direct I/O. As always, things will differ from
use case to use case.

A well performing HW raid card with sufficient writeback cache might also
help reduce the negative impact of directio.

Funny enough, it is often the systems with light read load that is hardest
hit. Systems with heavy read load have more pressure on the cache on the
read side and the write will not push content out of the cache (or
applications out of physical memory) as easily. To make things more
annoying, OSes (not just linux) has a tendency of behaving different from
release to release. What is a problem on one linux release is not
necessarily a problem on another.

I have not seen huge problems when compacting on cassandra in terms of I/O
myself, but I am currently working on HW with loads of memory, so I might
not see the problems others see. I am more concerned with other performance
issues at the moment.

One nifty effect which may, or may not, be worth looking into, is what
happens when you flip over to the new compacted SSTable, the last thing you
write to the new compacted table will be there ready in cache to be read
once you start using it. It can as such be worth ordering the compaction so
that the most performance critical parts are written last and they are
written without direct I/O or similar settings so they will be ready in
cache when needed.

I am not sure to what extent parts of the SSTables have structures of
importance like this for Cassandra. Haven't really thought about it until
now.

Might also be worth looking at IO scheduler settings in the linux kernel.
Some of the io schedulers also supports ionice/io priorities.

I have never used it on single threads, but I have read that ioprio_set()
accepts thread id's (not just process ids like the man page indicate). While
not super efficient, in my experience, on preventing cache flushing of
mostly idle data, if the compaction I/O occurs in isolated threads so ionice
can be applied to that thread, it should help.



> Exactly: it the fadvise mode that would actually be useful to us, is a
> no-op and not likely to change soon. A bit of history:
>
> Interesting, I had not seen that before.
Thanks!

Terje


Minor point on FailureDetector

2010-07-13 Thread Masood Mortazavi
In FailureDetector, in ArrivalWindow, in "double p(double t)",

is

return 1 - ( 1 - Math.pow(Math.E, exponent) );

really needed, instead of

return Math.pow(Math.E, exponent) ;

I believe the integral of the the Exponential Distribution from "t" to
infinity leads to the latter value.

While most Java compilers are probably smart enough to take care of the
redundant subtractions, there's no explanatory value I see in the former
form. I just ran into this as I was reviewing the accrual failure detector
code.

- m .