performance with a "large" number of supercolumns/columns

2010-06-22 Thread Terje Marthinussen
Hi,

I was looking a bit on a case we have with columnfamily which has 439k
supercolumns, each supercolumn with ~3 columns each so a total of some 1.3
million objects in total.

This takes about 9 second to read with thrift on first access, 4-5 second on
second access.

I took a little closer look at this, I noticed that roughly half of this
time was spend in cassandra.

I will look more at this, but I thought I would just ask people here as it
could be that someone already had good explanations...

Most of the time is spent here

public void collectReducedColumns(IColumnContainer container,
Iterator reducedColumns, int gcBefore)
{
int liveColumns = 0;
AbstractType comparator = container.getComparator();

while (reducedColumns.hasNext())
{
if (liveColumns >= count)
break;

IColumn column = reducedColumns.next();
if (logger.isDebugEnabled())
logger.debug(String.format("collecting %s of %s: %s",
   liveColumns, count,
column.getString(comparator)));

if (finish.length > 0
&& ((!reversed && comparator.compare(column.name(),
finish) > 0))
|| (reversed && comparator.compare(column.name(),
finish) < 0))
break;

// only count live columns towards the `count` criteria
if (!column.isMarkedForDelete()
&& (!container.isMarkedForDelete()
|| (ClockRelationship.GREATER_THAN ==
column.mostRecentLiveChangeAt().compare(container.getMarkedForDeleteAt()
{
liveColumns++;
}

// but we need to add all non-gc-able columns to the
result for read repair:
if (QueryFilter.isRelevant(column, container, gcBefore))
container.addColumn(column);
}
}

Adding some time measuring print revealed a few interesting this.
1. First time I request the row (I request the entire row in this case),
collectReducedColumns() is called twice. I have not had time to understand
why yet, but there is one difference between the calls. All columns are
returned both times, but the first call is done with MAXINT as "count" while
the second call has the maxcount actually requested by the client as
"count". The first call takes about 3.7 seconds to process, the second about
1.7 secs. Whatever reason, I think we should be able to remove one of these
calls?

2. Almost half the time is spent in "container.addColumn". This is probably
no big surprise as there is a lot of code hidden "down there". I am however
very curious if it could not be significantly optimized, especially in the
case where you have predicates which cases all columns to be included. That
said, I have not manage to read enough cassandra code to understand
tombstones or all the other things which is going on in that part of the
code.

3. A bit more of a surprise, most of the remaining 50% of the time seems to
occur at while (reducedColumns.hasNext())
That is, save system.nanoTime() at the end of the loop and after hasNext and
sum up and it accounts for almost 50% of the total time. That seems quite
weird to me.

I will dig more, but I was curious if someone had answers already.

Best regards,
Terje


Re: performance with a "large" number of supercolumns/columns

2010-07-07 Thread Terje Marthinussen
Hi,

yes, I have been looking a bit on this, but not as much as I wanted to.

I kind of forgot about the whole double call on first access (probably
something related to caching?) as even fixing that, the performance is not
good enough for what I need.

The hasNext() call causes interactions with both apache common-collections
and google-collection and simply speaking, the amount of code which ends up
being executed for the hasNext()/next() part of this loop is a bit
ridiculous.

I started to optimize some code in common-collections, got a 10-20%
improvement (for my specific test case), but as I started understanding both
the collection code and Cassandra code better  it started dawning on me
that:

1.  There is just too much code and too many "layers" involved when hasNext
is called. I suspect this requires a re-design and the
google/common-collections may have to be thrown out. This would seem to be a
pretty critical area of Cassandra code for performance, so I suspect the
cost of making  functionality specifically tuned for Cassandra should be
worth it.

2. From what I understand, Columns from all SS/MemTables are merged before
considering tombstones and applying the slice predicates. To reduce
deserialization and Column object creation, maybe some of this filtering and
column merging could/should be done before Columns objects are even created?


While this may make things somewhat hairy codewise, JVM performance/memory
consumption tend to not be happy in scenarios with heavy object creation,
and we get a lot of objects here.

3. It would seem like there might be good possibilities to do a fair bit of
optimization for the case where a complete set of columns are fetched
(slices with null for start/end predicates), but I am not currently sure how
much time it would save and if it would be worth the work and added
complexity to the code.

I may make an attempt at 1. but I don't feel entirely like I understand
enough of the Cassandra code yet to do any of the above at the moment so it
may take a little while.

Regards,
Terje

On Thu, Jul 8, 2010 at 8:07 AM, Jonathan Ellis  wrote:

> Hi Terje,
>
> Sorry to not get to this sooner.  Are you still looking into this?
>
> On Tue, Jun 22, 2010 at 12:47 PM, Terje Marthinussen
>  wrote:
> > Hi,
> >
> > I was looking a bit on a case we have with columnfamily which has 439k
> > supercolumns, each supercolumn with ~3 columns each so a total of some
> 1.3
> > million objects in total.
> >
> > This takes about 9 second to read with thrift on first access, 4-5 second
> on
> > second access.
> >
> > I took a little closer look at this, I noticed that roughly half of this
> > time was spend in cassandra.
> >
> > I will look more at this, but I thought I would just ask people here as
> it
> > could be that someone already had good explanations...
> >
> > Most of the time is spent here
> >
> >public void collectReducedColumns(IColumnContainer container,
> > Iterator reducedColumns, int gcBefore)
> >{
> >int liveColumns = 0;
> >AbstractType comparator = container.getComparator();
> >
> >while (reducedColumns.hasNext())
> >{
> >if (liveColumns >= count)
> >break;
> >
> >IColumn column = reducedColumns.next();
> >if (logger.isDebugEnabled())
> >logger.debug(String.format("collecting %s of %s: %s",
> >   liveColumns, count,
> > column.getString(comparator)));
> >
> >if (finish.length > 0
> >&& ((!reversed && comparator.compare(column.name(),
> > finish) > 0))
> >|| (reversed && comparator.compare(column.name(),
> > finish) < 0))
> >break;
> >
> >// only count live columns towards the `count` criteria
> >if (!column.isMarkedForDelete()
> >&& (!container.isMarkedForDelete()
> >|| (ClockRelationship.GREATER_THAN ==
> >
> column.mostRecentLiveChangeAt().compare(container.getMarkedForDeleteAt()
> >{
> >liveColumns++;
> >}
> >
> >// but we need to add all non-gc-able columns to the
> > result for read repair:
> >if (QueryFilter.isRelevant(column, container, gcBefore))
> >container.addColumn(column);
> >}
> >}
> >
> > Adding some time measuring print revealed a few interesting this.
> > 1. First time I request the row (I request the entire row in this case),
> > collec

Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Terje Marthinussen
> (2) posix_fadvise() feels more obscure and less portable than
> O_DIRECT, the latter being well-understood and used by e.g. databases
> for a long time.
>

Due to the need for doing data alignment in the application itself (you are
bypassing all the OS magic here), there is really nothing portable about
O_DIRECT. Just have a look at open(2) on linux:

  O_DIRECT
   The  O_DIRECT  flag may impose alignment restrictions on the length
and
   address of userspace buffers and the file offset  of  I/Os.   In
Linux
   alignment restrictions vary by file system and kernel version and
might
   be absent entirely.  However there is currently no file
system-indepen‐
   dent  interface for an application to discover these restrictions for
a
   given file or file system.  Some file systems provide their own
inter‐
   faces  for  doing  so,  for  example  the  XFS_IOC_DIOINFO operation
in
   xfsctl(3).

So, just within Linux you got different mechanisms for this depending on
kernel and fs in use and you need to figure out what to do yourself as the
OS will not tell you that. Don't expect this alignment stuff to be more
standardized across OSes than inside of Linux. Still find this portable?

O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O
scheduling and caching across multiple reads/writers in threaded apps and
separated processes which the OS may offer. This can especially be a big
loss when you got servers with loads of memory for large filesystem caches
where you might find it hard to actually utilize the cache in the
application.

O_DIRECT was made to solve HW performance limitation on servers 10+ years
ago. It is far from an ideal solution today (but until stuff like fadvice is
implemented properly, somewhat unavoidable)

Best regards,
Terje


Re: Minimizing the impact of compaction on latency and throughput

2010-07-13 Thread Terje Marthinussen
On Tue, Jul 13, 2010 at 10:26 PM, Jonathan Ellis  wrote:

>
> I'm totally fine with saying "Here's a JNI library for Linux [or even
> Linux version >= 2.6.X]" since that makes up 99% of our production
> deployments, and leaving the remaining 1% with the status quo.
>

You really need to say Linux > 2.6 and filesystem xyz .

That probably reduces the percentage a bit, but probably not critically.

It is quite a while since I have written code for directio (I really try to
avoid using it anymore), but from memory, as long as there is a framework
which is somewhat extendable and can be used as a basis for new platforms,
it should be reasonably trivial for a somewhat experienced person to add a
new unix like platform in a couple of days.

No idea for windows. I have never written code for this there.


> > O_DIRECT also bypasses the cache completely
>
> Right, that's the idea. :)
>

Hm... I would have thought it was clear that my idea is that you do want to
interact with the cache if you can! :)

Under high load, you might reduce performance 10-30% by throwing out the
scheduling benefits you get from the OS (yes, that is based on real life
experience). Of course... that is given that you can somehow can avoid the
worst case scenarios without direct I/O. As always, things will differ from
use case to use case.

A well performing HW raid card with sufficient writeback cache might also
help reduce the negative impact of directio.

Funny enough, it is often the systems with light read load that is hardest
hit. Systems with heavy read load have more pressure on the cache on the
read side and the write will not push content out of the cache (or
applications out of physical memory) as easily. To make things more
annoying, OSes (not just linux) has a tendency of behaving different from
release to release. What is a problem on one linux release is not
necessarily a problem on another.

I have not seen huge problems when compacting on cassandra in terms of I/O
myself, but I am currently working on HW with loads of memory, so I might
not see the problems others see. I am more concerned with other performance
issues at the moment.

One nifty effect which may, or may not, be worth looking into, is what
happens when you flip over to the new compacted SSTable, the last thing you
write to the new compacted table will be there ready in cache to be read
once you start using it. It can as such be worth ordering the compaction so
that the most performance critical parts are written last and they are
written without direct I/O or similar settings so they will be ready in
cache when needed.

I am not sure to what extent parts of the SSTables have structures of
importance like this for Cassandra. Haven't really thought about it until
now.

Might also be worth looking at IO scheduler settings in the linux kernel.
Some of the io schedulers also supports ionice/io priorities.

I have never used it on single threads, but I have read that ioprio_set()
accepts thread id's (not just process ids like the man page indicate). While
not super efficient, in my experience, on preventing cache flushing of
mostly idle data, if the compaction I/O occurs in isolated threads so ionice
can be applied to that thread, it should help.



> Exactly: it the fadvise mode that would actually be useful to us, is a
> no-op and not likely to change soon. A bit of history:
>
> Interesting, I had not seen that before.
Thanks!

Terje


column family names

2010-08-30 Thread Terje Marthinussen
Hi,

Now that we can make columns families on the fly, it gets interesting to use
column families more as part of the data model (can reduce diskspace quite a
bit vs. super columns in some cases).

However, currently, the column family name validator is pretty strict
allowing only word characters and in some cases it is pretty darned nice to
be able to put something like a "-" inbetweenallthewords.

Any reason to be this strict or could it be loosened up a little bit?

Terje


cassandra disk usage

2010-08-30 Thread Terje Marthinussen
Hi,

Was just looking at a SSTable file after loading a dataset. The data load
has no updates of data  but:
- Columns can in some rare cases be added to existing super columns
- SuperColumns will be added to the same key (but not overwriting existing
data). I batch these, but it is quite likely that there will be 2-3 updates
to a key.

This is a random selected SSTable file from a much bigger dataset.

The data is stored as date(super)/type(column)/value
Date is a simple "20100811" type string.
Value is a small integer, 2 digit on average

If I run a simple strings on the SSTable and look for the data:
value: 692Kbyte of data
type: 4.01MByte of data
date: 4.6MB of data

In total: 9.4MByte

The size of the .db file however, is 36.4MB...

The expansion from the column headers are bad enough, but I can somehow
accept that.
The almost 4x expansion on top of that is a bit harder to justify...

Anyone know already where this expansion comes from? Or I need to take a
careful look at source (probably useful anyway :))

Terje


Re: Simple Compression Idea

2011-01-31 Thread Terje Marthinussen
There is a lot of overhead in the serialized data itself (just have a look at a 
sstable file).

It would be great to be able to compress at the byte array level rather than 
string.

Regards,
Terje

On 1 Feb 2011, at 03:15, "David G. Boney"  wrote:

> In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, 
> the modeling is separate from the coding. A standard arrangement is to have a 
> 0-order model, frequencies of individual bytes, 1-order model, frequencies of 
> two byte occurrences, and 2-order models, frequencies of three byte 
> occurrences.  For multibyte unicode encodings, which almost all fall in the 
> two or three byte encodings, the higher order models should capture the 
> statistics. Having said than, there is nothing to prevent someone from 
> developing a UTF-8 specific model for capturing the statistics better.
> 
> An adaptive arithmetic coding model starts with the probabilities of all the 
> symbols being equal. This results in all the symbols using 8-bits per 
> encoding. The model adapts with each symbol it sees but it takes seeing some 
> data to start seeing compression. I do not know, or have any reports from the 
> literature, on how many bytes, on average, it takes to start seeing 
> compression. All the papers I have seen have studied large blocks of text. 
> 
> My assumption is that the majority of strings to compress in Cassandra will 
> be short string, less then 32 bytes, to medium length strings, less than, 256 
> bytes. An example of short strings might be column names. An example of 
> medium length strings would be URLs. My assumption is that most short strings 
> would not get much compression from adaptive arithmetic coding compression 
> but medium length to longer strings would. In an effort to get higher 
> compression on the short strings, static encoding methods could be used. A 
> static encoding method uses a hard coded frequency table for the bytes. The 
> start of the compressed string could have a 2 bit code, 00-no compression, 
> 01-static, default probability table, 10-static, locale probability table, 
> 11-adaptive coding. The default static coding would be based on English. For 
> the static locale probability table, the 2 bit code would need to be followed 
> by a code table, indicating the probability table to use for a particular 
> language. The locale probability tables would be stored in the compressor jar 
> file as a separate class for each locale supported (This issue needs more 
> analysis, what I don't think is effective is to load the probability tables 
> from disk each time a new compressor is created). During the encoding phase, 
> both adaptive and static coding of the string would occur. In general, 
> compression with a static coding table is very fast. static codig is 
> basically a table lookup from a table with 256 entries. It is the adaptive 
> coding that is more computationally intensive. Furthermore, you could place a 
> limit on the use of static coding, say strings less than 256 bytes. This 
> would need to be set empirically. The shorter string from the two methods is 
> used as the compressed string. There is no additional burden on the decoding 
> side, since you know the type of compression encoding.
> 
> It might be the case that block compressors can achieve greater compression 
> ratios. From what I have read on the mailing list and JIRA, it appears that 
> there is a lot of pain involved to implement block based compressors. This 
> method, a compressed string data type, is presented as a method that is 
> minimally invasive to the Cassandra architecture. Since everything is already 
> stored as a byte array, nothing would need to be changed in the internal data 
> types. Furthermore, no surgery of the internal tables is need. The main 
> addition is the development of comparators, for keys and column headings, 
> that know how to decompress the byte arrays before comparing them. There is 
> also the additional work on writing the codex but there are a number of 
> examples in C and Java from which to draw. Moffat's web site would be a good 
> place to start.
> 
> -
> Sincerely,
> David G. Boney
> dbon...@semanticartifacts.com
> http://www.semanticartifacts.com
> 
> 
> 
> 
> On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote:
> 
>> On 31 January 2011 04:41, David G. Boney  
>> wrote:
>>> I propose a simple idea for compression using a compressed string datatype.
>>> 
>>> The compressed string datatype could be implemented for column family keys 
>>> by creating a compressed string ordered partitioner. The compressed string 
>>> ordered partitioner works by decompressing the string and then applying an 
>>> ordered partitioner for strings to the decompressed string. The hash based 
>>> partitioner would be used with the compressed string without any 
>>> modification. The compressed string datatype could be implemented for 
>>> column keys by creating a compressed string comparator. 

2GB rows and errros

2011-03-04 Thread Terje Marthinussen
Hi,

Any good reason this guy
 public int bytesPastMark(FileMark mark)
{
assert mark instanceof BufferedRandomAccessFileMark;
long bytes = getFilePointer() - ((BufferedRandomAccessFileMark)
mark).pointer;

assert bytes >= 0;
if (bytes > Integer.MAX_VALUE)
throw new UnsupportedOperationException("Overflow: " + bytes);
return (int) bytes;
}

does not show an error more like "Overflow: Maximum row size 2GB.
Currently:" + bytes?

Error you get today is not exactly self explaining :)

Terje


Re: 2GB rows and errros

2011-03-04 Thread Terje Marthinussen
Ah, yes, I should have noticed that distinction.

We actually hit this overflow on a row that was more than 60GB (yes, we had
to count the number of digits a few times to make sure).

On Sat, Mar 5, 2011 at 5:41 AM, Jonathan Ellis  wrote:

> Second try:
>
> - this isn't used in row size (which is not limited to 2GB)
> - it's used both for the column index summary and index-block reading,
> both of which should be well under 2GB
> - however, I don't see any technical reason this method should return
> an int instead of a long
> - if we make that change we should probably do additional sanity
> checks in the callers, which will have the necessary context to
> provide better error messages
>
> On Fri, Mar 4, 2011 at 1:36 PM, Terje Marthinussen
>  wrote:
> > Hi,
> >
> > Any good reason this guy
> >  public int bytesPastMark(FileMark mark)
> >{
> >assert mark instanceof BufferedRandomAccessFileMark;
> >long bytes = getFilePointer() - ((BufferedRandomAccessFileMark)
> > mark).pointer;
> >
> >assert bytes >= 0;
> >if (bytes > Integer.MAX_VALUE)
> >throw new UnsupportedOperationException("Overflow: " + bytes);
> >return (int) bytes;
> >}
> >
> > does not show an error more like "Overflow: Maximum row size 2GB.
> > Currently:" + bytes?
> >
> > Error you get today is not exactly self explaining :)
> >
> > Terje
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: [VOTE] Apache Cassandra 0.8.0-beta1

2011-04-19 Thread Terje Marthinussen
Unfortunate as it means beta1 is useless for testing for us, but I guess it
does not make much difference if we try 0.8 trunk instead.

Terje

On Tue, Apr 19, 2011 at 9:46 PM, Jonathan Ellis  wrote:

> We typically only block betas for major regressions, not bugs that are
> already present in a released version (all of 0.7 releases in this
> case).  So yes, this will be in the next one after beta1 (probably rc1
> since 0.8 won't be a moving target like 0.7 was).
>
> On Tue, Apr 19, 2011 at 7:42 AM, Shotaro Kamio 
> wrote:
> > -1,
> > I'd like to make 0.8 to include the fix for CASSANDRA-2406. The fix is
> > committed to the 0.8 branch lately. I'll wait for beta2.
> >
> >
> > Regards,
> > Shotaro
> >
> >
> > On Tue, Apr 19, 2011 at 10:42 AM, Jonathan Ellis 
> wrote:
> >> CASSANDRA-2448 means
> https://issues.apache.org/jira/browse/CASSANDRA-2448
> >>
> >> On Mon, Apr 18, 2011 at 8:32 PM, Huang Stanley 
> wrote:
> >>> +1
> >>>
> >>> I'm looking forward to 0.8 :-)
> >>>
> >>> And one question about 0.8 beta1,
> >>> I saw an item in the change log:
> >>>
> >>>
> >>> * remove "nodetool loadbalance" (CASSANDRA-2448)
> >>>
> >>> any reason? any alternative solution to keep data "balance"?
> >>>
> >>> regards,
> >>>
> >>> Stanley Huang
> >>>
> >>> regards,
> >>>
> >>> Stanley Huang
> >>>
> >>> On Tue, Apr 19, 2011 at 5:28 AM, Gary Dusbabek 
> wrote:
> >>>
>  +1
> 
>  On Mon, Apr 18, 2011 at 13:44, Eric Evans 
> wrote:
>  >
>  > OK.  Here are artifacts for a proposed 0.8 beta1 release.
>  >
>  > You will note the addition of three new artifacts, cql-1.0.0.tar.gz,
>  > txcql-1.0.0.tar.gz and apache-cassandra-cql-1.0.0.jar.  These are
>  > language drivers for CQL; Be sure to include them in your review.
>  >
>  > SVN:
>  >
> 
> https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.7@r1094668
>  > 0.8.0-beta1 artifacts: http://people.apache.org/~eevans
>  >
>  > The vote will be open for 72 hours, longer if needed.
>  >
>  > Thanks!
>  >
>  > --
>  > Eric Evans
>  > eev...@rackspace.com
>  >
>  >
> 
> >>>
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of DataStax, the source for professional Cassandra support
> >> http://www.datastax.com
> >>
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>


Re: Cassandra 1.0

2011-06-16 Thread Terje Marthinussen
There is already so much stuff on the 1.0 branch that I don't think 4 month to 
feature freeze is a problem. 

Assuming big stuff like new sstable format will go into 1.0, I am more 
concerned about the 1 month from freeze to release.

Regards,
Terje

On 17 Jun 2011, at 01:39, Eric Evans  wrote:

> On Thu, 2011-06-16 at 09:15 -0700, Ryan King wrote:
>> I think maybe 4 months was too short? Do we optimistically want to try
>> that again or plan on taking a bit more time?
> 
> I think 4 months is about the right amount of time.
> 
> Also, our upgrade story is better than it has ever been, which changes
> things here considerably.
> 
>> Either way I'm happy to have a plan. :) 
> 
> Yup.
> 
> -- 
> Eric Evans
> eev...@rackspace.com
>