performance with a "large" number of supercolumns/columns
Hi, I was looking a bit on a case we have with columnfamily which has 439k supercolumns, each supercolumn with ~3 columns each so a total of some 1.3 million objects in total. This takes about 9 second to read with thrift on first access, 4-5 second on second access. I took a little closer look at this, I noticed that roughly half of this time was spend in cassandra. I will look more at this, but I thought I would just ask people here as it could be that someone already had good explanations... Most of the time is spent here public void collectReducedColumns(IColumnContainer container, Iterator reducedColumns, int gcBefore) { int liveColumns = 0; AbstractType comparator = container.getComparator(); while (reducedColumns.hasNext()) { if (liveColumns >= count) break; IColumn column = reducedColumns.next(); if (logger.isDebugEnabled()) logger.debug(String.format("collecting %s of %s: %s", liveColumns, count, column.getString(comparator))); if (finish.length > 0 && ((!reversed && comparator.compare(column.name(), finish) > 0)) || (reversed && comparator.compare(column.name(), finish) < 0)) break; // only count live columns towards the `count` criteria if (!column.isMarkedForDelete() && (!container.isMarkedForDelete() || (ClockRelationship.GREATER_THAN == column.mostRecentLiveChangeAt().compare(container.getMarkedForDeleteAt() { liveColumns++; } // but we need to add all non-gc-able columns to the result for read repair: if (QueryFilter.isRelevant(column, container, gcBefore)) container.addColumn(column); } } Adding some time measuring print revealed a few interesting this. 1. First time I request the row (I request the entire row in this case), collectReducedColumns() is called twice. I have not had time to understand why yet, but there is one difference between the calls. All columns are returned both times, but the first call is done with MAXINT as "count" while the second call has the maxcount actually requested by the client as "count". The first call takes about 3.7 seconds to process, the second about 1.7 secs. Whatever reason, I think we should be able to remove one of these calls? 2. Almost half the time is spent in "container.addColumn". This is probably no big surprise as there is a lot of code hidden "down there". I am however very curious if it could not be significantly optimized, especially in the case where you have predicates which cases all columns to be included. That said, I have not manage to read enough cassandra code to understand tombstones or all the other things which is going on in that part of the code. 3. A bit more of a surprise, most of the remaining 50% of the time seems to occur at while (reducedColumns.hasNext()) That is, save system.nanoTime() at the end of the loop and after hasNext and sum up and it accounts for almost 50% of the total time. That seems quite weird to me. I will dig more, but I was curious if someone had answers already. Best regards, Terje
Re: performance with a "large" number of supercolumns/columns
Hi, yes, I have been looking a bit on this, but not as much as I wanted to. I kind of forgot about the whole double call on first access (probably something related to caching?) as even fixing that, the performance is not good enough for what I need. The hasNext() call causes interactions with both apache common-collections and google-collection and simply speaking, the amount of code which ends up being executed for the hasNext()/next() part of this loop is a bit ridiculous. I started to optimize some code in common-collections, got a 10-20% improvement (for my specific test case), but as I started understanding both the collection code and Cassandra code better it started dawning on me that: 1. There is just too much code and too many "layers" involved when hasNext is called. I suspect this requires a re-design and the google/common-collections may have to be thrown out. This would seem to be a pretty critical area of Cassandra code for performance, so I suspect the cost of making functionality specifically tuned for Cassandra should be worth it. 2. From what I understand, Columns from all SS/MemTables are merged before considering tombstones and applying the slice predicates. To reduce deserialization and Column object creation, maybe some of this filtering and column merging could/should be done before Columns objects are even created? While this may make things somewhat hairy codewise, JVM performance/memory consumption tend to not be happy in scenarios with heavy object creation, and we get a lot of objects here. 3. It would seem like there might be good possibilities to do a fair bit of optimization for the case where a complete set of columns are fetched (slices with null for start/end predicates), but I am not currently sure how much time it would save and if it would be worth the work and added complexity to the code. I may make an attempt at 1. but I don't feel entirely like I understand enough of the Cassandra code yet to do any of the above at the moment so it may take a little while. Regards, Terje On Thu, Jul 8, 2010 at 8:07 AM, Jonathan Ellis wrote: > Hi Terje, > > Sorry to not get to this sooner. Are you still looking into this? > > On Tue, Jun 22, 2010 at 12:47 PM, Terje Marthinussen > wrote: > > Hi, > > > > I was looking a bit on a case we have with columnfamily which has 439k > > supercolumns, each supercolumn with ~3 columns each so a total of some > 1.3 > > million objects in total. > > > > This takes about 9 second to read with thrift on first access, 4-5 second > on > > second access. > > > > I took a little closer look at this, I noticed that roughly half of this > > time was spend in cassandra. > > > > I will look more at this, but I thought I would just ask people here as > it > > could be that someone already had good explanations... > > > > Most of the time is spent here > > > >public void collectReducedColumns(IColumnContainer container, > > Iterator reducedColumns, int gcBefore) > >{ > >int liveColumns = 0; > >AbstractType comparator = container.getComparator(); > > > >while (reducedColumns.hasNext()) > >{ > >if (liveColumns >= count) > >break; > > > >IColumn column = reducedColumns.next(); > >if (logger.isDebugEnabled()) > >logger.debug(String.format("collecting %s of %s: %s", > > liveColumns, count, > > column.getString(comparator))); > > > >if (finish.length > 0 > >&& ((!reversed && comparator.compare(column.name(), > > finish) > 0)) > >|| (reversed && comparator.compare(column.name(), > > finish) < 0)) > >break; > > > >// only count live columns towards the `count` criteria > >if (!column.isMarkedForDelete() > >&& (!container.isMarkedForDelete() > >|| (ClockRelationship.GREATER_THAN == > > > column.mostRecentLiveChangeAt().compare(container.getMarkedForDeleteAt() > >{ > >liveColumns++; > >} > > > >// but we need to add all non-gc-able columns to the > > result for read repair: > >if (QueryFilter.isRelevant(column, container, gcBefore)) > >container.addColumn(column); > >} > >} > > > > Adding some time measuring print revealed a few interesting this. > > 1. First time I request the row (I request the entire row in this case), > > collec
Re: Minimizing the impact of compaction on latency and throughput
> (2) posix_fadvise() feels more obscure and less portable than > O_DIRECT, the latter being well-understood and used by e.g. databases > for a long time. > Due to the need for doing data alignment in the application itself (you are bypassing all the OS magic here), there is really nothing portable about O_DIRECT. Just have a look at open(2) on linux: O_DIRECT The O_DIRECT flag may impose alignment restrictions on the length and address of userspace buffers and the file offset of I/Os. In Linux alignment restrictions vary by file system and kernel version and might be absent entirely. However there is currently no file system-indepen‐ dent interface for an application to discover these restrictions for a given file or file system. Some file systems provide their own inter‐ faces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3). So, just within Linux you got different mechanisms for this depending on kernel and fs in use and you need to figure out what to do yourself as the OS will not tell you that. Don't expect this alignment stuff to be more standardized across OSes than inside of Linux. Still find this portable? O_DIRECT also bypasses the cache completely, so you loose a lot of the I/O scheduling and caching across multiple reads/writers in threaded apps and separated processes which the OS may offer. This can especially be a big loss when you got servers with loads of memory for large filesystem caches where you might find it hard to actually utilize the cache in the application. O_DIRECT was made to solve HW performance limitation on servers 10+ years ago. It is far from an ideal solution today (but until stuff like fadvice is implemented properly, somewhat unavoidable) Best regards, Terje
Re: Minimizing the impact of compaction on latency and throughput
On Tue, Jul 13, 2010 at 10:26 PM, Jonathan Ellis wrote: > > I'm totally fine with saying "Here's a JNI library for Linux [or even > Linux version >= 2.6.X]" since that makes up 99% of our production > deployments, and leaving the remaining 1% with the status quo. > You really need to say Linux > 2.6 and filesystem xyz . That probably reduces the percentage a bit, but probably not critically. It is quite a while since I have written code for directio (I really try to avoid using it anymore), but from memory, as long as there is a framework which is somewhat extendable and can be used as a basis for new platforms, it should be reasonably trivial for a somewhat experienced person to add a new unix like platform in a couple of days. No idea for windows. I have never written code for this there. > > O_DIRECT also bypasses the cache completely > > Right, that's the idea. :) > Hm... I would have thought it was clear that my idea is that you do want to interact with the cache if you can! :) Under high load, you might reduce performance 10-30% by throwing out the scheduling benefits you get from the OS (yes, that is based on real life experience). Of course... that is given that you can somehow can avoid the worst case scenarios without direct I/O. As always, things will differ from use case to use case. A well performing HW raid card with sufficient writeback cache might also help reduce the negative impact of directio. Funny enough, it is often the systems with light read load that is hardest hit. Systems with heavy read load have more pressure on the cache on the read side and the write will not push content out of the cache (or applications out of physical memory) as easily. To make things more annoying, OSes (not just linux) has a tendency of behaving different from release to release. What is a problem on one linux release is not necessarily a problem on another. I have not seen huge problems when compacting on cassandra in terms of I/O myself, but I am currently working on HW with loads of memory, so I might not see the problems others see. I am more concerned with other performance issues at the moment. One nifty effect which may, or may not, be worth looking into, is what happens when you flip over to the new compacted SSTable, the last thing you write to the new compacted table will be there ready in cache to be read once you start using it. It can as such be worth ordering the compaction so that the most performance critical parts are written last and they are written without direct I/O or similar settings so they will be ready in cache when needed. I am not sure to what extent parts of the SSTables have structures of importance like this for Cassandra. Haven't really thought about it until now. Might also be worth looking at IO scheduler settings in the linux kernel. Some of the io schedulers also supports ionice/io priorities. I have never used it on single threads, but I have read that ioprio_set() accepts thread id's (not just process ids like the man page indicate). While not super efficient, in my experience, on preventing cache flushing of mostly idle data, if the compaction I/O occurs in isolated threads so ionice can be applied to that thread, it should help. > Exactly: it the fadvise mode that would actually be useful to us, is a > no-op and not likely to change soon. A bit of history: > > Interesting, I had not seen that before. Thanks! Terje
column family names
Hi, Now that we can make columns families on the fly, it gets interesting to use column families more as part of the data model (can reduce diskspace quite a bit vs. super columns in some cases). However, currently, the column family name validator is pretty strict allowing only word characters and in some cases it is pretty darned nice to be able to put something like a "-" inbetweenallthewords. Any reason to be this strict or could it be loosened up a little bit? Terje
cassandra disk usage
Hi, Was just looking at a SSTable file after loading a dataset. The data load has no updates of data but: - Columns can in some rare cases be added to existing super columns - SuperColumns will be added to the same key (but not overwriting existing data). I batch these, but it is quite likely that there will be 2-3 updates to a key. This is a random selected SSTable file from a much bigger dataset. The data is stored as date(super)/type(column)/value Date is a simple "20100811" type string. Value is a small integer, 2 digit on average If I run a simple strings on the SSTable and look for the data: value: 692Kbyte of data type: 4.01MByte of data date: 4.6MB of data In total: 9.4MByte The size of the .db file however, is 36.4MB... The expansion from the column headers are bad enough, but I can somehow accept that. The almost 4x expansion on top of that is a bit harder to justify... Anyone know already where this expansion comes from? Or I need to take a careful look at source (probably useful anyway :)) Terje
Re: Simple Compression Idea
There is a lot of overhead in the serialized data itself (just have a look at a sstable file). It would be great to be able to compress at the byte array level rather than string. Regards, Terje On 1 Feb 2011, at 03:15, "David G. Boney" wrote: > In Cassandra, strings are stored as UTF-8. In arithmetic coding compression, > the modeling is separate from the coding. A standard arrangement is to have a > 0-order model, frequencies of individual bytes, 1-order model, frequencies of > two byte occurrences, and 2-order models, frequencies of three byte > occurrences. For multibyte unicode encodings, which almost all fall in the > two or three byte encodings, the higher order models should capture the > statistics. Having said than, there is nothing to prevent someone from > developing a UTF-8 specific model for capturing the statistics better. > > An adaptive arithmetic coding model starts with the probabilities of all the > symbols being equal. This results in all the symbols using 8-bits per > encoding. The model adapts with each symbol it sees but it takes seeing some > data to start seeing compression. I do not know, or have any reports from the > literature, on how many bytes, on average, it takes to start seeing > compression. All the papers I have seen have studied large blocks of text. > > My assumption is that the majority of strings to compress in Cassandra will > be short string, less then 32 bytes, to medium length strings, less than, 256 > bytes. An example of short strings might be column names. An example of > medium length strings would be URLs. My assumption is that most short strings > would not get much compression from adaptive arithmetic coding compression > but medium length to longer strings would. In an effort to get higher > compression on the short strings, static encoding methods could be used. A > static encoding method uses a hard coded frequency table for the bytes. The > start of the compressed string could have a 2 bit code, 00-no compression, > 01-static, default probability table, 10-static, locale probability table, > 11-adaptive coding. The default static coding would be based on English. For > the static locale probability table, the 2 bit code would need to be followed > by a code table, indicating the probability table to use for a particular > language. The locale probability tables would be stored in the compressor jar > file as a separate class for each locale supported (This issue needs more > analysis, what I don't think is effective is to load the probability tables > from disk each time a new compressor is created). During the encoding phase, > both adaptive and static coding of the string would occur. In general, > compression with a static coding table is very fast. static codig is > basically a table lookup from a table with 256 entries. It is the adaptive > coding that is more computationally intensive. Furthermore, you could place a > limit on the use of static coding, say strings less than 256 bytes. This > would need to be set empirically. The shorter string from the two methods is > used as the compressed string. There is no additional burden on the decoding > side, since you know the type of compression encoding. > > It might be the case that block compressors can achieve greater compression > ratios. From what I have read on the mailing list and JIRA, it appears that > there is a lot of pain involved to implement block based compressors. This > method, a compressed string data type, is presented as a method that is > minimally invasive to the Cassandra architecture. Since everything is already > stored as a byte array, nothing would need to be changed in the internal data > types. Furthermore, no surgery of the internal tables is need. The main > addition is the development of comparators, for keys and column headings, > that know how to decompress the byte arrays before comparing them. There is > also the additional work on writing the codex but there are a number of > examples in C and Java from which to draw. Moffat's web site would be a good > place to start. > > - > Sincerely, > David G. Boney > dbon...@semanticartifacts.com > http://www.semanticartifacts.com > > > > > On Jan 31, 2011, at 2:19 AM, Stephen Connolly wrote: > >> On 31 January 2011 04:41, David G. Boney >> wrote: >>> I propose a simple idea for compression using a compressed string datatype. >>> >>> The compressed string datatype could be implemented for column family keys >>> by creating a compressed string ordered partitioner. The compressed string >>> ordered partitioner works by decompressing the string and then applying an >>> ordered partitioner for strings to the decompressed string. The hash based >>> partitioner would be used with the compressed string without any >>> modification. The compressed string datatype could be implemented for >>> column keys by creating a compressed string comparator.
2GB rows and errros
Hi, Any good reason this guy public int bytesPastMark(FileMark mark) { assert mark instanceof BufferedRandomAccessFileMark; long bytes = getFilePointer() - ((BufferedRandomAccessFileMark) mark).pointer; assert bytes >= 0; if (bytes > Integer.MAX_VALUE) throw new UnsupportedOperationException("Overflow: " + bytes); return (int) bytes; } does not show an error more like "Overflow: Maximum row size 2GB. Currently:" + bytes? Error you get today is not exactly self explaining :) Terje
Re: 2GB rows and errros
Ah, yes, I should have noticed that distinction. We actually hit this overflow on a row that was more than 60GB (yes, we had to count the number of digits a few times to make sure). On Sat, Mar 5, 2011 at 5:41 AM, Jonathan Ellis wrote: > Second try: > > - this isn't used in row size (which is not limited to 2GB) > - it's used both for the column index summary and index-block reading, > both of which should be well under 2GB > - however, I don't see any technical reason this method should return > an int instead of a long > - if we make that change we should probably do additional sanity > checks in the callers, which will have the necessary context to > provide better error messages > > On Fri, Mar 4, 2011 at 1:36 PM, Terje Marthinussen > wrote: > > Hi, > > > > Any good reason this guy > > public int bytesPastMark(FileMark mark) > >{ > >assert mark instanceof BufferedRandomAccessFileMark; > >long bytes = getFilePointer() - ((BufferedRandomAccessFileMark) > > mark).pointer; > > > >assert bytes >= 0; > >if (bytes > Integer.MAX_VALUE) > >throw new UnsupportedOperationException("Overflow: " + bytes); > >return (int) bytes; > >} > > > > does not show an error more like "Overflow: Maximum row size 2GB. > > Currently:" + bytes? > > > > Error you get today is not exactly self explaining :) > > > > Terje > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >
Re: [VOTE] Apache Cassandra 0.8.0-beta1
Unfortunate as it means beta1 is useless for testing for us, but I guess it does not make much difference if we try 0.8 trunk instead. Terje On Tue, Apr 19, 2011 at 9:46 PM, Jonathan Ellis wrote: > We typically only block betas for major regressions, not bugs that are > already present in a released version (all of 0.7 releases in this > case). So yes, this will be in the next one after beta1 (probably rc1 > since 0.8 won't be a moving target like 0.7 was). > > On Tue, Apr 19, 2011 at 7:42 AM, Shotaro Kamio > wrote: > > -1, > > I'd like to make 0.8 to include the fix for CASSANDRA-2406. The fix is > > committed to the 0.8 branch lately. I'll wait for beta2. > > > > > > Regards, > > Shotaro > > > > > > On Tue, Apr 19, 2011 at 10:42 AM, Jonathan Ellis > wrote: > >> CASSANDRA-2448 means > https://issues.apache.org/jira/browse/CASSANDRA-2448 > >> > >> On Mon, Apr 18, 2011 at 8:32 PM, Huang Stanley > wrote: > >>> +1 > >>> > >>> I'm looking forward to 0.8 :-) > >>> > >>> And one question about 0.8 beta1, > >>> I saw an item in the change log: > >>> > >>> > >>> * remove "nodetool loadbalance" (CASSANDRA-2448) > >>> > >>> any reason? any alternative solution to keep data "balance"? > >>> > >>> regards, > >>> > >>> Stanley Huang > >>> > >>> regards, > >>> > >>> Stanley Huang > >>> > >>> On Tue, Apr 19, 2011 at 5:28 AM, Gary Dusbabek > wrote: > >>> > +1 > > On Mon, Apr 18, 2011 at 13:44, Eric Evans > wrote: > > > > OK. Here are artifacts for a proposed 0.8 beta1 release. > > > > You will note the addition of three new artifacts, cql-1.0.0.tar.gz, > > txcql-1.0.0.tar.gz and apache-cassandra-cql-1.0.0.jar. These are > > language drivers for CQL; Be sure to include them in your review. > > > > SVN: > > > > https://svn.apache.org/repos/asf/cassandra/branches/cassandra-0.7@r1094668 > > 0.8.0-beta1 artifacts: http://people.apache.org/~eevans > > > > The vote will be open for 72 hours, longer if needed. > > > > Thanks! > > > > -- > > Eric Evans > > eev...@rackspace.com > > > > > > >>> > >> > >> > >> > >> -- > >> Jonathan Ellis > >> Project Chair, Apache Cassandra > >> co-founder of DataStax, the source for professional Cassandra support > >> http://www.datastax.com > >> > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >
Re: Cassandra 1.0
There is already so much stuff on the 1.0 branch that I don't think 4 month to feature freeze is a problem. Assuming big stuff like new sstable format will go into 1.0, I am more concerned about the 1 month from freeze to release. Regards, Terje On 17 Jun 2011, at 01:39, Eric Evans wrote: > On Thu, 2011-06-16 at 09:15 -0700, Ryan King wrote: >> I think maybe 4 months was too short? Do we optimistically want to try >> that again or plan on taking a bit more time? > > I think 4 months is about the right amount of time. > > Also, our upgrade story is better than it has ever been, which changes > things here considerably. > >> Either way I'm happy to have a plan. :) > > Yup. > > -- > Eric Evans > eev...@rackspace.com >