Caching Saving Question

2012-07-30 Thread Zhu Han
We have a CF on which both key cache and row cache are enabled. We found
the key cache is saved to disk periodically per configuration, while row
cache was not saved at all in the last two month. Both the log and ondisk
mtime verified the result.

Is it the expected behavior?

The node runs Cassandra 1.0.10. Here is the CF definition:

create column family Demo
  with column_type = 'Standard'
  and comparator = 'BytesType'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and rows_cached = 10.0
  and row_cache_save_period = 14400
  and row_cache_keys_to_save = 2147483647
  and keys_cached = 20.0
  and key_cache_save_period = 14400
  and read_repair_chance = 0.01
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and row_cache_provider = 'SerializingCacheProvider'
  and compaction_strategy =
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and compression_options = {'chunk_length_kb' : '64',
'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};


best regards,
Zhu Han


Re: Cassandra on top of B-Tree

2010-03-30 Thread Zhu Han
Log structural database  has the append-only characteristics, e.g. BDB-JE.
Is it an alternative for SSTable? Those matured database product might have
done a lot for cache management. Not sure whether it can improve the
performance of read or not.

Support of compression on those product is another issue

Just as Avinash pointed out, distributed B-Tree may bring some trouble
because of the complicated operations during re-balance, unless there is
some elegant algorithm I'm not aware.

best regards,
hanzhu


On Mon, Mar 29, 2010 at 8:12 PM, Jonathan Ellis  wrote:

> On Mon, Mar 29, 2010 at 6:52 AM, Michael Poole 
> wrote:
> > SSTables aren't written on every update.  Why would a B-Tree
> > implementation differ?
>
> Because traditional B-trees are update-in-place, and although CouchDB
> has an append-only B-tree, it's limited to one writer at a time which
> is (one reason) why they get 6x less throughput than we do on writes.
>


Re: Cassandra on top of B-Tree

2010-03-31 Thread Zhu Han
Good catch!  it's the major  drawback of JE that it is lack of on disk
locality, if the internal node cannot put in RAM.   BDB can provide such on
disk locality. Seems like BDB JE can be an alternative of memtable + tablet
log, BDB can be an alternative of SSTable.

This is just general discussion. I don't believe an open source project will
bet it future on a commercial product with another license. :-)

best regards,
hanzhu


On Tue, Mar 30, 2010 at 6:45 PM, Peter Schuller  wrote:

> > Log structural database  has the append-only characteristics, e.g.
> BDB-JE.
> > Is it an alternative for SSTable? Those matured database product might
> have
> > done a lot for cache management. Not sure whether it can improve the
> > performance of read or not.
>
> BDB JE seems to be targetted mostly at cases where data fits in RAM,
> or reasonably close to it. A problem is that while writes will be
> append-only as long as the database is sufficiently small, you start
> taking reads once the internal btree nodes no longer fit in RAM. So
> depending on cache size, at a certain number of keys (thus size of the
> btree) you start being seek-bound on reads while writing, even though
> the writes are in and of themselves append-only and not subject to
> seek overhead.
>
> Another effect, which I have not specifically confirmed in testing but
> expect to happen, is that once you reach the point this point of
> taking reads, compaction is probably going to be a lot more expensive.
> While normally JE can pick a log segment with the most garbage and
> mostly stream through it, re-writing non-garbage, that process will
> then also become entirely seek bound if a only a small subset of the
> btree fits in RAM. So now you have a seek bound compaction process
> that must keep up with the append-only write process, meaning that
> your append-only writes are limited by said seeks in addition to any
> seeks it takes "directly" when generating the writes.
>
> Also keep in mind that JE won't have on-disk locality for neither
> internal nodes nor leaf (data) nodes.
>
> The guaranteed append-only nature of Cassandra, in combination with
> the on-disk locality, is one reason to prefer it, under some
> circumstances, over JE even for non-clustered local use on a single
> machine.
>
> (As a parenthesis: I doubt JE is being used very much with huge
> databases, since a very significant CPU bottleneck became O(n) (with
> respect to the number of log segments) file listings. This is probably
> easily patched, or configured away by using larger log segments, but
> the repeated O(n) file listings suggest to me that huge databases is
> not an expected use case - beyond some hints in the documentation that
> would indicate it's meant for smaller databases.)
>
> --
> / Peter Schuller
>


Re: [DISCUSSION] High-volume counters in Cassandra

2010-09-05 Thread Zhu Han
+ 1 for Jonathan Ellis.

I might not be on the same page as you active community members. But I'm
wondering why not put this feature to a popular client library or as a
contrib package?

In CASSANDRA-1072 + CASSANDRA-1397, the increment of counter is not
idempotent, so it's difficult to align with the consistency model of
Cassandra.  It's not worth to put a lot of code to the core base to just
serve a single feature.

In CASSANDRA-1421, the increment is idempotent and is easier to align with
Cassandra. However, the read performance could be poor because it has to
reconcile a lot of columns. The memory consumption on cassandra node might
be much higher than the above approach, if I understood it correctly.

If you decides to put the feature to the client library. The client library
can take the approach as CASSANDRA-142, and serialize the increment from a
single writer to limit the columns generated.  If the writers of a single
counter are just hundreds processes, I don't think it is a big deal for
performance.

If you worry about the performance on the client side because it serialize
the increment of a single counter,  maintain a queue for each counter and
it's easy to batch multiple updates in the same queue.



best regards,
hanzhu


On Fri, Sep 3, 2010 at 4:55 AM, Jonathan Ellis  wrote:

> I still have not seen any response to my other misgivings about 1072
> that I have raised on the ticket.  Specifically, the existing patch is
> based around a Clock structure that, since 580 is a dead end, is no
> longer necessary.
>
> I'm also uneasy about adding 200k of code that meshes as poorly with
> the rest of Cassandra as this does.  The more it can be split off into
> separate code paths, the better.  Adding its own thrift method is a
> good start, but it should go deeper than that.
>
> On Thu, Sep 2, 2010 at 12:01 PM, Johan Oskarsson 
> wrote:
> > In the last few months Digg and Twitter have been using a counter patch
> that lets Cassandra act as a high-volume realtime counting system. Atomic
> counters enable new applications that were previously difficult to implement
> at scale, including realtime analytics and large-scale systems monitoring.
> >
> > Discussion
> > There are currently two different suggestions for how to implement
> counters in Cassandra. The discussion has so far been limited to those
> following the jiras (CASSANDRA-1072 and CASSANDRA-1421) closely and we don’t
> seem to be nearing a decision. I want to open it up to the Cassandra
> community at large to get additional feedback.
> >
> > Below are very basic and brief introductions to the alternatives. Please
> help us move forward by reading through the docs and jiras and reply to this
> thread with your thoughts. Would one or the other, both or neither be
> suitable for inclusion in Cassandra? Is there a third option? What can we do
> to reach a decision?
> >
> > We believe that both options can coexist; their strengths and weaknesses
> make them suitable for different use cases.
> >
> >
> > CASSANDRA-1072 + CASSANDRA-1397
> > https://issues.apache.org/jira/browse/CASSANDRA-1072 (see design doc)
> > https://issues.apache.org/jira/browse/CASSANDRA-1397
> >
> > How does it work?
> > A node is picked as the primary replica for each write. The context byte
> array for a column contains (primary replica ip, value). Any previous data
> with the same ip is reconciled with the new increment and put as the column
> value.
> >
> > Concerns raised
> > * an increment in flight will be lost if the wrong node goes down
> > * if an increment operation times out it’s impossible to know if it has
> been executed or not
> >
> > The most recent jira comment proposes a new API method for increments
> that reflects the different consistency level guarantees.
> >
> >
> > CASSANDRA-1421
> > https://issues.apache.org/jira/browse/CASSANDRA-1421
> >
> > How does it work?
> > Each increment for a counter is stored as a (UUID, value) tuple. The read
> operations will read all these increment tuples for a counter, reconcile and
> return. On a regular interval the values are all read and reconciled into
> one value to reduce the amount of data required for each read operation.
> >
> > Concerns raised
> > * poor read performance, especially for time-series data
> > * post aggregation reconciliation issues
> >
> >
> > Again, we feel that both options can co-exist, especially if the 1072
> patch uses a new API method that reflects its different consistency level
> guarantees. Our proposal is to accept 1072 into trunk with the new API
> method, and when an implementation of 1421 is completed it can be accepted
> alongside.
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: [DISCUSSION] High-volume counters in Cassandra

2010-09-05 Thread Zhu Han
I thought about it again for a while.  It might be a good trade-off to just
implement the "CASSANDRA-1421"  as a new API and limit the new code only in
StorageProxy level, and never put any dependency on internal memchanism of
Cassandra, e.g. compaction,  membership management and other complicated
stuff.

If it does not pollute the core code base, it may be easier to refine it or
remove them when there is a better idea.

If so, the  number of writers of a single counter is at most equals to the
number of cassandra node.

If the client takes a simple optimism to raise the thrift request only to
the cassandra node which are the storage node of the counter, the
performance should be almost the same as " CASSANDRA-1072 + CASSANDRA-1397",
as the number of writers is the same as the number of replications. You can
also save an extra round trip time. That's what I did in my project.

best regards,
hanzhu


On Sun, Sep 5, 2010 at 5:24 PM, Zhu Han  wrote:

> + 1 for Jonathan Ellis.
>
> I might not be on the same page as you active community members. But I'm
> wondering why not put this feature to a popular client library or as a
> contrib package?
>
> In CASSANDRA-1072 + CASSANDRA-1397, the increment of counter is not
> idempotent, so it's difficult to align with the consistency model of
> Cassandra.  It's not worth to put a lot of code to the core base to just
> serve a single feature.
>
> In CASSANDRA-1421, the increment is idempotent and is easier to align with
> Cassandra. However, the read performance could be poor because it has to
> reconcile a lot of columns. The memory consumption on cassandra node might
> be much higher than the above approach, if I understood it correctly.
>
> If you decides to put the feature to the client library. The client library
> can take the approach as CASSANDRA-142, and serialize the increment from a
> single writer to limit the columns generated.  If the writers of a single
> counter are just hundreds processes, I don't think it is a big deal for
> performance.
>
> If you worry about the performance on the client side because it serialize
> the increment of a single counter,  maintain a queue for each counter and
> it's easy to batch multiple updates in the same queue.
>
>
>
> best regards,
> hanzhu
>
>
>
> On Fri, Sep 3, 2010 at 4:55 AM, Jonathan Ellis  wrote:
>
>> I still have not seen any response to my other misgivings about 1072
>> that I have raised on the ticket.  Specifically, the existing patch is
>> based around a Clock structure that, since 580 is a dead end, is no
>> longer necessary.
>>
>> I'm also uneasy about adding 200k of code that meshes as poorly with
>> the rest of Cassandra as this does.  The more it can be split off into
>> separate code paths, the better.  Adding its own thrift method is a
>> good start, but it should go deeper than that.
>>
>> On Thu, Sep 2, 2010 at 12:01 PM, Johan Oskarsson 
>> wrote:
>> > In the last few months Digg and Twitter have been using a counter patch
>> that lets Cassandra act as a high-volume realtime counting system. Atomic
>> counters enable new applications that were previously difficult to implement
>> at scale, including realtime analytics and large-scale systems monitoring.
>> >
>> > Discussion
>> > There are currently two different suggestions for how to implement
>> counters in Cassandra. The discussion has so far been limited to those
>> following the jiras (CASSANDRA-1072 and CASSANDRA-1421) closely and we don’t
>> seem to be nearing a decision. I want to open it up to the Cassandra
>> community at large to get additional feedback.
>> >
>> > Below are very basic and brief introductions to the alternatives. Please
>> help us move forward by reading through the docs and jiras and reply to this
>> thread with your thoughts. Would one or the other, both or neither be
>> suitable for inclusion in Cassandra? Is there a third option? What can we do
>> to reach a decision?
>> >
>> > We believe that both options can coexist; their strengths and weaknesses
>> make them suitable for different use cases.
>> >
>> >
>> > CASSANDRA-1072 + CASSANDRA-1397
>> > https://issues.apache.org/jira/browse/CASSANDRA-1072 (see design doc)
>> > https://issues.apache.org/jira/browse/CASSANDRA-1397
>> >
>> > How does it work?
>> > A node is picked as the primary replica for each write. The context byte
>> array for a column contains (primary replica ip, value). Any previous data
>> with the same ip is reconciled with the new increment and put as the column
>> value.
>> >

Randomly read repair?

2010-09-18 Thread Zhu Han
Hi,

I notice below code snippet in StorageProxy#strongRead(). Why the read
repair is still trigged randomly if digest  is mis-matched  for CL.Quorum
and CL.ALL. IMHO, the client wants the result to be returned is the
consistent one on those two CL.  Read repair should be triggered
definitively here.

If the client does not care about consistency, CL.ONE is the definitive
choice. For this level, read repair can be trigged randomly per the
configuration of key space.

try
> {
> long startTime2 = System.currentTimeMillis();
> row = quorumResponseHandler.get();
> if (row != null)
> rows.add(row);
>
> if (logger.isDebugEnabled())
> logger.debug("quorumResponseHandler: " + (System.
> currentTimeMillis() - startTime2) + " ms.");
> }
> catch (DigestMismatchException ex)
> {
> if (randomlyReadRepair(command))
> {
> AbstractReplicationStrategy rs = StorageService.
> instance.getReplicationStrategy(command.table);
> QuorumResponseHandler qrhRepair = rs.
> getQuorumResponseHandler(new ReadResponseResolver(command.table),
> ConsistencyLevel.QUORUM);
> if (logger.isDebugEnabled())
> logger.debug("Digest mismatch:", ex);
> Message messageRepair = command.makeReadMessage();
> MessagingService.instance.sendRR(messageRepair,
> commandEndpoints.get(i), qrhRepair);
> if (repairResponseHandlers == null)
> repairResponseHandlers = new ArrayList<
> QuorumResponseHandler>();
> repairResponseHandlers.add(qrhRepair);
> }
>




best regards,
hanzhu


Re: Randomly read repair?

2010-09-19 Thread Zhu Han
>
> (IMO the "right" thing is more complicated -- we shouldn't send
> requests to _all_ the replicas on the _first_ read with CL.QUORUM,
> except as dictated by randomlyReadRepair.)
>
>
I agree with you. But which replicas to send requests depends on the
replication strategy... Only
the nearest node should receive the requests.

BTW, which version of cassandra was the randomly repair feature firstly
implemented? 0.7 beta?


> On Sat, Sep 18, 2010 at 9:39 AM, Zhu Han  wrote:
> > Hi,
> >
> > I notice below code snippet in StorageProxy#strongRead(). Why the read
> > repair is still trigged randomly if digest  is mis-matched  for CL.Quorum
> > and CL.ALL. IMHO, the client wants the result to be returned is the
> > consistent one on those two CL.  Read repair should be triggered
> > definitively here.
> >
> > If the client does not care about consistency, CL.ONE is the definitive
> > choice. For this level, read repair can be trigged randomly per the
> > configuration of key space.
> >
> >try
> >> {
> >> long startTime2 = System.currentTimeMillis();
> >> row = quorumResponseHandler.get();
> >> if (row != null)
> >> rows.add(row);
> >>
> >> if (logger.isDebugEnabled())
> >> logger.debug("quorumResponseHandler: " + (System.
> >> currentTimeMillis() - startTime2) + " ms.");
> >> }
> >> catch (DigestMismatchException ex)
> >> {
> >> if (randomlyReadRepair(command))
> >> {
> >> AbstractReplicationStrategy rs = StorageService.
> >> instance.getReplicationStrategy(command.table);
> >> QuorumResponseHandler qrhRepair = rs.
> >> getQuorumResponseHandler(new ReadResponseResolver(command.table),
> >> ConsistencyLevel.QUORUM);
> >> if (logger.isDebugEnabled())
> >> logger.debug("Digest mismatch:", ex);
> >> Message messageRepair = command.makeReadMessage();
> >> MessagingService.instance.sendRR(messageRepair,
> >> commandEndpoints.get(i), qrhRepair);
> >> if (repairResponseHandlers == null)
> >> repairResponseHandlers = new ArrayList<
> >> QuorumResponseHandler>();
> >> repairResponseHandlers.add(qrhRepair);
> >> }
> >>
> >
> >
> >
> >
> > best regards,
> > hanzhu
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: improving read performance

2010-09-21 Thread Zhu Han
> Reasons to not use the row cache with large rows include:
>
> * In general it's a waste of memory better given to the OS page cache,
> unless possibly you're continually reading entire rows rather than
> subsets of rows.
>
> * For truly large rows you may have immediate issues with the size of
> the data being cached; e.g. attempting to cache a 2 GB row is not the
> best idea in terms of heap space consumption; you'll likely OOM or
> trigger fallbacks to full GC, etc.
>
> * Having a larger key cache may often be more productive.
>
> > That aside, splitting the memtable in 2, could make checking the bloom
> > filters unnecessary in most cases for me, but I'm not sure it's worth the
> > effort.
>
> Write-through row caching seems like a more direct approach to me
> personally, off hand. Also to the extent that you're worried about
> false positive rates, larger bloom filters may still be an option (not
> currently configurable; would require source changes).
>
> IMHO, it's very difficult to tune JVM when JVM caches a lot of data for a
long time because
the modern GC does not design for such purpose.

There is a patch about make the row cache pluggable to be replaced by
memcached[1]. This is likely
the right way to go.

[1]https://issues.apache.org/jira/browse/CASSANDRA-1283


> --
> / Peter Schuller
>


Re: [DISCUSSION] High-volume counters in Cassandra

2010-09-25 Thread Zhu Han
>
>
> On the other hand, if the patch authors never bring it up to the
> standards of the rest of the project, well, then it's a good thing we
> didn't commit it under a "commit now, fix later" process.
>
> > Maybe this fork could be prevented if committers could give the guidance?
>
> While it's true -- and unfortunate, mea culpa -- that the rest of us
> weren't involved enough at the beginning of the counter design
> process, that's not the case any more.
>

Can we just let the patch committed but mark it as "alpah" or
"experimental"?
So that we can refine it after commit without breaking any contract?

IIRC, the clock structure was pushed to trunk  several months ago
(Cassandra-1070 [1]).
The "commit first, fix later" process is unavoidable even without this
feature. This is my two
cents on it.

[1]: https://issues.apache.org/jira/browse/CASSANDRA-1070


>
> The people most familiar with this patch besides its authors are
> myself and Sylvain, and we have said (starting a month ago) that
> building on the Clock structure looks like the wrong approach.  That's
> a big change to the patch, so it's understandable that this is painful
> for the authors.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: [DISCUSSION] High-volume counters in Cassandra

2010-09-26 Thread Zhu Han
 I propose a new way to solve the counter problem in cassandra-1502[1].
Since I do not follow the jira update very carefully, I paste it here and
want to let more people comment it and then to see whether its feasible.

"Seems like we have not found a solution acceptable to everybody. I tries to
propose a new approach. Let's see whether anybody can shed some light on it
and make it as reality.

1) We add a basic data structure, called as counter, which is a special type
of super column.

2) The name of each column in the counter super column, is the host name of
a cassandra node. And the value is the calculated result from that node.

3) WRITE PATH: Once a node receives the add/dec request of a counter, it
de-serializes its local counter super column, and update the column named by
itself atomically. After that, it propagates the updated column value to
other replicas, just like how the mutation of a normal column is propagated
to other replicas. Different consistency levels can be supported as before.

4) READ PATH: Depends on the consistency level, contact several replicas,
read back the counter super column as whole, and get the latest counter
value by summing up all columns in the counter. Read-repair logic can work
as before.

IMHO, the biggest advantages of this approach, is re-using as many
mechanisms already in the code as possible. So it might not so disruptive.
But adding new thrift API is inevitable. "
NB: If it's feasible, I might not be the right man working on it as I have
not touched the internal of cassandra for more than 1 year. I wants to
contribute something to help us get consensus.

[1]
https://issues.apache.org/jira/browse/CASSANDRA-1502?focusedCommentId=12915103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12915103

best regards,
hanzhu


On Sun, Sep 26, 2010 at 9:49 PM, Jonathan Ellis  wrote:

> you have misunderstood.  if we continue the 1072 approach of writing
> counter data to the clock field, this is necessarily incompatible with
> the right way of writing counter data to the value field.  it's no
> longer simply a matter of reversing 1070.
>
> On Sat, Sep 25, 2010 at 11:50 PM, Zhu Han  wrote:
> > Jonathan,
> >
> > This is a personnel email.
> >
> > On Sun, Sep 26, 2010 at 1:27 PM, Jonathan Ellis 
> wrote:
> >>
> >> On Sat, Sep 25, 2010 at 8:57 PM, Zhu Han  wrote:
> >> > Can we just let the patch committed but mark it as "alpah" or
> >> > "experimental"?
> >>
> >> I explained exactly why that is not a good approach here:
> >> http://www.mail-archive.com/dev@cassandra.apache.org/msg00917.html
> >>
> > Yes, I see. But the clock structure is in truck since Cassandra-1070.  We
> > still need to clean them
> > out,  whatever. We need somebody to be volunteer to take this work.
> > Considering the complexity
> > of Cassandra-1070, the programmer who has the in depth knowledge of this
> > patch is preferable. And it
> > will take some time to do it.
> >
> > Fortunately,  Johan Oskarsson has promised to take it in the comment of
> > Cassandra-1072[1]:
> >
> > "The clock changes would get into trunk quicker if we didn't, avoiding
> the
> > extra overhead of a big patch during reviews, merge with trunk, code
> updates
> > and publication of a new patch.
> > If the concern is that we won't attend to the clocks once this patch is
> in I
> > can promise that we'll look at it straight away. "
> >
> > And if twitter/digg/simplegeo forks their tree of cassandra, this will
> give
> > a big marketing opportunities of other NOSQL system supporters. As you
> know,
> > the competition is quite fierce currently.
> >
> > So, instead of sticking to the embarrassed situation,  why not change to
> > another strategy:
> >
> >> "Fork another experimental tree from 0.7 beta 1 and accept
> >> Cassandra-1072.  At the same time, start the clean up work on this tree.
> >> Once it's finalized , merge them back to 0.7, no matter it's 0.7.1 or
> 0.7.2.
> >>
> >> Hence, these guys from twitter does not need to maintain a huge
> >> out-of-tree patch, while the quality impact of cassandra-1072 is still
> >> limited.
> >
> > I do know the pain of maintaining a large patch out of the official tree.
> > Once it gets in, everybody will feels much better.
> >
> > If you give some opportunities to this patch, Johan or others  can be
> highly
> > motivated because all of the community works together.  It's a
> compromise,
> > but it's worth.
> >
> > [1]
> >
> https://issues.apache.org/jira/browse/CASSANDRA-1072?focusedCommentId=12909234&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12909234
> >
> >
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: [DISCUSSION] High-volume counters in Cassandra

2010-10-01 Thread Zhu Han
> They have however at least one advantage:
>  - your super columns are indexed, you don't have to deserialize them
>entirely each time.
>

The  size of counter super column is limited to how many replicas propagated
values as the lead replica.  It's size is upper bounded by the number of
replicas.  Even if we support hinted hand off of counters, the size of super
columns just equals to the size of nodes in the cluster, as the worse case.


IMHO, it's not a big deal to de-serialize them entirely, the can fit into
memory very easily. Did I miss anything here?

One advantage of implementing counter as a new CF type:
--you can add some counter specific configuration very easily.

best regards,
hanzhu

To sum up, I can see the following drawbacks to such encoding:
>  - querying SC by names is less efficient.
>  - it takes more disk space (but that's the cheapest resource we have
> isn't it).
>
> I'd say these are fair compromises.
>
> --
> Sylvain
>


Re: [VOTE] 0.6.7 RC1

2010-11-05 Thread Zhu Han
Is the link to CHANGES.txt broken?

best regards,
hanzhu


On Fri, Nov 5, 2010 at 2:39 AM, Eric Evans  wrote:

>
> The list of changes[1] since 0.6.6 is fairly small, but there's been
> some interest in getting CASSANDRA-1656[2] (a HH bug) out to people.  I
> propose the following for release as 0.6.7.
>
> SVN:
> https://svn.apache.org/repos/asf/cassandra/branches/cassandra-...@r1031103
> 0.6.7 artifacts: 
> http://people.apache.org/~eevans
>
> The vote will be open for 72 hours.
>
> [1]: http://goo.gl/pGEx (CHANGES.txt)
> [2]: https://issues.apache.org/jira/browse/CASSANDRA-1656
> [3]: http://goo.gl/IQ3rR (NEWS.txt)
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: Very high memory utilization (not caused by mmap on sstables)

2010-12-15 Thread Zhu Han
After investigating it deeper,  I suspect it's native memory leak of JVM.
The large anonymous map on lower address space should be the native heap of
JVM,  but not java object heap.  Has anybody met it before?

I'll try to upgrade the JVM tonight.

best regards,
hanzhu


On Thu, Dec 16, 2010 at 10:50 AM, Zhu Han  wrote:

> Hi,
>
> I have a test node with apache-cassandra-0.6.8 on ubuntu 10.4.  The
> hardware environment is an OpenVZ container. JVM settings is
> # java -Xmx128m -version
> java version "1.6.0_18"
> OpenJDK Runtime Environment (IcedTea6 1.8.2) (6b18-1.8.2-4ubuntu2)
> OpenJDK 64-Bit Server VM (build 16.0-b13, mixed mode)
>
> This is the memory settings:
>
> "/usr/bin/java -ea -Xms1G -Xmx1G ..."
>
> And the ondisk footprint of sstables is very small:
>
> "#du -sh data/
>  "9.8Mdata/"
>
> The node was infrequently accessed in the last  three weeks.  After that, I
> observe the abnormal memory utilization by top:
>
>   PID USER  PR  NI  *VIRT*  *RES*  SHR S %CPU %MEMTIME+
> COMMAND
>
>  7836 root  15   0 *3300m* *2.4g*  13m S0 26.0   2:58.51
> java
>
> The jvm heap utilization is quite normal:
>
> #sudo jstat -gc -J"-Xmx128m" 7836
>  S0CS1CS0US1U  *EC*   *EU*  *OC**
> OU**PC   PU*  YGC  YGCT  FGCFGCT GCT
>
> 8512.0 8512.0 372.8   0.0   *68160.0*   *5225.7*   *963392.0   508200.7
> 30604.0 18373.4*4803.979  2  0.0053.984
>
> And then I try "pmap" to see the native memory mapping. *There is two
> large anonymous mmap regions.*
>
> 080dc000 1573568K rw---[ anon ]
> 2b2afc90  1079180K rw---[ anon ]
>
> The second one should be JVM heap.  What is the first one?  Mmap of sstable
> should never be anonymous mmap, but file based mmap.  *Is it  a native
> memory leak?  *Does cassandra allocate any DirectByteBuffer?
>
> best regards,
> hanzhu
>


Re: Very high memory utilization (not caused by mmap on sstables)

2010-12-15 Thread Zhu Han
The test node is behind a firewall. So I took some time to find a way to get
JMX diagnostic information from it.

What's interesting is, both the HeapMemoryUsage and NonHeapMemoryUsage
reported by JVM is quite reasonable.  So, it's a myth why the JVM process
maps such a big anonymous memory region...

$ java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
java.lang:type=Memory HeapMemoryUsage
12/16/2010 15:07:45 +0800 org.archive.jmx.Client HeapMemoryUsage:
committed: 1065025536
init: 1073741824
max: 1065025536
used: 18295328

$java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
java.lang:type=Memory NonHeapMemoryUsage
12/16/2010 15:01:51 +0800 org.archive.jmx.Client NonHeapMemoryUsage:
committed: 34308096
init: 24313856
max: 226492416
used: 21475376

If anybody is interested in it, I can provide more diagnostic information
before I restart the instance.

best regards,
hanzhu


On Thu, Dec 16, 2010 at 1:00 PM, Zhu Han  wrote:

> After investigating it deeper,  I suspect it's native memory leak of JVM.
> The large anonymous map on lower address space should be the native heap of
> JVM,  but not java object heap.  Has anybody met it before?
>
> I'll try to upgrade the JVM tonight.
>
> best regards,
> hanzhu
>
>
>
> On Thu, Dec 16, 2010 at 10:50 AM, Zhu Han  wrote:
>
>> Hi,
>>
>> I have a test node with apache-cassandra-0.6.8 on ubuntu 10.4.  The
>> hardware environment is an OpenVZ container. JVM settings is
>> # java -Xmx128m -version
>> java version "1.6.0_18"
>> OpenJDK Runtime Environment (IcedTea6 1.8.2) (6b18-1.8.2-4ubuntu2)
>> OpenJDK 64-Bit Server VM (build 16.0-b13, mixed mode)
>>
>> This is the memory settings:
>>
>> "/usr/bin/java -ea -Xms1G -Xmx1G ..."
>>
>> And the ondisk footprint of sstables is very small:
>>
>> "#du -sh data/
>>  "9.8Mdata/"
>>
>> The node was infrequently accessed in the last  three weeks.  After that,
>> I observe the abnormal memory utilization by top:
>>
>>   PID USER  PR  NI  *VIRT*  *RES*  SHR S %CPU %MEMTIME+
>> COMMAND
>>
>>  7836 root  15   0 *3300m* *2.4g*  13m S0 26.0   2:58.51
>> java
>>
>> The jvm heap utilization is quite normal:
>>
>> #sudo jstat -gc -J"-Xmx128m" 7836
>>  S0CS1CS0US1U  *EC*   *EU*  *OC**
>> OU**PC   PU*  YGC  YGCT  FGCFGCT
>> GCT
>> 8512.0 8512.0 372.8   0.0   *68160.0*   *5225.7*   *963392.0   508200.7
>> 30604.0 18373.4*4803.979  2  0.0053.984
>>
>> And then I try "pmap" to see the native memory mapping. *There is two
>> large anonymous mmap regions.*
>>
>> 080dc000 1573568K rw---[ anon ]
>> 2b2afc90  1079180K rw---[ anon ]
>>
>> The second one should be JVM heap.  What is the first one?  Mmap of
>> sstable should never be anonymous mmap, but file based mmap.  *Is it  a
>> native memory leak?  *Does cassandra allocate any DirectByteBuffer?
>>
>> best regards,
>> hanzhu
>>
>
>


Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-15 Thread Zhu Han
Sorry for spam again. :-)

I think I find the root cause. Here is a bug report[1] on memory leak of
ParNewGC.  It is solved by OpenJDK 1.6.0_20(IcedTea6 1.9.2)[2].

So the suggestion is: for who runs cassandra  of Ubuntu 10.04, please
upgrade OpenJDK to the latest version.

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6824570
[2] http://blog.fuseyism.com/index.php/2010/09/10/icedtea6-19-released/

best regards,
hanzhu


On Thu, Dec 16, 2010 at 3:10 PM, Zhu Han  wrote:

> The test node is behind a firewall. So I took some time to find a way to
> get JMX diagnostic information from it.
>
> What's interesting is, both the HeapMemoryUsage and NonHeapMemoryUsage
> reported by JVM is quite reasonable.  So, it's a myth why the JVM process
> maps such a big anonymous memory region...
>
> $ java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
> java.lang:type=Memory HeapMemoryUsage
> 12/16/2010 15:07:45 +0800 org.archive.jmx.Client HeapMemoryUsage:
> committed: 1065025536
> init: 1073741824
> max: 1065025536
> used: 18295328
>
> $java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
> java.lang:type=Memory NonHeapMemoryUsage
> 12/16/2010 15:01:51 +0800 org.archive.jmx.Client NonHeapMemoryUsage:
> committed: 34308096
> init: 24313856
> max: 226492416
> used: 21475376
>
> If anybody is interested in it, I can provide more diagnostic information
> before I restart the instance.
>
> best regards,
> hanzhu
>
>
>
> On Thu, Dec 16, 2010 at 1:00 PM, Zhu Han  wrote:
>
>> After investigating it deeper,  I suspect it's native memory leak of JVM.
>> The large anonymous map on lower address space should be the native heap of
>> JVM,  but not java object heap.  Has anybody met it before?
>>
>> I'll try to upgrade the JVM tonight.
>>
>> best regards,
>> hanzhu
>>
>>
>>
>> On Thu, Dec 16, 2010 at 10:50 AM, Zhu Han  wrote:
>>
>>> Hi,
>>>
>>> I have a test node with apache-cassandra-0.6.8 on ubuntu 10.4.  The
>>> hardware environment is an OpenVZ container. JVM settings is
>>> # java -Xmx128m -version
>>> java version "1.6.0_18"
>>> OpenJDK Runtime Environment (IcedTea6 1.8.2) (6b18-1.8.2-4ubuntu2)
>>> OpenJDK 64-Bit Server VM (build 16.0-b13, mixed mode)
>>>
>>> This is the memory settings:
>>>
>>> "/usr/bin/java -ea -Xms1G -Xmx1G ..."
>>>
>>> And the ondisk footprint of sstables is very small:
>>>
>>> "#du -sh data/
>>>  "9.8Mdata/"
>>>
>>> The node was infrequently accessed in the last  three weeks.  After that,
>>> I observe the abnormal memory utilization by top:
>>>
>>>   PID USER  PR  NI  *VIRT*  *RES*  SHR S %CPU %MEMTIME+
>>> COMMAND
>>>
>>>  7836 root  15   0 *3300m* *2.4g*  13m S0 26.0   2:58.51
>>> java
>>>
>>> The jvm heap utilization is quite normal:
>>>
>>> #sudo jstat -gc -J"-Xmx128m" 7836
>>>  S0CS1CS0US1U  *EC*   *EU*  *OC*
>>> *OU**PC   PU*  YGC  YGCT  FGCFGCT
>>> GCT
>>> 8512.0 8512.0 372.8   0.0   *68160.0*   *5225.7*   *963392.0   508200.7
>>> 30604.0 18373.4*4803.979  2  0.0053.984
>>>
>>> And then I try "pmap" to see the native memory mapping. *There is two
>>> large anonymous mmap regions.*
>>>
>>> 080dc000 1573568K rw---[ anon ]
>>> 2b2afc90  1079180K rw---[ anon ]
>>>
>>> The second one should be JVM heap.  What is the first one?  Mmap of
>>> sstable should never be anonymous mmap, but file based mmap.  *Is it  a
>>> native memory leak?  *Does cassandra allocate any DirectByteBuffer?
>>>
>>> best regards,
>>> hanzhu
>>>
>>
>>
>


Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-16 Thread Zhu Han
I've tried it. But it does not work for me this afternoon.

Thank you!

best regards,
hanzhu


On Thu, Dec 16, 2010 at 8:59 PM, Matthew Conway  wrote:

> Thanks for debugging this, I'm running into the same problem.
> BTW, if you can ssh into your nodes, you can use jconsole over ssh:
> http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html
>
> Matt
>
>
> On Dec 16, 2010, at Thu Dec 16, 2:39 AM, Zhu Han wrote:
>
> > Sorry for spam again. :-)
> >
> > I think I find the root cause. Here is a bug report[1] on memory leak of
> > ParNewGC.  It is solved by OpenJDK 1.6.0_20(IcedTea6 1.9.2)[2].
> >
> > So the suggestion is: for who runs cassandra  of Ubuntu 10.04, please
> > upgrade OpenJDK to the latest version.
> >
> > [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6824570
> > [2] http://blog.fuseyism.com/index.php/2010/09/10/icedtea6-19-released/
> >
> > best regards,
> > hanzhu
> >
> >
> > On Thu, Dec 16, 2010 at 3:10 PM, Zhu Han  wrote:
> >
> >> The test node is behind a firewall. So I took some time to find a way to
> >> get JMX diagnostic information from it.
> >>
> >> What's interesting is, both the HeapMemoryUsage and NonHeapMemoryUsage
> >> reported by JVM is quite reasonable.  So, it's a myth why the JVM
> process
> >> maps such a big anonymous memory region...
> >>
> >> $ java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
> >> java.lang:type=Memory HeapMemoryUsage
> >> 12/16/2010 15:07:45 +0800 org.archive.jmx.Client HeapMemoryUsage:
> >> committed: 1065025536
> >> init: 1073741824
> >> max: 1065025536
> >> used: 18295328
> >>
> >> $java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
> >> java.lang:type=Memory NonHeapMemoryUsage
> >> 12/16/2010 15:01:51 +0800 org.archive.jmx.Client NonHeapMemoryUsage:
> >> committed: 34308096
> >> init: 24313856
> >> max: 226492416
> >> used: 21475376
> >>
> >> If anybody is interested in it, I can provide more diagnostic
> information
> >> before I restart the instance.
> >>
> >> best regards,
> >> hanzhu
> >>
> >>
> >>
> >> On Thu, Dec 16, 2010 at 1:00 PM, Zhu Han  wrote:
> >>
> >>> After investigating it deeper,  I suspect it's native memory leak of
> JVM.
> >>> The large anonymous map on lower address space should be the native
> heap of
> >>> JVM,  but not java object heap.  Has anybody met it before?
> >>>
> >>> I'll try to upgrade the JVM tonight.
> >>>
> >>> best regards,
> >>> hanzhu
> >>>
> >>>
> >>>
> >>> On Thu, Dec 16, 2010 at 10:50 AM, Zhu Han 
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I have a test node with apache-cassandra-0.6.8 on ubuntu 10.4.  The
> >>>> hardware environment is an OpenVZ container. JVM settings is
> >>>> # java -Xmx128m -version
> >>>> java version "1.6.0_18"
> >>>> OpenJDK Runtime Environment (IcedTea6 1.8.2) (6b18-1.8.2-4ubuntu2)
> >>>> OpenJDK 64-Bit Server VM (build 16.0-b13, mixed mode)
> >>>>
> >>>> This is the memory settings:
> >>>>
> >>>> "/usr/bin/java -ea -Xms1G -Xmx1G ..."
> >>>>
> >>>> And the ondisk footprint of sstables is very small:
> >>>>
> >>>> "#du -sh data/
> >>>> "9.8Mdata/"
> >>>>
> >>>> The node was infrequently accessed in the last  three weeks.  After
> that,
> >>>> I observe the abnormal memory utilization by top:
> >>>>
> >>>>  PID USER  PR  NI  *VIRT*  *RES*  SHR S %CPU %MEMTIME+
> >>>> COMMAND
> >>>>
> >>>> 7836 root  15   0 *3300m* *2.4g*  13m S0 26.0   2:58.51
> >>>> java
> >>>>
> >>>> The jvm heap utilization is quite normal:
> >>>>
> >>>> #sudo jstat -gc -J"-Xmx128m" 7836
> >>>> S0CS1CS0US1U  *EC*   *EU*  *OC*
> >>>> *OU**PC   PU*  YGC  YGCT  FGCFGCT
> >>>> GCT
> >>>> 8512.0 8512.0 372.8   0.0   *68160.0*   *5225.7*   *963392.0
> 508200.7
> >>>> 30604.0 18373.4*4803.979  2  0.0053.984
> >>>>
> >>>> And then I try "pmap" to see the native memory mapping. *There is two
> >>>> large anonymous mmap regions.*
> >>>>
> >>>> 080dc000 1573568K rw---[ anon ]
> >>>> 2b2afc90  1079180K rw---[ anon ]
> >>>>
> >>>> The second one should be JVM heap.  What is the first one?  Mmap of
> >>>> sstable should never be anonymous mmap, but file based mmap.  *Is it
>  a
> >>>> native memory leak?  *Does cassandra allocate any DirectByteBuffer?
> >>>>
> >>>> best regards,
> >>>> hanzhu
> >>>>
> >>>
> >>>
> >>
>
>


Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-17 Thread Zhu Han
Seems like  the problem there after I upgrade to "OpenJDK Runtime
Environment (IcedTea6 1.9.2)". So it is not related to the bug I reported
two days ago.

Can somebody else share some info with us? What's the java environment you
used? Is it stable for long-lived cassandra instances?

best regards,
hanzhu


On Thu, Dec 16, 2010 at 9:28 PM, Zhu Han  wrote:

> I've tried it. But it does not work for me this afternoon.
>
> Thank you!
>
> best regards,
> hanzhu
>
>
>
> On Thu, Dec 16, 2010 at 8:59 PM, Matthew Conway wrote:
>
>> Thanks for debugging this, I'm running into the same problem.
>> BTW, if you can ssh into your nodes, you can use jconsole over ssh:
>> http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html
>>
>> Matt
>>
>>
>> On Dec 16, 2010, at Thu Dec 16, 2:39 AM, Zhu Han wrote:
>>
>> > Sorry for spam again. :-)
>> >
>> > I think I find the root cause. Here is a bug report[1] on memory leak of
>> > ParNewGC.  It is solved by OpenJDK 1.6.0_20(IcedTea6 1.9.2)[2].
>> >
>> > So the suggestion is: for who runs cassandra  of Ubuntu 10.04, please
>> > upgrade OpenJDK to the latest version.
>> >
>> > [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6824570
>> > [2] http://blog.fuseyism.com/index.php/2010/09/10/icedtea6-19-released/
>> >
>> > best regards,
>> > hanzhu
>> >
>> >
>> > On Thu, Dec 16, 2010 at 3:10 PM, Zhu Han  wrote:
>> >
>> >> The test node is behind a firewall. So I took some time to find a way
>> to
>> >> get JMX diagnostic information from it.
>> >>
>> >> What's interesting is, both the HeapMemoryUsage and NonHeapMemoryUsage
>> >> reported by JVM is quite reasonable.  So, it's a myth why the JVM
>> process
>> >> maps such a big anonymous memory region...
>> >>
>> >> $ java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
>> >> java.lang:type=Memory HeapMemoryUsage
>> >> 12/16/2010 15:07:45 +0800 org.archive.jmx.Client HeapMemoryUsage:
>> >> committed: 1065025536
>> >> init: 1073741824
>> >> max: 1065025536
>> >> used: 18295328
>> >>
>> >> $java -Xmx128m -jar /tmp/cmdline-jmxclient-0.10.3.jar - localhost:8080
>> >> java.lang:type=Memory NonHeapMemoryUsage
>> >> 12/16/2010 15:01:51 +0800 org.archive.jmx.Client NonHeapMemoryUsage:
>> >> committed: 34308096
>> >> init: 24313856
>> >> max: 226492416
>> >> used: 21475376
>> >>
>> >> If anybody is interested in it, I can provide more diagnostic
>> information
>> >> before I restart the instance.
>> >>
>> >> best regards,
>> >> hanzhu
>> >>
>> >>
>> >>
>> >> On Thu, Dec 16, 2010 at 1:00 PM, Zhu Han  wrote:
>> >>
>> >>> After investigating it deeper,  I suspect it's native memory leak of
>> JVM.
>> >>> The large anonymous map on lower address space should be the native
>> heap of
>> >>> JVM,  but not java object heap.  Has anybody met it before?
>> >>>
>> >>> I'll try to upgrade the JVM tonight.
>> >>>
>> >>> best regards,
>> >>> hanzhu
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Dec 16, 2010 at 10:50 AM, Zhu Han 
>> wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I have a test node with apache-cassandra-0.6.8 on ubuntu 10.4.  The
>> >>>> hardware environment is an OpenVZ container. JVM settings is
>> >>>> # java -Xmx128m -version
>> >>>> java version "1.6.0_18"
>> >>>> OpenJDK Runtime Environment (IcedTea6 1.8.2) (6b18-1.8.2-4ubuntu2)
>> >>>> OpenJDK 64-Bit Server VM (build 16.0-b13, mixed mode)
>> >>>>
>> >>>> This is the memory settings:
>> >>>>
>> >>>> "/usr/bin/java -ea -Xms1G -Xmx1G ..."
>> >>>>
>> >>>> And the ondisk footprint of sstables is very small:
>> >>>>
>> >>>> "#du -sh data/
>> >>>> "9.8Mdata/"
>> >>>>
>> >>>> The node was infrequently accessed in the last  three weeks.  After
>> that,
>> >>>> I observe the abnormal memory utilization by top:
>> >>>>
>> >>>>  PID USER  PR  NI  *VIRT*  *RES*  SHR S %CPU %MEMTIME+
>> >>>> COMMAND
>> >>>>
>> >>>> 7836 root  15   0 *3300m* *2.4g*  13m S0 26.0   2:58.51
>> >>>> java
>> >>>>
>> >>>> The jvm heap utilization is quite normal:
>> >>>>
>> >>>> #sudo jstat -gc -J"-Xmx128m" 7836
>> >>>> S0CS1CS0US1U  *EC*   *EU*  *OC*
>> >>>> *OU**PC   PU*  YGC  YGCT  FGCFGCT
>> >>>> GCT
>> >>>> 8512.0 8512.0 372.8   0.0   *68160.0*   *5225.7*   *963392.0
>> 508200.7
>> >>>> 30604.0 18373.4*4803.979  2  0.0053.984
>> >>>>
>> >>>> And then I try "pmap" to see the native memory mapping. *There is two
>> >>>> large anonymous mmap regions.*
>> >>>>
>> >>>> 080dc000 1573568K rw---[ anon ]
>> >>>> 2b2afc90  1079180K rw---[ anon ]
>> >>>>
>> >>>> The second one should be JVM heap.  What is the first one?  Mmap of
>> >>>> sstable should never be anonymous mmap, but file based mmap.  *Is it
>>  a
>> >>>> native memory leak?  *Does cassandra allocate any DirectByteBuffer?
>> >>>>
>> >>>> best regards,
>> >>>> hanzhu
>> >>>>
>> >>>
>> >>>
>> >>
>>
>>
>


Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-18 Thread Zhu Han
The problem seems still like the C-heap of JVM, which leaks 70MB every day.
Here is the summary:

on 12/19: 010c3000 178548K rw---[ anon ]
on 12/18: 010c3000 110320K rw---[ anon ]
on 12/17: 010c3000  39256K rw---[ anon ]

This should not be the JVM object heap, because the object heap size is
fixed up per the below JVM settings. Here is the map of JVM object heap,
which remains constant.

010c3000  39256K rw---[ anon ]

I'll paste it to open-jdk mailist to seek for help.

Zhu,
> Couple of quick questions:
>  How many threads are in your JVM?
>

There are hundreds of threads. Here is the settings of Cassandra:
1)  *8
  128*

The thread stack size on this server is 1MB. So I observe hundreds of single
mmap segment as 1MB.

 Can you also post the full commandline as well?
>
Sure. All of them are default settings.

/usr/bin/java -ea -Xms1G -Xmx1G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8080
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dstorage-config=bin/../conf -cp
bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.8.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar
org.apache.cassandra.thrift.CassandraDaemon


>  Also, output of cat /proc/meminfo
>

This is an openvz based testing environment. So /proc/meminfo is not very
helpful. Whatever, I paste it here.


MemTotal:  9838380 kB
MemFree:   4005900 kB
Buffers: 0 kB
Cached:  0 kB
SwapCached:  0 kB
Active:  0 kB
Inactive:0 kB
HighTotal:   0 kB
HighFree:0 kB
LowTotal:  9838380 kB
LowFree:   4005900 kB
SwapTotal:   0 kB
SwapFree:0 kB
Dirty:   0 kB
Writeback:   0 kB
AnonPages:   0 kB
Mapped:  0 kB
Slab:0 kB
PageTables:  0 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit: 0 kB
Committed_AS:0 kB
VmallocTotal:0 kB
VmallocUsed: 0 kB
VmallocChunk:0 kB
HugePages_Total: 0
HugePages_Free:  0
HugePages_Rsvd:  0
Hugepagesize: 2048 kB


> thanks,
> Sri
>
> On Fri, Dec 17, 2010 at 7:15 PM, Zhu Han  wrote:
>
> > Seems like  the problem there after I upgrade to "OpenJDK Runtime
> > Environment (IcedTea6 1.9.2)". So it is not related to the bug I reported
> > two days ago.
> >
> > Can somebody else share some info with us? What's the java environment
> you
> > used? Is it stable for long-lived cassandra instances?
> >
> > best regards,
> > hanzhu
> >
> >
> > On Thu, Dec 16, 2010 at 9:28 PM, Zhu Han  wrote:
> >
> > > I've tried it. But it does not work for me this afternoon.
> > >
> > > Thank you!
> > >
> > > best regards,
> > > hanzhu
> > >
> > >
> > >
> > > On Thu, Dec 16, 2010 at 8:59 PM, Matthew Conway  > >wrote:
> > >
> > >> Thanks for debugging this, I'm running into the same problem.
> > >> BTW, if you can ssh into your nodes, you can use jconsole over ssh:
> > >> http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html
> > >>
> > >> Matt
> > >>
> > >>
> > >> On Dec 16, 2010, at Thu Dec 16, 2:39 AM, Zhu Han wrote:
> > >>
> > >> > Sorry for spam again. :-)
> > >> >
> > >> > I think I find the root cause. Here is a bug report[1] on memory
> leak
> > of
> > >> > ParNewGC.  It is solved by OpenJDK 1.6.0_20(IcedTea6 1.9.2)[2].
> > >> >
> > >> > So the suggestion is: for who runs cassandra  of Ubuntu 10.04,
> please
> > >> > upgrade OpenJDK to the latest version.
> > >> >
> > >> > [1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6824570
> > >> > [2]
> > http://blog.fuseyism.com/index.php/2010/09/10/icedtea6-19

Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-18 Thread Zhu Han
Here is a typo, sorry...

best regards,
hanzhu


On Sun, Dec 19, 2010 at 10:29 AM, Zhu Han  wrote:

> The problem seems still like the C-heap of JVM, which leaks 70MB every day.
> Here is the summary:
>
> on 12/19: 010c3000 178548K rw---[ anon ]
> on 12/18: 010c3000 110320K rw---[ anon ]
> on 12/17: 010c3000  39256K rw---[ anon ]
>
> This should not be the JVM object heap, because the object heap size is
> fixed up per the below JVM settings. Here is the map of JVM object heap,
> which remains constant.
>
> 010c3000  39256K rw---[ anon ]
>

It should be :
2b58433c 1069824K rw---[ anon ]


>
> I'll paste it to open-jdk mailist to seek for help.
>
> Zhu,
>> Couple of quick questions:
>>  How many threads are in your JVM?
>>
>
> There are hundreds of threads. Here is the settings of Cassandra:
> 1)  *8
>   128*
>
> The thread stack size on this server is 1MB. So I observe hundreds of
> single mmap segment as 1MB.
>
>  Can you also post the full commandline as well?
>>
> Sure. All of them are default settings.
>
> /usr/bin/java -ea -Xms1G -Xmx1G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8080
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.8.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar
> org.apache.cassandra.thrift.CassandraDaemon
>
>
>>  Also, output of cat /proc/meminfo
>>
>
> This is an openvz based testing environment. So /proc/meminfo is not very
> helpful. Whatever, I paste it here.
>
>
> MemTotal:  9838380 kB
> MemFree:   4005900 kB
> Buffers: 0 kB
> Cached:  0 kB
> SwapCached:  0 kB
> Active:  0 kB
> Inactive:0 kB
> HighTotal:   0 kB
> HighFree:0 kB
> LowTotal:  9838380 kB
> LowFree:   4005900 kB
> SwapTotal:   0 kB
> SwapFree:0 kB
> Dirty:   0 kB
> Writeback:   0 kB
> AnonPages:   0 kB
> Mapped:  0 kB
> Slab:0 kB
> PageTables:  0 kB
> NFS_Unstable:0 kB
> Bounce:  0 kB
> CommitLimit: 0 kB
> Committed_AS:0 kB
> VmallocTotal:0 kB
> VmallocUsed: 0 kB
> VmallocChunk:0 kB
> HugePages_Total: 0
> HugePages_Free:  0
> HugePages_Rsvd:  0
> Hugepagesize: 2048 kB
>
>
>> thanks,
>> Sri
>>
>> On Fri, Dec 17, 2010 at 7:15 PM, Zhu Han  wrote:
>>
>> > Seems like  the problem there after I upgrade to "OpenJDK Runtime
>> > Environment (IcedTea6 1.9.2)". So it is not related to the bug I
>> reported
>> > two days ago.
>> >
>> > Can somebody else share some info with us? What's the java environment
>> you
>> > used? Is it stable for long-lived cassandra instances?
>> >
>> > best regards,
>> > hanzhu
>> >
>> >
>> > On Thu, Dec 16, 2010 at 9:28 PM, Zhu Han  wrote:
>> >
>> > > I've tried it. But it does not work for me this afternoon.
>> > >
>> > > Thank you!
>> > >
>> > > best regards,
>> > > hanzhu
>> > >
>> > >
>> > >
>> > > On Thu, Dec 16, 2010 at 8:59 PM, Matthew Conway > > >wrote:
>> > >
>> > >> Thanks for debugging this, I'm running into the same problem.
>> > >> BTW, if you can ssh into your nodes, you can use jconsole over ssh:
>> > >> http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html
>> > >>
>> > >> Matt
>> > >>
>> > >>
>> > >> On Dec 16, 2010, at Thu Dec 16, 2:39 AM, Zhu 

Re: [SOLVED] Very high memory utilization (not caused by mmap on sstables)

2010-12-20 Thread Zhu Han
Can anybody recommend a stable enough JDK environment for 0.6.x branch on
ubuntu server?

Thank you!

best regards,
hanzhu


On Sun, Dec 19, 2010 at 10:29 AM, Zhu Han  wrote:

> The problem seems still like the C-heap of JVM, which leaks 70MB every day.
> Here is the summary:
>
> on 12/19: 010c3000 178548K rw---[ anon ]
> on 12/18: 010c3000 110320K rw---[ anon ]
> on 12/17: 010c3000  39256K rw---[ anon ]
>
> This should not be the JVM object heap, because the object heap size is
> fixed up per the below JVM settings. Here is the map of JVM object heap,
> which remains constant.
>
> 010c3000  39256K rw---[ anon ]
>
> I'll paste it to open-jdk mailist to seek for help.
>
> Zhu,
>> Couple of quick questions:
>>  How many threads are in your JVM?
>>
>
> There are hundreds of threads. Here is the settings of Cassandra:
> 1)  *8
>   128*
>
> The thread stack size on this server is 1MB. So I observe hundreds of
> single mmap segment as 1MB.
>
>  Can you also post the full commandline as well?
>>
> Sure. All of them are default settings.
>
> /usr/bin/java -ea -Xms1G -Xmx1G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8080
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dstorage-config=bin/../conf -cp
> bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.8.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar
> org.apache.cassandra.thrift.CassandraDaemon
>
>
>>  Also, output of cat /proc/meminfo
>>
>
> This is an openvz based testing environment. So /proc/meminfo is not very
> helpful. Whatever, I paste it here.
>
>
> MemTotal:  9838380 kB
> MemFree:   4005900 kB
> Buffers: 0 kB
> Cached:  0 kB
> SwapCached:  0 kB
> Active:  0 kB
> Inactive:0 kB
> HighTotal:   0 kB
> HighFree:0 kB
> LowTotal:  9838380 kB
> LowFree:   4005900 kB
> SwapTotal:   0 kB
> SwapFree:0 kB
> Dirty:   0 kB
> Writeback:   0 kB
> AnonPages:   0 kB
> Mapped:  0 kB
> Slab:0 kB
> PageTables:  0 kB
> NFS_Unstable:0 kB
> Bounce:  0 kB
> CommitLimit: 0 kB
> Committed_AS:0 kB
> VmallocTotal:0 kB
> VmallocUsed: 0 kB
> VmallocChunk:0 kB
> HugePages_Total: 0
> HugePages_Free:  0
> HugePages_Rsvd:  0
> Hugepagesize: 2048 kB
>
>
>> thanks,
>> Sri
>>
>> On Fri, Dec 17, 2010 at 7:15 PM, Zhu Han  wrote:
>>
>> > Seems like  the problem there after I upgrade to "OpenJDK Runtime
>> > Environment (IcedTea6 1.9.2)". So it is not related to the bug I
>> reported
>> > two days ago.
>> >
>> > Can somebody else share some info with us? What's the java environment
>> you
>> > used? Is it stable for long-lived cassandra instances?
>> >
>> > best regards,
>> > hanzhu
>> >
>> >
>> > On Thu, Dec 16, 2010 at 9:28 PM, Zhu Han  wrote:
>> >
>> > > I've tried it. But it does not work for me this afternoon.
>> > >
>> > > Thank you!
>> > >
>> > > best regards,
>> > > hanzhu
>> > >
>> > >
>> > >
>> > > On Thu, Dec 16, 2010 at 8:59 PM, Matthew Conway > > >wrote:
>> > >
>> > >> Thanks for debugging this, I'm running into the same problem.
>> > >> BTW, if you can ssh into your nodes, you can use jconsole over ssh:
>> > >> http://simplygenius.com/2010/08/jconsole-via-socks-ssh-tunnel.html
>> > >>
>> > >> Matt
>> > >>
>> > >>
>> > >> On Dec 16, 2010, at Thu Dec 16, 2:3

Re: Distributed counters are in trunk

2010-12-21 Thread Zhu Han
Thank you all for this work.

Is there any plan for Cassandra-1546[1]?  Will it be merged as an
alternative? Or the current patch embraces 1546?

[1] https://issues.apache.org/jira/browse/CASSANDRA-1546

best regards,
hanzhu


On Wed, Dec 22, 2010 at 10:12 AM, Jonathan Ellis  wrote:

> Thanks to Kelvin, Johan, Ryan, Sylvain, Chris, and everyone else for their
> hard work on this!
>
> For mere mortals: http://wiki.apache.org/cassandra/Counters.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: How is Cassandra being used?

2011-11-16 Thread Zhu Han
On Wed, Nov 16, 2011 at 3:03 PM, Norman Maurer  wrote:

> 2011/11/16 Jonathan Ellis :
> > I started a "users survey" thread over on the users list (replies are
> > still trickling in), but as useful as that is, I'd like to get
> > feedback that is more quantitative and with a broader base.  This will
> > let us prioritize our development efforts to better address what
> > people are actually using it for, with less guesswork.  For instance:
> > we put a lot of effort into compression for 1.0.0; if it turned out
> > that only 1% of 1.0.x users actually enable compression, then it means
> > that we should spend less effort fine-tuning that moving forward, and
> > use the energy elsewhere.
> >
> > (Of course it could also mean that we did a terrible job getting the
> > word out about new features and explaining how to use them, but either
> > way, it would be good to know!)
> >
> > I propose adding a basic cluster reporting feature to cassandra.yaml,
> > enabled by default.  It would send anonymous information about your
> > cluster to an apache.org VM.  Information like, number (but not names)
> > of keyspaces and columnfamilies, ks-level options like compression, cf
> > options like compaction strategy, data types (again, not names) of
> > columns, average row size (or better: the histogram data), and average
> > sstables per read.
> >
> > Thoughts?
>

-1.

It may scare some admins who stores sensitive data  in cassandra. Even if
it can
disabled, we can not sleep well in the night when we know the door can be
opened unintentionally...


> Hi there,
>
> I'm not a cassandra dev but an user of it. I would really "hate" to
> see such code in the cassandra code-base. I understand that it would
> be kind of useful to get a better feeling about usage etc, but its
> really something that scares the shit out of many managers (and even
> devs ;) ).
>
> So -1 to add this code (*non-binding)
>
> Bye,
> Norman
>


Re: RFC: Cassandra Virtual Nodes

2012-03-17 Thread Zhu Han
On Sat, Mar 17, 2012 at 7:38 AM, Sam Overton  wrote:

> Hello cassandra-dev,
>
> This is a long email. It concerns a significant change to Cassandra, so
> deserves a thorough introduction.
>
> *The summary is*: we believe virtual nodes are the way forward. We would
> like to add virtual nodes to Cassandra and we are asking for comments,
> criticism and collaboration!
>
> Cassandra's current partitioning scheme is sub-optimal for bootstrap,
> decommission, repair and re-balance operations, and places the burden on
> users to properly calculate tokens (a common cause of mistakes), which is a
> recurring pain-point.
>
> Virtual nodes have a variety of benefits over the one-to-one mapping of
> host to key range which Cassandra currently supports.
>
> Among these benefits are:
>
> * Even load balancing when growing and shrinking the cluster
> A virtual node scheme ensures that all hosts in a cluster have an even
> portion of the total data, and a new node bootstrapped into the cluster
> will assume its share of the data. Doubling, or halving the cluster to
> ensure even load distribution would no longer be necessary.
>
> * Distributed rebuild
> When sizing a cluster, one of the considerations is the amount of time
> required to recover from a failed node. This is the exposure time, during
> which a secondary failure could cause data loss. In order to guarantee an
> upper bound on the exposure time, the amount of data which can be stored on
> each host is limited by the amount of time taken to recover the required
> replica count. At Acunu we have found that the exposure time is frequently
> the limiting factor which dictates the maximum allowed node size in
> customers' clusters.
>
> Using a virtual node scheme, the data stored on one host is not replicated
> on just RF-1 other physical hosts. Each virtual node is replicated to RF-1
> other virtual nodes which may be on a different set of physical hosts to
> replicas of other virtual nodes stored on the same host. This means data
> for one host is replicated evenly across the entire cluster.
>
> In the event of a failure then, restoring the replica count can be done in
> a fully distributed way. Each host in the cluster participates in the
> rebuild, drastically reducing the exposure time, allowing more data to be
> stored on a single host while still maintaining an acceptable upper bound
> on the likelihood of secondary failure. This reduces TCO concerns.
>
> * Greater failure tolerance in streaming
> Operations which require streaming of a large range of data, eg. bootstrap,
> decommission, repair, etc. incur a heavy cost if an error (eg. dropped
> network connection) is encountered during the streaming. Currently the
> whole range must be re-streamed, and this could constitute a very large
> amount of data. Virtual nodes reduce the impact of streaming failures,
> since each virtual node is a much smaller range of the key-space, so
> re-streaming a whole virtual node is a much cheaper process.
>
> * Evenly distributed impact of streaming operations
> Streaming operations such as bootstrap, repair, et al. would involve every
> node in the cluster. This would distribute the load of these operations
> across the whole cluster, and could be staggered so that only a small
> subset of nodes were affected at once, similar to staggered repair[1].
>
> * Possibility for active load balancing
> Load balancing in Cassandra currently involves moving a token to
> increase/reduce the amount of key-space for which a host is responsible.
> This only allows load balancing between neighbouring nodes, so it could
> involve moving more than one token just to redistribute a single overloaded
> node. Virtual nodes could allow load balancing on a much finer granularity,
> so heavily loaded portions of the key-space could be redistributed to
> lighter-loaded hosts by reassigning one or more virtual nodes.
>
>
> Implementing a virtual node scheme in Cassandra is not an insignificant
> amount of work, and it will touch a large amount of the codebase related to
> partitioning, placement, routing, gossip, and so on. We do believe that
> this is possible to do incrementally, and in such a way that there is an
> easy upgrade path for pre-virtual-node deployments.
>
> It would not however touch the storage layer. The virtual node concept is
> solely for partitioning and placement, not for segregating the data storage
> of the host, so all keys for all virtual nodes on a host would be stored in
> the same SSTables.
>
> We are not proposing the adoption of the same scheme used by Voldemort[2]
> and described in the Dynamo paper[3]. We feel this scheme is too different
> from Cassandra's current distribution model to be a viable target for
> incremental development. Their scheme also fixes the number of virtual
> nodes for the lifetime of the cluster, which can prove to be a ceiling to
> scaling the cluster if the virtual nodes grow too large.
>
> The proposed design is:
> * Assign each host T

Re: RFC: Cassandra Virtual Nodes

2012-03-21 Thread Zhu Han
On Tue, Mar 20, 2012 at 11:24 PM, Jeremiah Jordan <
jeremiah.jor...@morningstar.com> wrote:

> So taking a step back, if we want "vnodes" why can't we just give every
> node 100 tokens instead of only one?  Seems to me this would have less
> impact on the rest of the code.  It would just look like you had a 500 node
> cluster, instead of a 5 node cluster.  Your replication strategy would have
> to know about the physical machines so that data gets replicated right, but
> there is already some concept of this with the data center aware and rack
> aware stuff.
>
> From what I see I think you could get most of the benefits of vnodes by
> implementing a new Placement Strategy that did something like this, and you
> wouldn't have to touch (and maybe break) code in other places.
>
> Am I crazy? Naive?
>
> Once you had this setup, you can start to implement the vnode like stuff
> on top of it.  Like bootstrapping nodes in one token at a time, and taking
> them on from the whole cluster, not just your neighbor. etc. etc.
>

I second it.

Is there some goals we missed which can not be achieved by assigning
multiple tokens to a single node?


>
> -Jeremiah Jordan
>
> 
> From: Rick Branson [rbran...@datastax.com]
> Sent: Monday, March 19, 2012 5:16 PM
> To: dev@cassandra.apache.org
> Subject: Re: RFC: Cassandra Virtual Nodes
>
> I think if we could go back and rebuild Cassandra from scratch, vnodes
> would likely be implemented from the beginning. However, I'm concerned that
> implementing them now could be a big distraction from more productive uses
> of all of our time and introduce major potential stability issues into what
> is becoming a business critical piece of infrastructure for many people.
> However, instead of just complaining and pedantry, I'd like to offer a
> feasible alternative:
>
> Has there been consideration given to the idea of a supporting a single
> token range for a node?
>
> While not theoretically as capable as vnodes, it seems to me to be more
> practical as it would have a significantly lower impact on the codebase and
> provides a much clearer migration path. It also seems to solve a majority
> of complaints regarding operational issues with Cassandra clusters.
>
> Each node would have a lower and an upper token, which would form a range
> that would be actively distributed via gossip. Read and replication
> requests would only be routed to a replica when the key of these operations
> matched the replica's token range in the gossip tables. Each node would
> locally store it's own current active token range as well as a target token
> range it's "moving" towards.
>
> As a new node undergoes bootstrap, the bounds would be gradually expanded
> to allow it to handle requests for a wider range of the keyspace as it
> moves towards it's target token range. This idea boils down to a move from
> hard cutovers to smoother operations by gradually adjusting active token
> ranges over a period of time. It would apply to token change operations
> (nodetool 'move' and 'removetoken') as well.
>
> Failure during streaming could be recovered at the bounds instead of
> restarting the whole process as the active bounds would effectively track
> the progress for bootstrap & target token changes. Implicitly these
> operations would be throttled to some degree. Node repair (AES) could also
> be modified using the same overall ideas provide a more gradual impact on
> the cluster overall similar as the ideas given in CASSANDRA-3721.
>
> While this doesn't spread the load over the cluster for these operations
> evenly like vnodes does, this is likely an issue that could be worked
> around by performing concurrent (throttled) bootstrap & node repair (AES)
> operations. It does allow some kind of "active" load balancing, but clearly
> this is not as flexible or as useful as vnodes, but you should be using
> RandomPartitioner or sort-of-randomized keys with OPP right? ;)
>
> As a side note: vnodes fail to provide solutions to node-based limitations
> that seem to me to cause a substantial portion of operational issues such
> as impact of node restarts / upgrades, GC and compaction induced latency. I
> think some progress could be made here by allowing a "pack" of independent
> Cassandra nodes to be ran on a single host; somewhat (but nowhere near
> entirely) similar to a pre-fork model used by some UNIX-based servers.
>
> Input?
>
> --
> Rick Branson
> DataStax
>


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Thu, Mar 22, 2012 at 6:20 PM, Richard Low  wrote:

> On 22 March 2012 05:48, Zhu Han  wrote:
>
> > I second it.
> >
> > Is there some goals we missed which can not be achieved by assigning
> > multiple tokens to a single node?
>
> This is exactly the proposed solution.  The discussion is about how to
> implement this, and the methods of choosing tokens and replication
> strategy.
>

Does the new scheme  still require the node to re-iterate all sstables to
build the merkle tree or streaming data for partition level
repair and move?

The disk IO triggered by above steps could be very time-consuming if the
dataset on single node is very large.  It could be much more costly than
the network IO, especially when concurrent repair tasks hit the same
node.

Is there any good ideas on it?


> Richard.
>


Re: RFC: Cassandra Virtual Nodes

2012-03-22 Thread Zhu Han
On Fri, Mar 23, 2012 at 6:54 AM, Peter Schuller  wrote:

> > You would have to iterate through all sstables on the system to repair
> one
> > vnode, yes: but building the tree for just one range of the data means
> that
> > huge portions of the sstables files can be skipped. It should scale down
> > linearly as the number of vnodes increases (ie, with 100 vnodes, it will
> > take 1/100th the time to repair one vnode).
>

The SSTable indices should still be scanned for size tiered compaction.
Do I miss anything here?


> The story is less good for "nodetool cleanup" however, which still has
> to truck over the entire dataset.
>
> (The partitions/buckets in my crush-inspired scheme addresses this by
> allowing that each ring segment, in vnode terminology, be stored
> separately in the file system.)
>

But the number of files can be a big problem if there are hundreds of
vnodes and millions of sstables
on the same physical node.

We need a way to pin sstable inode to memory.  Otherwise,
it's possible the average number of disk IO to access a row in a sstable
could
be five or more.


>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: RFC: Cassandra Virtual Nodes

2012-03-23 Thread Zhu Han
On Sat, Mar 24, 2012 at 7:55 AM, Peter Schuller  wrote:

> > No I don't think you did, in fact, depending on the size of your SSTable
> a
> > contiguous range (or the entire SSTable) may or may not be affected by a
> > cleanup/move or any type of topology change. There is lots of room for
> > optimization here. After loading the indexes we actually know start/end
> > range for an SSTable so we can include/exclude it in any such operation
>
>
> Just note that unless there is some correlation between range and
> these sstables being created to begin with (like with leveled), you're
> highly unlikely to be able to optimize here. For uniformly distributed
> tokens (hashed keys), all sstables are likely to have almost the
> entire possible token range in them.
>

As Peter pointed out, for random partitioner, the rows of  a specific range
might scatter around all sstables.

Unless whole sstable can be ignored, disk seek is the performance killer
here.




> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>


Re: Re: how to upgrade my cassadra from SizeTieredCompaction to LeveledCompactiom

2012-05-13 Thread Zhu Han
On Mon, May 14, 2012 at 10:34 AM, zhangcheng  wrote:

> thanks, Edward.
>
> In my test, when I changed to leveled strategy, the compaction can't
> finish for the reason that we have 700G new data every day!
>
> what can I do if I want to save the comapction space?
>

For so big dataset, you may trigger the manual compaction periodically as
long as there is not much deletions.

Putting several TB data to a single data requires much computing and IO
power to serve the read request or compaction.


>
>
>
>
> zhangcheng
>
> From: Edward Capriolo
> Date: 2012-05-14 10:14
> To: dev; zhangcheng
> Subject: Re: how to upgrade my cassadra from SizeTieredCompaction to
> LeveledCompactiom
> As soon as you use the CLI to change the compaction strategy for a
> column family Cassandra will consider all SSTables level 0 and being
> level-ing them.  With that much data, think hard before making the
> change. You have to understand how level ed will work with your
> workload.
>
> On Sun, May 13, 2012 at 10:09 PM, zhangcheng  wrote:
> >
> > There is 2T data on each server. Can someone give me some advice?
> >
> > Thanks.
>


Re: Re: how to upgrade my cassadra from SizeTieredCompaction to LeveledCompactiom

2012-05-13 Thread Zhu Han
On Mon, May 14, 2012 at 10:34 AM, zhangcheng  wrote:

> thanks, Edward.
>
> In my test, when I changed to leveled strategy, the compaction can't
> finish for the reason that we have 700G new data every day!
>
> what can I do if I want to save the comapction space?
>
>
>
For so big dataset, you may trigger the manual compaction periodically as
long as there is not much deletions.

Putting several TB data to a single data requires much computing and IO
power to serve the read request or compaction.


>
>
> zhangcheng
>
> From: Edward Capriolo
> Date: 2012-05-14 10:14
> To: dev; zhangcheng
> Subject: Re: how to upgrade my cassadra from SizeTieredCompaction to
> LeveledCompactiom
> As soon as you use the CLI to change the compaction strategy for a
> column family Cassandra will consider all SSTables level 0 and being
> level-ing them.  With that much data, think hard before making the
> change. You have to understand how level ed will work with your
> workload.
>
> On Sun, May 13, 2012 at 10:09 PM, zhangcheng  wrote:
> >
> > There is 2T data on each server. Can someone give me some advice?
> >
> > Thanks.
>