from:"Ben Standefer"

Re: Standardizing Timestamps Across Clients

2010-03-18 Thread Ben Standefer

+1 I think this is a great idea.  I've been bitten by this when switching
clients, and it took a while to figure out what was going on.  Good job on
pycassa, vomjom!

-Ben


2010/3/18 Ted Zlatanov 

> On Thu, 18 Mar 2010 02:36:35 -0500 Jonathan Hseu 
> wrote:
>
> JH> Jonathan Ellis suggested that I bring this issue to the dev mailing
> list:
> JH> Cassandra should recommended a default timestamp across all clients
> JH> libraries.
> ...
> JH> Here's what different clients are using:
>
> JH> 1. Cassandra CLI: Milliseconds since UTC epoch.
> JH> 2. lazyboy: Seconds since UTC epoch.  It used to be seconds since local
> time
> JH> epoch.  Now it's changing again to microseconds since UTC epoch.
> JH> 3. driftx's client: Milliseconds since UTC epoch.
> JH> 4. The example app, Twissandra: Microseconds since UTC epoch.
> JH> 5. pycassa: Microseconds since UTC epoch.  It used to be seconds since
> local
> JH> time epoch.
> JH> 6. The most popular Cassandra Ruby client: Microseconds since UTC
> epoch.
>
> It's good to standardize :)
>
> In Perl land, Net::Cassandra::Easy is using seconds but should be using
> microseconds.  I'll change it for 0.4 (the underlying Thrift code will
> DTRT for the 64-bit encoding using Bit::Vector).  Net::Cassandra uses
> seconds and should also be changed; CC-d to that module's maintainer.
>
> Ted
>
>

Re: Create a Cassandra demo application

2010-03-24 Thread Ben Standefer

I'm interesting in mentoring for this position. I think Ruby or Python would
be best for an example app, and I think that having a well-documented
Lucandra implementation (Wikipedia search) could really help a broad range
of people out.

-Ben Standefer



On Tue, Mar 23, 2010 at 9:22 PM, Janith Bandara wrote:

> Hi All,
>
> I'm student of UCSC[1], eligible for the gsoc 2010[2] projects. I'm
> intersted in Apache Cassandra project[3].
>
> I choosed the idea[4] call "Create a Cassandra demo application" from
> apache
> GSoc Project list[5].
>
> I hope to develop more functional demo application than  twissandra[6].
>
> Is there a Assigner or mentor for this idea?
>
> Thanks & Regards,
> Janith.
>
> [1] http://ucsc.cmb.ac.lk/
> [2] http://code.google.com/soc/
> [3] http://cassandra.apache.org/
> [4] https://issues.apache.org/jira/browse/CASSANDRA-873
> [5]
>
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?mode=hide&requestId=12314021
> [6] http://twissandra.com/
>

Re: Gsoc2010 proposal

2010-03-26 Thread Ben Standefer

Priyanka,

I think our listserv might be dropping your attachment.  Please send it
directly to me at benstande...@gmail.com.

-Ben


On Fri, Mar 26, 2010 at 10:41 AM, Priyanka Sharma  wrote:

> Hi
>
> I am Priyanka Sharma, master student at Vrije University, Amsterdam. My
> major is "parallel and distributed system system".
> I am interested to participate in gsoc2010 with cassandra. I would like to
> implement "demo application for cassandra".
> I have attached my proposal(not fully final) with this email and also my CV
> for your reference.
>
> I would like to have your comments on my proposal, So that I can make it
> better.
> Kindly give me some feedback about my proposal.
> --
> Thanks & Regards
> Priyanka
>
>

Re: cassandra increment counters, Jira #1072

2010-08-13 Thread Ben Standefer

Interesting idea with the counter row approach.  I think it puts a dubious
responsibility on the Cassandra user.  Sure, Cassandra users are expected to
maintain the size of a row, but asking Cassandra users to constantly
aggregate counts of uuids in a situation where the rows are growing rapidly
to maintain a counter seems out of the realm of the average Cassandra end
user.

My napkin math may be slightly off, but if a "counter row aggregator"
stopped functioning, crashed, or didn't do it's job correctly on a counter
receiving 2,000 increments per second, you end up with a single row at
>2.57GB after 24 hours (2,000/sec x 86,400 seconds x 16 bytes per uuid).
 This is approaches the magnitude of memory on a single node and would seem
(to me?) to significantly impact load and load distribution.  Maybe there is
a way Cassandra could perform the counter row aggregation internally (with
read repair?) and offer it to end users as a clean, simple, intuitive
interface.

I have never thought counters were something Cassandra handles well.  If
there is not a satisfactory way to integrate counter into the Cassandra
internals, I think it'd be great for somebody in-the-know to provide
in-depth and detailed documentation on best practices for how to implement
counters.  I think distributed and scalable counters can be a killer app for
Cassandra, and circumventing locking systems such as ZooKeeper is key.

Disclaimer: I'm not quite a Cassandra developer, more of an Ops guy and
user, just trying to add perspective.  I do not want a pony.

-Ben Standefer

On Thu, Aug 12, 2010 at 8:54 PM, Jonathan Ellis  wrote:

> There are two concerns that give me pause.
>
> The first is that 1072 is tackling a use case that Cassandra already
> handles well: high volume of writes to a counter, with low volume
> reads.  (This can be done by inserting uuids into a counter row, and
> aggregating them either in the background or at read time or with some
> combination of these.  The counter rows can be sharded if necessary.)
>
> The second is that the approach in 1072 resembles an entirely separate
> system that happens to use part of Cassandra infrastructure -- the
> thrift API, the MessagingService, the sstable format -- but isn't
> really part of it.  ConsistencyLevel is not respected, and special
> cases abound to weld things in that don't fit, e.g. the AES/Streaming
> business.
>
> On Thu, Aug 12, 2010 at 1:28 AM, Robin Bowes 
> wrote:
> > Hi Jonathan,
> >
> > I'm contacting you in your capacity as project lead for the cassandra
> > project. I am wondering how close ticket #1072 is to implementation [1]
> >
> > We are about to do a proof of concept with cassandra to replace around
> > 20 MySQL partitions (1 partition = 4 machines: master/slave in DC A,
> > master/slave in DC B).
> >
> > We're essentially just counting web hits - around 10k/second at peak
> > times - so increment counters is pretty much essential functionality for
> us.
> >
> > How close is the patch in #1072 to being acceptable? What is blocking it?
> >
> > Thanks,
> >
> > R.
> >
> > [1] https://issues.apache.org/jira/browse/CASSANDRA-1072
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: [DISCUSSION] High-volume counters in Cassandra

2010-09-02 Thread Ben Standefer

At SimpleGeo, we're close to just merging 1072 internally.  I've talked with
several members of the community who have already done this and are running
1072 in production or quasi-production.  It seems like if this isn't merged,
people are going to merge it internally anyways.  I think such a widely
desired feature should not be left as a patch for users to merge themselves.
 I like the idea of including both approaches and choosing between them
given your requirements.

-Ben Standefer


On Thu, Sep 2, 2010 at 12:01 PM, Johan Oskarsson  wrote:

> In the last few months Digg and Twitter have been using a counter patch
> that lets Cassandra act as a high-volume realtime counting system. Atomic
> counters enable new applications that were previously difficult to implement
> at scale, including realtime analytics and large-scale systems monitoring.
>
> Discussion
> There are currently two different suggestions for how to implement counters
> in Cassandra. The discussion has so far been limited to those following the
> jiras (CASSANDRA-1072 and CASSANDRA-1421) closely and we don’t seem to be
> nearing a decision. I want to open it up to the Cassandra community at large
> to get additional feedback.
>
> Below are very basic and brief introductions to the alternatives. Please
> help us move forward by reading through the docs and jiras and reply to this
> thread with your thoughts. Would one or the other, both or neither be
> suitable for inclusion in Cassandra? Is there a third option? What can we do
> to reach a decision?
>
> We believe that both options can coexist; their strengths and weaknesses
> make them suitable for different use cases.
>
>
> CASSANDRA-1072 + CASSANDRA-1397
> https://issues.apache.org/jira/browse/CASSANDRA-1072 (see design doc)
> https://issues.apache.org/jira/browse/CASSANDRA-1397
>
> How does it work?
> A node is picked as the primary replica for each write. The context byte
> array for a column contains (primary replica ip, value). Any previous data
> with the same ip is reconciled with the new increment and put as the column
> value.
>
> Concerns raised
> * an increment in flight will be lost if the wrong node goes down
> * if an increment operation times out it’s impossible to know if it has
> been executed or not
>
> The most recent jira comment proposes a new API method for increments that
> reflects the different consistency level guarantees.
>
>
> CASSANDRA-1421
> https://issues.apache.org/jira/browse/CASSANDRA-1421
>
> How does it work?
> Each increment for a counter is stored as a (UUID, value) tuple. The read
> operations will read all these increment tuples for a counter, reconcile and
> return. On a regular interval the values are all read and reconciled into
> one value to reduce the amount of data required for each read operation.
>
> Concerns raised
> * poor read performance, especially for time-series data
> * post aggregation reconciliation issues
>
>
> Again, we feel that both options can co-exist, especially if the 1072 patch
> uses a new API method that reflects its different consistency level
> guarantees. Our proposal is to accept 1072 into trunk with the new API
> method, and when an implementation of 1421 is completed it can be accepted
> alongside.

Re: Distributed Counters Use Cases

2010-10-14 Thread Ben Standefer

Ryan,

Thanks for the insight.  FWIW, SimpleGeo's use cases are very similar
do the 2nd use case Ryan mentioned.  We want to do rollups by time,
geography, and facet of a customer's record.  The most important
benefit Cassandra brings for us is the ability to handle large number
of rows (very detailed rollups).  Secondary is the ability to
increment at high volume (the increment buffering that Ryan has
mentioned seems highly valuable).

Dirty Burritos, Inc.
<# burritos sold> in  by , , and 

Mission
  Total: 1,726
Meat
  Chicken: 765
  Beef: 620
  Chorizo: 173
  Veggie: 168
SOMA
  Total: 1,526
Meat
  Chicken: 665
  Beef: 520
  Chorizo: 173
  Veggie: 168
Marina
  Total: 1,326
Meat
  Chicken: 565
  Beef: 420
  Chorizo: 173
  Veggie: 168

We would roll up by many different time periods (minutes, hours,
days), geographic boundaries (neighborhod, zip, city, state), metrics
(# burritos sold, order total, delivery time), and properties (meat,
male/female, order size).

With a smart schema, I think we can store and update this data in
real-time and make it reasonably query-able, and it will be much
simpler and easier than batch processing.  This kind of reporting
isn't novel or special, but the cost to produce this data become
extremely low when you don't have to futz with Hadoop, batch
processing, broken jobs, etc.

We have looked at a few ways to store each increment in a new column,
and possibly have some kind of high-level compaction that comes
through and cleans it up, but it just become unwieldy at the app
level.

We plan on messing with #1072 in the very new future, as well as
offering to beta test the increment buffering Ryan has mentioned.

-Ben Standefer


On Wed, Oct 6, 2010 at 6:30 PM, Ryan King  wrote:
> In the spirit of making sure we have clear communication about our
> work, I'd like to outline the use cases Twitter has for distributed
> counters. I expect that many of you using Cassandra currently or in
> the future will have similar use cases.
>
> The first use case is pretty simple: high scale counters. Our Tweet
> button [1] is powered by #1072 counters. We could every mention of
> every url that comes through a public tweet. As you would expect,
> there are a lot of urls and a lot of traffic to this widget (its on
> many high traffic sites, though it is highly cached).
>
> The second is a bit more complex: time series data. We have built
> infrastructure that can process logs (in real time from scribe) or
> other events  and convert them into a series of keys to increment,
> buffer the data for 1 minute and increment those keys. For logs, each
> aggregator would do its on increment (so per thing you're tracking you
> get an increment for each aggregator), but for events it'll be one
> increment per event. We plan to open source all of this soon.
>
> We're hoping to soon start replacing our ganglia clusters with this.
> For the ganglia use-case we end up with a large number or increments
> for every read. For monitoring data, even a reasonably sized fleet
> with a moderate number of metrics can generate a huge amount of data.
> Imagine you have 500 machines (not how many we have) and measure 300
> (a reasonable estimate based on our experience) metrics per machine.
> Suppose you want to measure these things every minute and roll the
> values up every hour, day, month and for all time. Suppose also that
> you were tracking sum, count, min, max, and sum of squares (so that
> you can do standard deviation).  You also want to track these metrics
> across groups like web hosts, databases, datacenters, etc.
>
> These basic assumptions would mean this kind of traffic:
>
> (500      +  100  ) *   300   *      5             *     4
> 3,600,000 increments/minute
> (machines   groups)    metrics  time granularities   aggregates
>
> Read traffic, being employee-only would be negligible compared to this.
>
> One other use case is that for many of the metrics we track, we want
> to track the usage across several facets.
>
> For example [2] to build our local trends feature, you could store a
> time series of terms per city. In this case supercolumns would be a
> natural fit because the set of facets is unknown and open:
>
> Imagine a CF that has data like this:
>
> city0 => hour0 => { term1 => 2, term2 => 1000, term3 => 1}, hour1 => {
> term5 => 2, term2 => 10}
> city1 => hour0 => { term12 => 3, term0 => 500, term3 => 1}, hour1 => {
> term5 => 2, term2 => 10}
>
> Of course, there are some other ways to model this data– you could
> collapse the subcolumn names into the column names and re-do how you
> slice (you have to slice anyway). You have to have fixed width terms
> then, thoug

Re: Distributed counters are in trunk

2010-12-21 Thread Ben Standefer

Huzzah!!!  Great work everybody!

-Ben Standefer


On Tue, Dec 21, 2010 at 6:12 PM, Jonathan Ellis  wrote:
> Thanks to Kelvin, Johan, Ryan, Sylvain, Chris, and everyone else for their
> hard work on this!
>
> For mere mortals: http://wiki.apache.org/cassandra/Counters.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Standardizing Timestamps Across Clients

Re: Create a Cassandra demo application

Re: Gsoc2010 proposal

Re: cassandra increment counters, Jira #1072

Re: [DISCUSSION] High-volume counters in Cassandra

Re: Distributed Counters Use Cases

Re: Distributed counters are in trunk

7 matches

Site Navigation

Mail list logo

Footer information