Document storage

2012-03-28 Thread Ben McCann
Hi,

I was wondering if it would be interesting to add some type of
document-oriented data type.

I've found it somewhat awkward to store document-oriented data in Cassandra
today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
Cassandra cannot differentiate it from any other string or byte array.
 However, if my column validation_class could be a JsonType that would
allow tools to potentially do more interesting introspection on the column
value.  E.g. bug 3647
calls for
supporting arbitrarily nested "documents" in CQL.  Running a
query against the JSON column in Pig is possible as well, but again in this
use case it would be helpful to be able to encode in column metadata that
the column is stored as JSON.  For debugging, running nightly reports, etc.
it would be quite useful compared to the opaque string and byte array types
we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
since they would be more space efficient.  However, they would also be a
bit more difficult to implement because of the extra typing information
they provide.  I'm hoping with Cassandra 1.0's addition of compression that
storing JSON is not too inefficient.

Would there be interest in adding a JsonType?  I could look at putting a
patch together.

Thanks,
Ben


Re: Document storage

2012-03-28 Thread Ben McCann
Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

Thanks,
Ben


On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:

> Hi,
>
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
>
> I've found it somewhat awkward to store document-oriented data in
> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> store it, but Cassandra cannot differentiate it from any other string or
> byte array.  However, if my column validation_class could be a JsonType
> that would allow tools to potentially do more interesting introspection on
> the column value.  E.g. bug 
> 3647<https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for 
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
>  Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
>
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
>
> Thanks,
> Ben
>
>


Re: Document storage

2012-03-28 Thread Ben McCann
I don't imagine sort is a meaningful operation on JSON data.  As long as
the sorting is consistent I would think that should be sufficient.


On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo wrote:

> Some work I did stores JSON blobs in columns. The question on JSON
> type is how to sort it.
>
> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
>  wrote:
> > I don't speak for the project, but you might give it a day or two for
> people to respond and/or perhaps create a jira ticket.  Seems like that's a
> reasonable data type that would get some traction - a json type.  However,
> what would validation look like?  That's one of the main reasons there are
> the data types and validators, in order to validate on insert.
> >
> > On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
> >
> >> Any thoughts?  I'd like to submit a patch, but only if it will be
> accepted.
> >>
> >> Thanks,
> >> Ben
> >>
> >>
> >> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
> >>
> >>> Hi,
> >>>
> >>> I was wondering if it would be interesting to add some type of
> >>> document-oriented data type.
> >>>
> >>> I've found it somewhat awkward to store document-oriented data in
> >>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> >>> store it, but Cassandra cannot differentiate it from any other string
> or
> >>> byte array.  However, if my column validation_class could be a JsonType
> >>> that would allow tools to potentially do more interesting
> introspection on
> >>> the column value.  E.g. bug 3647<
> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for supporting
> arbitrarily nested "documents" in CQL.  Running a
> >>> query against the JSON column in Pig is possible as well, but again in
> this
> >>> use case it would be helpful to be able to encode in column metadata
> that
> >>> the column is stored as JSON.  For debugging, running nightly reports,
> etc.
> >>> it would be quite useful compared to the opaque string and byte array
> types
> >>> we have today.  JSON is appealing because it would be easy to
> implement.
> >>> Something like Thrift or Protocol Buffers would actually be interesting
> >>> since they would be more space efficient.  However, they would also be
> a
> >>> bit more difficult to implement because of the extra typing information
> >>> they provide.  I'm hoping with Cassandra 1.0's addition of compression
> that
> >>> storing JSON is not too inefficient.
> >>>
> >>> Would there be interest in adding a JsonType?  I could look at putting
> a
> >>> patch together.
> >>>
> >>> Thanks,
> >>> Ben
> >>>
> >>>
> >
>


Re: Document storage

2012-03-29 Thread Ben McCann
Sounds awesome Drew.  Mind sharing your custom type?  I just wrote a basic
JSON type and did the validation the same way you did, but I don't have any
SMILE support yet.  It seems that if your type were committed to the
Cassandra codebase then the issue you ran into of the CLI only supporting
built-in types would no longer be a problem for you (though fixing the
issue anyway would be good and I voted for it).  Btw, any reason you
compress it with Snappy yourself instead of just setting sstable_compression
to 
SnappyCompressor<http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression>and
letting Cassandra do that part?

-Ben


On Wed, Mar 28, 2012 at 11:28 PM, Drew Kutcharian  wrote:

> I'm actually doing something almost the same. I serialize my objects into
> byte[] using Jackson's SMILE format, then compress it using Snappy then
> store the byte[] in Cassandra. I actually created a simple Cassandra Type
> for this but I hit a wall with cassandra-cli:
>
> https://issues.apache.org/jira/browse/CASSANDRA-4081
>
> Please vote on the JIRA if you are interested.
>
> Validation is pretty simple, you just need to read the value and parse it
> using Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)
>
> -- Drew
>
>
>
> On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:
>
> > I don't imagine sort is a meaningful operation on JSON data.  As long as
> > the sorting is consistent I would think that should be sufficient.
> >
> >
> > On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo  >wrote:
> >
> >> Some work I did stores JSON blobs in columns. The question on JSON
> >> type is how to sort it.
> >>
> >> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
> >>  wrote:
> >>> I don't speak for the project, but you might give it a day or two for
> >> people to respond and/or perhaps create a jira ticket.  Seems like
> that's a
> >> reasonable data type that would get some traction - a json type.
>  However,
> >> what would validation look like?  That's one of the main reasons there
> are
> >> the data types and validators, in order to validate on insert.
> >>>
> >>> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
> >>>
> >>>> Any thoughts?  I'd like to submit a patch, but only if it will be
> >> accepted.
> >>>>
> >>>> Thanks,
> >>>> Ben
> >>>>
> >>>>
> >>>> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann 
> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I was wondering if it would be interesting to add some type of
> >>>>> document-oriented data type.
> >>>>>
> >>>>> I've found it somewhat awkward to store document-oriented data in
> >>>>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it,
> and
> >>>>> store it, but Cassandra cannot differentiate it from any other string
> >> or
> >>>>> byte array.  However, if my column validation_class could be a
> JsonType
> >>>>> that would allow tools to potentially do more interesting
> >> introspection on
> >>>>> the column value.  E.g. bug 3647<
> >> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
> supporting
> >> arbitrarily nested "documents" in CQL.  Running a
> >>>>> query against the JSON column in Pig is possible as well, but again
> in
> >> this
> >>>>> use case it would be helpful to be able to encode in column metadata
> >> that
> >>>>> the column is stored as JSON.  For debugging, running nightly
> reports,
> >> etc.
> >>>>> it would be quite useful compared to the opaque string and byte array
> >> types
> >>>>> we have today.  JSON is appealing because it would be easy to
> >> implement.
> >>>>> Something like Thrift or Protocol Buffers would actually be
> interesting
> >>>>> since they would be more space efficient.  However, they would also
> be
> >> a
> >>>>> bit more difficult to implement because of the extra typing
> information
> >>>>> they provide.  I'm hoping with Cassandra 1.0's addition of
> compression
> >> that
> >>>>> storing JSON is not too inefficient.
> >>>>>
> >>>>> Would there be interest in adding a JsonType?  I could look at
> putting
> >> a
> >>>>> patch together.
> >>>>>
> >>>>> Thanks,
> >>>>> Ben
> >>>>>
> >>>>>
> >>>
> >>
>
>


Re: Document storage

2012-03-29 Thread Ben McCann
Could you explain further how I would use CASSANDRA-3647?  There's still
very little documentation on composite columns and it was not clear to me
whether they could be used to store document oriented data.  Say for
example that I had a document like:

user: {
  firstName: 'ben',
  skills: ['java', 'javascript', 'html'],
  education {
school: 'cmu',
major: 'computer science'
  }
}

How would I flatten this to be stored and then reconstruct the document?


On Thu, Mar 29, 2012 at 5:44 AM, Jake Luciani  wrote:

> Is there a reason you would prefer a JSONType over CASSANDRA-3647?  It
> would seem the only thing a JSON type offers you is validation.  3647 takes
> it much further by deconstructing a JSON document using composite columns
> to flatten the document out, with the ability to access and update portions
> of the document (as well as reconstruct it).
>
> On Wed, Mar 28, 2012 at 11:58 AM, Ben McCann  wrote:
>
> > Hi,
> >
> > I was wondering if it would be interesting to add some type of
> > document-oriented data type.
> >
> > I've found it somewhat awkward to store document-oriented data in
> Cassandra
> > today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it,
> but
> > Cassandra cannot differentiate it from any other string or byte array.
> >  However, if my column validation_class could be a JsonType that would
> > allow tools to potentially do more interesting introspection on the
> column
> > value.  E.g. bug 3647
> > <https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for
> > supporting arbitrarily nested "documents" in CQL.  Running a
> > query against the JSON column in Pig is possible as well, but again in
> this
> > use case it would be helpful to be able to encode in column metadata that
> > the column is stored as JSON.  For debugging, running nightly reports,
> etc.
> > it would be quite useful compared to the opaque string and byte array
> types
> > we have today.  JSON is appealing because it would be easy to implement.
> >  Something like Thrift or Protocol Buffers would actually be interesting
> > since they would be more space efficient.  However, they would also be a
> > bit more difficult to implement because of the extra typing information
> > they provide.  I'm hoping with Cassandra 1.0's addition of compression
> that
> > storing JSON is not too inefficient.
> >
> > Would there be interest in adding a JsonType?  I could look at putting a
> > patch together.
> >
> > Thanks,
> > Ben
> >
>
>
>
> --
> http://twitter.com/tjake
>


Re: Document storage

2012-03-29 Thread Ben McCann
Creating materialized paths may well be a possible solution.  If that were
the solution the community were to agree upon then I would like it to be a
standardized and well-documented best practice.  I asked how to store a
list of values on the user
list
and
no one suggested ["fieldName", ]: "fieldValue".  It would be a
huge pain right now to create materialized paths like this for each of my
objects, so client library support would definitely be needed.  And the
client libraries should agree.  If Astyanax and lazyboy both add support
for materialized path and I write an object to Cassandra with Astyanax,
then I should be able to read it back with lazyboy.  The benefit of using
JSON/SMILE is that it's very clear that there's exactly one way to
serialize and deserialize the data and it's very easy.  It's not clear to
me that this is true using materialized paths.


On Thu, Mar 29, 2012 at 8:21 AM, Tyler Patterson wrote:

> >
> >
> > Would there be interest in adding a JsonType?
>
>
> What about checking that data inserted into a JsonType is valid JSON? How
> would you do it, and would the overhead be something we are concerned
> about, especially if the JSON string is large?
>


Re: Document storage

2012-03-29 Thread Ben McCann
Jonathan, I asked Brian about his REST
APIand
he said he does not take the json objects and split them because the
client libraries do not agree on implementations.  This was exactly my
concern as well with this solution.  I would be perfectly happy to do it
this way instead of using JSON if it were standardized.  The reason I
suggested JSON is that it is standardized.  As far as I can tell, Cassandra
doesn't support maps and lists in a standardized way today, which is the
root of my problem.

-Ben


On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian  wrote:

> Yes, I meant the "row header index". What I have done is that I'm storing
> an object (i.e. UserProfile) where you read or write it as a whole (a user
> updates their user details in a single page in the UI). So I serialize that
> object into a binary JSON using SMILE format. I then compress it using
> Snappy on the client side. So as far as Cassandra cares it's storing a
> byte[].
>
> Now on the client side, I'm using cassandra-cli with a custom type that
> knows how to turn a byte[] into a JSON text and back. The only issue was
> CASSANDRA-4081 where "assume" doesn't work with custom types. If
> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
>
> Also advantages of this vs. the thrift based Super Column families are:
>
> 1. Saving extra CPU usage on the Cassandra nodes. Since
> serialize/deserialize and compression/decompression happens on the client
> nodes where there is plenty idle CPU time
>
> 2. Saving network bandwidth since I'm sending over a compressed byte[]
>
>
> -- Drew
>
>
>
> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
>
> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> wrote:
> >>> I think this is a much better approach because that gives you the
> >>> ability to update or retrieve just parts of objects efficiently,
> >>> rather than making column values just blobs with a bunch of special
> >>> case logic to introspect them.  Which feels like a big step backwards
> >>> to me.
> >>
> >> Unless your access pattern involves reading/writing the whole document
> each time. In that case you're better off serializing the whole document
> and storing it in a column as a byte[] without incurring the overhead of
> column indexes. Right?
> >
> > Hmm, not sure what you're thinking of there.
> >
> > If you mean the "index" that's part of the row header for random
> > access within a row, then no, serializing to byte[] doesn't save you
> > anything.
> >
> > If you mean secondary indexes, don't declare any if you don't want any.
> :)
> >
> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> > than giving it named columns, but we're talking negligible compared to
> > the overhead of actually moving the data on or off disk in the first
> > place.  Not even close to being worth giving up being able to deal
> > with your data from standard tools like cqlsh, IMO.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
>
>


Re: Document storage

2012-03-29 Thread Ben McCann
Thanks Jonathan.  The only reason I suggested JSON was because it already
has support for lists.  Native support for lists in Cassandra would more
than satisfy me.  Are there any existing proposals or a bug I can follow?
 I'm not familiar with the Cassandra codebase, so I'm not entirely sure how
helpful I can be, but I'd certainly be interested in taking a look to see
what's required.

-Ben


On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill wrote:

> Jonathan,
>
> I was actually going to take this up with Nate McCall a few weeks back.  I
> think it might make sense to get the client development community together
> (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
>
> I agree whole-heartedly that it shouldn't go into the database for all the
> reasons you point out.
>
> If we can all decide on some standards for data storage (e.g. composite
> types), indexing strategies, etc.  We can provide higher-level functions
> through the client libraries and also provide interoperability between
> them.  (without bloating Cassandra)
>
> CCing Nate.  Nate, thoughts?
> I wouldn't mind coordinating/facilitating the conversation.  If we know
> who should be involved.
>
> -brian
>
> 
> Brian O'Neill
> Lead Architect, Software Development
> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>
>
>
>
>
>
> On 3/29/12 3:06 PM, "Ben McCann"  wrote:
>
> >Jonathan, I asked Brian about his REST
> >API<
> https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
> >9C8Us>and
> >he said he does not take the json objects and split them because the
> >client libraries do not agree on implementations.  This was exactly my
> >concern as well with this solution.  I would be perfectly happy to do it
> >this way instead of using JSON if it were standardized.  The reason I
> >suggested JSON is that it is standardized.  As far as I can tell,
> >Cassandra
> >doesn't support maps and lists in a standardized way today, which is the
> >root of my problem.
> >
> >-Ben
> >
> >
> >On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian 
> wrote:
> >
> >> Yes, I meant the "row header index". What I have done is that I'm
> >>storing
> >> an object (i.e. UserProfile) where you read or write it as a whole (a
> >>user
> >> updates their user details in a single page in the UI). So I serialize
> >>that
> >> object into a binary JSON using SMILE format. I then compress it using
> >> Snappy on the client side. So as far as Cassandra cares it's storing a
> >> byte[].
> >>
> >> Now on the client side, I'm using cassandra-cli with a custom type that
> >> knows how to turn a byte[] into a JSON text and back. The only issue was
> >> CASSANDRA-4081 where "assume" doesn't work with custom types. If
> >> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
> >>
> >> Also advantages of this vs. the thrift based Super Column families are:
> >>
> >> 1. Saving extra CPU usage on the Cassandra nodes. Since
> >> serialize/deserialize and compression/decompression happens on the
> >>client
> >> nodes where there is plenty idle CPU time
> >>
> >> 2. Saving network bandwidth since I'm sending over a compressed byte[]
> >>
> >>
> >> -- Drew
> >>
> >>
> >>
> >> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
> >>
> >> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> >> wrote:
> >> >>> I think this is a much better approach because that gives you the
> >> >>> ability to update or retrieve just parts of objects efficiently,
> >> >>> rather than making column values just blobs with a bunch of special
> >> >>> case logic to introspect them.  Which feels like a big step
> >>backwards
> >> >>> to me.
> >> >>
> >> >> Unless your access pattern involves reading/writing the whole
> >>document
> >> each time. In that case you're better off serializing the whole document
> >> and storing it in a column as a byte[] without incurring the overhead of
> >> column indexes. Right?
> >> >
> >> > Hmm, not sure what you're thinking of there.
> >> >
> >> > If you mean the "index" that's part of the row header for random
> >> > access within a row, then no, serializing to byte[] doesn't save you
> >> > anything.
> >> >
> >> > If you mean secondary indexes, don't declare any if you don't want
> >>any.
> >> :)
> >> >
> >> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> >> > than giving it named columns, but we're talking negligible compared to
> >> > the overhead of actually moving the data on or off disk in the first
> >> > place.  Not even close to being worth giving up being able to deal
> >> > with your data from standard tools like cqlsh, IMO.
> >> >
> >> > --
> >> > Jonathan Ellis
> >> > Project Chair, Apache Cassandra
> >> > co-founder of DataStax, the source for professional Cassandra support
> >> > http://www.datastax.com
> >>
> >>
>
>
>


Re: Document storage

2012-03-29 Thread Ben McCann
Cool.  How were you thinking we should store the data?  As a stanardized
composite column (e.g. potentially a list as ["fieldName", ]:
"fieldValue" and a set as  ["fieldName",  "fieldValue" ]:"")?  Or as a new
column type?


On Thu, Mar 29, 2012 at 12:35 PM, Jonathan Ellis  wrote:

> I kind of hijacked
> https://issues.apache.org/jira/browse/CASSANDRA-3647 ("Sylvain
> suggests we start with (non-nested) lists, maps, and sets. I agree
> that this is a great 80/20 approach to the problem") but we could
> split it out to another ticket.
>
> On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann  wrote:
> > Thanks Jonathan.  The only reason I suggested JSON was because it already
> > has support for lists.  Native support for lists in Cassandra would more
> > than satisfy me.  Are there any existing proposals or a bug I can follow?
> >  I'm not familiar with the Cassandra codebase, so I'm not entirely sure
> how
> > helpful I can be, but I'd certainly be interested in taking a look to see
> > what's required.
> >
> > -Ben
> >
> >
> > On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill  >wrote:
> >
> >> Jonathan,
> >>
> >> I was actually going to take this up with Nate McCall a few weeks back.
>  I
> >> think it might make sense to get the client development community
> together
> >> (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
> >>
> >> I agree whole-heartedly that it shouldn't go into the database for all
> the
> >> reasons you point out.
> >>
> >> If we can all decide on some standards for data storage (e.g. composite
> >> types), indexing strategies, etc.  We can provide higher-level functions
> >> through the client libraries and also provide interoperability between
> >> them.  (without bloating Cassandra)
> >>
> >> CCing Nate.  Nate, thoughts?
> >> I wouldn't mind coordinating/facilitating the conversation.  If we know
> >> who should be involved.
> >>
> >> -brian
> >>
> >> 
> >> Brian O'Neill
> >> Lead Architect, Software Development
> >> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
> >> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
> >> blog: http://brianoneill.blogspot.com/
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 3/29/12 3:06 PM, "Ben McCann"  wrote:
> >>
> >> >Jonathan, I asked Brian about his REST
> >> >API<
> >> https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
> >> >9C8Us>and
> >> >he said he does not take the json objects and split them because the
> >> >client libraries do not agree on implementations.  This was exactly my
> >> >concern as well with this solution.  I would be perfectly happy to do
> it
> >> >this way instead of using JSON if it were standardized.  The reason I
> >> >suggested JSON is that it is standardized.  As far as I can tell,
> >> >Cassandra
> >> >doesn't support maps and lists in a standardized way today, which is
> the
> >> >root of my problem.
> >> >
> >> >-Ben
> >> >
> >> >
> >> >On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian 
> >> wrote:
> >> >
> >> >> Yes, I meant the "row header index". What I have done is that I'm
> >> >>storing
> >> >> an object (i.e. UserProfile) where you read or write it as a whole (a
> >> >>user
> >> >> updates their user details in a single page in the UI). So I
> serialize
> >> >>that
> >> >> object into a binary JSON using SMILE format. I then compress it
> using
> >> >> Snappy on the client side. So as far as Cassandra cares it's storing
> a
> >> >> byte[].
> >> >>
> >> >> Now on the client side, I'm using cassandra-cli with a custom type
> that
> >> >> knows how to turn a byte[] into a JSON text and back. The only issue
> was
> >> >> CASSANDRA-4081 where "assume" doesn't work with custom types. If
> >> >> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
> >> >>
> >> >> Also advantages of this vs. the thrift based Super Column families
> are:
> >> >>
> >> >> 1. Saving extra CP

Re: Document storage

2012-03-30 Thread Ben McCann
>
> If you don't need selected updates and having something as compact as
> possible on disk make a important difference for you, sure, do use blobs.
> The only argument is that you can already do that without any change to
> the core.


The thing that we can't do today without changes to the core is index on
subparts of some document format like Protobuf/JSON/etc.  If cassandra were
to understand one of these formats, it could remove the need for manual
management of an index.


On Fri, Mar 30, 2012 at 10:23 AM, Sylvain Lebresne wrote:

> On Fri, Mar 30, 2012 at 6:01 PM, Daniel Doubleday
>  wrote:
> > But decomposing into columns will lead to more of that:
> >
> > - Total amount of serialized data is (in most cases a lot) larger than
> protobuffed / compressed version
>
> At least with sstable compression, I would expect the difference to
> not be too big in practice.
>
> > - If you do selective updates the document will be scattered over
> multiple ssts plus if you do sliced reads you can't optimize reads as
> opposed to the single column version that when updated is automatically
> superseding older versions so most reads will hit only one sst
>
> But if you need to do selective updates, then a blob just doesn't work
> so that comparison is moot.
>
> Now I don't think anyone pretended that you should never use blobs
> (whether that's protobuffed, jsoned, ...). If you don't need selected
> updates and having something as compact as possible on disk make a
> important difference for you, sure, do use blobs. The only argument is
> that you can already do that without any change to the core. What we
> are saying is that for the case where you care more about schema
> flexibility (being able to do selective updates, to index on some
> subpart, etc...) then we think that something like the map and list
> idea of CASSANDRA-3647 will probably be a more natural fit to the
> current CQL API.
>
> --
> Sylvain
>
> >
> > All these reads make the hot dataset. If it fits the page cache your
> fine. If it doesn't you need to buy more iron.
> >
> > Really could not resist because your statement seems to be contrary to
> all our tests / learnings.
> >
> > Cheers,
> > Daniel
> >
> > From dev list:
> >
> > Re: Document storage
> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian 
> wrote:
> >>> I think this is a much better approach because that gives you the
> >>> ability to update or retrieve just parts of objects efficiently,
> >>> rather than making column values just blobs with a bunch of special
> >>> case logic to introspect them.  Which feels like a big step backwards
> >>> to me.
> >>
> >> Unless your access pattern involves reading/writing the whole document
> each time. In
> > that case you're better off serializing the whole document and storing
> it in a column as a
> > byte[] without incurring the overhead of column indexes. Right?
> >
> > Hmm, not sure what you're thinking of there.
> >
> > If you mean the "index" that's part of the row header for random
> > access within a row, then no, serializing to byte[] doesn't save you
> > anything.
> >
> > If you mean secondary indexes, don't declare any if you don't want any.
> :)
> >
> > Just telling C* to store a byte[] *will* be slightly lighter-weight
> > than giving it named columns, but we're talking negligible compared to
> > the overhead of actually moving the data on or off disk in the first
> > place.  Not even close to being worth giving up being able to deal
> > with your data from standard tools like cqlsh, IMO.
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of DataStax, the source for professional Cassandra support
> > http://www.datastax.com
> >
>