Re: OoM querying very wide-row in CLI

2012-03-28 Thread Brian O'Neill
Sorry, I didn't realize we weren't hip to pulls yet.

I created a JIRA and attached the patch.
https://issues.apache.org/jira/browse/CASSANDRA-4098

-brian

On Tue, Mar 27, 2012 at 10:42 PM, Brian O'Neill wrote:

> Here she is:
> https://github.com/apache/cassandra/pull/8
>
> Verified functionally with the attached data script.
>
> -brian
>
>
>
> On Tue, Mar 27, 2012 at 9:49 PM, Brian O'Neill wrote:
>
>> 10-4.  I'll see if I can track it down and submit a pull request that
>> specifies a default if one does not exist.
>>
>> -brian
>>
>> 
>> Brian O'Neill
>> Lead Architect, Software Development
>> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
>> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
>> blog: http://brianoneill.blogspot.com/
>>
>>
>>
>>
>>
>>
>>
>> On 3/27/12 9:45 PM, "Jonathan Ellis"  wrote:
>>
>> >I believe we added support for specifying a column range to the cli
>> >recently.  I don't know if there is a default limit.
>> >
>> >On Tue, Mar 27, 2012 at 8:40 PM, Brian O'Neill 
>> >wrote:
>> >> Today, running 1.0.7, we saw a node crash with an OutOfMemory.
>> >> We have a single row with ~10million columns in it. (using it as an
>> >>index)
>> >> Accidentally, we attempted to list the CF in CLI that had the wide-row.
>> >>  This caused the CLI to hang and then eventually crashed Cassandra with
>> >>an
>> >> OoM.
>> >>
>> >> I know this is a case of "If it hurts when you do that, don't do that",
>> >>but
>> >> we may want to better protect against it in the CLI and/or the DB.  I
>> >>know
>> >> we limit row counts on lists in CLI.  Do we also limit column counts?
>> >>If
>> >> not, I don't mind submitting a patch for this.
>> >>
>> >> let me know,
>> >> brian
>> >>
>> >> --
>> >> Brian ONeill
>> >> Lead Architect, Health Market Science (http://healthmarketscience.com)
>> >> mobile:215.588.6024
>> >> blog: http://weblogs.java.net/blog/boneill42/
>> >> blog: http://brianoneill.blogspot.com/
>> >
>> >
>> >
>> >--
>> >Jonathan Ellis
>> >Project Chair, Apache Cassandra
>> >co-founder of DataStax, the source for professional Cassandra support
>> >http://www.datastax.com
>>
>>
>>
>
>
> --
> Brian ONeill
> Lead Architect, Health Market Science (http://healthmarketscience.com)
> mobile:215.588.6024
> blog: http://weblogs.java.net/blog/boneill42/
> blog: http://brianoneill.blogspot.com/
>
>


-- 
Brian ONeill
Lead Architect, Health Market Science (http://healthmarketscience.com)
mobile:215.588.6024
blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/


Document storage

2012-03-28 Thread Ben McCann
Hi,

I was wondering if it would be interesting to add some type of
document-oriented data type.

I've found it somewhat awkward to store document-oriented data in Cassandra
today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
Cassandra cannot differentiate it from any other string or byte array.
 However, if my column validation_class could be a JsonType that would
allow tools to potentially do more interesting introspection on the column
value.  E.g. bug 3647
calls for
supporting arbitrarily nested "documents" in CQL.  Running a
query against the JSON column in Pig is possible as well, but again in this
use case it would be helpful to be able to encode in column metadata that
the column is stored as JSON.  For debugging, running nightly reports, etc.
it would be quite useful compared to the opaque string and byte array types
we have today.  JSON is appealing because it would be easy to implement.
 Something like Thrift or Protocol Buffers would actually be interesting
since they would be more space efficient.  However, they would also be a
bit more difficult to implement because of the extra typing information
they provide.  I'm hoping with Cassandra 1.0's addition of compression that
storing JSON is not too inefficient.

Would there be interest in adding a JsonType?  I could look at putting a
patch together.

Thanks,
Ben


Re: Document storage

2012-03-28 Thread Ben McCann
Any thoughts?  I'd like to submit a patch, but only if it will be accepted.

Thanks,
Ben


On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:

> Hi,
>
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
>
> I've found it somewhat awkward to store document-oriented data in
> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> store it, but Cassandra cannot differentiate it from any other string or
> byte array.  However, if my column validation_class could be a JsonType
> that would allow tools to potentially do more interesting introspection on
> the column value.  E.g. bug 
> 3647calls for 
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
>  Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
>
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
>
> Thanks,
> Ben
>
>


Re: Document storage

2012-03-28 Thread Jeremy Hanna
I don't speak for the project, but you might give it a day or two for people to 
respond and/or perhaps create a jira ticket.  Seems like that's a reasonable 
data type that would get some traction - a json type.  However, what would 
validation look like?  That's one of the main reasons there are the data types 
and validators, in order to validate on insert.

On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:

> Any thoughts?  I'd like to submit a patch, but only if it will be accepted.
> 
> Thanks,
> Ben
> 
> 
> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
> 
>> Hi,
>> 
>> I was wondering if it would be interesting to add some type of
>> document-oriented data type.
>> 
>> I've found it somewhat awkward to store document-oriented data in
>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
>> store it, but Cassandra cannot differentiate it from any other string or
>> byte array.  However, if my column validation_class could be a JsonType
>> that would allow tools to potentially do more interesting introspection on
>> the column value.  E.g. bug 
>> 3647calls for 
>> supporting arbitrarily nested "documents" in CQL.  Running a
>> query against the JSON column in Pig is possible as well, but again in this
>> use case it would be helpful to be able to encode in column metadata that
>> the column is stored as JSON.  For debugging, running nightly reports, etc.
>> it would be quite useful compared to the opaque string and byte array types
>> we have today.  JSON is appealing because it would be easy to implement.
>> Something like Thrift or Protocol Buffers would actually be interesting
>> since they would be more space efficient.  However, they would also be a
>> bit more difficult to implement because of the extra typing information
>> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
>> storing JSON is not too inefficient.
>> 
>> Would there be interest in adding a JsonType?  I could look at putting a
>> patch together.
>> 
>> Thanks,
>> Ben
>> 
>> 



Re: Document storage

2012-03-28 Thread Jeremiah Jordan
Sounds interesting to me.  I looked into adding protocol buffer support at one 
point, and it didn't look like it would be too much work.  The tricky part was 
I also wanted to add indexing support for attributes of the inserted protocol 
buffers.  That looked a little trickier, but still not impossible.  Though 
other stuff came up and I never got around to actually writing any code.
JSON support would be nice, especially if you figured out how to get built in 
indexing of the attributes inside the JSON to work =).

-Jeremiah

On Mar 28, 2012, at 10:58 AM, Ben McCann wrote:

> Hi,
> 
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
> 
> I've found it somewhat awkward to store document-oriented data in Cassandra
> today.  I can make a JSON/Protobuf/Thrift, serialize it, and store it, but
> Cassandra cannot differentiate it from any other string or byte array.
> However, if my column validation_class could be a JsonType that would
> allow tools to potentially do more interesting introspection on the column
> value.  E.g. bug 3647
> calls for
> supporting arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in this
> use case it would be helpful to be able to encode in column metadata that
> the column is stored as JSON.  For debugging, running nightly reports, etc.
> it would be quite useful compared to the opaque string and byte array types
> we have today.  JSON is appealing because it would be easy to implement.
> Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
> storing JSON is not too inefficient.
> 
> Would there be interest in adding a JsonType?  I could look at putting a
> patch together.
> 
> Thanks,
> Ben



Re: Document storage

2012-03-28 Thread Tatu Saloranta
On Wed, Mar 28, 2012 at 6:59 PM, Jeremiah Jordan
 wrote:
> Sounds interesting to me.  I looked into adding protocol buffer support at 
> one point, and it didn't look like it would be too much work.  The tricky 
> part was I also wanted to add indexing support for attributes of the inserted 
> protocol buffers.  That looked a little trickier, but still not impossible.  
> Though other stuff came up and I never got around to actually writing any 
> code.
> JSON support would be nice, especially if you figured out how to get built in 
> indexing of the attributes inside the JSON to work =).

Also, for whatever it's worth, it should be trivial to add support for
Smile (binary JSON serialization):
http://wiki.fasterxml.com/SmileFormatSpec
since its logical data structure is pure JSON, no extensions or
subsetting. The main Java impl is by Jackson project, but there is
also a C codec (https://github.com/pierre/libsmile), and prototypes
for PHP and Ruby bindings as well.
But for all data it's bit faster, bit more compact; about 30% for
individual items, but more (40 - 70%) for data sequences (due to
optional back-referencing).

JSON and Smile can be auto-detected from first 4 bytes or so, reliably
and efficiently, so one should be able to add this either
transparently or explicitly.
One could even transcode things on the fly -- store as Smile, expose
filtered results as JSON (and accept JSON or both). This could reduce
storage cost while keep the benefits of flexible data format.

-+ Tatu +-


Re: Document storage

2012-03-28 Thread Edward Capriolo
Some work I did stores JSON blobs in columns. The question on JSON
type is how to sort it.

On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
 wrote:
> I don't speak for the project, but you might give it a day or two for people 
> to respond and/or perhaps create a jira ticket.  Seems like that's a 
> reasonable data type that would get some traction - a json type.  However, 
> what would validation look like?  That's one of the main reasons there are 
> the data types and validators, in order to validate on insert.
>
> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
>
>> Any thoughts?  I'd like to submit a patch, but only if it will be accepted.
>>
>> Thanks,
>> Ben
>>
>>
>> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
>>
>>> Hi,
>>>
>>> I was wondering if it would be interesting to add some type of
>>> document-oriented data type.
>>>
>>> I've found it somewhat awkward to store document-oriented data in
>>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
>>> store it, but Cassandra cannot differentiate it from any other string or
>>> byte array.  However, if my column validation_class could be a JsonType
>>> that would allow tools to potentially do more interesting introspection on
>>> the column value.  E.g. bug 
>>> 3647calls for 
>>> supporting arbitrarily nested "documents" in CQL.  Running a
>>> query against the JSON column in Pig is possible as well, but again in this
>>> use case it would be helpful to be able to encode in column metadata that
>>> the column is stored as JSON.  For debugging, running nightly reports, etc.
>>> it would be quite useful compared to the opaque string and byte array types
>>> we have today.  JSON is appealing because it would be easy to implement.
>>> Something like Thrift or Protocol Buffers would actually be interesting
>>> since they would be more space efficient.  However, they would also be a
>>> bit more difficult to implement because of the extra typing information
>>> they provide.  I'm hoping with Cassandra 1.0's addition of compression that
>>> storing JSON is not too inefficient.
>>>
>>> Would there be interest in adding a JsonType?  I could look at putting a
>>> patch together.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>


Re: Document storage

2012-03-28 Thread Ben McCann
I don't imagine sort is a meaningful operation on JSON data.  As long as
the sorting is consistent I would think that should be sufficient.


On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo wrote:

> Some work I did stores JSON blobs in columns. The question on JSON
> type is how to sort it.
>
> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
>  wrote:
> > I don't speak for the project, but you might give it a day or two for
> people to respond and/or perhaps create a jira ticket.  Seems like that's a
> reasonable data type that would get some traction - a json type.  However,
> what would validation look like?  That's one of the main reasons there are
> the data types and validators, in order to validate on insert.
> >
> > On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
> >
> >> Any thoughts?  I'd like to submit a patch, but only if it will be
> accepted.
> >>
> >> Thanks,
> >> Ben
> >>
> >>
> >> On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
> >>
> >>> Hi,
> >>>
> >>> I was wondering if it would be interesting to add some type of
> >>> document-oriented data type.
> >>>
> >>> I've found it somewhat awkward to store document-oriented data in
> >>> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> >>> store it, but Cassandra cannot differentiate it from any other string
> or
> >>> byte array.  However, if my column validation_class could be a JsonType
> >>> that would allow tools to potentially do more interesting
> introspection on
> >>> the column value.  E.g. bug 3647<
> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for supporting
> arbitrarily nested "documents" in CQL.  Running a
> >>> query against the JSON column in Pig is possible as well, but again in
> this
> >>> use case it would be helpful to be able to encode in column metadata
> that
> >>> the column is stored as JSON.  For debugging, running nightly reports,
> etc.
> >>> it would be quite useful compared to the opaque string and byte array
> types
> >>> we have today.  JSON is appealing because it would be easy to
> implement.
> >>> Something like Thrift or Protocol Buffers would actually be interesting
> >>> since they would be more space efficient.  However, they would also be
> a
> >>> bit more difficult to implement because of the extra typing information
> >>> they provide.  I'm hoping with Cassandra 1.0's addition of compression
> that
> >>> storing JSON is not too inefficient.
> >>>
> >>> Would there be interest in adding a JsonType?  I could look at putting
> a
> >>> patch together.
> >>>
> >>> Thanks,
> >>> Ben
> >>>
> >>>
> >
>


Re: Document storage

2012-03-28 Thread Drew Kutcharian
I'm actually doing something almost the same. I serialize my objects into 
byte[] using Jackson's SMILE format, then compress it using Snappy then store 
the byte[] in Cassandra. I actually created a simple Cassandra Type for this 
but I hit a wall with cassandra-cli:

https://issues.apache.org/jira/browse/CASSANDRA-4081

Please vote on the JIRA if you are interested.

Validation is pretty simple, you just need to read the value and parse it using 
Jackson, if you don't get any exceptions you're JSON/Smile is valid ;)

-- Drew



On Mar 28, 2012, at 9:28 PM, Ben McCann wrote:

> I don't imagine sort is a meaningful operation on JSON data.  As long as
> the sorting is consistent I would think that should be sufficient.
> 
> 
> On Wed, Mar 28, 2012 at 8:51 PM, Edward Capriolo wrote:
> 
>> Some work I did stores JSON blobs in columns. The question on JSON
>> type is how to sort it.
>> 
>> On Wed, Mar 28, 2012 at 7:35 PM, Jeremy Hanna
>>  wrote:
>>> I don't speak for the project, but you might give it a day or two for
>> people to respond and/or perhaps create a jira ticket.  Seems like that's a
>> reasonable data type that would get some traction - a json type.  However,
>> what would validation look like?  That's one of the main reasons there are
>> the data types and validators, in order to validate on insert.
>>> 
>>> On Mar 29, 2012, at 12:27 AM, Ben McCann wrote:
>>> 
 Any thoughts?  I'd like to submit a patch, but only if it will be
>> accepted.
 
 Thanks,
 Ben
 
 
 On Wed, Mar 28, 2012 at 8:58 AM, Ben McCann  wrote:
 
> Hi,
> 
> I was wondering if it would be interesting to add some type of
> document-oriented data type.
> 
> I've found it somewhat awkward to store document-oriented data in
> Cassandra today.  I can make a JSON/Protobuf/Thrift, serialize it, and
> store it, but Cassandra cannot differentiate it from any other string
>> or
> byte array.  However, if my column validation_class could be a JsonType
> that would allow tools to potentially do more interesting
>> introspection on
> the column value.  E.g. bug 3647<
>> https://issues.apache.org/jira/browse/CASSANDRA-3647>calls for supporting
>> arbitrarily nested "documents" in CQL.  Running a
> query against the JSON column in Pig is possible as well, but again in
>> this
> use case it would be helpful to be able to encode in column metadata
>> that
> the column is stored as JSON.  For debugging, running nightly reports,
>> etc.
> it would be quite useful compared to the opaque string and byte array
>> types
> we have today.  JSON is appealing because it would be easy to
>> implement.
> Something like Thrift or Protocol Buffers would actually be interesting
> since they would be more space efficient.  However, they would also be
>> a
> bit more difficult to implement because of the extra typing information
> they provide.  I'm hoping with Cassandra 1.0's addition of compression
>> that
> storing JSON is not too inefficient.
> 
> Would there be interest in adding a JsonType?  I could look at putting
>> a
> patch together.
> 
> Thanks,
> Ben
> 
> 
>>> 
>>