Re: C* engine

2013-12-19 Thread Mark Papadakis
There is a _lot_ of room for performance improvements, both in terms of design 
decisions but even more so in terms of implementation specifics on there.

I don’t think it makes sense (any kind of sense) to rewrite to C or any other 
language, given the amount of work and(perhaps most importantly) testing that 
was put on C* so far, but, choosing a different language is not what’s most 
important here in terms of extracting performance; improving existing codebase 
is the right thing to do, however, as I understand it, improving what’s already 
there in terms of stability and conformance and releasing new features is a 
higher priority item for the developers, for now. It will happen though.

Mark Papadakis


On Dec 19, 2013, at 9:22 PM, Roman Vasilyev  wrote:

> Hello,
> 
> Don't want to rise "holy war". Just let me share my crazy thoughts.
> I believe it could improve Cassandra speed and robustness.
> 
> What people will say if I propose to have Cassandra engine written in C/C++, 
> and this engine will give you ability to run extensions in Java, Groovy and 
> bunch other languages like Perl/Python/Ruby?
> 
> I just want to understand for myself does this solution will be useful or I'm 
> looking in wrong direction?
> 
> Thank you for reading.



Re: C* engine

2013-12-20 Thread Mark Papadakis
For what it’s worth, our datastore is written in C++ and very similar to C* in 
terms of functionality and semantics; whatever performance delta there is 
there, it mostly comes down to reducing I/O and taking advantage of locality of 
reference(both in actual I/O, but also in terms of accessing 
memory/data-structures whatever). There is no real benefit to moving to another 
language in order to better address those. 

@markpapadakis

On Dec 19, 2013, at 11:56 PM, Roman Vasilyev  wrote:

> This is what I expected as an answer, that effort just fully destroying any 
> benefits.
> All what I can hear is just better not touch stuff which works and find 
> improvements different way.
> Change (algorithms/caching/...) where I agree, and seems like bottleneck will 
> be FS I/O or network delays.
> 
> Just want to thank you for all answers on my question.
> 
> -- Original Message --
> From: "Eric Evans" 
> To: dev@cassandra.apache.org; "Roman Vasilyev" 
> Sent: 12/19/2013 1:47:03 PM
> Subject: Re: C* engine
> 
>> [ Roman Vasilyev ]
>>> Don't want to rise "holy war". Just let me share my crazy thoughts.
>>> I believe it could improve Cassandra speed and robustness.
>> 
>> That's an awfully large reset button you want to press, presumably the
>> benefits will justify the enormity of effort? What are we talking
>> here, 10x improvement? 100x improvement? More?
>> 
>>> What people will say if I propose to have Cassandra engine written
>>> in C/C++, and this engine will give you ability to run extensions in
>>> Java, Groovy and bunch other languages like Perl/Python/Ruby?
>> 
>> No one can stop you from working on this; Good luck
>> 
>> 
>> P.S. Please subscribe to the list (dev-subsrc...@cassandra.apache.org)
>> 
>> --
>> Eric Evans
>> eev...@sym-link.com
> 



[contrib] Idea/ Reduced I/O and CPU cost for GET ops

2014-09-07 Thread Mark Papadakis
PROS/CONS: Easier to compute, no restriction to first 31 distinct columns

2. 
PROS/CONS: Can support interleaved columns (as opposed to many all exact 
columns req. of 1)

This is configurable on a per CF basis (We usually choose 2).

Maybe you could consider such a heuristic for C*, it should probably benefit 
your users too.
Apologies if any of this doesn’t make sense in anyway, feel free to ignore:)


Mark Papadakis
@markpapadakis

Re: [contrib] Idea/ Reduced I/O and CPU cost for GET ops

2014-09-07 Thread Mark Papadakis
Hi Benedict,


> On Sep 8, 2014, at 4:01 AM, Benedict Elliott Smith 
>  wrote:
> 
> Hi Mark,
> 
> This specific heuristic is unlikely to be applied, as (if I've understood
> it correctly) it has a very narrow window of utility to only those updates
> that hit *exactly* the same clustering columns (cql rows) *and *data
> columns,
Well, method #1 indeed applies to updates that pertain to exactly the same 
columns, like you said. That is to say, if you go back to the products example, 
if you end up updating e.g (price, title) every day and therefore end up with a 
dozen SSTables, where in one SSTable you have all properties for the product 
(merchant info, category, image, .. ) and in almost all other you have an 
update for just (price, title) - this helps you avoid I/O and CPU costs by only 
having to consider 2 SSTables (index chunk segments are almost always cached).
Method #2 does not have this requirement. It works even if you don’t have the 
exact same set of columns in the row on each SSTable, however you are limited 
to the first 31 distinct ones on per SSTable basis.


> and is not trivial to maintain (either cpu- or memory-wise).
> However two variants on this heuristic are already being considered for
> inclusion as part of the new sstable format we're introducing in
> CASSANDRA-7447 <https://issues.apache.org/jira/browse/CASSANDRA-7447>,
> which is an extension of the the heuristic that is already applied at the
> whole sstable level.
> 
Very interesting. We have made some similar choices (albeit differently, and 
for different reasons — e.g timestamp/ttls compression, per file layout index, 
etc). The described heuristics, in practice and for the workloads and update 
and retrieval patterns I described, are extremely effective.  They are also 
particularly ‘cheap’ (turned out) to compute and store, and add very little 
overhead to the GET logic impl. Again, those are configured on a per CF basis 
(either method is selected) - so we only do this where we know it makes sense 
to do it. 
Thank you for the link; some great ideas there. This looks like a great update 
to C*.


> (1) Per partition, we will store the maximum timestamp (whether or not this
> sits in the hash index / key cache, or in the clustering column index part,
> is an open question)
> - this permits us to stop looking at files once we have a complete set of
> the data we expect to return, i.e. all selected fields for the complete set
> of selected rows
> 
> (2) Per clustering row, we may store a enough information to construct the
> max timestamp for the row
> - this permits us to stop looking at data pages if we know we have all
> selected fields for a given row only
> 
> 
> 
> 
> On Sun, Sep 7, 2014 at 11:30 PM, Mark Papadakis 
> wrote:
> 
>> Greetings,
>> 
>> This heuristic helps us eliminate unnecessary I/O in certain workloads and
>> datasets, by often many orders of magnitude. This is description of the
>> problems we faced and how we dealt with it — I am pretty certain this can
>> be easily implemented on C*, albeit will likely require a new SSTable
>> format that can support the semantics described below.
>> 
>> # Example
>> One of our services, a price comparison service, has many millions of
>> products in our datastore, and we access over 100+ rows on a single page
>> request (almost all of them in 2-3 MultiGets - those are executed in
>> parallel in our datastore implementation). This is fine, and it rarely
>> takes more than 100ms to get back responses from all those requests.
>> 
>> Because we, in practice, need to update all key=>value rows multiple times
>> a day (merchants tend to update their price every few hours for some odd
>> reason), it means that a key’s columns exist in multiple(almost always all)
>> SSTables of a CF, and so, we almost always have to merge the final value
>> for each key from all those many SSTables, as opposed to only need to
>> access a single SSTable to do that.
>> 
>> In fact, for most CFs of this service, we need to merge most of their
>> SSTables to get the final CF, because of that same reason (rows update very
>> frequently, as opposed to say, a ‘users’ CF where you typically only set
>> the row once on creation and very infrequently after that ).
>> Surely, there  must have been ways to exploit this pattern and access and
>> update semantics. (there are).
>> 
>> Our SSTables are partitioned into chunks. One of those chunks is the index
>> chunk which holds distinctKeyId:u64 => offset:u32, sorted by distinctKeyId.
>> We have a map which allows us to map  distinctKeyId:u64=> data chunk
>> region(skip list), so that this offset is an offset into the respectiv

Re: why bloom filter is only for row key?

2014-09-14 Thread Mark Papadakis
There was a per CF skip list and bloom filter in earlier C* releases(serialized 
right before the actual value). Haven't checked lately but I believe this is no 
longer the case, or it's configurable and by default it doesn't do it. There is 
probably a ticket on JIRA that outlines that rationale.

Sent from my iPhone

> On 14 Σεπ 2014, at 9:22 μ.μ., Philo Yang  wrote:
> 
> Hi all,
> 
> After reading some docs, I find that bloom filter is built on row keys, not
> on column key. Can anyone tell me what is considered for not building bloom
> filter on column key? Is it a good idea to offer a table property option
> between row key and primary key for what boolm filter is built on?
> 
> Thanks,
> Philo Yang