Hi Frank,

You could also use Hadoop with no reducer or with IdentityReducer, which 
ensures data locality as long as you start task tracker on Cassandra nodes 
where the data resides.

Concerning the difficulty to get tokens in a vnodes environment that's what 
Hadoop core functions do. You can have a look at 
https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/hadoop/AbstractColumnFamilyInputFormat.java#L114
 which shows how splits are enumerated. And as you got the endpoint address of 
each split, you can choose where to launch your code

Regards.
-- 
Cyril SCETBON
(Orange Sophia Antipolis)

On 23 Apr 2014, at 15:15, franck.me...@orange.com wrote:

> Hi,
> 
> 
> We are using Cassandra at Orange to manage a big sparse matrix on a cluster 
> of servers.
> 
> On this database we want to run a sparse matrix factorization algorithm.
> 
> 
> 
> We need to parallelize this matrix factorization algorithm, for instance by 
> computing the factorization model rows by rows.
> 
> So we want to distribute the computation of the rows on each server.
> 
> A natural way to do this would be to apply the algorithm on each server, 
> using the local rows that are stored by this server.
> 
> As the factorization model is also distributed, there is no need to merge the 
> results (no need to a kind of "reduce phase").
> 
> So there is no need of Hadoop.
> 
> Cassandra and the distributed algorithm on each server could be sufficient.
> 
> 
> 
> The problem is that the access to local data is currently not easy with the 
> Cassandra API:
> 
> - There is a token() function allowing to iterate on local rows.
> 
> - but this token function works well only with the one-token-per-server 
> partition scheme of Cassandra; with the 256-virtual-token partition scheme, 
> it becomes very difficult to access efficiently to local rows
> 
> - Unfortunately it seems that the one-token-per-server partition scheme is 
> not recommended, and may be it could become deprecated, as the later scheme 
> is more efficient for cluster managements.
> 
> 
> 
> We believe that the easy access to local data could be a key feature for 
> Cassandra to offer implicit parallelization strategies for many classes of 
> algorithms and classical process.
> 
> To ensure this key feature, it is just necessary to provide an easy, 
> transparent and sustainable function to access local data (local tables). 
> This function will just have to be compliant with future partition schemes.
> 
> 
> 
> Do you think this request may be a priority to Cassandra?
> 
> If so, when and how do you plan to provide this feature?, so we could adapt 
> our developments?
> 
> 
> 
> Many thanks for considering my request,
> 
> Best Regards,
> 
> 
> 
> Frank Meyer.
> 
> Research Engineer
> 
> Orange Labs - Lannion
> 
> 
> Frank Meyer
> France Telecom OLPS/UCE/CRM-DA/PROF (LD128)
> 2 avenue Pierre Marzin 22307 Lannion Cedex
> E-mail : franck.me...@orange.com<mailto:franck.me...@orange-ftgroup.com>
> Telephone : +33 (0)2 96 05 28 89
> http://www.francetelecom.com/rd
> 
> 
> _________________________________________________________________________________________________________________________
> 
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
> 
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
> 

Reply via email to