Re: [Proposal] lxdb - proposal for Apache Incubation

lidong dai Fri, 05 Mar 2021 02:55:23 -0800

Hi,
  Kammi’s summary is very comprehensive,  try to open source first. and
you'd better find an experienced mentor to help you, it will be very
helpful !  Good luck



Best Regards
---------------
DolphinScheduler(Incubator) PPMC
Lidong Dai
dailidon...@gmail.com
---------------


On Sun, Feb 28, 2021 at 6:52 PM Furkan KAMACI <furkankam...@gmail.com>
wrote:

> Hi,
>
> Actually you have a detailed documentation which explains which approach
> you have compared to similar systems and performance metrics of following
> them i.e. reducing storage 10 to the 100 times or having low latency
> queries.
>
> My advices are (some of them are same with Sheng's and Liang's ):
>
> 1) Find an experienced mentor to guide you.
>
> 2) Start to translate your documentation to English.
>
> 3) Open source your project. How can we have a comment on your project if
> we cannot see anything about it?
>
> 4) Gain contributors to your project. At least you should show your
> intention to have committers/contributors out of your company. Eliminate
> the risk of being non-meritocratic management of the project.
>
> 5) Structure your proposal. Explain why people need this project, which
> problems do current projects have and how you managed to handle them. We
> should understand is it a bundle of other projects, a completely new
> project, or a wrapper of other projects which eliminates the shortcomings
> of them.
>
> 6) Find a suitable name for your project in order to not try to solve
> trademark problems that may lose your time if you enter the incubation.
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <chenliang6...@gmail.com>
> wrote:
>
> > Hi
> >
> > It would be better if you could find an experienced IPMC member to help
> you
> > for preparing the proposal.
> > Based on Sheng Wu input, i have one more comment : can you please explain
> > what are the different with other similar data analysis DB?  you can
> > consider explaining from use cases perspective.
> >
> > Regards
> > Liang
> >
> >
> > fp wrote
> > > Dear Apache Incubator Community,
> > >
> > >
> > > Please accept the following proposal for presentation and discussion:
> > > https://github.com/lucene-cn/lxdb/wiki
> > >
> > >
> > > LXDB is a high-performance,OLAP,full text search database.it`s base on
> > > hbase,but replaced hfile with lucene index to support more effective
> > > secondary indexes,it`s also base on spark sql,so that you can used sql
> > api
> > > to visit data and do olap calculate. and also the lucene index is store
> > on
> > > hdfs (not local disk).
> > >
> > >
> > > In our Production System, LXDB supported 200+ clusters,some of the
> single
> > > cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
> > > billion rows for total), one of the biggest single table has 200million
> > > lucene index on LXDB.
> > >
> > >
> > > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
> > HDFS,
> > > Lucene.We have merged these separated projects again,LXDB&nbsp;equals
> > > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me
> 10
> > > years to complete these merging operations.But the purpose is no
> longer a
> > > search engine, but a database.
> > >
> > >
> > >
> > >
> > >
> > > Best regards
> > > &nbsp; yannian mu
> > >
> > >
> > >
> > >
> > > LXDB Proposal
> > > == Abstract ==
> > > LXDB is a high-performance,OLAP,full text search database.
> > >
> > >
> > > === it`s base on hbase,but replaced hfile with lucene index to support
> > > more effective secondary indexes.===&nbsp;
> > > we modify hbase region server ,we&nbsp; change hfile to lucene,when put
> > > data we put&nbsp; document to lucene instande of&nbsp; put data to
> hfile
> > > lucene index store on region server&nbsp;&nbsp;(it is not sote in
> > > different cluster like elstice search+hbase ,it takes to copy of data)
> > >
> > >
> > > === it`s base on spark sql for olap===&nbsp;
> > > we Integrated spark and hbase together ,it`s useage like this ,
> > > 1.unpackage lxdb.tar.gz&nbsp;
> > > 2.config hadoop_config path,
> > > 3.run start-all.sh to start cluster.&nbsp;
> > > lxdb can startup spark through hadoop yarn ,and then spark executor
> > > process Embedded start hbase region server service .&nbsp;
> > >
> > >
> > > you can operate lxdb database throuth spark sql api(hive) or mysql api.
> > > 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
> > > 2.the sql`s condition (filter or group by agg) will predicate to hbase
> ,
> > > 3.hbase used lucene index to filter data in region server.
> > > all of the spark,hbase,lucene is Embedded Integrated together,it is
> > > not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es
> +
> > > hbase+spark Solution.
> > >
> > >
> > > == Background ==
> > > === Multiple copies of data ===
> > > Apache HBase+Elastic Search is the most popular Solution on full text
> > > search ,but it`s weak on Online AnalyticalProcessing.
> > > so most of the time the Production System used spark(or hive or impala
> or
> > > presto) ,hbase,solr/es at the same time.Multiple copies of data are
> > stored
> > > in multiple systems,multiple systems has different Api .Data
> consistency
> > > is difficult to guarantee.For the above reasons we merger
> > > spark,hbase,elastic into one project .it`s target is used one copy of
> > > data,one cluster,one api to solve olap,kv,full text...database
> scenarios.
> > >
> > >
> > > === Merging and splitting of lucene indexes(hstore) acrocess different
> > > machine on hdfs ===
> > > As we all know solr/es store file in local fileSystem,it`s shard num
> must
> > > be a fix num,but if we store index on hdfs,the index can split able
> like
> > > hbase hstore,it can split or merge acorss machine nodes ,this is very
> > > usefull for distribute database ,it depend malloc how much resource on
> a
> > > table,most of time the records of a table is different by time by time
> so
> > > the num of shards always need adjust,if index store local it can`t
> split
> > > acroces throw different machine ,but lucene index store on hdfs it`s
> can
> > > do it.
> > > whether the number of pieces can be flexibly adjusted, whether it has
> the
> > > ability of elastic scaling, in a distributed database is particularly
> > > important
> > >
> > >
> > >
> > > === solved Insufficient of&nbsp; secondary indexes ===
> > > some people use hbase secondary index like Phoenix prjoect. but those
> > > programme base on the hbase rowkey has a lot of redundancy,He can't
> > create
> > > too many indexes,Data inflation rate is too high,so used lucene index
> > > instand of secondary is the best chooses.&nbsp;
> > >
> > >
> > > === we add an lucene index for spark olap===&nbsp;
> > > Most of OLAP systems has violent scanning problems and Poor timeliness
> of
> > > data like hive,spark sql,impala or some of the mpp database.
> > > 1.They used violent scans to calculate the data.but another choice is
> add
> > > index to the big data.some of the time using index can greatly improve
> > the
> > > performance of the original brute force scanning. i think&nbsp; that
> just
> > > like the traditional database, indexing technology can greatly improve
> > the
> > > performance of the speed database.
> > > 2.Another problem of thoses database or system, Most of them are an
> > > offline system or batch system,lxdb `s target is realtime append
> > ,realtime
> > > kv update just like hbase.
> > >
> > >
> > > ==future==
> > > === lucene on parquet ===
> > > recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm
> files
> > > to&nbsp; like parquet or orc format.
> > > To solve the performance problem of traversing Lucene index.To solve
> the
> > > problem that opening Lucene file needs to load files such as tip into
> > > memory, which leads to slow opening Lucene index file,To enable Lucene
> to
> > > store multi column joint index by column, which is used to handle some
> > > logic such as multi table join and materialized view ,mulity fields
> group
> > > by by invert index,The current Lucene index has many problems because
> of
> > > too many file pointers and single column problems,We want to modify
> > Lucene
> > > to make it more suitable for HDFS, not only for full-text retrieval,
> but
> > > also better at statistical analysis, which is a real database level
> > > index,We want Lucene to be splitable, which can separate storage from
> > > computation.
> > >
> > >
> > >
> > >
> > > ===&nbsp; supporting all kinds of Predicate pushdown
> calculation&nbsp;===
> > > We find that if we can combine the calculation method with the data
> > > closely, we can give more play to the performance of the database.
> Index
> > > is only a way of calculating push down. For example, storage push down,
> > we
> > > can store the index on the SSD device, and the data part on the SATA
> > > device. We can store the data that are often grouped together in
> advance,
> > > instead of calculating line by line, We can give important tables or
> > > columns to dedicated devices and resources, but these hbases are still
> > > lacking, which we need to further improve
> > >
> > >
> > > === Distribution of intervention data ===
> > > we can used row key to intervention data to different nodes ,it can do
> > > many interestest things
> > >
> > >
> > > === Resource control, resource isolation ===
> > > lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
> > we
> > > can do it , I can control the priority of SQL so that Lucene with
> higher
> > > priority can get faster IO resources.
> > >
> > >
> > > == Status ==
> > > since 2011 I released the first open source version on Alibaba&nbsp;
> ,At
> > > that time, mdrill used 10 nodes 48g machines to support 400 billion
> data.
> > > the first index on hdfs is from this version.it`s one year ahead of
> the
> > > community.&nbsp; https://github.com/alibaba/mdrill .
> > >
> > >
> > > since 2014 i stoped mdrill project update for the reason of i join into
> > > tencent . in our team we developed&nbsp; hermes project ,we also build
> > > lucene on hdfs , hermes now realtime import 1000 billion rows of data
> per
> > > day.It's the largest database I've ever developed ,
> > > https://plus.tencent.com/bigdata/hermes
> > >
> > >
> > > since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> > > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
> > > lucene.xin ,mail domain is lucene.cn.
> > > luxin`s first version of lxdb is called lsql,it`s means lucene
> sql.&nbsp;
> > > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
> > > cluster use lsql. it`s process about 200 billions per day ,amount of
> > 20000
> > > billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
> > >
> > >
> > > since 2010 In the case of COVID-19 our team decide to developed the
> next
> > > generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
> > > hbase to lsql To solve the update problem.nowadays we have finish the
> > > first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > == Known Risks ==
> > > ==Meritocracy ==
> > >
> > >
> > > lxdb has been deployed in production and is applying more than 200
> lines
> > > of business. It has demonstrated great performance benefits and has
> > proved
> > > to be a better way for reporting and analysis based big data. Still We
> > > look forward to growing a rich user and developer community.
> > >
> > >
> > > === Orphaned products ===
> > >
> > >
> > > The core developers currently work full-time for Luxin.
> > > lxdb is widely adopted by many companies and individuals. There's no
> > > realistic chance of it becoming orphaned. and we have a number of 1000
> > > person tencent qq Instant messaging group
> > >
> > >
> > >
> > > === Inexperience with Open Source===
> > >
> > > The core developers are all active users and followers of open source.
> > > They are already committers and contributors to the lxdb project.&nbsp;
> > > developed yannian mu has tens years on open source project,&nbsp;
> jstorm
> > > https://github.com/alibaba/jstorm and
> > > mdrill&nbsp;https://github.com/alibaba/mdrill
> > >
> > >
> > >
> > >
> > > === Homogenous Developers ===&nbsp;
> > >
> > >
> > > The most of core developers are from luxin for the Closed source
> products
> > > reason, but when lxdb was open sourced, lxdb will received a lot of bug
> > > fixes and enhancements from other developers not working at luxin.Where
> > > did you learn it from and where did you return it.
> > >
> > >
> > >
> > >
> > >
> > > ===Reliance on Salaried Developers ===
> > >
> > >
> > > Lxin invested in lxdb as the&nbsp; solution and some of its key
> engineers
> > > are working full time on the project. In addition, since there is a
> > > growing Big Data need for scalable solutions, we look forward to other
> > > Apache developers and researchers to contribute to the project. Also
> key
> > > to addressing the risk associated with relying on Salaried developers
> > from
> > > a single entity is to increase the diversity of the contributors and
> > > actively lobby , Apache lxdb intends to do this.
> > >
> > >
> > > === An Excessive Fascination with the Apache Brand ===
> > >
> > >
> > > Lxdb is proposing to enter incubation at Apache in order to help
> efforts
> > > to diversify the committer-base, not so much to capitalize on the
> Apache
> > > brand. The Lxdb project is in production use already inside lxdb, but
> is
> > > not expected to be an lxdb product for external customers. As such, the
> > > lxdb project is not seeking to use the Apache brand as a marketing
> tool.
> > >
> > >
> > >
> > >
> > >
> > > === Documentation===&nbsp;
> > >
> > >
> > > Information about Palo can be found at
> https://github.com/lucene-cn/lxdb
> > .
> > > The following links provide more information about lxdb in open source:
> > >
> > >
> > > * wiki site: https://github.com/lucene-cn/lxdb/wiki
> > > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> > > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> > > * lxin home page: http://www.lucene.xin
> > >
> > > * lsql document: http://docs.lucene.xin/lsql/v21/
> > >
> > >
> > >
> > > ##Initial Source
> > >
> > >
> > > lxdb will development source code under an Apache license at
> > > https://github.com/lucene-cn/lxdb.
> > >
> > >
> > >
> > >
> > >
> > >
> > > === Core Developers ===
> > >
> > >
> > >
> > > Currently most of the core developers of LXDB are working in the
> research
> > > Team of luxin.
> > >
> > >
> > > - yannian mu (dev)&nbsp;
> > > - yu chen (dev)&nbsp;
> > > - guangshi hao (dev)&nbsp;
> > > - wei sun (dev)&nbsp;
> > > - qihua zheng (dev)&nbsp;
> > > - xin wang (dev)&nbsp;
> > > - qingsong liu (dev)&nbsp;
> > > - anxing zhou (Tester)&nbsp;
> > > - jiajun duan (Tester)&nbsp;
> > >
> > >
> > >
> > > == External Dependencies ==
> > >
> > > As all dependencies are managed using Apache Maven
> > > Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp;
> &nbsp;
> > > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
> > > lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
> > > &nbsp; &nbsp; &nbsp; true
> > > zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
> > 2.0&nbsp;
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
> > > hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
> > > spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> > > true
> > > hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
> > > License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> > > hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> > true
> > >
> > >
> > >
> > >
> > > == Required Resources ==
> > >
> > >
> > > === Mailing lists ===
> > >
> > >
> > > &nbsp;* lxdb-private (PMC discussion)
> > > &nbsp;* lxdb-dev (developer discussion)
> > > &nbsp;* lxdb-user (user discussion)
> > > &nbsp;* lxdb-commits (SCM commits)
> > > &nbsp;* lxdb-issues (JIRA issue feed)
> > >
> > >
> > > === Subversion Directory ===
> > >
> > >
> > > Instead of subversion, LXDB prefers to git as source control
> > > management system: git://git.apache.org/lxdb
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-incubator-general.996316.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
>

Re: [Proposal] lxdb - proposal for Apache Incubation

Reply via email to