Bertty Contreras created COMDEV-472:
---------------------------------------
Summary: Apache Wayang(Incubating): AI Data Generator for Cost
model Calibration
Key: COMDEV-472
URL: https://issues.apache.org/jira/browse/COMDEV-472
Project: Community Development
Issue Type: New Feature
Components: GSoC/Mentoring ideas
Reporter: Bertty Contreras
*Synopsis*
The current Apache Wayang (Incubating) uses a cost model to select the right
set of platforms while optimizing the query plans. Nevertheless, the accuracy
of picking the correct configuration depends on the cost model's quality; The
idea is to build an AI pipeline capable of generating data for the current
profiler of Apache Wayang (Incubating), where another AI component is the main
component for the calibration process.
*Benefits to Community*
The benefits for the community will be the option of having a well-calibrated
cost model for their environments with low human effort. Being cost modelling
one of the most difficult tasks, having such an AI pipeline will enrich users’
experience when using Apache Wayang (Incubating).
*Deliverables*
The delivery expected is an adaptation of the paper "Expand your Training
Limits! Generating Training Data for ML-based Data Management" [1], where the
authors assume an ML-Cost-Model, but in this case, the idea needs modifications
to run in the current setup of Apache Wayang(Incubating).
The expected steps are the following:
* Understand the paper [1]
* Get Into the current process of the profiler of Apache Wayang (Incubating)
* Design the AI profile pipeline, based on [1] and the current profiler
* Discuss ideas on how to integrate the designed AI pipeline into Apache
Wayang(Incubating)
* Implement the AI-DataGenerator Component
*Related Work*
[1] [Expand your Training Limits! Generating Training Data for ML-based Data
Management|https://www.agora-ecosystem.com/publications_pdf/expand_training_limits.pdf]
[2] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform
systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])
*Biographical Information*
Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is
one of the PPMC of Apache Wayang(Incubating). He has many years of experience
developing intensive processing data systems for several industries, such as
banking systems. He was a research engineer at the Qatar Computing Research
Institute, where he was responsible for developing the declarative query engine
for Rheem and adding new underlying platforms to Rheem.
Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of
the PPMC of Apache Wayang(Incubating). He has many years of experience
developing applications that support Big Data processing, with experience
implementing ETL processes over distributed systems to optimize inventories in
supply chains. He was a research engineer at the Qatar Computing Research
Institute, where he specialized in human interface interaction with big data
analytics. During this time, he co-develop an ML-based cross-platform query
optimizer.
Jorge Quiané is the head of the Big Data Systems research group at the Berlin
Institute for the Foundations of Learning and Data (BIFOLD) and a Principal
Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of
the IAM group at the German Research Center for ArtificialIntelligence (DFKI).
His current research is in the broad area of big data: mainly in federated data
analytics, scalable data infrastructures, and distributed query processing. He
has published numerous research papers on data management and novel system
architectures. He has recently been honoured with the 2022 ACM SIGMOD Research
Highlight Award and the Best Paper Award at ICDE 2021 for his work on
“EfficientControl Flow in Dataflow Systems”. He holds five patents in core
database areas and on machine learning. Earlier in his career, he was a Senior
Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral
Researcher at Saarland University. He obtained his PhD in computer science from
INRIA (Nantes University).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]