This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a6ce63fb9c docs: udpate third party projects (#497)
a6ce63fb9c is described below
commit a6ce63fb9c82dc8f25f42f377b487c0de2aff826
Author: Matthew Powers <[email protected]>
AuthorDate: Thu Jan 25 11:18:05 2024 -0500
docs: udpate third party projects (#497)
---
site/third-party-projects.html | 79 ++++++++++++++++++++++--------------------
third-party-projects.md | 77 ++++++++++++++++++++--------------------
2 files changed, 81 insertions(+), 75 deletions(-)
diff --git a/site/third-party-projects.html b/site/third-party-projects.html
index ba0911b733..a0f7a953f8 100644
--- a/site/third-party-projects.html
+++ b/site/third-party-projects.html
@@ -141,40 +141,57 @@
<div class="col-12 col-md-9">
<p>This page tracks external software projects that supplement Apache
Spark and add to its ecosystem.</p>
-<p>To add a project, open a pull request against the <a
href="https://github.com/apache/spark-website">spark-website</a>
-repository. Add an entry to
-<a
href="https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md">this
markdown file</a>,
-then run <code class="language-plaintext highlighter-rouge">jekyll
build</code> to generate the HTML too. Include
-both in your pull request. See the README in this repo for more
information.</p>
+<h2 id="popular-libraries-with-pyspark-integrations">Popular libraries with
PySpark integrations</h2>
-<p>Note that all project and product names should follow <a
href="/trademarks.html">trademark guidelines</a>.</p>
+<ul>
+ <li><a
href="https://github.com/great-expectations/great_expectations">great-expectations</a>
- Always know what to expect from your data</li>
+ <li><a href="https://github.com/apache/airflow">Apache Airflow</a> - A
platform to programmatically author, schedule, and monitor workflows</li>
+ <li><a href="https://github.com/dmlc/xgboost">xgboost</a> - Scalable,
portable and distributed gradient boosting</li>
+ <li><a href="https://github.com/shap/shap">shap</a> - A game theoretic
approach to explain the output of any machine learning model</li>
+ <li><a href="https://github.com/awslabs/python-deequ">python-deequ</a> -
Measures data quality in large datasets</li>
+ <li><a href="https://github.com/datahub-project/datahub">datahub</a> -
Metadata platform for the modern data stack</li>
+ <li><a href="https://github.com/dbt-labs/dbt-spark">dbt-spark</a> - Enables
dbt to work with Apache Spark</li>
+</ul>
-<h2>spark-packages.org</h2>
+<h2 id="connectors">Connectors</h2>
-<p><a href="https://spark-packages.org/">spark-packages.org</a> is an
external,
-community-managed list of third-party libraries, add-ons, and applications
that work with
-Apache Spark. You can add a package as long as you have a GitHub
repository.</p>
+<ul>
+ <li><a
href="https://github.com/spark-redshift-community/spark-redshift">spark-redshift</a>
- Performant Redshift data source for Apache Spark</li>
+ <li><a
href="https://github.com/microsoft/sql-spark-connector">spark-sql-connector</a>
- Apache Spark Connector for SQL Server and Azure SQL</li>
+ <li><a
href="https://github.com/Azure/azure-cosmosdb-spark">azure-cosmos-spark</a> -
Apache Spark Connector for Azure Cosmos DB</li>
+ <li><a
href="https://github.com/Azure/azure-event-hubs-spark">azure-event-hubs-spark</a>
- Enables continuous data processing with Apache Spark and Azure Event
Hubs</li>
+ <li><a
href="https://github.com/Azure/azure-kusto-spark">azure-kusto-spark</a> -
Apache Spark connector for Azure Kusto</li>
+ <li><a href="https://github.com/mongodb/mongo-spark">mongo-spark</a> - The
MongoDB Spark connector</li>
+ <li><a
href="https://github.com/couchbase/couchbase-spark-connector">couchbase-spark-connector</a>
- The Official Couchbase Spark connector</li>
+ <li><a
href="https://github.com/datastax/spark-cassandra-connector">spark-cassandra-connector</a>
- DataStax connector for Apache Spark to Apache Cassandra</li>
+ <li><a
href="https://github.com/elastic/elasticsearch-hadoop">elasticsearch-hadoop</a>
- Elasticsearch real-time search and analytics natively integrated with
Spark</li>
+ <li><a
href="https://github.com/neo4j-contrib/neo4j-spark-connector">neo4j-spark-connector</a>
- Neo4j Connector for Apache Spark</li>
+ <li><a
href="https://github.com/StarRocks/starrocks-connector-for-apache-spark">starrocks-connector-for-apache-spark</a>
- StarRocks Apache Spark connector</li>
+ <li><a href="https://github.com/pingcap/tispark">tispark</a> - TiSpark is
built for running Apache Spark on top of TiDB/TiKV</li>
+</ul>
+
+<h2 id="open-table-formats">Open table formats</h2>
+
+<ul>
+ <li><a href="https://delta.io">Delta Lake</a> - Storage layer that provides
ACID transactions and scalable metadata handling for Apache Spark workloads</li>
+ <li><a href="https://github.com/apache/hudi">Hudi</a>: Upserts, Deletes And
Incremental Processing on Big Data</li>
+ <li><a href="https://github.com/apache/iceberg">Iceberg</a> - Open table
format for analytic datasets</li>
+</ul>
<h2>Infrastructure projects</h2>
<ul>
- <li><a href="https://github.com/spark-jobserver/spark-jobserver">REST Job
Server for Apache Spark</a> -
-REST interface for managing and submitting Spark jobs on the same cluster.</li>
- <li><a href="http://mlbase.org/">MLbase</a> - Machine Learning research
project on top of Spark</li>
+ <li><a href="https://github.com/apache/kyuubi">Kyuubi</a> - Apache Kyuubi is
a distributed and multi-tenant gateway to provide serverless SQL on data
warehouses and lakehouses</li>
+ <li><a href="https://github.com/spark-jobserver/spark-jobserver">REST Job
Server for Apache Spark</a> - REST interface for managing and submitting Spark
jobs on the same cluster.</li>
<li><a href="https://mesos.apache.org/">Apache Mesos</a> - Cluster
management system that supports
running Spark</li>
<li><a href="https://www.alluxio.org/">Alluxio</a> (née Tachyon) - Memory
speed virtual distributed
storage system that supports running Spark</li>
<li><a href="https://github.com/filodb/FiloDB">FiloDB</a> - a Spark
integrated analytical/columnar
database, with in-memory option capable of sub-second concurrent queries</li>
- <li><a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose
notebook which supports 20+ language backends,
-including Apache Spark</li>
- <li><a href="https://github.com/EclairJS/eclairjs-node">EclairJS</a> -
enables Node.js developers to code
-against Spark, and data scientists to use Javascript in Jupyter notebooks.</li>
- <li><a href="https://github.com/Hydrospheredata/mist">Mist</a> - Serverless
proxy for Spark cluster (spark middleware)</li>
+ <li><a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose
notebook which supports 20+ language backends, including Apache Spark</li>
<li><a
href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator">K8S
Operator for Apache Spark</a> - Kubernetes operator for specifying and managing
the lifecycle of Apache Spark applications on Kubernetes.</li>
<li><a
href="https://developer.ibm.com/storage/products/ibm-spectrum-conductor-spark/">IBM
Spectrum Conductor</a> - Cluster management software that integrates with
Spark and modern computing frameworks.</li>
- <li><a href="https://delta.io">Delta Lake</a> - Storage layer that provides
ACID transactions and scalable metadata handling for Apache Spark
workloads.</li>
<li><a href="https://mlflow.org">MLflow</a> - Open source platform to manage
the machine learning lifecycle, including deploying models from diverse machine
learning libraries on Apache Spark.</li>
<li><a
href="https://datafu.apache.org/docs/spark/getting-started.html">Apache
DataFu</a> - A collection of utils and user-defined-functions for working with
large scale data in Apache Spark, as well as making Scala-Python
interoperability easier.</li>
</ul>
@@ -184,16 +201,6 @@ against Spark, and data scientists to use Javascript in
Jupyter notebooks.</li>
<ul>
<li><a href="https://mahout.apache.org/">Apache Mahout</a> - Previously on
Hadoop MapReduce,
Mahout has switched to using Spark as the backend</li>
- <li><a href="https://wiki.apache.org/mrql/">Apache MRQL</a> - A query
processing and optimization
-system for large-scale, distributed data analysis, built on top of Apache
Hadoop, Hama, and Spark</li>
- <li><a href="https://github.com/sameeragarwal/blinkdb">BlinkDB</a> - a
massively parallel, approximate query engine built
-on top of Shark and Spark</li>
- <li><a href="https://github.com/adobe-research/spindle">Spindle</a> -
Spark/Parquet-based web
-analytics query engine</li>
- <li><a
href="https://github.com/thunderain-project/thunderain">Thunderain</a> - a
framework
-for combining stream processing with historical data, think Lambda
architecture</li>
- <li><a href="https://github.com/OryxProject/oryx">Oryx</a> - Lambda
architecture on Apache Spark,
-Apache Kafka for real-time large scale machine learning</li>
<li><a href="https://github.com/bigdatagenomics/adam">ADAM</a> - A framework
and CLI for loading,
transforming, and analyzing genomic data using Apache Spark</li>
<li><a href="https://github.com/salesforce/TransmogrifAI">TransmogrifAI</a>
- AutoML library for building modular, reusable, strongly typed machine
learning workflows on Spark with minimal hand tuning</li>
@@ -204,7 +211,6 @@ transforming, and analyzing genomic data using Apache
Spark</li>
<h2>Performance, monitoring, and debugging tools for Spark</h2>
<ul>
- <li><a href="https://github.com/g1thubhub/phil_stopwatch">Performance and
debugging library</a> - A library to analyze Spark and PySpark applications for
improving performance and finding the cause of failures</li>
<li><a href="https://www.datamechanics.co/delight">Data Mechanics
Delight</a> - Delight is a free, hosted, cross-platform Spark UI alternative
backed by an open-source Spark agent. It features new metrics and
visualizations to simplify Spark monitoring and performance tuning.</li>
</ul>
@@ -219,16 +225,9 @@ transforming, and analyzing genomic data using Apache
Spark</li>
<h3>Clojure</h3>
<ul>
- <li><a
href="https://github.com/TheClimateCorporation/clj-spark">clj-spark</a></li>
<li><a href="https://github.com/zero-one-group/geni">Geni</a> - A Clojure
dataframe library that runs on Apache Spark with a focus on optimizing the REPL
experience.</li>
</ul>
-<h3>Groovy</h3>
-
-<ul>
- <li><a
href="https://github.com/bunions1/groovy-spark-example">groovy-spark-example</a></li>
-</ul>
-
<h3>Julia</h3>
<ul>
@@ -241,6 +240,12 @@ transforming, and analyzing genomic data using Apache
Spark</li>
<li><a href="https://github.com/JetBrains/kotlin-spark-api">Kotlin for
Apache Spark</a></li>
</ul>
+<h2 id="adding-new-projects">Adding new projects</h2>
+
+<p>To add a project, open a pull request against the <a
href="https://github.com/apache/spark-website">spark-website</a> repository.
Add an entry to <a
href="https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md">this
markdown file</a>, then run <code class="language-plaintext
highlighter-rouge">jekyll build</code> to generate the HTML too. Include both
in your pull request. See the README in this repo for more information.</p>
+
+<p>Note that all project and product names should follow <a
href="/trademarks.html">trademark guidelines</a>.</p>
+
</div>
<div class="col-12 col-md-3">
<div class="news" style="margin-bottom: 20px;">
diff --git a/third-party-projects.md b/third-party-projects.md
index cf6f3c8102..e8b4b16c85 100644
--- a/third-party-projects.md
+++ b/third-party-projects.md
@@ -9,39 +9,50 @@ navigation:
This page tracks external software projects that supplement Apache Spark and
add to its ecosystem.
-To add a project, open a pull request against the
[spark-website](https://github.com/apache/spark-website)
-repository. Add an entry to
-[this markdown
file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md),
-then run `jekyll build` to generate the HTML too. Include
-both in your pull request. See the README in this repo for more information.
-
-Note that all project and product names should follow [trademark
guidelines](/trademarks.html).
-
-<h2>spark-packages.org</h2>
-
-<a href="https://spark-packages.org/">spark-packages.org</a> is an external,
-community-managed list of third-party libraries, add-ons, and applications
that work with
-Apache Spark. You can add a package as long as you have a GitHub repository.
+## Popular libraries with PySpark integrations
+
+-
[great-expectations](https://github.com/great-expectations/great_expectations)
- Always know what to expect from your data
+- [Apache Airflow](https://github.com/apache/airflow) - A platform to
programmatically author, schedule, and monitor workflows
+- [xgboost](https://github.com/dmlc/xgboost) - Scalable, portable and
distributed gradient boosting
+- [shap](https://github.com/shap/shap) - A game theoretic approach to explain
the output of any machine learning model
+- [python-deequ](https://github.com/awslabs/python-deequ) - Measures data
quality in large datasets
+- [datahub](https://github.com/datahub-project/datahub) - Metadata platform
for the modern data stack
+- [dbt-spark](https://github.com/dbt-labs/dbt-spark) - Enables dbt to work
with Apache Spark
+
+## Connectors
+
+- [spark-redshift](https://github.com/spark-redshift-community/spark-redshift)
- Performant Redshift data source for Apache Spark
+- [spark-sql-connector](https://github.com/microsoft/sql-spark-connector) -
Apache Spark Connector for SQL Server and Azure SQL
+- [azure-cosmos-spark](https://github.com/Azure/azure-cosmosdb-spark) - Apache
Spark Connector for Azure Cosmos DB
+- [azure-event-hubs-spark](https://github.com/Azure/azure-event-hubs-spark) -
Enables continuous data processing with Apache Spark and Azure Event Hubs
+- [azure-kusto-spark](https://github.com/Azure/azure-kusto-spark) - Apache
Spark connector for Azure Kusto
+- [mongo-spark](https://github.com/mongodb/mongo-spark) - The MongoDB Spark
connector
+-
[couchbase-spark-connector](https://github.com/couchbase/couchbase-spark-connector)
- The Official Couchbase Spark connector
+-
[spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector)
- DataStax connector for Apache Spark to Apache Cassandra
+- [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop) -
Elasticsearch real-time search and analytics natively integrated with Spark
+-
[neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector)
- Neo4j Connector for Apache Spark
+-
[starrocks-connector-for-apache-spark](https://github.com/StarRocks/starrocks-connector-for-apache-spark)
- StarRocks Apache Spark connector
+- [tispark](https://github.com/pingcap/tispark) - TiSpark is built for running
Apache Spark on top of TiDB/TiKV
+
+## Open table formats
+
+- <a href="https://delta.io">Delta Lake</a> - Storage layer that provides ACID
transactions and scalable metadata handling for Apache Spark workloads
+- [Hudi](https://github.com/apache/hudi): Upserts, Deletes And Incremental
Processing on Big Data
+- [Iceberg](https://github.com/apache/iceberg) - Open table format for
analytic datasets
<h2>Infrastructure projects</h2>
-- <a href="https://github.com/spark-jobserver/spark-jobserver">REST Job Server
for Apache Spark</a> -
-REST interface for managing and submitting Spark jobs on the same cluster.
-- <a href="http://mlbase.org/">MLbase</a> - Machine Learning research project
on top of Spark
+- [Kyuubi](https://github.com/apache/kyuubi) - Apache Kyuubi is a distributed
and multi-tenant gateway to provide serverless SQL on data warehouses and
lakehouses
+- <a href="https://github.com/spark-jobserver/spark-jobserver">REST Job Server
for Apache Spark</a> - REST interface for managing and submitting Spark jobs on
the same cluster.
- <a href="https://mesos.apache.org/">Apache Mesos</a> - Cluster management
system that supports
running Spark
- <a href="https://www.alluxio.org/">Alluxio</a> (née Tachyon) - Memory speed
virtual distributed
storage system that supports running Spark
- <a href="https://github.com/filodb/FiloDB">FiloDB</a> - a Spark integrated
analytical/columnar
database, with in-memory option capable of sub-second concurrent queries
-- <a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose notebook
which supports 20+ language backends,
-including Apache Spark
-- <a href="https://github.com/EclairJS/eclairjs-node">EclairJS</a> - enables
Node.js developers to code
-against Spark, and data scientists to use Javascript in Jupyter notebooks.
-- <a href="https://github.com/Hydrospheredata/mist">Mist</a> - Serverless
proxy for Spark cluster (spark middleware)
+- <a href="http://zeppelin-project.org/">Zeppelin</a> - Multi-purpose notebook
which supports 20+ language backends, including Apache Spark
- <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator">K8S
Operator for Apache Spark</a> - Kubernetes operator for specifying and managing
the lifecycle of Apache Spark applications on Kubernetes.
- <a
href="https://developer.ibm.com/storage/products/ibm-spectrum-conductor-spark/">IBM
Spectrum Conductor</a> - Cluster management software that integrates with
Spark and modern computing frameworks.
-- <a href="https://delta.io">Delta Lake</a> - Storage layer that provides ACID
transactions and scalable metadata handling for Apache Spark workloads.
- <a href="https://mlflow.org">MLflow</a> - Open source platform to manage the
machine learning lifecycle, including deploying models from diverse machine
learning libraries on Apache Spark.
- <a href="https://datafu.apache.org/docs/spark/getting-started.html">Apache
DataFu</a> - A collection of utils and user-defined-functions for working with
large scale data in Apache Spark, as well as making Scala-Python
interoperability easier.
@@ -49,16 +60,6 @@ against Spark, and data scientists to use Javascript in
Jupyter notebooks.
- <a href="https://mahout.apache.org/">Apache Mahout</a> - Previously on
Hadoop MapReduce,
Mahout has switched to using Spark as the backend
-- <a href="https://wiki.apache.org/mrql/">Apache MRQL</a> - A query processing
and optimization
-system for large-scale, distributed data analysis, built on top of Apache
Hadoop, Hama, and Spark
-- <a href="https://github.com/sameeragarwal/blinkdb">BlinkDB</a> - a massively
parallel, approximate query engine built
-on top of Shark and Spark
-- <a href="https://github.com/adobe-research/spindle">Spindle</a> -
Spark/Parquet-based web
-analytics query engine
-- <a href="https://github.com/thunderain-project/thunderain">Thunderain</a> -
a framework
-for combining stream processing with historical data, think Lambda architecture
-- <a href="https://github.com/OryxProject/oryx">Oryx</a> - Lambda
architecture on Apache Spark,
-Apache Kafka for real-time large scale machine learning
- <a href="https://github.com/bigdatagenomics/adam">ADAM</a> - A framework and
CLI for loading,
transforming, and analyzing genomic data using Apache Spark
- <a href="https://github.com/salesforce/TransmogrifAI">TransmogrifAI</a> -
AutoML library for building modular, reusable, strongly typed machine learning
workflows on Spark with minimal hand tuning
@@ -67,7 +68,6 @@ transforming, and analyzing genomic data using Apache Spark
<h2>Performance, monitoring, and debugging tools for Spark</h2>
-- <a href="https://github.com/g1thubhub/phil_stopwatch">Performance and
debugging library</a> - A library to analyze Spark and PySpark applications for
improving performance and finding the cause of failures
- <a href="https://www.datamechanics.co/delight">Data Mechanics Delight</a> -
Delight is a free, hosted, cross-platform Spark UI alternative backed by an
open-source Spark agent. It features new metrics and visualizations to simplify
Spark monitoring and performance tuning.
<h2>Additional language bindings</h2>
@@ -78,13 +78,8 @@ transforming, and analyzing genomic data using Apache Spark
<h3>Clojure</h3>
-- <a href="https://github.com/TheClimateCorporation/clj-spark">clj-spark</a>
- <a href="https://github.com/zero-one-group/geni">Geni</a> - A Clojure
dataframe library that runs on Apache Spark with a focus on optimizing the REPL
experience.
-<h3>Groovy</h3>
-
-- <a
href="https://github.com/bunions1/groovy-spark-example">groovy-spark-example</a>
-
<h3>Julia</h3>
- <a href="https://github.com/dfdx/Spark.jl">Spark.jl</a>
@@ -92,3 +87,9 @@ transforming, and analyzing genomic data using Apache Spark
<h3>Kotlin</h3>
- <a href="https://github.com/JetBrains/kotlin-spark-api">Kotlin for Apache
Spark</a>
+
+## Adding new projects
+
+To add a project, open a pull request against the
[spark-website](https://github.com/apache/spark-website) repository. Add an
entry to [this markdown
file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md),
then run `jekyll build` to generate the HTML too. Include both in your pull
request. See the README in this repo for more information.
+
+Note that all project and product names should follow [trademark
guidelines](/trademarks.html).
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]