(spark-website) branch asf-site updated: docs: udpate third party projects (#497)

dongjoon Thu, 25 Jan 2024 08:18:19 -0800

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new a6ce63fb9c docs: udpate third party projects (#497)
a6ce63fb9c is described below

commit a6ce63fb9c82dc8f25f42f377b487c0de2aff826
Author: Matthew Powers <[email protected]>
AuthorDate: Thu Jan 25 11:18:05 2024 -0500

    docs: udpate third party projects (#497)
---
 site/third-party-projects.html | 79 ++++++++++++++++++++++--------------------
 third-party-projects.md        | 77 ++++++++++++++++++++--------------------
 2 files changed, 81 insertions(+), 75 deletions(-)

diff --git a/site/third-party-projects.html b/site/third-party-projects.html
index ba0911b733..a0f7a953f8 100644
--- a/site/third-party-projects.html
+++ b/site/third-party-projects.html
@@ -141,40 +141,57 @@
     <div class="col-12 col-md-9">
       <p>This page tracks external software projects that supplement Apache 
Spark and add to its ecosystem.</p>
 
-<p>To add a project, open a pull request against the <a 
href="https://github.com/apache/spark-website";>spark-website</a> 
-repository. Add an entry to 
-<a 
href="https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md";>this
 markdown file</a>, 
-then run <code class="language-plaintext highlighter-rouge">jekyll 
build</code> to generate the HTML too. Include
-both in your pull request. See the README in this repo for more 
information.</p>
+<h2 id="popular-libraries-with-pyspark-integrations">Popular libraries with 
PySpark integrations</h2>
 
-<p>Note that all project and product names should follow <a 
href="/trademarks.html">trademark guidelines</a>.</p>
+<ul>
+  <li><a 
href="https://github.com/great-expectations/great_expectations";>great-expectations</a>
 - Always know what to expect from your data</li>
+  <li><a href="https://github.com/apache/airflow";>Apache Airflow</a> - A 
platform to programmatically author, schedule, and monitor workflows</li>
+  <li><a href="https://github.com/dmlc/xgboost";>xgboost</a> - Scalable, 
portable and distributed gradient boosting</li>
+  <li><a href="https://github.com/shap/shap";>shap</a> - A game theoretic 
approach to explain the output of any machine learning model</li>
+  <li><a href="https://github.com/awslabs/python-deequ";>python-deequ</a> - 
Measures data quality in large datasets</li>
+  <li><a href="https://github.com/datahub-project/datahub";>datahub</a> - 
Metadata platform for the modern data stack</li>
+  <li><a href="https://github.com/dbt-labs/dbt-spark";>dbt-spark</a> - Enables 
dbt to work with Apache Spark</li>
+</ul>
 
-<h2>spark-packages.org</h2>
+<h2 id="connectors">Connectors</h2>
 
-<p><a href="https://spark-packages.org/";>spark-packages.org</a> is an 
external, 
-community-managed list of third-party libraries, add-ons, and applications 
that work with 
-Apache Spark. You can add a package as long as you have a GitHub 
repository.</p>
+<ul>
+  <li><a 
href="https://github.com/spark-redshift-community/spark-redshift";>spark-redshift</a>
 - Performant Redshift data source for Apache Spark</li>
+  <li><a 
href="https://github.com/microsoft/sql-spark-connector";>spark-sql-connector</a> 
- Apache Spark Connector for SQL Server and Azure SQL</li>
+  <li><a 
href="https://github.com/Azure/azure-cosmosdb-spark";>azure-cosmos-spark</a> - 
Apache Spark Connector for Azure Cosmos DB</li>
+  <li><a 
href="https://github.com/Azure/azure-event-hubs-spark";>azure-event-hubs-spark</a>
 - Enables continuous data processing with Apache Spark and Azure Event 
Hubs</li>
+  <li><a 
href="https://github.com/Azure/azure-kusto-spark";>azure-kusto-spark</a> - 
Apache Spark connector for Azure Kusto</li>
+  <li><a href="https://github.com/mongodb/mongo-spark";>mongo-spark</a> - The 
MongoDB Spark connector</li>
+  <li><a 
href="https://github.com/couchbase/couchbase-spark-connector";>couchbase-spark-connector</a>
 - The Official Couchbase Spark connector</li>
+  <li><a 
href="https://github.com/datastax/spark-cassandra-connector";>spark-cassandra-connector</a>
 - DataStax connector for Apache Spark to Apache Cassandra</li>
+  <li><a 
href="https://github.com/elastic/elasticsearch-hadoop";>elasticsearch-hadoop</a> 
- Elasticsearch real-time search and analytics natively integrated with 
Spark</li>
+  <li><a 
href="https://github.com/neo4j-contrib/neo4j-spark-connector";>neo4j-spark-connector</a>
 - Neo4j Connector for Apache Spark</li>
+  <li><a 
href="https://github.com/StarRocks/starrocks-connector-for-apache-spark";>starrocks-connector-for-apache-spark</a>
 - StarRocks Apache Spark connector</li>
+  <li><a href="https://github.com/pingcap/tispark";>tispark</a> - TiSpark is 
built for running Apache Spark on top of TiDB/TiKV</li>
+</ul>
+
+<h2 id="open-table-formats">Open table formats</h2>
+
+<ul>
+  <li><a href="https://delta.io";>Delta Lake</a> - Storage layer that provides 
ACID transactions and scalable metadata handling for Apache Spark workloads</li>
+  <li><a href="https://github.com/apache/hudi";>Hudi</a>: Upserts, Deletes And 
Incremental Processing on Big Data</li>
+  <li><a href="https://github.com/apache/iceberg";>Iceberg</a> - Open table 
format for analytic datasets</li>
+</ul>
 
 <h2>Infrastructure projects</h2>
 
 <ul>
-  <li><a href="https://github.com/spark-jobserver/spark-jobserver";>REST Job 
Server for Apache Spark</a> - 
-REST interface for managing and submitting Spark jobs on the same cluster.</li>
-  <li><a href="http://mlbase.org/";>MLbase</a> - Machine Learning research 
project on top of Spark</li>
+  <li><a href="https://github.com/apache/kyuubi";>Kyuubi</a> - Apache Kyuubi is 
a distributed and multi-tenant gateway to provide serverless SQL on data 
warehouses and lakehouses</li>
+  <li><a href="https://github.com/spark-jobserver/spark-jobserver";>REST Job 
Server for Apache Spark</a> - REST interface for managing and submitting Spark 
jobs on the same cluster.</li>
   <li><a href="https://mesos.apache.org/";>Apache Mesos</a> - Cluster 
management system that supports 
 running Spark</li>
   <li><a href="https://www.alluxio.org/";>Alluxio</a> (née Tachyon) - Memory 
speed virtual distributed 
 storage system that supports running Spark</li>
   <li><a href="https://github.com/filodb/FiloDB";>FiloDB</a> - a Spark 
integrated analytical/columnar 
 database, with in-memory option capable of sub-second concurrent queries</li>
-  <li><a href="http://zeppelin-project.org/";>Zeppelin</a> - Multi-purpose 
notebook which supports 20+ language backends,
-including Apache Spark</li>
-  <li><a href="https://github.com/EclairJS/eclairjs-node";>EclairJS</a> - 
enables Node.js developers to code
-against Spark, and data scientists to use Javascript in Jupyter notebooks.</li>
-  <li><a href="https://github.com/Hydrospheredata/mist";>Mist</a> - Serverless 
proxy for Spark cluster (spark middleware)</li>
+  <li><a href="http://zeppelin-project.org/";>Zeppelin</a> - Multi-purpose 
notebook which supports 20+ language backends, including Apache Spark</li>
   <li><a 
href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator";>K8S 
Operator for Apache Spark</a> - Kubernetes operator for specifying and managing 
the lifecycle of Apache Spark applications on Kubernetes.</li>
   <li><a 
href="https://developer.ibm.com/storage/products/ibm-spectrum-conductor-spark/";>IBM
 Spectrum Conductor</a> - Cluster management software that integrates with 
Spark and modern computing frameworks.</li>
-  <li><a href="https://delta.io";>Delta Lake</a> - Storage layer that provides 
ACID transactions and scalable metadata handling for Apache Spark 
workloads.</li>
   <li><a href="https://mlflow.org";>MLflow</a> - Open source platform to manage 
the machine learning lifecycle, including deploying models from diverse machine 
learning libraries on Apache Spark.</li>
   <li><a 
href="https://datafu.apache.org/docs/spark/getting-started.html";>Apache 
DataFu</a> - A collection of utils and user-defined-functions for working with 
large scale data in Apache Spark, as well as making Scala-Python 
interoperability easier.</li>
 </ul>
@@ -184,16 +201,6 @@ against Spark, and data scientists to use Javascript in 
Jupyter notebooks.</li>
 <ul>
   <li><a href="https://mahout.apache.org/";>Apache Mahout</a> - Previously on 
Hadoop MapReduce, 
 Mahout has switched to using Spark as the backend</li>
-  <li><a href="https://wiki.apache.org/mrql/";>Apache MRQL</a> - A query 
processing and optimization 
-system for large-scale, distributed data analysis, built on top of Apache 
Hadoop, Hama, and Spark</li>
-  <li><a href="https://github.com/sameeragarwal/blinkdb";>BlinkDB</a> - a 
massively parallel, approximate query engine built 
-on top of Shark and Spark</li>
-  <li><a href="https://github.com/adobe-research/spindle";>Spindle</a> - 
Spark/Parquet-based web 
-analytics query engine</li>
-  <li><a 
href="https://github.com/thunderain-project/thunderain";>Thunderain</a> - a 
framework 
-for combining stream processing with historical data, think Lambda 
architecture</li>
-  <li><a href="https://github.com/OryxProject/oryx";>Oryx</a> -  Lambda 
architecture on Apache Spark, 
-Apache Kafka for real-time large scale machine learning</li>
   <li><a href="https://github.com/bigdatagenomics/adam";>ADAM</a> - A framework 
and CLI for loading, 
 transforming, and analyzing genomic data using Apache Spark</li>
   <li><a href="https://github.com/salesforce/TransmogrifAI";>TransmogrifAI</a> 
- AutoML library for building modular, reusable, strongly typed machine 
learning workflows on Spark with minimal hand tuning</li>
@@ -204,7 +211,6 @@ transforming, and analyzing genomic data using Apache 
Spark</li>
 <h2>Performance, monitoring, and debugging tools for Spark</h2>
 
 <ul>
-  <li><a href="https://github.com/g1thubhub/phil_stopwatch";>Performance and 
debugging library</a> - A library to analyze Spark and PySpark applications for 
improving performance and finding the cause of failures</li>
   <li><a href="https://www.datamechanics.co/delight";>Data Mechanics 
Delight</a> - Delight is a free, hosted, cross-platform Spark UI alternative 
backed by an open-source Spark agent. It features new metrics and 
visualizations to simplify Spark monitoring and performance tuning.</li>
 </ul>
 
@@ -219,16 +225,9 @@ transforming, and analyzing genomic data using Apache 
Spark</li>
 <h3>Clojure</h3>
 
 <ul>
-  <li><a 
href="https://github.com/TheClimateCorporation/clj-spark";>clj-spark</a></li>
   <li><a href="https://github.com/zero-one-group/geni";>Geni</a> - A Clojure 
dataframe library that runs on Apache Spark with a focus on optimizing the REPL 
experience.</li>
 </ul>
 
-<h3>Groovy</h3>
-
-<ul>
-  <li><a 
href="https://github.com/bunions1/groovy-spark-example";>groovy-spark-example</a></li>
-</ul>
-
 <h3>Julia</h3>
 
 <ul>
@@ -241,6 +240,12 @@ transforming, and analyzing genomic data using Apache 
Spark</li>
   <li><a href="https://github.com/JetBrains/kotlin-spark-api";>Kotlin for 
Apache Spark</a></li>
 </ul>
 
+<h2 id="adding-new-projects">Adding new projects</h2>
+
+<p>To add a project, open a pull request against the <a 
href="https://github.com/apache/spark-website";>spark-website</a>  repository. 
Add an entry to  <a 
href="https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md";>this
 markdown file</a>,  then run <code class="language-plaintext 
highlighter-rouge">jekyll build</code> to generate the HTML too. Include both 
in your pull request. See the README in this repo for more information.</p>
+
+<p>Note that all project and product names should follow <a 
href="/trademarks.html">trademark guidelines</a>.</p>
+
     </div>
     <div class="col-12 col-md-3">
       <div class="news" style="margin-bottom: 20px;">
diff --git a/third-party-projects.md b/third-party-projects.md
index cf6f3c8102..e8b4b16c85 100644
--- a/third-party-projects.md
+++ b/third-party-projects.md
@@ -9,39 +9,50 @@ navigation:
 
 This page tracks external software projects that supplement Apache Spark and 
add to its ecosystem.
 
-To add a project, open a pull request against the 
[spark-website](https://github.com/apache/spark-website) 
-repository. Add an entry to 
-[this markdown 
file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md),
 
-then run `jekyll build` to generate the HTML too. Include
-both in your pull request. See the README in this repo for more information.
-
-Note that all project and product names should follow [trademark 
guidelines](/trademarks.html).
-
-<h2>spark-packages.org</h2>
-
-<a href="https://spark-packages.org/";>spark-packages.org</a> is an external, 
-community-managed list of third-party libraries, add-ons, and applications 
that work with 
-Apache Spark. You can add a package as long as you have a GitHub repository.
+## Popular libraries with PySpark integrations
+
+- 
[great-expectations](https://github.com/great-expectations/great_expectations) 
- Always know what to expect from your data
+- [Apache Airflow](https://github.com/apache/airflow) - A platform to 
programmatically author, schedule, and monitor workflows
+- [xgboost](https://github.com/dmlc/xgboost) - Scalable, portable and 
distributed gradient boosting
+- [shap](https://github.com/shap/shap) - A game theoretic approach to explain 
the output of any machine learning model
+- [python-deequ](https://github.com/awslabs/python-deequ) - Measures data 
quality in large datasets
+- [datahub](https://github.com/datahub-project/datahub) - Metadata platform 
for the modern data stack
+- [dbt-spark](https://github.com/dbt-labs/dbt-spark) - Enables dbt to work 
with Apache Spark
+
+## Connectors
+
+- [spark-redshift](https://github.com/spark-redshift-community/spark-redshift) 
- Performant Redshift data source for Apache Spark
+- [spark-sql-connector](https://github.com/microsoft/sql-spark-connector) - 
Apache Spark Connector for SQL Server and Azure SQL
+- [azure-cosmos-spark](https://github.com/Azure/azure-cosmosdb-spark) - Apache 
Spark Connector for Azure Cosmos DB
+- [azure-event-hubs-spark](https://github.com/Azure/azure-event-hubs-spark) - 
Enables continuous data processing with Apache Spark and Azure Event Hubs
+- [azure-kusto-spark](https://github.com/Azure/azure-kusto-spark) - Apache 
Spark connector for Azure Kusto 
+- [mongo-spark](https://github.com/mongodb/mongo-spark) - The MongoDB Spark 
connector
+- 
[couchbase-spark-connector](https://github.com/couchbase/couchbase-spark-connector)
 - The Official Couchbase Spark connector
+- 
[spark-cassandra-connector](https://github.com/datastax/spark-cassandra-connector)
 - DataStax connector for Apache Spark to Apache Cassandra
+- [elasticsearch-hadoop](https://github.com/elastic/elasticsearch-hadoop) - 
Elasticsearch real-time search and analytics natively integrated with Spark
+- 
[neo4j-spark-connector](https://github.com/neo4j-contrib/neo4j-spark-connector) 
- Neo4j Connector for Apache Spark
+- 
[starrocks-connector-for-apache-spark](https://github.com/StarRocks/starrocks-connector-for-apache-spark)
 - StarRocks Apache Spark connector
+- [tispark](https://github.com/pingcap/tispark) - TiSpark is built for running 
Apache Spark on top of TiDB/TiKV
+
+## Open table formats
+
+- <a href="https://delta.io";>Delta Lake</a> - Storage layer that provides ACID 
transactions and scalable metadata handling for Apache Spark workloads
+- [Hudi](https://github.com/apache/hudi): Upserts, Deletes And Incremental 
Processing on Big Data
+- [Iceberg](https://github.com/apache/iceberg) - Open table format for 
analytic datasets
 
 <h2>Infrastructure projects</h2>
 
-- <a href="https://github.com/spark-jobserver/spark-jobserver";>REST Job Server 
for Apache Spark</a> - 
-REST interface for managing and submitting Spark jobs on the same cluster.
-- <a href="http://mlbase.org/";>MLbase</a> - Machine Learning research project 
on top of Spark
+- [Kyuubi](https://github.com/apache/kyuubi) - Apache Kyuubi is a distributed 
and multi-tenant gateway to provide serverless SQL on data warehouses and 
lakehouses
+- <a href="https://github.com/spark-jobserver/spark-jobserver";>REST Job Server 
for Apache Spark</a> - REST interface for managing and submitting Spark jobs on 
the same cluster.
 - <a href="https://mesos.apache.org/";>Apache Mesos</a> - Cluster management 
system that supports 
 running Spark
 - <a href="https://www.alluxio.org/";>Alluxio</a> (née Tachyon) - Memory speed 
virtual distributed 
 storage system that supports running Spark    
 - <a href="https://github.com/filodb/FiloDB";>FiloDB</a> - a Spark integrated 
analytical/columnar 
 database, with in-memory option capable of sub-second concurrent queries
-- <a href="http://zeppelin-project.org/";>Zeppelin</a> - Multi-purpose notebook 
which supports 20+ language backends,
-including Apache Spark
-- <a href="https://github.com/EclairJS/eclairjs-node";>EclairJS</a> - enables 
Node.js developers to code
-against Spark, and data scientists to use Javascript in Jupyter notebooks.
-- <a href="https://github.com/Hydrospheredata/mist";>Mist</a> - Serverless 
proxy for Spark cluster (spark middleware)
+- <a href="http://zeppelin-project.org/";>Zeppelin</a> - Multi-purpose notebook 
which supports 20+ language backends, including Apache Spark
 - <a href="https://github.com/GoogleCloudPlatform/spark-on-k8s-operator";>K8S 
Operator for Apache Spark</a> - Kubernetes operator for specifying and managing 
the lifecycle of Apache Spark applications on Kubernetes.
 - <a 
href="https://developer.ibm.com/storage/products/ibm-spectrum-conductor-spark/";>IBM
 Spectrum Conductor</a> - Cluster management software that integrates with 
Spark and modern computing frameworks.
-- <a href="https://delta.io";>Delta Lake</a> - Storage layer that provides ACID 
transactions and scalable metadata handling for Apache Spark workloads.
 - <a href="https://mlflow.org";>MLflow</a> - Open source platform to manage the 
machine learning lifecycle, including deploying models from diverse machine 
learning libraries on Apache Spark.
 - <a href="https://datafu.apache.org/docs/spark/getting-started.html";>Apache 
DataFu</a> - A collection of utils and user-defined-functions for working with 
large scale data in Apache Spark, as well as making Scala-Python 
interoperability easier.
 
@@ -49,16 +60,6 @@ against Spark, and data scientists to use Javascript in 
Jupyter notebooks.
 
 - <a href="https://mahout.apache.org/";>Apache Mahout</a> - Previously on 
Hadoop MapReduce, 
 Mahout has switched to using Spark as the backend
-- <a href="https://wiki.apache.org/mrql/";>Apache MRQL</a> - A query processing 
and optimization 
-system for large-scale, distributed data analysis, built on top of Apache 
Hadoop, Hama, and Spark
-- <a href="https://github.com/sameeragarwal/blinkdb";>BlinkDB</a> - a massively 
parallel, approximate query engine built 
-on top of Shark and Spark
-- <a href="https://github.com/adobe-research/spindle";>Spindle</a> - 
Spark/Parquet-based web 
-analytics query engine
-- <a href="https://github.com/thunderain-project/thunderain";>Thunderain</a> - 
a framework 
-for combining stream processing with historical data, think Lambda architecture
-- <a href="https://github.com/OryxProject/oryx";>Oryx</a> -  Lambda 
architecture on Apache Spark, 
-Apache Kafka for real-time large scale machine learning
 - <a href="https://github.com/bigdatagenomics/adam";>ADAM</a> - A framework and 
CLI for loading, 
 transforming, and analyzing genomic data using Apache Spark
 - <a href="https://github.com/salesforce/TransmogrifAI";>TransmogrifAI</a> - 
AutoML library for building modular, reusable, strongly typed machine learning 
workflows on Spark with minimal hand tuning
@@ -67,7 +68,6 @@ transforming, and analyzing genomic data using Apache Spark
 
 <h2>Performance, monitoring, and debugging tools for Spark</h2>
 
-- <a href="https://github.com/g1thubhub/phil_stopwatch";>Performance and 
debugging library</a> - A library to analyze Spark and PySpark applications for 
improving performance and finding the cause of failures
 - <a href="https://www.datamechanics.co/delight";>Data Mechanics Delight</a> - 
Delight is a free, hosted, cross-platform Spark UI alternative backed by an 
open-source Spark agent. It features new metrics and visualizations to simplify 
Spark monitoring and performance tuning.
 
 <h2>Additional language bindings</h2>
@@ -78,13 +78,8 @@ transforming, and analyzing genomic data using Apache Spark
 
 <h3>Clojure</h3>
 
-- <a href="https://github.com/TheClimateCorporation/clj-spark";>clj-spark</a>
 - <a href="https://github.com/zero-one-group/geni";>Geni</a> - A Clojure 
dataframe library that runs on Apache Spark with a focus on optimizing the REPL 
experience.
 
-<h3>Groovy</h3>
-
-- <a 
href="https://github.com/bunions1/groovy-spark-example";>groovy-spark-example</a>
-
 <h3>Julia</h3>
 
 - <a href="https://github.com/dfdx/Spark.jl";>Spark.jl</a>
@@ -92,3 +87,9 @@ transforming, and analyzing genomic data using Apache Spark
 <h3>Kotlin</h3>
 
 - <a href="https://github.com/JetBrains/kotlin-spark-api";>Kotlin for Apache 
Spark</a>
+
+## Adding new projects
+
+To add a project, open a pull request against the 
[spark-website](https://github.com/apache/spark-website)  repository. Add an 
entry to  [this markdown 
file](https://github.com/apache/spark-website/blob/asf-site/third-party-projects.md),
  then run `jekyll build` to generate the HTML too. Include both in your pull 
request. See the README in this repo for more information.
+
+Note that all project and product names should follow [trademark 
guidelines](/trademarks.html).


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark-website) branch asf-site updated: docs: udpate third party projects (#497)

Reply via email to