Sougata Pandit created SPARK-56286:
--------------------------------------

             Summary: Add DataFrame.dataQuality API for column profiling in 
PySpark
                 Key: SPARK-56286
                 URL: https://issues.apache.org/jira/browse/SPARK-56286
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Sougata Pandit


This improvement proposes a new PySpark DataFrame API, `dataQuality()`, for 
exploratory dataset profiling.

The API would return a DataFrame with one row per input column and one 
synthetic dataset-level row. It would include metrics such as row count, column 
count, total cells, non-null count, null count, null ratio, distinct count, 
min, max, and mode. For numeric columns, it would also include mean, standard 
deviation, and median.

This is intended to simplify data quality inspection workflows that currently 
require users to compose multiple custom aggregations manually.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to