Hi everyone, I would like to start a discussion if there would be an interest in adding a SAS provider to Airflow, based on some challenges and solutions we’ve encountered in our environment. So let me explain.
Until recently, we relied on the JdbcHook combined with GenericTransfer to extract data from SAS to our Postgres DWH. However, this approach introduces significant overhead due to the JVM dependency (via jaydebeapi). In practice, this leads to duplicated memory usage and instability during larger transfers (as our workers on K8S pods are limited in memory). Lately, we’ve experienced multiple failures due to new imports of very large tables, including JVM crashes caused by OOM issues as well as intermittent backend errors from SAS during long runs (multiple hours, especially for the initial runs). We also evaluated saspy as an alternative, but since it also relies on the same underlying JDBC/JVM mechanism (e.g. jaydebeapi), it does not resolve these issues. To address this, we implemented a native SASHook that uses SSH underneath to connect to the SAS environment. This allows us to execute SQL queries directly on the SAS server and retrieve results without involving a JVM. On top of that, we've built a DB-API 2.0–compatible cursor implementation that handles the interaction over SSH and integrates with pandas (leveraging read_sas for returning results as DataFrames), thus it integrates transparently with the DBApiHook. This approach has proven to be much more stable in our DAGs. Since adopting it, we have not encountered failures anymore. While performance is not on par with native ODBC or database drivers, it is still an improvement over the JDBC bridge. For reference, we can fetch around 10k records in ~2 seconds. The initial query is slightly slower due to SSH handshake and session setup, but overall the trade-off is acceptable given the stability gains. Probably performance can still be improved, and we will further check into it but at least we resolved the instability issue and our pipelines are reliable again. I would like to get feedback on how to contribute this, if any interest of course. Should I split this into two parts? A standalone open-source project providing a generic SAS DB-API 2.0 implementation and an Airflow provider that depends on this library? Or would it be preferable to include the full implementation directly within the SAS Airflow provider (e.g., as a hook and cursor)? I’m happy to contribute the implementation and help maintain it, but I’d like to align with the community on the preferred approach before moving forward, if there would be any interest of course. The advantage of this solution is that it doesn't need of any new dependencies, it purely relies on dependencies (paramiko an pandas) already present in the Airflow ecosystem. Any thoughts or suggestions would be greatly appreciated. Best regards, David
