bradcnovi commented on issue #2525: URL: https://github.com/apache/iceberg-python/issues/2525#issuecomment-3338801386
<img width="1348" height="497" alt="Image" src="https://github.com/user-attachments/assets/99b2325d-c97b-4c08-b568-e49e9a8397e5" /> <img width="1343" height="774" alt="Image" src="https://github.com/user-attachments/assets/d9a418f6-dead-4792-a0c7-6aec44ba33cd" /> Using the Analysis feature to ping OpenAI on what improvements can be made: --- Thanks for the trace. With only the import-time profile, I’ll focus on startup/architecture and dependency choices that affect both design and performance. If you can share the repo or a high-level module layout, I can tighten this with concrete diffs. What the trace tells us - tests/benchmark.test_benchmark import takes ~1.3s, dominated by imports. - pyarrow.parquet costs ~0.6s. - A separate parquet.core import costs another ~0.6s. This looks like a different package named parquet, not pyarrow.parquet. If unintentional, this is likely duplicate/competing Parquet engines. - fs import costs ~0.4s (likely PyFilesystem2). If you intended fsspec, you might be pulling both worlds. - numpy import is ~0.2s (normal). - importlib bootstrap external overhead shows multiple discrete loads, indicating many top-level imports. High-impact improvements 1) Make the package import “cheap” - Goal: package import < 100–200 ms; defer heavy deps to call sites. - Move heavy imports (pyarrow, parquet, fs/fsspec, numpy) inside functions/methods that use them. - Use typing.TYPE_CHECKING for type-only imports to avoid runtime imports. - Keep __init__.py minimal: export symbols, avoid any I/O, logging setup, or configuration parsing at import time. - If you expose a CLI, keep the entry point very thin and import heavy submodules only after parsing arguments (so –help is instant). 2) Resolve Parquet engine duplication - Your trace shows both pyarrow.parquet and parquet.core importing. Unless you intentionally support multiple engines, this is a red flag. - Pick one default engine (pyarrow or fastparquet). If supporting multiple, implement a plugin/strategy pattern: - Provide an IO facade (e.g., io.read_parquet / io.write_parquet). - Dynamically select engine via parameter, config, or environment. - Only import the selected engine inside the function body. - Make other engines optional extras (pip install brad_pydata2025[fastparquet]). - Add a test that fails if both are imported during a simple import of your top-level package. 3) Filesystem abstraction hygiene - If you only need fsspec, avoid importing fs (PyFilesystem2). If you support both, wrap them behind a thin “FS adapter” interface and import on demand based on the URL scheme or config. - Avoid importing the cloud backends at module import. Let fsspec lazily resolve protocol handlers (e.g., s3fs/gcsfs) when encountering an s3:// or gs:// path. 4) Benchmark isolation - Ensure benchmarks don’t include import time in the inner timings. Use pytest-benchmark or asv; pre-import modules in a setup phase so measurements focus on the target function. - Provide warm-up iterations and cold/warm cache scenarios explicitly. - Store benchmark data and environment info so regressions are attributable. 5) Enforce an import-time budget in CI - Add a test using python -X importtime or importtime/import_profiler to assert the package import stays under a threshold on CI hardware. - Track import-time regressions over PRs. 6) Lazy submodule loading - Consider lazy_loader (or a small custom __getattr__ in __init__.py) to expose submodules without importing them upfront. This works well for large optional areas like io, datasets, and plugins. 7) Configuration and logging at runtime, not import time - Don’t parse env vars or read config files at import time. Provide an explicit load_config() call or initialize on first use. - Avoid logging.basicConfig at import; leave logging configuration to the application/CLI entry point. 8) Packaging and dependency footprint - Only declare hard requirements that are universally needed. Put IO and engine choices behind extras_require to reduce baseline import weight. - If you need a version string, use importlib.metadata.version("pkg") in a function, or inject a constant at build-time; avoid pkg_resources (slow). - Prefer small utility deps over “kitchen sink” helpers to keep transitive import cost low. PyArrow-specific guidance - If you only need Parquet metadata quickly: use pyarrow.parquet.read_metadata or ParquetFile(path).metadata; this avoids constructing a dataset or scanning files. - Avoid pandas roundtrips. Operate on Arrow Tables/RecordBatches where possible to keep zero-copy and reduce overhead. - For large scans, use datasets with filtering/predicate pushdown, and set use_threads=True (default) if CPU parallelism helps. For many small files, use fsspec caching or file coalescing to reduce overhead. - If you sometimes only write Parquet, defer importing read-related modules to keep import lighter. API design suggestions - Define a stable core API that is independent of any specific engine or filesystem. Example layers: - core/: pure logic and data models; no heavy deps. - io/: adapters for parquet, csv, etc.; each engine in a submodule with on-demand import. - fs/: adapters for fileystems (fsspec, PyFilesystem2) with a small registry and on-demand import. - cli/: argument parsing and thin orchestration. No heavy import in module scope. - Use protocols or abstract base classes to define engine and filesystem contracts. This keeps the core testable without heavy deps. - Keep data models with dataclasses or pydantic v2 (if needed). If using pydantic v1, consider migrating to v2 for speed and import time benefits. Testing and quality gates - tests/test_import_time.py: asserts cold import under a budget. - tests/test_optional_deps.py: verifies no ImportError on core package without extras installed. - tests/test_engine_selection.py: ensures only the chosen engine gets imported (inspect sys.modules). - Benchmark suite with fixed datasets and asv or pytest-benchmark. Concrete quick wins you can do today - Move import pyarrow and import parquet (the non-pyarrow one) out of module top levels into the functions that use them. - Remove/paraphrase from __init__.py any logic besides symbol exports. - Decide on a single default Parquet engine and put the other behind an optional extra. - Replace from X import Y at top level across wide modules with local imports at use sites, especially for pyarrow, parquet, fs/fsspec, pandas, and numpy. - Add a small import-time CI test and a benchmark that pre-imports the module in setup. If you share your repo structure (package layout and a few key modules), I’ll propose a concrete module graph, show where to move imports, and outline a minimal plugin registry for Parquet engines and filesystems with code snippets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
