Re: [I] Profile a query using Microlog [iceberg-python]

via GitHub Fri, 26 Sep 2025 06:45:38 -0700


bradcnovi commented on issue #2525:
URL: 
https://github.com/apache/iceberg-python/issues/2525#issuecomment-3338801386


   
   
   <img width="1348" height="497" alt="Image" 
src="https://github.com/user-attachments/assets/99b2325d-c97b-4c08-b568-e49e9a8397e5";
 />
   
   
   <img width="1343" height="774" alt="Image" 
src="https://github.com/user-attachments/assets/d9a418f6-dead-4792-a0c7-6aec44ba33cd";
 />
   
   Using the Analysis feature to ping OpenAI on what improvements can be made:
   
   ---
   
   Thanks for the trace. With only the import-time profile, I’ll focus on 
startup/architecture and dependency choices that affect both design and 
performance. If you can share the repo or a high-level module layout, I can 
tighten this with concrete diffs.
   
   What the trace tells us
   - tests/benchmark.test_benchmark import takes ~1.3s, dominated by imports.
   - pyarrow.parquet costs ~0.6s.
   - A separate parquet.core import costs another ~0.6s. This looks like a 
different package named parquet, not pyarrow.parquet. If unintentional, this is 
likely duplicate/competing Parquet engines.
   - fs import costs ~0.4s (likely PyFilesystem2). If you intended fsspec, you 
might be pulling both worlds.
   - numpy import is ~0.2s (normal).
   - importlib bootstrap external overhead shows multiple discrete loads, 
indicating many top-level imports.
   
   High-impact improvements
   1) Make the package import “cheap”
   - Goal: package import < 100–200 ms; defer heavy deps to call sites.
   - Move heavy imports (pyarrow, parquet, fs/fsspec, numpy) inside 
functions/methods that use them.
   - Use typing.TYPE_CHECKING for type-only imports to avoid runtime imports.
   - Keep __init__.py minimal: export symbols, avoid any I/O, logging setup, or 
configuration parsing at import time.
   - If you expose a CLI, keep the entry point very thin and import heavy 
submodules only after parsing arguments (so –help is instant).
   
   2) Resolve Parquet engine duplication
   - Your trace shows both pyarrow.parquet and parquet.core importing. Unless 
you intentionally support multiple engines, this is a red flag.
   - Pick one default engine (pyarrow or fastparquet). If supporting multiple, 
implement a plugin/strategy pattern:
     - Provide an IO facade (e.g., io.read_parquet / io.write_parquet).
     - Dynamically select engine via parameter, config, or environment.
     - Only import the selected engine inside the function body.
     - Make other engines optional extras (pip install 
brad_pydata2025[fastparquet]).
   - Add a test that fails if both are imported during a simple import of your 
top-level package.
   
   3) Filesystem abstraction hygiene
   - If you only need fsspec, avoid importing fs (PyFilesystem2). If you 
support both, wrap them behind a thin “FS adapter” interface and import on 
demand based on the URL scheme or config.
   - Avoid importing the cloud backends at module import. Let fsspec lazily 
resolve protocol handlers (e.g., s3fs/gcsfs) when encountering an s3:// or 
gs:// path.
   
   4) Benchmark isolation
   - Ensure benchmarks don’t include import time in the inner timings. Use 
pytest-benchmark or asv; pre-import modules in a setup phase so measurements 
focus on the target function.
   - Provide warm-up iterations and cold/warm cache scenarios explicitly.
   - Store benchmark data and environment info so regressions are attributable.
   
   5) Enforce an import-time budget in CI
   - Add a test using python -X importtime or importtime/import_profiler to 
assert the package import stays under a threshold on CI hardware.
   - Track import-time regressions over PRs.
   
   6) Lazy submodule loading
   - Consider lazy_loader (or a small custom __getattr__ in __init__.py) to 
expose submodules without importing them upfront. This works well for large 
optional areas like io, datasets, and plugins.
   
   7) Configuration and logging at runtime, not import time
   - Don’t parse env vars or read config files at import time. Provide an 
explicit load_config() call or initialize on first use.
   - Avoid logging.basicConfig at import; leave logging configuration to the 
application/CLI entry point.
   
   8) Packaging and dependency footprint
   - Only declare hard requirements that are universally needed. Put IO and 
engine choices behind extras_require to reduce baseline import weight.
   - If you need a version string, use importlib.metadata.version("pkg") in a 
function, or inject a constant at build-time; avoid pkg_resources (slow).
   - Prefer small utility deps over “kitchen sink” helpers to keep transitive 
import cost low.
   
   PyArrow-specific guidance
   - If you only need Parquet metadata quickly: use 
pyarrow.parquet.read_metadata or ParquetFile(path).metadata; this avoids 
constructing a dataset or scanning files.
   - Avoid pandas roundtrips. Operate on Arrow Tables/RecordBatches where 
possible to keep zero-copy and reduce overhead.
   - For large scans, use datasets with filtering/predicate pushdown, and set 
use_threads=True (default) if CPU parallelism helps. For many small files, use 
fsspec caching or file coalescing to reduce overhead.
   - If you sometimes only write Parquet, defer importing read-related modules 
to keep import lighter.
   
   API design suggestions
   - Define a stable core API that is independent of any specific engine or 
filesystem. Example layers:
     - core/: pure logic and data models; no heavy deps.
     - io/: adapters for parquet, csv, etc.; each engine in a submodule with 
on-demand import.
     - fs/: adapters for fileystems (fsspec, PyFilesystem2) with a small 
registry and on-demand import.
     - cli/: argument parsing and thin orchestration. No heavy import in module 
scope.
   - Use protocols or abstract base classes to define engine and filesystem 
contracts. This keeps the core testable without heavy deps.
   - Keep data models with dataclasses or pydantic v2 (if needed). If using 
pydantic v1, consider migrating to v2 for speed and import time benefits.
   
   Testing and quality gates
   - tests/test_import_time.py: asserts cold import under a budget.
   - tests/test_optional_deps.py: verifies no ImportError on core package 
without extras installed.
   - tests/test_engine_selection.py: ensures only the chosen engine gets 
imported (inspect sys.modules).
   - Benchmark suite with fixed datasets and asv or pytest-benchmark.
   
   Concrete quick wins you can do today
   - Move import pyarrow and import parquet (the non-pyarrow one) out of module 
top levels into the functions that use them.
   - Remove/paraphrase from __init__.py any logic besides symbol exports.
   - Decide on a single default Parquet engine and put the other behind an 
optional extra.
   - Replace from X import Y at top level across wide modules with local 
imports at use sites, especially for pyarrow, parquet, fs/fsspec, pandas, and 
numpy.
   - Add a small import-time CI test and a benchmark that pre-imports the 
module in setup.
   
   If you share your repo structure (package layout and a few key modules), 
I’ll propose a concrete module graph, show where to move imports, and outline a 
minimal plugin registry for Parquet engines and filesystems with code snippets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Profile a query using Microlog [iceberg-python]

Reply via email to