korbit-ai[bot] commented on code in PR #35318: URL: https://github.com/apache/superset/pull/35318#discussion_r2386409370
########## superset/config.py: ########## @@ -837,6 +837,9 @@ class D3TimeFormat(TypedDict, total=False): } THUMBNAIL_ERROR_CACHE_TTL = int(timedelta(days=1).total_seconds()) +# Cache warmup user +SUPERSET_CACHE_WARMUP_USER = "admin" Review Comment: ### Hardcoded cache warmup user may not exist <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The hardcoded cache warmup user 'admin' may not exist in all Superset installations, potentially causing cache warmup operations to fail. ###### Why this matters If the 'admin' user doesn't exist in a particular Superset deployment, any cache warmup operations that rely on this user will fail, breaking the new WebDriver-based cache warmup functionality. ###### Suggested change ∙ *Feature Preview* Make the cache warmup user configurable with a fallback mechanism or validation. For example: ```python # Cache warmup user - should be configured per deployment # Ensure this user exists and has appropriate permissions SUPERSET_CACHE_WARMUP_USER = os.environ.get("SUPERSET_CACHE_WARMUP_USER", "admin") ``` Or add validation logic to ensure the user exists before attempting cache warmup operations. ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/6cfa437d-2a6f-4d54-b38d-4f08a2048966/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/6cfa437d-2a6f-4d54-b38d-4f08a2048966?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/6cfa437d-2a6f-4d54-b38d-4f08a2048966?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/6cfa437d-2a6f-4d54-b38d-4f08a2048966?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/6cfa437d-2a6f-4d54-b38d-4f08a2048966) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:37495b7c-8543-4931-b3a6-4ba8703c031f --> [](37495b7c-8543-4931-b3a6-4ba8703c031f) ########## superset/utils/webdriver.py: ########## @@ -243,7 +246,30 @@ def get_screenshot( # pylint: disable=too-many-locals, too-many-statements # n class WebDriverSelenium(WebDriverProxy): - def create(self) -> WebDriver: + def __init__( + self, + driver_type: str, + window: WindowSize | None = None, + user: User | None = None, + ): + super().__init__(driver_type, window) + self._user = user + self._driver = None + + def __del__(self) -> None: + self._destroy() Review Comment: ### Unreliable resource cleanup with __del__ <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? Using `__del__` for resource cleanup of WebDriver instances is unreliable and can cause resource leaks. ###### Why this matters Python's garbage collector doesn't guarantee when or if `__del__` will be called, potentially leaving WebDriver processes running indefinitely and consuming system resources. ###### Suggested change ∙ *Feature Preview* Implement a context manager protocol (`__enter__`/`__exit__`) or provide an explicit `close()` method that users must call, and document the resource management requirements clearly. ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/161cd273-8db8-4d92-bd3e-efb4a66a9e45/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/161cd273-8db8-4d92-bd3e-efb4a66a9e45?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/161cd273-8db8-4d92-bd3e-efb4a66a9e45?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/161cd273-8db8-4d92-bd3e-efb4a66a9e45?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/161cd273-8db8-4d92-bd3e-efb4a66a9e45) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:0a98da93-e3c2-4b2c-b82b-5e54d55b8856 --> [](0a98da93-e3c2-4b2c-b82b-5e54d55b8856) ########## superset/tasks/cache.py: ########## @@ -118,8 +109,12 @@ class DummyStrategy(Strategy): # pylint: disable=too-few-public-methods name = "dummy" - def get_tasks(self) -> list[CacheWarmupTask]: - return [get_task(chart) for chart in db.session.query(Slice).all()] + def get_urls(self) -> list[str]: + dashboards = ( + db.session.query(Dashboard).filter(Dashboard.published.is_(True)).all() + ) + + return [get_dash_url(dashboard) for dashboard in dashboards if dashboard.slices] Review Comment: ### Unbounded memory usage in dashboard query <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? Loading all published dashboards into memory at once using .all() without pagination or batching. ###### Why this matters This approach can consume excessive memory and cause performance degradation or out-of-memory errors when there are thousands of dashboards. The entire result set is loaded into memory before filtering. ###### Suggested change ∙ *Feature Preview* Use database-level filtering and implement batching or streaming: ```python # Filter at database level to avoid loading unnecessary data dashboards = ( db.session.query(Dashboard) .filter(Dashboard.published.is_(True)) .filter(Dashboard.slices.any()) # Only dashboards with slices ) # Process in batches to control memory usage batch_size = 100 urls = [] for offset in range(0, dashboards.count(), batch_size): batch = dashboards.offset(offset).limit(batch_size).all() urls.extend([get_dash_url(dashboard) for dashboard in batch]) return urls ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/1de87a6a-f9f5-4830-a2a6-c9b206c68a81/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/1de87a6a-f9f5-4830-a2a6-c9b206c68a81?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/1de87a6a-f9f5-4830-a2a6-c9b206c68a81?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/1de87a6a-f9f5-4830-a2a6-c9b206c68a81?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/1de87a6a-f9f5-4830-a2a6-c9b206c68a81) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:11f43670-85f7-4233-9830-33bcb5a2b6ff --> [](11f43670-85f7-4233-9830-33bcb5a2b6ff) ########## superset/utils/webdriver.py: ########## @@ -243,7 +246,30 @@ def get_screenshot( # pylint: disable=too-many-locals, too-many-statements # n class WebDriverSelenium(WebDriverProxy): - def create(self) -> WebDriver: + def __init__( + self, + driver_type: str, + window: WindowSize | None = None, + user: User | None = None, + ): + super().__init__(driver_type, window) + self._user = user + self._driver = None + + def __del__(self) -> None: + self._destroy() + + @property + def driver(self) -> WebDriver: + if not self._driver: + self._driver = self._create() + assert self._driver # for mypy + self._driver.set_window_size(*self._window) + if self._user: + self._auth(self._user) + return self._driver Review Comment: ### No error recovery in expensive driver initialization <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The driver property performs expensive operations (driver creation, window sizing, authentication) on every first access without any caching validation or error recovery. ###### Why this matters If driver creation fails partially (e.g., after `_create()` but before `set_window_size()`), subsequent calls will return a malformed driver instance, causing cascading failures without retry mechanisms. ###### Suggested change ∙ *Feature Preview* Wrap the entire initialization block in try-catch, reset `self._driver = None` on any failure, and consider adding a flag to track successful initialization completion. ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7659419f-dc90-4e0e-96bd-1f1dd80381b5/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7659419f-dc90-4e0e-96bd-1f1dd80381b5?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7659419f-dc90-4e0e-96bd-1f1dd80381b5?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7659419f-dc90-4e0e-96bd-1f1dd80381b5?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7659419f-dc90-4e0e-96bd-1f1dd80381b5) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:f0206eab-bce1-4e73-a494-a25d2d2372b3 --> [](f0206eab-bce1-4e73-a494-a25d2d2372b3) ########## superset/utils/webdriver.py: ########## @@ -243,7 +246,30 @@ def get_screenshot( # pylint: disable=too-many-locals, too-many-statements # n class WebDriverSelenium(WebDriverProxy): - def create(self) -> WebDriver: + def __init__( + self, + driver_type: str, + window: WindowSize | None = None, + user: User | None = None, + ): + super().__init__(driver_type, window) + self._user = user + self._driver = None Review Comment: ### Missing thread safety in lazy driver initialization <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? WebDriverSelenium now stores driver instance as class state with lazy initialization, but lacks proper thread safety mechanisms for concurrent access. ###### Why this matters Multiple threads could simultaneously check `if not self._driver` and create multiple driver instances, leading to resource leaks and unpredictable behavior in concurrent environments. ###### Suggested change ∙ *Feature Preview* Add thread synchronization using `threading.Lock()` around the driver creation check and initialization in the `driver` property to ensure only one thread creates the driver instance. ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/4c46e220-ca9f-4634-910a-d14548c384cd/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/4c46e220-ca9f-4634-910a-d14548c384cd?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/4c46e220-ca9f-4634-910a-d14548c384cd?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/4c46e220-ca9f-4634-910a-d14548c384cd?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/4c46e220-ca9f-4634-910a-d14548c384cd) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:4a080eae-3c7e-4fd2-b0d0-be5fecaa30bc --> [](4a080eae-3c7e-4fd2-b0d0-be5fecaa30bc) ########## superset/tasks/cache.py: ########## @@ -307,25 +239,20 @@ def cache_warmup( logger.exception(message) return message - results: dict[str, list[str]] = {"scheduled": [], "errors": []} - for task in strategy.get_tasks(): - username = task["username"] - payload = json.dumps(task["payload"]) - if username: - try: - user = security_manager.get_user_by_username(username) - cookies = MachineAuthProvider.get_auth_cookies(user) - headers = { - "Cookie": f"session={cookies.get('session', '')}", - "Content-Type": "application/json", - } - logger.info("Scheduling %s", payload) - fetch_url.delay(payload, headers) - results["scheduled"].append(payload) - except SchedulingError: - logger.exception("Error scheduling fetch_url for payload: %s", payload) - results["errors"].append(payload) - else: - logger.warn("Executor not found for %s", payload) + results: dict[str, list[str]] = {"success": [], "errors": []} + + user = security_manager.find_user( + username=current_app.config["SUPERSET_CACHE_WARMUP_USER"] + ) + wd = WebDriverSelenium(current_app.config["WEBDRIVER_TYPE"], user=user) + + for url in strategy.get_urls(): + try: + logger.info("Fetching %s", url) + wd.get_screenshot(url, "grid-container") + results["success"].append(url) + except URLError: + logger.exception("Error warming up cache!") + results["errors"].append(url) Review Comment: ### WebDriver resource leak in cache warmup loop <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? WebDriver instance is created once and reused across all URLs without proper cleanup or session management. ###### Why this matters This can lead to memory leaks, stale browser sessions, and potential crashes when processing many URLs. Browser resources accumulate without being released, and connection timeouts or browser crashes can affect subsequent URL processing. ###### Suggested change ∙ *Feature Preview* Implement proper WebDriver lifecycle management by either creating a new instance per URL or adding explicit cleanup in a try-finally block: ```python for url in strategy.get_urls(): wd = None try: wd = WebDriverSelenium(current_app.config["WEBDRIVER_TYPE"], user=user) logger.info("Fetching %s", url) wd.get_screenshot(url, "grid-container") results["success"].append(url) except URLError: logger.exception("Error warming up cache!") results["errors"].append(url) finally: if wd: wd.quit() # or appropriate cleanup method ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/a0acf75f-54b8-4755-8de3-20f6618f1625/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/a0acf75f-54b8-4755-8de3-20f6618f1625?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/a0acf75f-54b8-4755-8de3-20f6618f1625?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/a0acf75f-54b8-4755-8de3-20f6618f1625?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/a0acf75f-54b8-4755-8de3-20f6618f1625) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:d7c5ece6-04a2-496f-bf93-00339323fb03 --> [](d7c5ece6-04a2-496f-bf93-00339323fb03) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
