Re: [PR] fix: Upload CSV as Dataset [superset]

via GitHub Mon, 25 Aug 2025 12:43:29 -0700


korbit-ai[bot] commented on code in PR #34763:
URL: https://github.com/apache/superset/pull/34763#discussion_r2298968063



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -188,13 +189,32 @@
                         index_col = kwargs.get("index_col")
                         if isinstance(index_col, str):
                             result.index.name = index_col
-                    return result
-                return pd.DataFrame()
+                    df = result
+            else:
+                df = pd.read_csv(
+                    filepath_or_buffer=file.stream,
+                    **kwargs,
+                )
 
-            return pd.read_csv(
-                filepath_or_buffer=file.stream,
-                **kwargs,
-            )
+            if types:
+                for column, dtype in types.items():
+                    try:
+                        df[column] = df[column].astype(dtype)

Review Comment:
   ### Inefficient Column-by-Column Type Conversion <sub>![category 
Performance](https://img.shields.io/badge/Performance-4f46e5)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   Type conversion is performed column by column, which is less efficient than 
converting all columns at once.
   
   
   ###### Why this matters
   Sequential column conversion creates multiple intermediate DataFrames and 
increases memory usage.
   
   ###### Suggested change ∙ *Feature Preview*
   Convert all columns at once using pandas' built-in functionality:
   ```python
   if types:
       try:
           df = df.astype(types)
       except ValueError as ex:
           # Handle error and identify problematic column
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:387d2699-b8f6-4b2d-8b08-906faef26e7e -->
   
   
   [](387d2699-b8f6-4b2d-8b08-906faef26e7e)



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -188,13 +189,32 @@
                         index_col = kwargs.get("index_col")
                         if isinstance(index_col, str):
                             result.index.name = index_col
-                    return result
-                return pd.DataFrame()
+                    df = result
+            else:
+                df = pd.read_csv(
+                    filepath_or_buffer=file.stream,
+                    **kwargs,
+                )
 
-            return pd.read_csv(
-                filepath_or_buffer=file.stream,
-                **kwargs,
-            )
+            if types:
+                for column, dtype in types.items():
+                    try:
+                        df[column] = df[column].astype(dtype)
+                    except ValueError as ex:
+                        error_msg = f"Non {dtype} value found in column 
'{column}'."
+                        ex_msg = str(ex)
+                        invalid_value = 
ex_msg.split(":")[-1].strip().strip("'")
+                        error_msg += (
+                            f" Value: '{invalid_value}'" if invalid_value else 
""
+                        )
+                        mask = df[column] == invalid_value
+                        if mask.any() and invalid_value:
+                            line_number = mask.idxmax() + kwargs.get("header", 
0) + 1

Review Comment:
   ### Inefficient Invalid Value Detection <sub>![category 
Performance](https://img.shields.io/badge/Performance-4f46e5)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   The code performs a full column scan to find the first occurrence of an 
invalid value.
   
   
   ###### Why this matters
   This creates an unnecessary intermediate boolean mask array and scans the 
entire column even after finding the first invalid value.
   
   ###### Suggested change ∙ *Feature Preview*
   Use more efficient first occurrence detection:
   ```python
   try:
       invalid_idx = df.index[df[column] == invalid_value][0]
       line_number = invalid_idx + kwargs.get("header", 0) + 1
   except IndexError:
       line_number = None
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:1bf1dc73-1eab-492d-9d87-9df32c635bca -->
   
   
   [](1bf1dc73-1eab-492d-9d87-9df32c635bca)



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -188,13 +189,32 @@ def _read_csv(  # noqa: C901
                         index_col = kwargs.get("index_col")
                         if isinstance(index_col, str):
                             result.index.name = index_col
-                    return result
-                return pd.DataFrame()
+                    df = result
+            else:
+                df = pd.read_csv(
+                    filepath_or_buffer=file.stream,
+                    **kwargs,
+                )
 
-            return pd.read_csv(
-                filepath_or_buffer=file.stream,
-                **kwargs,
-            )
+            if types:
+                for column, dtype in types.items():
+                    try:
+                        df[column] = df[column].astype(dtype)
+                    except ValueError as ex:
+                        error_msg = f"Non {dtype} value found in column 
'{column}'."
+                        ex_msg = str(ex)
+                        invalid_value = 
ex_msg.split(":")[-1].strip().strip("'")
+                        error_msg += (
+                            f" Value: '{invalid_value}'" if invalid_value else 
""
+                        )
+                        mask = df[column] == invalid_value

Review Comment:
   ### Unreliable Invalid Value Detection <sub>![category 
Functionality](https://img.shields.io/badge/Functionality-0284c7)</sub>
   
   <details>
     <summary>Tell me more</summary>
   
   ###### What is the issue?
   The code assumes the invalid value extracted from the exception message 
matches exactly with values in the DataFrame, which may not always be true due 
to type differences or string representation variations.
   
   
   ###### Why this matters
   This can lead to incorrect line number reporting or missing error detection 
when the invalid value's string representation differs from how it appears in 
the DataFrame.
   
   ###### Suggested change ∙ *Feature Preview*
   Use pandas' type conversion error handling to identify problematic values 
more reliably:
   ```python
   try:
       pd.to_numeric(df[column]) if dtype in ('int64', 'float64') else 
df[column].astype(dtype)
   except ValueError as ex:
       error_msg = f"Non {dtype} value found in column '{column}'."
       mask = pd.to_numeric(df[column], errors='coerce').isna() if dtype in 
('int64', 'float64') else df[column].astype(dtype, errors='ignore') == 
df[column]
       if mask.any():
           line_number = mask.idxmax() + kwargs.get('header', 0) + 1
           invalid_value = df.loc[mask.idxmax(), column]
           error_msg += f" Value: '{invalid_value}', line: {line_number}."
       raise DatabaseUploadFailed(message=error_msg) from ex
   ```
   
   
   ###### Provide feedback to improve future suggestions
   [![Nice 
Catch](https://img.shields.io/badge/👍%20Nice%20Catch-71BC78)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0/upvote)
 
[![Incorrect](https://img.shields.io/badge/👎%20Incorrect-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_not_true=true)
  [![Not in 
Scope](https://img.shields.io/badge/👎%20Out%20of%20PR%20scope-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_out_of_scope=true)
 [![Not in coding 
standard](https://img.shields.io/badge/👎%20Not%20in%20our%20standards-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_not_in_standard=true)
 
[![Other](https://img.shields.io/badge/👎%20Other-white)](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0)
   </details>
   
   <sub>
   
   💬 Looking for more details? Reply to this comment to chat with Korbit.
   </sub>
   
   <!--- korbi internal id:e455147b-bf2d-4c49-a656-37dbf69d2f17 -->
   
   
   [](e455147b-bf2d-4c49-a656-37dbf69d2f17)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Upload CSV as Dataset [superset]

Reply via email to