korbit-ai[bot] commented on code in PR #34763:
URL: https://github.com/apache/superset/pull/34763#discussion_r2298968063
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -188,13 +189,32 @@
index_col = kwargs.get("index_col")
if isinstance(index_col, str):
result.index.name = index_col
- return result
- return pd.DataFrame()
+ df = result
+ else:
+ df = pd.read_csv(
+ filepath_or_buffer=file.stream,
+ **kwargs,
+ )
- return pd.read_csv(
- filepath_or_buffer=file.stream,
- **kwargs,
- )
+ if types:
+ for column, dtype in types.items():
+ try:
+ df[column] = df[column].astype(dtype)
Review Comment:
### Inefficient Column-by-Column Type Conversion <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
Type conversion is performed column by column, which is less efficient than
converting all columns at once.
###### Why this matters
Sequential column conversion creates multiple intermediate DataFrames and
increases memory usage.
###### Suggested change ∙ *Feature Preview*
Convert all columns at once using pandas' built-in functionality:
```python
if types:
try:
df = df.astype(types)
except ValueError as ex:
# Handle error and identify problematic column
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b00c46fa-9113-42aa-8959-0ae3bc92c846)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:387d2699-b8f6-4b2d-8b08-906faef26e7e -->
[](387d2699-b8f6-4b2d-8b08-906faef26e7e)
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -188,13 +189,32 @@
index_col = kwargs.get("index_col")
if isinstance(index_col, str):
result.index.name = index_col
- return result
- return pd.DataFrame()
+ df = result
+ else:
+ df = pd.read_csv(
+ filepath_or_buffer=file.stream,
+ **kwargs,
+ )
- return pd.read_csv(
- filepath_or_buffer=file.stream,
- **kwargs,
- )
+ if types:
+ for column, dtype in types.items():
+ try:
+ df[column] = df[column].astype(dtype)
+ except ValueError as ex:
+ error_msg = f"Non {dtype} value found in column
'{column}'."
+ ex_msg = str(ex)
+ invalid_value =
ex_msg.split(":")[-1].strip().strip("'")
+ error_msg += (
+ f" Value: '{invalid_value}'" if invalid_value else
""
+ )
+ mask = df[column] == invalid_value
+ if mask.any() and invalid_value:
+ line_number = mask.idxmax() + kwargs.get("header",
0) + 1
Review Comment:
### Inefficient Invalid Value Detection <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
The code performs a full column scan to find the first occurrence of an
invalid value.
###### Why this matters
This creates an unnecessary intermediate boolean mask array and scans the
entire column even after finding the first invalid value.
###### Suggested change ∙ *Feature Preview*
Use more efficient first occurrence detection:
```python
try:
invalid_idx = df.index[df[column] == invalid_value][0]
line_number = invalid_idx + kwargs.get("header", 0) + 1
except IndexError:
line_number = None
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/7f7948ed-cfe1-4937-9d5a-d68e4d7b3f59)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:1bf1dc73-1eab-492d-9d87-9df32c635bca -->
[](1bf1dc73-1eab-492d-9d87-9df32c635bca)
##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -188,13 +189,32 @@ def _read_csv( # noqa: C901
index_col = kwargs.get("index_col")
if isinstance(index_col, str):
result.index.name = index_col
- return result
- return pd.DataFrame()
+ df = result
+ else:
+ df = pd.read_csv(
+ filepath_or_buffer=file.stream,
+ **kwargs,
+ )
- return pd.read_csv(
- filepath_or_buffer=file.stream,
- **kwargs,
- )
+ if types:
+ for column, dtype in types.items():
+ try:
+ df[column] = df[column].astype(dtype)
+ except ValueError as ex:
+ error_msg = f"Non {dtype} value found in column
'{column}'."
+ ex_msg = str(ex)
+ invalid_value =
ex_msg.split(":")[-1].strip().strip("'")
+ error_msg += (
+ f" Value: '{invalid_value}'" if invalid_value else
""
+ )
+ mask = df[column] == invalid_value
Review Comment:
### Unreliable Invalid Value Detection <sub></sub>
<details>
<summary>Tell me more</summary>
###### What is the issue?
The code assumes the invalid value extracted from the exception message
matches exactly with values in the DataFrame, which may not always be true due
to type differences or string representation variations.
###### Why this matters
This can lead to incorrect line number reporting or missing error detection
when the invalid value's string representation differs from how it appears in
the DataFrame.
###### Suggested change ∙ *Feature Preview*
Use pandas' type conversion error handling to identify problematic values
more reliably:
```python
try:
pd.to_numeric(df[column]) if dtype in ('int64', 'float64') else
df[column].astype(dtype)
except ValueError as ex:
error_msg = f"Non {dtype} value found in column '{column}'."
mask = pd.to_numeric(df[column], errors='coerce').isna() if dtype in
('int64', 'float64') else df[column].astype(dtype, errors='ignore') ==
df[column]
if mask.any():
line_number = mask.idxmax() + kwargs.get('header', 0) + 1
invalid_value = df.loc[mask.idxmax(), column]
error_msg += f" Value: '{invalid_value}', line: {line_number}."
raise DatabaseUploadFailed(message=error_msg) from ex
```
###### Provide feedback to improve future suggestions
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0/upvote)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_not_true=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_out_of_scope=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0?what_not_in_standard=true)
[](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/5c997253-98e0-429b-8da4-72e2bad3ded0)
</details>
<sub>
💬 Looking for more details? Reply to this comment to chat with Korbit.
</sub>
<!--- korbi internal id:e455147b-bf2d-4c49-a656-37dbf69d2f17 -->
[](e455147b-bf2d-4c49-a656-37dbf69d2f17)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]