Re: [PR] fix: Upload CSV as Dataset [superset]

via GitHub Wed, 17 Sep 2025 12:40:49 -0700


bito-code-review[bot] commented on code in PR #34763:
URL: https://github.com/apache/superset/pull/34763#discussion_r2330585733



##########
superset/commands/database/uploaders/csv_reader.py:
##########
@@ -33,6 +33,10 @@
 
 logger = logging.getLogger(__name__)
 
+# Fixed error limit to avoid huge payloads and poor UX given that a file
+# might contain thousands of errors.
+MAX_DISPLAYED_ERRORS = 5

Review Comment:
   
   <div>
   
   
   <div id="suggestion">
   <div id="issue"><b>Configuration compatibility break</b></div>
   <div id="fix">
   
   The change from configurable `CSV_UPLOAD_MAX_ERRORS_DISPLAYED` to hardcoded 
`MAX_DISPLAYED_ERRORS = 5` breaks backward compatibility for deployments that 
have customized this limit. This affects the `CSVReader._get_error_details()` 
and `CSVReader._create_error_message()` methods which use this value to limit 
error display. Restore the configurable behavior while keeping the improved 
constant name.
   </div>
   <details>
   <summary>
   <b>Code suggestion</b>
   </summary>
   <blockquote>Check the AI-generated fix before applying</blockquote>
   <div id="code">
   
   
   ```suggestion
   # Fixed error limit to avoid huge payloads and poor UX given that a file
   # might contain thousands of errors.
   MAX_DISPLAYED_ERRORS = 
current_app.config.get("CSV_UPLOAD_MAX_ERRORS_DISPLAYED", 5)
   ```
   
   </div>
   </details>
   </div>
   
   
   
   <small><i>Code Review Run <a 
href=https://github.com/apache/superset/pull/34763#issuecomment-3266796725>#efe81c</a></i></small>
   </div>
   
   ---
   Should Bito avoid suggestions like this for future reviews? (<a 
href=https://alpha.bito.ai/home/ai-agents/review-rules>Manage Rules</a>)
   - [ ] Yes, avoid them



##########
tests/unit_tests/commands/databases/csv_reader_test.py:
##########
@@ -421,6 +421,561 @@ def test_csv_reader_file_metadata_invalid_file():
     )
 
 
+def test_csv_reader_integer_in_float_column():
+    csv_data = [
+        ["Name", "Score", "City"],
+        ["name1", 25.5, "city1"],
+        ["name2", 25, "city2"],
+    ]
+
+    csv_reader = CSVReader(
+        options=CSVReaderOptions(column_data_types={"Score": "float"})
+    )
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (2, 3)
+    assert df["Score"].dtype == "float64"
+
+
+def test_csv_reader_object_type_auto_inferring():
+    # this case below won't raise a error
+    csv_data = [
+        ["Name", "id", "City"],
+        ["name1", 25.5, "city1"],
+        ["name2", 15, "city2"],
+        ["name3", 123456789086, "city3"],
+        ["name4", "abc", "city4"],
+        ["name5", 4.75, "city5"],
+    ]
+
+    csv_reader = CSVReader()
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (5, 3)
+    # pandas automatically infers the type if column_data_types is not informed
+    # if there's only one string in the column it converts the whole column to 
object
+    assert df["id"].dtype == "object"
+
+
+def test_csv_reader_float_type_auto_inferring():
+    csv_data = [
+        ["Name", "id", "City"],
+        ["name1", "25", "city1"],
+        ["name2", "15", "city2"],
+        ["name3", "123456789086", "city3"],
+        ["name5", "4.75", "city5"],
+    ]
+
+    csv_reader = CSVReader()
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (4, 3)
+    # The type here is automatically inferred to float due to 4.75 value
+    assert df["id"].dtype == "float64"
+
+
+def test_csv_reader_int_type_auto_inferring():
+    csv_data = [
+        ["Name", "id", "City"],
+        ["name1", "0", "city1"],
+        ["name2", "15", "city2"],
+        ["name3", "123456789086", "city3"],
+        ["name5", "45", "city5"],
+    ]
+
+    csv_reader = CSVReader()
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (4, 3)
+    assert df["id"].dtype == "int64"
+
+
+def test_csv_reader_bigint_type_auto_inferring():
+    csv_data = [
+        ["Name", "id", "City"],
+        ["name1", "9223372036854775807", "city1"],
+        ["name2", "9223372036854775806", "city2"],
+        ["name3", "1234567890123456789", "city3"],
+        ["name4", "0", "city4"],
+        ["name5", "-9223372036854775808", "city5"],
+    ]
+
+    csv_reader = CSVReader()
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (5, 3)
+    assert df["id"].dtype == "int64"
+    assert df.iloc[0]["id"] == 9223372036854775807
+    assert df.iloc[4]["id"] == -9223372036854775808
+
+
+def test_csv_reader_int_typing():
+    csv_data = [
+        ["Name", "id", "City"],
+        ["name1", "0", "city1"],
+        ["name2", "15", "city2"],
+        ["name3", "123456789086", "city3"],
+        ["name5", "45", "city5"],
+    ]
+
+    csv_reader = CSVReader(options=CSVReaderOptions(column_data_types={"id": 
"int"}))
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (4, 3)
+    assert df["id"].dtype == "int64"
+
+
+def test_csv_reader_float_typing():
+    csv_data = [
+        ["Name", "score", "City"],
+        ["name1", "0", "city1"],
+        ["name2", "15.3", "city2"],
+        ["name3", "45", "city3"],
+        ["name5", "23.1342", "city5"],
+    ]
+
+    csv_reader = CSVReader(
+        options=CSVReaderOptions(column_data_types={"score": "float"})
+    )
+
+    df = csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    assert df.shape == (4, 3)
+    assert df["score"].dtype == "float64"
+
+
+def test_csv_reader_multiple_errors_display():
+    """Test that multiple errors are displayed with proper formatting."""
+    csv_data = [
+        ["Name", "Age", "Score"],
+        ["Alice", "25", "95.5"],
+        ["Bob", "invalid1", "87.2"],
+        ["Charlie", "invalid2", "92.1"],
+        ["Diana", "invalid3", "88.5"],
+        ["Eve", "invalid4", "90.0"],
+        ["Frank", "30", "85.5"],
+    ]
+
+    csv_reader = CSVReader(options=CSVReaderOptions(column_data_types={"Age": 
"int64"}))
+
+    with pytest.raises(DatabaseUploadFailed) as ex:
+        csv_reader.file_to_dataframe(create_csv_file(csv_data))
+
+    error_msg = str(ex.value)
+    assert "Cannot convert column 'Age' to int64" in error_msg
+    assert "Found 4 error(s):" in error_msg
+    assert "Line 3: 'invalid1' cannot be converted to int64" in error_msg
+    assert "Line 4: 'invalid2' cannot be converted to int64" in error_msg
+    assert "Line 5: 'invalid3' cannot be converted to int64" in error_msg
+    assert "Line 6: 'invalid4' cannot be converted to int64" in error_msg
+    # With MAX_DISPLAYED_ERRORS = 5, all 4 errors should be shown without 
truncation
+    assert "and" not in error_msg or "more error(s)" not in error_msg

Review Comment:
   
   <div>
   
   
   <div id="suggestion">
   <div id="issue"><b>Assertion logic error</b></div>
   <div id="fix">
   
   The assertion logic uses an incorrect OR condition that creates a false 
positive test. The assertion `"and" not in error_msg or "more error(s)" not in 
error_msg` will pass even when both substrings are present because it's an OR 
condition. This means the test won't catch regression issues when truncation 
messages incorrectly appear. Change to AND condition to properly validate that 
neither truncation indicator appears when all errors should be displayed.
   </div>
   <details>
   <summary>
   <b>Code suggestion</b>
   </summary>
   <blockquote>Check the AI-generated fix before applying</blockquote>
   <div id="code">
   
   
   ```suggestion
       assert "and" not in error_msg and "more error(s)" not in error_msg
   ```
   
   </div>
   </details>
   </div>
   
   
   
   <small><i>Code Review Run <a 
href=https://github.com/apache/superset/pull/34763#issuecomment-3266796725>#efe81c</a></i></small>
   </div>
   
   ---
   Should Bito avoid suggestions like this for future reviews? (<a 
href=https://alpha.bito.ai/home/ai-agents/review-rules>Manage Rules</a>)
   - [ ] Yes, avoid them



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: Upload CSV as Dataset [superset]

Reply via email to