Skip to content

BUG: read_csv with chained fsspec TAR file and compression="infer" #60100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Nov 11, 2024

Conversation

KevsterAmp
Copy link
Contributor

@KevsterAmp KevsterAmp commented Oct 24, 2024

@KevsterAmp KevsterAmp marked this pull request as ready for review October 29, 2024 14:33
@KevsterAmp
Copy link
Contributor Author

tests needs ffspec to be installed. What do you think @rhshadrach?

@rhshadrach
Copy link
Member

Use the decorator td.skip_if_no("fsspec").

@KevsterAmp
Copy link
Contributor Author

CI failures seems unrelated

cc: @rhshadrach

pd.read_csv(f"tar://test.csv::file://{tar_file}", compression=None)
pd.read_csv(f"tar://test.csv::file://{tar_file}", compression="infer")
except Exception as e:
pytest.fail(e)
Copy link
Member

@rhshadrach rhshadrach Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of using pytest.fail as opposed to just removing the try-except entirely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially thinking that I need to use somekind of pytest function to validate. But it seems like it wouldn't be the case for this

Comment on lines 650 to 651
pd.read_csv(f"tar://test.csv::file://{tar_file}", compression=None)
pd.read_csv(f"tar://test.csv::file://{tar_file}", compression="infer")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the results are as expected.

@@ -642,6 +643,16 @@ def close(self):
handles.created_handles.append(TestError())


@td.skip_if_no("fsspec")
def test_read_csv_chained_url_no_error():
tar_file = "pandas/tests/io/data/tar/test-csv.tar"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment with the issue number as the first line

@KevsterAmp
Copy link
Contributor Author

Improved the test function to validate the dataframe inputs as well

(pandas-dev) kev@mac pandas % pytest pandas/tests/io/test_common.py::test_read_csv_chained_url_no_error
+ /opt/homebrew/Caskroom/miniforge/base/envs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command
============================================================================================================ test session starts =============================================================================================================
platform darwin -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0
PyQt5 5.15.9 -- Qt runtime 5.15.8 -- Qt compiled 5.15.8
rootdir: /Users/kev/self/pandas
configfile: pyproject.toml
plugins: localserver-0.0.0, qt-4.4.0, cov-5.0.0, hypothesis-6.115.5, cython-0.3.1, anyio-4.6.2.post1, xdist-3.6.1
collected 1 item

pandas/tests/io/test_common.py .

------------------------------------------------------------------------------------------ generated xml file: /Users/kev/self/pandas/test-data.xml ------------------------------------------------------------------------------------------
============================================================================================================ slowest 30 durations ============================================================================================================
0.01s call     pandas/tests/io/test_common.py::test_read_csv_chained_url_no_error

(2 durations < 0.005s hidden.  Use -vv to show these durations.)

@KevsterAmp KevsterAmp requested a review from rhshadrach November 4, 2024 13:04
Comment on lines 654 to 655
x_to_json_expected_output = '{"1;2":{"0":"3;4"}}'
y_to_json_expected_output = '{"1;2":{"0":"3;4"}}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just create the expected DataFrame and use tm.assert_frame_equal. Conversion JSON is lossy.

@KevsterAmp KevsterAmp requested a review from rhshadrach November 6, 2024 09:32
@KevsterAmp
Copy link
Contributor Author

CI looks unrelated

@rhshadrach
Copy link
Member

CI looks unrelated

The test that's failing is the one being added here.

https://github.com/pandas-dev/pandas/actions/runs/11700807499/job/32585578297?pr=60100#step:8:83

@KevsterAmp
Copy link
Contributor Author

The test that's failing is the one being added here.

Any ideas to fix it? Since the CI was minimum versions I assumed that it was just outdated version of Tarfile or pandas

@mroeschke
Copy link
Member

Looks like a bug with the version of fsspec being tested (TarContainedFile in fsspec didn't define flush). You'll want to specify a minimum version in skip_if_no (2023.1.0)

@mroeschke mroeschke added the IO CSV read_csv, to_csv label Nov 8, 2024
chained_file_url = f"tar://test.csv::file://{tar_file_path}"

result_a = pd.read_csv(chained_file_url, compression=None, sep=";")
result_b = pd.read_csv(chained_file_url, compression="infer", sep=";")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use pytest.mark.parametrize over the compression parameter

@mroeschke mroeschke added this to the 3.0 milestone Nov 11, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added the Bug label Nov 11, 2024
@rhshadrach rhshadrach changed the title BUG: read_csv with chained fsspec TAR file and compression="infer" fails with tarfile.ReadError BUG: read_csv with chained fsspec TAR file and compression="infer" Nov 11, 2024
@rhshadrach rhshadrach merged commit 22df68e into pandas-dev:main Nov 11, 2024
55 checks passed
@rhshadrach
Copy link
Member

Thanks @KevsterAmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv with chained fsspec TAR file and compression="infer" fails with tarfile.ReadError
3 participants