BUG: `read_csv` with chained fsspec TAR file and `compression="infer"` #60100

KevsterAmp · 2024-10-24T13:40:32Z

closes BUG: read_csv with chained fsspec TAR file and compression="infer" fails with tarfile.ReadError #60028
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

KevsterAmp · 2024-10-29T23:20:42Z

tests needs ffspec to be installed. What do you think @rhshadrach?

rhshadrach · 2024-10-30T21:09:33Z

Use the decorator td.skip_if_no("fsspec").

KevsterAmp · 2024-10-30T23:52:04Z

CI failures seems unrelated

cc: @rhshadrach

rhshadrach · 2024-10-31T20:16:49Z

pandas/tests/io/test_common.py

+        pd.read_csv(f"tar://test.csv::file://{tar_file}", compression=None)
+        pd.read_csv(f"tar://test.csv::file://{tar_file}", compression="infer")
+    except Exception as e:
+        pytest.fail(e)


What's the benefit of using pytest.fail as opposed to just removing the try-except entirely?

I was initially thinking that I need to use somekind of pytest function to validate. But it seems like it wouldn't be the case for this

rhshadrach · 2024-10-31T20:17:07Z

pandas/tests/io/test_common.py

+        pd.read_csv(f"tar://test.csv::file://{tar_file}", compression=None)
+        pd.read_csv(f"tar://test.csv::file://{tar_file}", compression="infer")


Can you check the results are as expected.

rhshadrach · 2024-10-31T20:17:25Z

pandas/tests/io/test_common.py

@@ -642,6 +643,16 @@ def close(self):
                handles.created_handles.append(TestError())


+@td.skip_if_no("fsspec")
+def test_read_csv_chained_url_no_error():
+    tar_file = "pandas/tests/io/data/tar/test-csv.tar"


Add a comment with the issue number as the first line

KevsterAmp · 2024-11-04T13:03:15Z

Improved the test function to validate the dataframe inputs as well

(pandas-dev) kev@mac pandas % pytest pandas/tests/io/test_common.py::test_read_csv_chained_url_no_error
+ /opt/homebrew/Caskroom/miniforge/base/envs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command
============================================================================================================ test session starts =============================================================================================================
platform darwin -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0
PyQt5 5.15.9 -- Qt runtime 5.15.8 -- Qt compiled 5.15.8
rootdir: /Users/kev/self/pandas
configfile: pyproject.toml
plugins: localserver-0.0.0, qt-4.4.0, cov-5.0.0, hypothesis-6.115.5, cython-0.3.1, anyio-4.6.2.post1, xdist-3.6.1
collected 1 item

pandas/tests/io/test_common.py .

------------------------------------------------------------------------------------------ generated xml file: /Users/kev/self/pandas/test-data.xml ------------------------------------------------------------------------------------------
============================================================================================================ slowest 30 durations ============================================================================================================
0.01s call     pandas/tests/io/test_common.py::test_read_csv_chained_url_no_error

(2 durations < 0.005s hidden.  Use -vv to show these durations.)

rhshadrach · 2024-11-05T14:15:27Z

pandas/tests/io/test_common.py

+    x_to_json_expected_output = '{"1;2":{"0":"3;4"}}'
+    y_to_json_expected_output = '{"1;2":{"0":"3;4"}}'


Can you just create the expected DataFrame and use tm.assert_frame_equal. Conversion JSON is lossy.

…_file_url

KevsterAmp · 2024-11-06T12:24:24Z

CI looks unrelated

rhshadrach · 2024-11-07T21:50:10Z

CI looks unrelated

The test that's failing is the one being added here.

https://github.com/pandas-dev/pandas/actions/runs/11700807499/job/32585578297?pr=60100#step:8:83

KevsterAmp · 2024-11-08T09:39:43Z

The test that's failing is the one being added here.

Any ideas to fix it? Since the CI was minimum versions I assumed that it was just outdated version of Tarfile or pandas

mroeschke · 2024-11-08T18:58:22Z

Looks like a bug with the version of fsspec being tested (TarContainedFile in fsspec didn't define flush). You'll want to specify a minimum version in skip_if_no (2023.1.0)

mroeschke · 2024-11-08T18:59:11Z

pandas/tests/io/test_common.py

+    chained_file_url = f"tar://test.csv::file://{tar_file_path}"
+
+    result_a = pd.read_csv(chained_file_url, compression=None, sep=";")
+    result_b = pd.read_csv(chained_file_url, compression="infer", sep=";")


Could you use pytest.mark.parametrize over the compression parameter

rhshadrach

lgtm

rhshadrach · 2024-11-11T21:16:08Z

Thanks @KevsterAmp

KevsterAmp added 5 commits October 24, 2024 21:40

add to whatsnew

795b260

Merge remote-tracking branch 'upstream/main' into bug-read_csv-tarfile

0f0ac4e

extract the target file to access when chained URLs are used

f05dec5

add isinstance to filter on str inputs only

cb94060

add tests and test tar file

9e1ba27

KevsterAmp marked this pull request as ready for review October 29, 2024 14:33

KevsterAmp added 2 commits October 29, 2024 22:39

rename func to start with "test"; revert removed random test func

3aaad97

formatting improvements by ruff

778e385

add @td.skip_if_no("fsspec") on test func

33b601d

rhshadrach requested changes Oct 31, 2024

View reviewed changes

improve test function for read_csv chained urls

fc469c7

Merge remote-tracking branch 'upstream/main' into bug-read_csv-tarfile

deb21df

KevsterAmp requested a review from rhshadrach November 4, 2024 13:04

rhshadrach reviewed Nov 5, 2024

View reviewed changes

KevsterAmp added 2 commits November 6, 2024 17:31

use tm.assert_frame_equal; add separator on read_csv; improve chained…

32fef29

…_file_url

Merge remote-tracking branch 'upstream/main' into bug-read_csv-tarfile

0fe1864

KevsterAmp requested a review from rhshadrach November 6, 2024 09:32

mroeschke added the IO CSV read_csv, to_csv label Nov 8, 2024

mroeschke reviewed Nov 8, 2024

View reviewed changes

KevsterAmp added 3 commits November 11, 2024 20:04

add min_version to td.skip_if_no due to ffspec bug

0dc0444

utilize pytest.mark.parametrize for testing

04f9246

Merge remote-tracking branch 'upstream/main' into bug-read_csv-tarfile

53d997f

mroeschke approved these changes Nov 11, 2024

View reviewed changes

mroeschke added this to the 3.0 milestone Nov 11, 2024

rhshadrach approved these changes Nov 11, 2024

View reviewed changes

rhshadrach added the Bug label Nov 11, 2024

rhshadrach changed the title ~~BUG: read_csv with chained fsspec TAR file and compression="infer" fails with tarfile.ReadError~~ BUG: read_csv with chained fsspec TAR file and compression="infer" Nov 11, 2024

rhshadrach merged commit 22df68e into pandas-dev:main Nov 11, 2024
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `read_csv` with chained fsspec TAR file and `compression="infer"` #60100

BUG: `read_csv` with chained fsspec TAR file and `compression="infer"` #60100

KevsterAmp commented Oct 24, 2024 •

edited

Loading

KevsterAmp commented Oct 29, 2024

rhshadrach commented Oct 30, 2024

KevsterAmp commented Oct 30, 2024

rhshadrach Oct 31, 2024 •

edited

Loading

KevsterAmp Nov 4, 2024

rhshadrach Oct 31, 2024

rhshadrach Oct 31, 2024

KevsterAmp commented Nov 4, 2024

rhshadrach Nov 5, 2024

KevsterAmp commented Nov 6, 2024

rhshadrach commented Nov 7, 2024

KevsterAmp commented Nov 8, 2024

mroeschke commented Nov 8, 2024

mroeschke Nov 8, 2024

rhshadrach left a comment

rhshadrach commented Nov 11, 2024

		pd.read_csv(f"tar://test.csv::file://{tar_file}", compression=None)
		pd.read_csv(f"tar://test.csv::file://{tar_file}", compression="infer")

		x_to_json_expected_output = '{"1;2":{"0":"3;4"}}'
		y_to_json_expected_output = '{"1;2":{"0":"3;4"}}'

BUG: read_csv with chained fsspec TAR file and compression="infer" #60100

BUG: read_csv with chained fsspec TAR file and compression="infer" #60100

Conversation

KevsterAmp commented Oct 24, 2024 • edited Loading

KevsterAmp commented Oct 29, 2024

rhshadrach commented Oct 30, 2024

KevsterAmp commented Oct 30, 2024

rhshadrach Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

KevsterAmp Nov 4, 2024

Choose a reason for hiding this comment

rhshadrach Oct 31, 2024

Choose a reason for hiding this comment

rhshadrach Oct 31, 2024

Choose a reason for hiding this comment

KevsterAmp commented Nov 4, 2024

rhshadrach Nov 5, 2024

Choose a reason for hiding this comment

KevsterAmp commented Nov 6, 2024

rhshadrach commented Nov 7, 2024

KevsterAmp commented Nov 8, 2024

mroeschke commented Nov 8, 2024

mroeschke Nov 8, 2024

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Nov 11, 2024

BUG: `read_csv` with chained fsspec TAR file and `compression="infer"` #60100

BUG: `read_csv` with chained fsspec TAR file and `compression="infer"` #60100

KevsterAmp commented Oct 24, 2024 •

edited

Loading

rhshadrach Oct 31, 2024 •

edited

Loading