REF: simplify CSVFormatter #36046

ivanovmg · 2020-09-01T17:49:29Z

Refactor CSVFormatter

Put data validation in setters
Extract helper methods and properties

Needed to eliminate compression setter due to the interdependencies between ioargs and compression.

jreback

looks pretty good a few comments

pandas/io/formats/csvs.py

gfyoung · 2020-09-04T23:54:35Z

@ivanovmg: Welcome to pandas ! I like the general direction of this PR.

If you can address all of @jreback's comments, I think we should be good.

To make sure that the newer mypy (v0.782) passes.

This eliminates repetition of the type annotations for index label in multiple places.

jreback

thanks for the update @ivanovmg couple of questions

jreback · 2020-09-05T14:27:34Z

pandas/_typing.py

@@ -82,6 +83,7 @@

 Axis = Union[str, int]
 Label = Optional[Hashable]
+IndexLabel = Optional[Union[bool, str, Sequence[Label]]]


shouldn't this just be Optional[Union[Label, Sequence[Label]]] ?

cc @simonjayhawkins

As far as I understand, index label can be bool (at least in the original implementation). So, I left it exactly as it was.
It can probably be changed into Optional[Union[bool, Label, Sequence[Label]]], if needed.

bool should be included in Hashable though? if that's the case then I don't think you need it here (or is somthing breaking)?

Replaced IndexLabel as suggested in b7dae11

jreback · 2020-09-05T14:29:05Z

pandas/io/formats/csvs.py

@@ -75,50 +83,92 @@ def __init__(
        self.should_close = ioargs.should_close
        self.mode = ioargs.mode

+        # GH21227 internal compression is not used for non-binary handles.


shouldn't this actually be in get_filepath_or_buffer instead? in pandas/io/common.py?

cc @twoertwein

that is currently not in get_filepath_or_buffer (but it would be a good place for it!). We have the same code also in to_csv:

pandas/pandas/io/formats/csvs.py

Line 162 in bdb6e26

# GH21227 internal compression is not used for non-binary handles.

If I didn't miss a function, all callers of get_filepath_or_buffer use the returned compression from it (if they care about compression).

I moved this check on compression method into pandas/io/common.py. See ca888c1.

thank you! With this change, I think we don't need the warning in pandas/pandas/io/formats/csvs.py line 162 anymore.

get_filepath_or_buffer opens some URL-like strings in binary mode (because pandas supports more operations in binary mode) even if the caller specified a non-binary mode (unless the caller explicitly specified text mode with "t"). If (isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer)) or is_fsspec_url(filepath_or_buffer), you will need to use fsspec_mode instead of mode. I'm not sure how keen the pandas devs are on function decorates, I can imagine that it might be easier to first call get_filepath_or_buffer (it will decide whether to use binary/text mode) and then throw the warning.

@twoertwein, indeed I removed the warning from pandas/io/formats/csvs.py. All of these checks are not handed to the function get_filepath_or_buffer.
I am not sure I understood: is there anything else to be done on the topic? Because it seems that everything is fine now: tests pass and there is no check on the compression method in CSVFormatter.

I'm surprised that it doesn't break this test

pandas/pandas/tests/io/test_gcs.py

Line 78 in c688a0f

def test_to_csv_compression_encoding_gcs(gcs_buffer, compression_only, encoding):

It should call get_filepath_or_buffer with mode='w' or mode='r' but get_filepath_or_buffer will use fsspec_mode which adds the binary flag.

edit: I just realized why this PR doesn't break this test: fsspecs are strings, they have no write method.

jreback · 2020-09-05T14:29:56Z

pandas/io/formats/csvs.py


+    @property
+    def quotechar(self):


can you add return type annotions as much as possible on these property / functions

jreback · 2020-09-05T14:30:14Z

pandas/io/formats/csvs.py

+        return self._cols
+
+    @cols.setter
+    def cols(self, cols):


and type if you can

Added, but with the limitations.
See 1346995

pandas/io/formats/csvs.py

For some reason mypy would not recognize that chunksize turns from Optional[int] to int inside the setter. Even setting an intentional assertion ``assert chunksize is not None`` does not help.

Limitations: - ignore type[assignment] error. - Created additional method _refine_cols to allow conversion from Optional[Sequence[Label]] to Sequence[Label].

jreback

very nice, one small question on typing. pls ping on green.

jreback · 2020-09-05T18:28:28Z

pandas/_typing.py

@@ -82,6 +83,7 @@

 Axis = Union[str, int]
 Label = Optional[Hashable]
+IndexLabel = Optional[Union[bool, str, Sequence[Label]]]


bool should be included in Hashable though? if that's the case then I don't think you need it here (or is somthing breaking)?

jreback · 2020-09-05T18:28:59Z

pandas/io/formats/csvs.py

+                index_label = [index_label]
+        self._index_label = index_label
+
+    def _get_index_label_from_obj(self):


ideally type the returns of these

jreback

sorry @ivanovmg one more set of comments, ping on green.

jreback · 2020-09-06T18:04:38Z

pandas/io/formats/csvs.py

-        # save it
-        self.cols = cols
+    @property
+    def _number_format(self) -> dict:


technically this should be Dict[str, Any]

cc @simonjayhawkins don't we have a mypy setting that fails on this? (or should be have a code-check)?

Thank you! Corrected typing.

cc @simonjayhawkins don't we have a mypy setting that fails on this? (or should be have a code-check)?

see #30539

jreback · 2020-09-06T18:05:23Z

pandas/io/formats/csvs.py

+        return bool(self._has_aliases or self.header)
+
+    @property
+    def write_cols(self):


can you type here

jreback · 2020-09-06T18:05:30Z

pandas/io/formats/csvs.py

-        if not index:
-            self.nlevels = 0
+    @property
+    def encoded_labels(self):


ca you type

ivanovmg · 2020-09-09T10:03:33Z

@jreback, is there anything else to be done on this PR? Or good to go?

jreback · 2020-09-09T10:33:23Z

thanks @ivanovmg very nice

simonjayhawkins

I had just started reviewing the code as it's been merged, so this comment is just for info.

simonjayhawkins · 2020-09-09T10:35:00Z

pandas/_typing.py

@@ -82,6 +83,7 @@

 Axis = Union[str, int]
 Label = Optional[Hashable]
+IndexLabel = Optional[Union[Label, Sequence[Label]]]


the Optional in Label above is to include None in Label. (On hindsight, I think this could have been Union[Hashable, None] for clarity but we don't use the Union[..., None] pattern)

And also I'm not sure how we got the Ordered alias below.

In pandas._typing, I would generally prefer that we don't include the Optional, and just add it when needed in the annotations. I think this in generally allows more use of the aliases (i.e. after setting a default since we don't always set defaults in the signatures)

so in the code, you would have

index_label: Optional[IndexLabel] = None

instead of

index_label: IndexLabel = None

of course Label includes None anyway so the Optional isn't needed anyway. it's just a stylistic preference.

@simonjayhawkins, I agree that it is more reasonable to define Label and IndexLabel without Optional. I will take a look at this in a separate PR.

* REF: extract properties cols and has_mi_columns * REF: extract property chunksize * REF: extract property quotechar * REF: extract properties data_index and nlevels * REF: refactor _save_chunk * REF: refactor _save * REF: extract method _save_body * REF: reorder _save-like methods * REF: extract compression property * REF: Extract property index_label * REF: extract helper properties * REF: delete local variables in _save_header * REF: extract method _get_header_rows * REF: move check for header into _save function * TYP: add several type annotations * FIX: fix index labels * FIX: fix multiindex * FIX: fix test failures on compression Needed to eliminate compression setter due to the interdependencies between ioargs and compression. * REF: eliminate preallocation of self.data * REF: extract method _convert_to_native_types * REF: rename regular -> flat as reviewed * TYP: add type annotations as reviewed * REF: refactor number formatting Replace _convert_to_native_types method in favor of a number formatting dictionary. * FIX: mypy error with index_label * FIX: reorder if-statements in index_label To make sure that the newer mypy (v0.782) passes. * TYP: move IndexLabel to pandas._typing This eliminates repetition of the type annotations for index label in multiple places. * TYP: quotechar, has_mi_columns, _need_to_save... * TYP: chunksize, but ignored assignment check For some reason mypy would not recognize that chunksize turns from Optional[int] to int inside the setter. Even setting an intentional assertion ``assert chunksize is not None`` does not help. * TYP: cols property Limitations: - ignore type[assignment] error. - Created additional method _refine_cols to allow conversion from Optional[Sequence[Label]] to Sequence[Label]. * TYP: nlevels and _has_aliases * CLN: move GH21227 check to pandas/io/common.py * TYP: remove redundant bool from IndexLabel type * TYP: add to _get_index_label... methods * TYP: use Iterator instead of Generator * TYP: explicitly use List type * TYP: correct dict typing * TYP: remaining properties

ivanovmg added 19 commits September 2, 2020 00:45

REF: extract properties cols and has_mi_columns

4223349

REF: extract property chunksize

58ef283

REF: extract property quotechar

f4fe66d

REF: extract properties data_index and nlevels

59a2d21

REF: refactor _save_chunk

29256d4

REF: refactor _save

a6e84e1

REF: extract method _save_body

c840b3f

REF: reorder _save-like methods

6828146

REF: extract compression property

98d4e47

REF: Extract property index_label

d6b2827

REF: extract helper properties

15dbc83

REF: delete local variables in _save_header

5e7b778

REF: extract method _get_header_rows

6e3b389

REF: move check for header into _save function

d733f0f

TYP: add several type annotations

cdeb115

FIX: fix index labels

417e74a

FIX: fix multiindex

9df1d82

Merge branch 'master' into refactor/csvs

9fd8d13

FIX: fix test failures on compression

22955db

Needed to eliminate compression setter due to the interdependencies between ioargs and compression.

ivanovmg force-pushed the refactor/csvs branch from faf1b4b to 22955db Compare September 4, 2020 17:10

ivanovmg added 2 commits September 5, 2020 02:55

REF: eliminate preallocation of self.data

5dcff8e

REF: extract method _convert_to_native_types

ff144d8

jreback requested changes Sep 4, 2020

View reviewed changes

jreback added IO CSV read_csv, to_csv Refactor Internal refactoring of code labels Sep 4, 2020

jreback added this to the 1.2 milestone Sep 4, 2020

jreback requested a review from gfyoung September 4, 2020 22:08

ivanovmg added 2 commits September 5, 2020 16:44

REF: rename regular -> flat as reviewed

3da7207

TYP: add type annotations as reviewed

6041666

ivanovmg requested a review from jreback September 5, 2020 09:52

ivanovmg added 3 commits September 5, 2020 19:30

FIX: mypy error with index_label

080e6e1

FIX: reorder if-statements in index_label

1e35f87

To make sure that the newer mypy (v0.782) passes.

TYP: move IndexLabel to pandas._typing

ba353a5

This eliminates repetition of the type annotations for index label in multiple places.

jreback requested changes Sep 5, 2020

View reviewed changes

ivanovmg added 6 commits September 5, 2020 21:56

TYP: quotechar, has_mi_columns, _need_to_save...

a49dd63

TYP: chunksize, but ignored assignment check

f1e1ac8

For some reason mypy would not recognize that chunksize turns from Optional[int] to int inside the setter. Even setting an intentional assertion ``assert chunksize is not None`` does not help.

TYP: cols property

1346995

Limitations: - ignore type[assignment] error. - Created additional method _refine_cols to allow conversion from Optional[Sequence[Label]] to Sequence[Label].

TYP: nlevels and _has_aliases

bebdfcf

Merge branch 'master' into refactor/csvs

b381e8a

CLN: move GH21227 check to pandas/io/common.py

ca888c1

jreback requested changes Sep 5, 2020

View reviewed changes

ivanovmg and others added 4 commits September 6, 2020 01:58

TYP: remove redundant bool from IndexLabel type

b7dae11

TYP: add to _get_index_label... methods

1f8c488

TYP: use Iterator instead of Generator

1a750b4

TYP: explicitly use List type

7b89921

ivanovmg requested a review from jreback September 6, 2020 17:59

jreback requested changes Sep 6, 2020

View reviewed changes

ivanovmg added 2 commits September 7, 2020 01:44

TYP: correct dict typing

2478084

TYP: remaining properties

e08f656

ivanovmg requested a review from jreback September 6, 2020 20:20

jreback approved these changes Sep 9, 2020

View reviewed changes

jreback merged commit 44e933a into pandas-dev:master Sep 9, 2020

simonjayhawkins reviewed Sep 9, 2020

View reviewed changes

ivanovmg mentioned this pull request Sep 16, 2020

TYP: alias IndexLabel without Optional #36401

Merged

5 tasks

ivanovmg deleted the refactor/csvs branch November 6, 2020 15:35

REF: simplify CSVFormatter #36046

REF: simplify CSVFormatter #36046

Conversation

ivanovmg commented Sep 1, 2020 • edited Loading

Refactor CSVFormatter

jreback left a comment

Choose a reason for hiding this comment

gfyoung commented Sep 4, 2020

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanovmg Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twoertwein Sep 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanovmg commented Sep 9, 2020

jreback commented Sep 9, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

simonjayhawkins Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanovmg commented Sep 1, 2020 •

edited

Loading

ivanovmg Sep 6, 2020 •

edited

Loading

twoertwein Sep 6, 2020 •

edited

Loading

simonjayhawkins Sep 9, 2020 •

edited

Loading