Skip to content

REF: simplify CSVFormatter #36046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Sep 9, 2020
Merged

REF: simplify CSVFormatter #36046

merged 39 commits into from
Sep 9, 2020

Conversation

ivanovmg
Copy link
Member

@ivanovmg ivanovmg commented Sep 1, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Refactor CSVFormatter

  1. Put data validation in setters
  2. Extract helper methods and properties

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good a few comments

@jreback jreback added IO CSV read_csv, to_csv Refactor Internal refactoring of code labels Sep 4, 2020
@jreback jreback added this to the 1.2 milestone Sep 4, 2020
@jreback jreback requested a review from gfyoung September 4, 2020 22:08
@gfyoung
Copy link
Member

gfyoung commented Sep 4, 2020

@ivanovmg: Welcome to pandas ! I like the general direction of this PR.

If you can address all of @jreback's comments, I think we should be good.

@ivanovmg ivanovmg requested a review from jreback September 5, 2020 09:52
To make sure that the newer mypy (v0.782) passes.
This eliminates repetition of the type annotations
for index label in multiple places.
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the update @ivanovmg couple of questions

@@ -82,6 +83,7 @@

Axis = Union[str, int]
Label = Optional[Hashable]
IndexLabel = Optional[Union[bool, str, Sequence[Label]]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be Optional[Union[Label, Sequence[Label]]] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, index label can be bool (at least in the original implementation). So, I left it exactly as it was.
It can probably be changed into Optional[Union[bool, Label, Sequence[Label]]], if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool should be included in Hashable though? if that's the case then I don't think you need it here (or is somthing breaking)?

Copy link
Member Author

@ivanovmg ivanovmg Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced IndexLabel as suggested in b7dae11

@@ -75,50 +83,92 @@ def __init__(
self.should_close = ioargs.should_close
self.mode = ioargs.mode

# GH21227 internal compression is not used for non-binary handles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this actually be in get_filepath_or_buffer instead? in pandas/io/common.py?

cc @twoertwein

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is currently not in get_filepath_or_buffer (but it would be a good place for it!). We have the same code also in to_csv:

# GH21227 internal compression is not used for non-binary handles.

If I didn't miss a function, all callers of get_filepath_or_buffer use the returned compression from it (if they care about compression).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this check on compression method into pandas/io/common.py. See ca888c1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! With this change, I think we don't need the warning in pandas/pandas/io/formats/csvs.py line 162 anymore.

get_filepath_or_buffer opens some URL-like strings in binary mode (because pandas supports more operations in binary mode) even if the caller specified a non-binary mode (unless the caller explicitly specified text mode with "t"). If (isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer)) or is_fsspec_url(filepath_or_buffer), you will need to use fsspec_mode instead of mode. I'm not sure how keen the pandas devs are on function decorates, I can imagine that it might be easier to first call get_filepath_or_buffer (it will decide whether to use binary/text mode) and then throw the warning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twoertwein, indeed I removed the warning from pandas/io/formats/csvs.py. All of these checks are not handed to the function get_filepath_or_buffer.
I am not sure I understood: is there anything else to be done on the topic? Because it seems that everything is fine now: tests pass and there is no check on the compression method in CSVFormatter.

Copy link
Member

@twoertwein twoertwein Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that it doesn't break this test

def test_to_csv_compression_encoding_gcs(gcs_buffer, compression_only, encoding):

It should call get_filepath_or_buffer with mode='w' or mode='r' but get_filepath_or_buffer will use fsspec_mode which adds the binary flag.

edit: I just realized why this PR doesn't break this test: fsspecs are strings, they have no write method.


@property
def quotechar(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add return type annotions as much as possible on these property / functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that.

return self._cols

@cols.setter
def cols(self, cols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and type if you can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, but with the limitations.
See 1346995

For some reason mypy would not recognize that chunksize
turns from Optional[int] to int inside the setter.
Even setting an intentional assertion
``assert chunksize is not None``
does not help.
Limitations:
 - ignore type[assignment] error.
 - Created additional method _refine_cols to allow
 conversion from Optional[Sequence[Label]] to Sequence[Label].
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice, one small question on typing. pls ping on green.

@@ -82,6 +83,7 @@

Axis = Union[str, int]
Label = Optional[Hashable]
IndexLabel = Optional[Union[bool, str, Sequence[Label]]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool should be included in Hashable though? if that's the case then I don't think you need it here (or is somthing breaking)?

index_label = [index_label]
self._index_label = index_label

def _get_index_label_from_obj(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally type the returns of these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ivanovmg ivanovmg requested a review from jreback September 6, 2020 17:59
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry @ivanovmg one more set of comments, ping on green.

# save it
self.cols = cols
@property
def _number_format(self) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically this should be Dict[str, Any]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @simonjayhawkins don't we have a mypy setting that fails on this? (or should be have a code-check)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Corrected typing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @simonjayhawkins don't we have a mypy setting that fails on this? (or should be have a code-check)?

see #30539

return bool(self._has_aliases or self.header)

@property
def write_cols(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if not index:
self.nlevels = 0
@property
def encoded_labels(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ca you type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ivanovmg ivanovmg requested a review from jreback September 6, 2020 20:20
@ivanovmg
Copy link
Member Author

ivanovmg commented Sep 9, 2020

@jreback, is there anything else to be done on this PR? Or good to go?

@jreback jreback merged commit 44e933a into pandas-dev:master Sep 9, 2020
@jreback
Copy link
Contributor

jreback commented Sep 9, 2020

thanks @ivanovmg very nice

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had just started reviewing the code as it's been merged, so this comment is just for info.

@@ -82,6 +83,7 @@

Axis = Union[str, int]
Label = Optional[Hashable]
IndexLabel = Optional[Union[Label, Sequence[Label]]]
Copy link
Member

@simonjayhawkins simonjayhawkins Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Optional in Label above is to include None in Label. (On hindsight, I think this could have been Union[Hashable, None] for clarity but we don't use the Union[..., None] pattern)

And also I'm not sure how we got the Ordered alias below.

In pandas._typing, I would generally prefer that we don't include the Optional, and just add it when needed in the annotations. I think this in generally allows more use of the aliases (i.e. after setting a default since we don't always set defaults in the signatures)

so in the code, you would have

index_label: Optional[IndexLabel] = None

instead of

index_label: IndexLabel = None

of course Label includes None anyway so the Optional isn't needed anyway. it's just a stylistic preference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins, I agree that it is more reasonable to define Label and IndexLabel without Optional. I will take a look at this in a separate PR.

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
* REF: extract properties cols and has_mi_columns

* REF: extract property chunksize

* REF: extract property quotechar

* REF: extract properties data_index and nlevels

* REF: refactor _save_chunk

* REF: refactor _save

* REF: extract method _save_body

* REF: reorder _save-like methods

* REF: extract compression property

* REF: Extract property index_label

* REF: extract helper properties

* REF: delete local variables in _save_header

* REF: extract method _get_header_rows

* REF: move check for header into _save function

* TYP: add several type annotations

* FIX: fix index labels

* FIX: fix multiindex

* FIX: fix test failures on compression

Needed to eliminate compression setter
due to the interdependencies between ioargs and compression.

* REF: eliminate preallocation of self.data

* REF: extract method _convert_to_native_types

* REF: rename regular -> flat as reviewed

* TYP: add type annotations as reviewed

* REF: refactor number formatting

Replace _convert_to_native_types method
in favor of a number formatting dictionary.

* FIX: mypy error with index_label

* FIX: reorder if-statements in index_label

To make sure that the newer mypy (v0.782) passes.

* TYP: move IndexLabel to pandas._typing

This eliminates repetition of the type annotations
for index label in multiple places.

* TYP: quotechar, has_mi_columns, _need_to_save...

* TYP: chunksize, but ignored assignment check

For some reason mypy would not recognize that chunksize
turns from Optional[int] to int inside the setter.
Even setting an intentional assertion
``assert chunksize is not None``
does not help.

* TYP: cols property

Limitations:
 - ignore type[assignment] error.
 - Created additional method _refine_cols to allow
 conversion from Optional[Sequence[Label]] to Sequence[Label].

* TYP: nlevels and _has_aliases

* CLN: move GH21227 check to pandas/io/common.py

* TYP: remove redundant bool from IndexLabel type

* TYP: add to _get_index_label... methods

* TYP: use Iterator instead of Generator

* TYP: explicitly use List type

* TYP: correct dict typing

* TYP: remaining properties
@ivanovmg ivanovmg deleted the refactor/csvs branch November 6, 2020 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants