Skip to content

Allow reading SAS files from archives #47154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 15, 2022
Merged

Conversation

jonashaag
Copy link
Contributor

  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jreback jreback added the IO SAS SAS: read_sas label Jun 5, 2022
@jreback jreback added this to the 1.5 milestone Jun 5, 2022
@jreback
Copy link
Contributor

jreback commented Jun 5, 2022

cc @bashtage @twoertwein if any comments

Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! One small optional comment.

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly straightforward. Don't fully see the point of the big docstring refactor since no real reuse.


from pandas.io.common import stringify_path

if TYPE_CHECKING:
from pandas import DataFrame


_doc_read_sas = r"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems strange to pull this out just to use Appender. Why not leave it in-line and use Substitution?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See, e.g.,

@Substitution(name="groupby")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I copy pasted that from some other place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bashtage now we have the problem that the substituted snippet is dedented:

...
iterator : bool, defaults to False
    If True, returns an iterator for reading the file incrementally.

    .. versionchanged:: 1.2

        ``TextFileReader`` is a context manager.
compression : str or dict, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer' and '%s' is
...

Last line should be indented.

Any pointers how to deal with this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the @doc decorator as in here

@doc(

def test_sas_archive(datapath):
fname_uncompressed = datapath("io", "sas", "data", "airline.sas7bdat")
df_uncompressed = read_sas(fname_uncompressed)
fname_compressed = datapath("io", "sas", "data", "airline.sas7bdat.gz")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth adding a gzip'd file? Could the gzip file be created on the fly during the test? This would also let you test other supported formats, e.g., zip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have .gz files in the repo so I thought that's the way to go. Can also gzip the file in the test. Let me know what's your preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is essential to use this method - I just wonder about the longer term future since it isn't strictly necessary to add compressed versions of files to the main repo. I'm happy to sign off on it as it is now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c. I would prefer these files to by gzipped on the fly in the test such that files in the main repo are absolutely necessary. Doesn't have to be for this PR but a followup would be appreciated

@@ -107,6 +121,7 @@ def read_sas(
.. versionchanged:: 1.2

``TextFileReader`` is a context manager.
%(decompression_options)s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No indent here for the %(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See question in other comment thread. I think your suggestion is wrong.

@mroeschke mroeschke merged commit 7397adc into pandas-dev:main Jun 15, 2022
@mroeschke
Copy link
Member

Thanks @jonashaag

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
* Allow reading SAS files from archives

* Add missing file

* Review feedback

* Fix

Co-authored-by: Jeff Reback <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants