Skip to content

ENH: read_sas only selected columns (and rows) of a large data file #37088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
paris0120 opened this issue Oct 12, 2020 · 8 comments
Open

ENH: read_sas only selected columns (and rows) of a large data file #37088

paris0120 opened this issue Oct 12, 2020 · 8 comments
Labels

Comments

@paris0120
Copy link

paris0120 commented Oct 12, 2020

Is your feature request related to a problem?

Reading a large data file can exhaust system memory. However, most of the time, we don't need all data from that file. It will be convenient to be able to read only the data needed.

Describe the solution you'd like

Reading only the data needed through iteration.

API breaking implications

One more function for each type of files

Describe alternatives you've considered

May add conditions as well, which will be sweet

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

def read_sas_by_columns(file_path, keys, chunksize=100, charset = 'utf-8'):
    data = pd.DataFrame()
    for df in pd.read_sas(file_path,iterator=True, chunksize=chunksize):
        data = data.append(df[keys])
    for c in data.select_dtypes(include=['object']).columns:
        data[c] = data[c].str.decode(charset)
    return data
@paris0120 paris0120 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 12, 2020
@paris0120 paris0120 changed the title ENH: Read only columns of a large data file ENH: Read only selected columns (and rows) of a large data file Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

cc @bashtage this is supportted for many other readers, not sure for sas

@jreback jreback added IO SAS SAS: read_sas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 12, 2020
@jreback jreback changed the title ENH: Read only selected columns (and rows) of a large data file ENH: read_sas only selected columns (and rows) of a large data file Oct 12, 2020
@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

row filtering is not supported, though its very easy to do with chunking

@bashtage
Copy link
Contributor

I think an implementation that would do on-the-fly column filtering would probably add quite a bit of complexity, so the simplest implementation would read then drop columns before the return. This won't save much memory, and I'm not sure this is worth the complexity.

SAS files may be compressed so grabbing of select columns is non-trivial. One would need to read each line, then apply the filtering on a line-by-line basis, and the return the reduced line.

Doable but given there is chunking support not clear it is worth the complexity unless someone feels very motivated.

@bashtage
Copy link
Contributor

FWIW I think this would work better

def read_sas_by_columns(file_path, keys, chunksize=100, charset='utf-8'):
    df_data: List[DataFrame] = []
    for df in pd.read_sas(file_path,iterator=True, chunksize=chunksize):
        df_data.append(df[keys])
    data: DataFrame = pd.concat(df_data, axis=0)
    for c in data.select_dtypes(include=['object']).columns:
        data[c] = data[c].str.decode(charset)
    return data

@paris0120
Copy link
Author

paris0120 commented Oct 14, 2020

I think an implementation that would do on-the-fly column filtering would probably add quite a bit of complexity, so the simplest implementation would read then drop columns before the return. This won't save much memory, and I'm not sure this is worth the complexity.

SAS files may be compressed so grabbing of select columns is non-trivial. One would need to read each line, then apply the filtering on a line-by-line basis, and the return the reduced line.

Doable but given there is chunking support not clear it is worth the complexity unless someone feels very motivated.

It can be VERY useful for SAS data. Most of the datasets I use have a size of more than 2GB after compression. My computer has a 32GB ram but can't directly read most of them. I have to read them with the function above.

Besides, the read_sas function is weird. If iterator or chunksize are set, a reader is returned instead of a dataframe. I prefer the reader has its own function like:


def get_sas_reader(
    filepath_or_buffer: FilePathOrBuffer,
    format: Optional[str] = None,
    index: Optional[Label] = None,
    encoding: Optional[str] = None,
    chunksize: Optional[int] = None
) -> Union["DataFrame", ReaderBase]:
    """
    Read SAS files stored as either XPORT or SAS7BDAT format files.
    Parameters
    ----------
    filepath_or_buffer : str, path object or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be:
        ``file://localhost/path/to/table.sas``.
        If you want to pass in a path object, pandas accepts any
        ``os.PathLike``.
        By file-like object, we refer to objects with a ``read()`` method,
        such as a file handler (e.g. via builtin ``open`` function)
        or ``StringIO``.
    format : str {'xport', 'sas7bdat'} or None
        If None, file format is inferred from file extension. If 'xport' or
        'sas7bdat', uses the corresponding format.
    index : identifier of index column, defaults to None
        Identifier of column that should be used as index of the DataFrame.
    encoding : str, default is None
        Encoding for text data.  If None, text data are stored as raw bytes.
    chunksize : int
        Read file `chunksize` lines at a time, returns iterator.
    iterator : bool, defaults to False
        If True, returns an iterator for reading the file incrementally.
    Returns
    -------
    DataFrame if iterator=False and chunksize=None, else SAS7BDATReader
    or XportReader
    """
    if format is None:
        buffer_error_msg = (
            "If this is a buffer object rather "
            "than a string name, you must specify a format string"
        )
        filepath_or_buffer = stringify_path(filepath_or_buffer)
        if not isinstance(filepath_or_buffer, str):
            raise ValueError(buffer_error_msg)
        fname = filepath_or_buffer.lower()
        if fname.endswith(".xpt"):
            format = "xport"
        elif fname.endswith(".sas7bdat"):
            format = "sas7bdat"
        else:
            raise ValueError("unable to infer format of SAS file")
    reader: ReaderBase
    if format.lower() == "xport":
        from pandas.io.sas.sas_xport import XportReader

        reader = XportReader(
            filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
        )
    elif format.lower() == "sas7bdat":
        from pandas.io.sas.sas7bdat import SAS7BDATReader

        reader = SAS7BDATReader(
            filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
        )
    else:
        raise ValueError("unknown SAS format")
    return reader

and then


def read_sas(file_path, keys=None, chunksize=None, charset = 'utf-8'):
    data: List[DataFrame] = []
    for df in pd.get_sas_reader(file_path, chunksize=chunksize):
        if(keys==None):
            data = data.append(df)
        else:
            data = data.append(df[keys])
    data: DataFrame = pd.concat(data, axis=0)
    for c in data.select_dtypes(include=['object']).columns:
        data[c] = data[c].str.decode(charset)
    return data

@bashtage
Copy link
Contributor

bashtage commented Oct 14, 2020

Besides, the read_sas function is weird. If iterator or chunksize are set, a reader is returned instead of a dataframe.

This isn't weird at all. This is how all chunking iterators work.

Your custom read_sas seems to do the trick with a small mod:

import pandas as pd
from typing import List


def read_sas(file_path, keys=None, chunksize=None, charset="utf-8") -> pd.DataFrame:
    def drop(df: pd.DataFrame) -> pd.DataFrame:
        if keys is None:
            return df
        # Not sure if copy is needed, but ensures original df can be GC
        return df[keys].copy()

    if chunksize is None:
        data = drop(pd.read_sas(file_path))
    else:
        dfs: List[pd.DataFrame] = []
        itr = pd.read_sas(filename, chunksize=chunksize, iterator=True)
        for df in itr:
            dfs.append(drop(df))
        data: pd.DataFrame = pd.concat(data, axis=0)
    for c in data.select_dtypes(include=["object"]).columns:
        data[c] = data[c].str.decode(charset)
    return data

@bashtage
Copy link
Contributor

I'm not opposed to column filtering, but if it is done it should be properly implemented at the lowest level given there is this pretty simple workaround that allows for very large files to be read and filtered.

@ofajardo
Copy link

ofajardo commented Dec 9, 2020

just in case, pyreadstat supports both reading selected columns and rows ... you could consider adding it as a backend for read_sas (it is already used in read_spss)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants