ENH: read_sas only selected columns (and rows) of a large data file #37088

paris0120 · 2020-10-12T22:22:20Z

Is your feature request related to a problem?

Reading a large data file can exhaust system memory. However, most of the time, we don't need all data from that file. It will be convenient to be able to read only the data needed.

Describe the solution you'd like

Reading only the data needed through iteration.

API breaking implications

One more function for each type of files

Describe alternatives you've considered

May add conditions as well, which will be sweet

Additional context

[add any other context, code examples, or references to existing implementations about the feature request here]

def read_sas_by_columns(file_path, keys, chunksize=100, charset = 'utf-8'):
    data = pd.DataFrame()
    for df in pd.read_sas(file_path,iterator=True, chunksize=chunksize):
        data = data.append(df[keys])
    for c in data.select_dtypes(include=['object']).columns:
        data[c] = data[c].str.decode(charset)
    return data

jreback · 2020-10-12T23:07:49Z

cc @bashtage this is supportted for many other readers, not sure for sas

jreback · 2020-10-12T23:08:43Z

row filtering is not supported, though its very easy to do with chunking

bashtage · 2020-10-13T09:07:12Z

I think an implementation that would do on-the-fly column filtering would probably add quite a bit of complexity, so the simplest implementation would read then drop columns before the return. This won't save much memory, and I'm not sure this is worth the complexity.

SAS files may be compressed so grabbing of select columns is non-trivial. One would need to read each line, then apply the filtering on a line-by-line basis, and the return the reduced line.

Doable but given there is chunking support not clear it is worth the complexity unless someone feels very motivated.

bashtage · 2020-10-13T09:10:05Z

FWIW I think this would work better

def read_sas_by_columns(file_path, keys, chunksize=100, charset='utf-8'):
    df_data: List[DataFrame] = []
    for df in pd.read_sas(file_path,iterator=True, chunksize=chunksize):
        df_data.append(df[keys])
    data: DataFrame = pd.concat(df_data, axis=0)
    for c in data.select_dtypes(include=['object']).columns:
        data[c] = data[c].str.decode(charset)
    return data

paris0120 · 2020-10-14T00:28:24Z

I think an implementation that would do on-the-fly column filtering would probably add quite a bit of complexity, so the simplest implementation would read then drop columns before the return. This won't save much memory, and I'm not sure this is worth the complexity.

SAS files may be compressed so grabbing of select columns is non-trivial. One would need to read each line, then apply the filtering on a line-by-line basis, and the return the reduced line.

Doable but given there is chunking support not clear it is worth the complexity unless someone feels very motivated.

It can be VERY useful for SAS data. Most of the datasets I use have a size of more than 2GB after compression. My computer has a 32GB ram but can't directly read most of them. I have to read them with the function above.

Besides, the read_sas function is weird. If iterator or chunksize are set, a reader is returned instead of a dataframe. I prefer the reader has its own function like:


def get_sas_reader(
    filepath_or_buffer: FilePathOrBuffer,
    format: Optional[str] = None,
    index: Optional[Label] = None,
    encoding: Optional[str] = None,
    chunksize: Optional[int] = None
) -> Union["DataFrame", ReaderBase]:
    """
    Read SAS files stored as either XPORT or SAS7BDAT format files.
    Parameters
    ----------
    filepath_or_buffer : str, path object or file-like object
        Any valid string path is acceptable. The string could be a URL. Valid
        URL schemes include http, ftp, s3, and file. For file URLs, a host is
        expected. A local file could be:
        ``file://localhost/path/to/table.sas``.
        If you want to pass in a path object, pandas accepts any
        ``os.PathLike``.
        By file-like object, we refer to objects with a ``read()`` method,
        such as a file handler (e.g. via builtin ``open`` function)
        or ``StringIO``.
    format : str {'xport', 'sas7bdat'} or None
        If None, file format is inferred from file extension. If 'xport' or
        'sas7bdat', uses the corresponding format.
    index : identifier of index column, defaults to None
        Identifier of column that should be used as index of the DataFrame.
    encoding : str, default is None
        Encoding for text data.  If None, text data are stored as raw bytes.
    chunksize : int
        Read file `chunksize` lines at a time, returns iterator.
    iterator : bool, defaults to False
        If True, returns an iterator for reading the file incrementally.
    Returns
    -------
    DataFrame if iterator=False and chunksize=None, else SAS7BDATReader
    or XportReader
    """
    if format is None:
        buffer_error_msg = (
            "If this is a buffer object rather "
            "than a string name, you must specify a format string"
        )
        filepath_or_buffer = stringify_path(filepath_or_buffer)
        if not isinstance(filepath_or_buffer, str):
            raise ValueError(buffer_error_msg)
        fname = filepath_or_buffer.lower()
        if fname.endswith(".xpt"):
            format = "xport"
        elif fname.endswith(".sas7bdat"):
            format = "sas7bdat"
        else:
            raise ValueError("unable to infer format of SAS file")
    reader: ReaderBase
    if format.lower() == "xport":
        from pandas.io.sas.sas_xport import XportReader

        reader = XportReader(
            filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
        )
    elif format.lower() == "sas7bdat":
        from pandas.io.sas.sas7bdat import SAS7BDATReader

        reader = SAS7BDATReader(
            filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
        )
    else:
        raise ValueError("unknown SAS format")
    return reader

and then


def read_sas(file_path, keys=None, chunksize=None, charset = 'utf-8'):
    data: List[DataFrame] = []
    for df in pd.get_sas_reader(file_path, chunksize=chunksize):
        if(keys==None):
            data = data.append(df)
        else:
            data = data.append(df[keys])
    data: DataFrame = pd.concat(data, axis=0)
    for c in data.select_dtypes(include=['object']).columns:
        data[c] = data[c].str.decode(charset)
    return data

bashtage · 2020-10-14T10:45:05Z

Besides, the read_sas function is weird. If iterator or chunksize are set, a reader is returned instead of a dataframe.

This isn't weird at all. This is how all chunking iterators work.

Your custom read_sas seems to do the trick with a small mod:

import pandas as pd
from typing import List


def read_sas(file_path, keys=None, chunksize=None, charset="utf-8") -> pd.DataFrame:
    def drop(df: pd.DataFrame) -> pd.DataFrame:
        if keys is None:
            return df
        # Not sure if copy is needed, but ensures original df can be GC
        return df[keys].copy()

    if chunksize is None:
        data = drop(pd.read_sas(file_path))
    else:
        dfs: List[pd.DataFrame] = []
        itr = pd.read_sas(filename, chunksize=chunksize, iterator=True)
        for df in itr:
            dfs.append(drop(df))
        data: pd.DataFrame = pd.concat(data, axis=0)
    for c in data.select_dtypes(include=["object"]).columns:
        data[c] = data[c].str.decode(charset)
    return data

bashtage · 2020-10-14T10:46:12Z

I'm not opposed to column filtering, but if it is done it should be properly implemented at the lowest level given there is this pretty simple workaround that allows for very large files to be read and filtered.

ofajardo · 2020-12-09T10:59:49Z

just in case, pyreadstat supports both reading selected columns and rows ... you could consider adding it as a backend for read_sas (it is already used in read_spss)

paris0120 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 12, 2020

paris0120 changed the title ~~ENH: Read only columns of a large data file~~ ENH: Read only selected columns (and rows) of a large data file Oct 12, 2020

jreback added IO SAS SAS: read_sas and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 12, 2020

jreback changed the title ~~ENH: Read only selected columns (and rows) of a large data file~~ ENH: read_sas only selected columns (and rows) of a large data file Oct 12, 2020

This was referenced Dec 9, 2020

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Closed

Reading SAS time variables #22720

Open

Reading a SAS file with 0 rows gives None #18198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_sas only selected columns (and rows) of a large data file #37088

ENH: read_sas only selected columns (and rows) of a large data file #37088

paris0120 commented Oct 12, 2020 •

edited

Loading

jreback commented Oct 12, 2020 •

edited

Loading

jreback commented Oct 12, 2020

bashtage commented Oct 13, 2020

bashtage commented Oct 13, 2020

paris0120 commented Oct 14, 2020 •

edited

Loading

bashtage commented Oct 14, 2020 •

edited

Loading

bashtage commented Oct 14, 2020

ofajardo commented Dec 9, 2020 •

edited

Loading

ENH: read_sas only selected columns (and rows) of a large data file #37088

ENH: read_sas only selected columns (and rows) of a large data file #37088

Comments

paris0120 commented Oct 12, 2020 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

jreback commented Oct 12, 2020 • edited Loading

jreback commented Oct 12, 2020

bashtage commented Oct 13, 2020

bashtage commented Oct 13, 2020

paris0120 commented Oct 14, 2020 • edited Loading

bashtage commented Oct 14, 2020 • edited Loading

bashtage commented Oct 14, 2020

ofajardo commented Dec 9, 2020 • edited Loading

paris0120 commented Oct 12, 2020 •

edited

Loading

jreback commented Oct 12, 2020 •

edited

Loading

paris0120 commented Oct 14, 2020 •

edited

Loading

bashtage commented Oct 14, 2020 •

edited

Loading

ofajardo commented Dec 9, 2020 •

edited

Loading