-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: read_sas only selected columns (and rows) of a large data file #37088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @bashtage this is supportted for many other readers, not sure for sas |
row filtering is not supported, though its very easy to do with chunking |
I think an implementation that would do on-the-fly column filtering would probably add quite a bit of complexity, so the simplest implementation would read then drop columns before the return. This won't save much memory, and I'm not sure this is worth the complexity. SAS files may be compressed so grabbing of select columns is non-trivial. One would need to read each line, then apply the filtering on a line-by-line basis, and the return the reduced line. Doable but given there is chunking support not clear it is worth the complexity unless someone feels very motivated. |
FWIW I think this would work better def read_sas_by_columns(file_path, keys, chunksize=100, charset='utf-8'):
df_data: List[DataFrame] = []
for df in pd.read_sas(file_path,iterator=True, chunksize=chunksize):
df_data.append(df[keys])
data: DataFrame = pd.concat(df_data, axis=0)
for c in data.select_dtypes(include=['object']).columns:
data[c] = data[c].str.decode(charset)
return data |
It can be VERY useful for SAS data. Most of the datasets I use have a size of more than 2GB after compression. My computer has a 32GB ram but can't directly read most of them. I have to read them with the function above. Besides, the read_sas function is weird. If iterator or chunksize are set, a reader is returned instead of a dataframe. I prefer the reader has its own function like:
and then
|
This isn't weird at all. This is how all chunking iterators work. Your custom import pandas as pd
from typing import List
def read_sas(file_path, keys=None, chunksize=None, charset="utf-8") -> pd.DataFrame:
def drop(df: pd.DataFrame) -> pd.DataFrame:
if keys is None:
return df
# Not sure if copy is needed, but ensures original df can be GC
return df[keys].copy()
if chunksize is None:
data = drop(pd.read_sas(file_path))
else:
dfs: List[pd.DataFrame] = []
itr = pd.read_sas(filename, chunksize=chunksize, iterator=True)
for df in itr:
dfs.append(drop(df))
data: pd.DataFrame = pd.concat(data, axis=0)
for c in data.select_dtypes(include=["object"]).columns:
data[c] = data[c].str.decode(charset)
return data |
I'm not opposed to column filtering, but if it is done it should be properly implemented at the lowest level given there is this pretty simple workaround that allows for very large files to be read and filtered. |
just in case, pyreadstat supports both reading selected columns and rows ... you could consider adding it as a backend for read_sas (it is already used in read_spss) |
Is your feature request related to a problem?
Reading a large data file can exhaust system memory. However, most of the time, we don't need all data from that file. It will be convenient to be able to read only the data needed.
Describe the solution you'd like
Reading only the data needed through iteration.
API breaking implications
One more function for each type of files
Describe alternatives you've considered
May add conditions as well, which will be sweet
Additional context
[add any other context, code examples, or references to existing implementations about the feature request here]
The text was updated successfully, but these errors were encountered: