-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_stata ignores columns parameter and dtypes of empty dta files #46240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this occurs for other file types as well. I tried it for .csv and .xlsx file types and the same thing occurred. The second part of this bug did not occur in these file types. So I am guessing it's an issue with the method |
take |
Could you check your expected? I think you meant that the dtype should be something else than object?
|
I'm not sure if Excel files have any implicit type but I don't think so. I could check and see what happens with parquet and SAS files. I think both of those still have type information even if there are no observations. |
@sterlinm you wrote in the OP
you meant
|
@simonjayhawkins You're right, I mixed it up because I was highlighting two separate issues with reading empty files:
Here's an updated example: Reproducible Exampleimport numpy as np
import pandas as pd
from pandas.io.stata import StataReader
# create a DataFrame with int32 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]})
df.loc[:, 'a'] = df['a'].astype('int32')
df_empty = df.head(0)
# write the empty and non-empty DataFrame's to .dta files
df.to_stata('nonempty.dta', write_index=False, version=117)
df_empty.to_stata('empty.dta', write_index=False, version=117)
# column variables
expected_cols = pd.Index(['a'])
all_cols = df.columns
# reading one column of non-empty .dta file works
assert pd.read_stata('nonempty.dta', columns=["a"]).columns.equals(expected_cols)
# reading one column of empty .dta file does not work
assert pd.read_stata('empty.dta', columns=["a"]).columns.equals(all_cols)
assert pd.read_stata('empty.dta', columns=["xyz"]).columns.equals(all_cols) # should raise error
# reading non-empty .dta file retains correct dtypes
assert pd.read_stata('nonempty.dta').dtypes.equals(df.dtypes)
# reading empty .dta file makes all the columns object columns
assert (pd.read_stata('empty.dta').dtypes == 'object').all()
# we can confirm that the empty .dta file does retain the type information
expected_dtyplist = [np.dtype('int32'), np.dtype('float64')]
assert StataReader('nonempty.dta').dtyplist == expected_dtyplist
assert StataReader('empty.dta').dtyplist == expected_dtyplist Expected BehaviorIn the above example In [2]: df2.dtypes
Out[2]:
a int32
b float64
dtype: object |
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes #46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Correct column selection when reading empty dta files Correct omitted dtype information in empty dta files closes pandas-dev#46240
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
A stata
.dta
file with zero rows still has type information, but when you try to read an empty.dta
file usingpd.read_stata
all of the columns have object dtype. It will also ignore thecolumns
parameter and read all of the columns.Expected Behavior
In the above example
df2.dtypes
should return:Installed Versions
Apologies,
pd.show_versions()
fails for some reason. I've included it, but the pandas version is 1.4.1.In [5]: pd.__version__ Out[5]: '1.4.1'
The text was updated successfully, but these errors were encountered: