Skip to content

Inconsistent dtype of category in empty Series between dict and list input #18515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Nov 27, 2017 · 7 comments · Fixed by #18496
Closed

Inconsistent dtype of category in empty Series between dict and list input #18515

toobaz opened this issue Nov 27, 2017 · 7 comments · Fixed by #18496
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Nov 27, 2017

Code Sample, a copy-pastable example if possible

In [2]: pd.Series([], dtype='category')
Out[2]: 
Series([], dtype: category
Categories (0, object): [])

In [3]: pd.Series({}, dtype='category')
Out[3]: 
Series([], dtype: category
Categories (0, float64): [])

In [4]: pd.Series(dtype='category')
Out[4]: 
Series([], dtype: category
Categories (0, float64): [])

Problem description

The difference is unjustified.

Expected Output

The same. Probably Out[4]:, which is also (implicitly) tested.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0.dev0+241.gf745e52e1
pytest: 3.0.6
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: 0.4.1+dev
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

dupe of #17261

@jreback jreback closed this as completed Nov 27, 2017
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 27, 2017
@jreback jreback added this to the No action milestone Nov 27, 2017
@jorisvandenbossche jorisvandenbossche removed the Duplicate Report Duplicate issue or pull request label Nov 30, 2017
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Nov 30, 2017
@jorisvandenbossche
Copy link
Member

This takes a separate fix, so let's keep this open?
I suppose this is due to this inconsistency in the Categorical constructor:

In [13]: pd.Categorical(np.array([]))
Out[13]: [], Categories (0, float64): []

In [14]: pd.Categorical([])
Out[14]: [], Categories (0, object): []

@jorisvandenbossche
Copy link
Member

Actually, the above is maybe not that wrong? As in the first case you actually pass it a float array, so it's OK that the categories are float.
So then it does need to be fixed in the Series init, which I think @toobaz is doing in #18496

@toobaz
Copy link
Member Author

toobaz commented Nov 30, 2017

I suppose this is due to this inconsistency in the Categorical constructor:

The inconsistency is precisely the one described in my opening example. Which is what I'm fixing in #18496 (and has nothing to do with passing an array - which would rightly keep its dtype).

But you are right that this can be considered separate from #17261, which doesn't involve categories. @jreback probably just viewed this as included in that (which was fine to me - but the fix is distinct).

So OK with reopening, I will just mention in #18496 that it closes this.

@jorisvandenbossche
Copy link
Member

The inconsistency is precisely the one described in my opening example.

No, you are describing it with Series, which is something different as with Categorical :-)

Anyhow, yes consider this as a separate issue, and mention it in #18496 (+ adding new tests, whatsnew note)

@toobaz
Copy link
Member Author

toobaz commented Nov 30, 2017

No, you are describing it with Series, which is something different as with Categorical :-)

Indeed. This issue has nothing to do with (non-Series) Categorical (the inconsistency you are describing is unrelated to mine, and to this bug).

Anyhow, yes consider this as a separate issue, and mention it in #18496 (+ adding new tests, whatsnew note)

OK

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 30, 2017

the inconsistency you are describing is unrelated to mine

No, the inconsistency I describe is the underlying reason for the bug you reported in this issue. Of course, the actual cause is in the implementation detail how Series handles no data or empty dict, previously it was passed as np.array([]), now you changed that to be passed as []. Which you are fixing, so perfect!

toobaz added a commit to toobaz/pandas that referenced this issue Nov 30, 2017
toobaz added a commit to toobaz/pandas that referenced this issue Dec 1, 2017
@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants