Skip to content

take_1d yields surprising results when working with SparseArray #19506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hexgnu opened this issue Feb 2, 2018 · 7 comments · Fixed by #22325
Closed

take_1d yields surprising results when working with SparseArray #19506

hexgnu opened this issue Feb 2, 2018 · 7 comments · Fixed by #22325
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Sparse Sparse Data Type

Comments

@hexgnu
Copy link
Contributor

hexgnu commented Feb 2, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
import pandas.core.algorithms as algos

algos.take_1d(pd.SparseArray([0,0,1], fill_value=0), [0,1,2]) #=> array([       1, 23618416,       32])

# VS

algos.take_1d(np.array([0,0,1]), [0,1,2]) #=> array([0,0,1])

Problem description

This to me smells like a problem with SparseArray sending over a sparse representation to take_1d in the C code. I would expect these values to be the same.

Expected Output

I would expect them to be the same.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: be09289 python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.13.16-202.fc26.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+205.gbe0928903
pytest: 3.3.1
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.1.0
sphinx: 1.6.5
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.0.2
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.2
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

Those algos are not meant to be used by users, so please don't use them directly. And they assume numpy arrays as input. Although a sparse array is a subclass, if you do np.array(sparse_array) you only get the actual values, without any fills. That is the reason that you see those strange values (they are leftovers of making an 'empty' result array to be filled by the take operation, but only the first item got filled).
There is an issue for that np.array(..) behaviour: #14167

@jorisvandenbossche
Copy link
Member

Ah, I see now you were actually working on that issue I linked to :-)

@hexgnu
Copy link
Contributor Author

hexgnu commented Feb 2, 2018

Yea I had a feeling it was related. That problem is a lot harder than I expected on first take... have a PR that's somewhat done but a lot of weird edge cases.

@hexgnu
Copy link
Contributor Author

hexgnu commented Feb 2, 2018

Also I would never use algos as a user. I was just looking for advice on fixing something else related to merging sparse frames.

@jorisvandenbossche
Copy link
Member

Also I would never use algos as a user.

Yes, sure! That's only not always clear from the issue post :)

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Sparse Sparse Data Type labels Feb 6, 2018
@TomAugspurger
Copy link
Contributor

Ideally algos.take_1d would dispatch to arr.take for SparseArray input. But we have some places in the library that seem to rely on take_1d(sparse_array) returning an ndarray.

@TomAugspurger
Copy link
Contributor

This is fixed by #22325.

TomAugspurger added a commit that referenced this issue Oct 13, 2018
Makes SparseArray an ExtensionArray.

* Fixed DataFrame.__setitem__ for updating to sparse.

Closes #22367

* Fixed Series[sparse].to_sparse

Closes #22389

Closes #21978
Closes #19506
Closes #22835
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
Makes SparseArray an ExtensionArray.

* Fixed DataFrame.__setitem__ for updating to sparse.

Closes pandas-dev#22367

* Fixed Series[sparse].to_sparse

Closes pandas-dev#22389

Closes pandas-dev#21978
Closes pandas-dev#19506
Closes pandas-dev#22835
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants