Skip to content

BUG: ensuring that np.asarray() simple handles data as objects and doesn't… #22161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 10, 2018

Conversation

realead
Copy link
Contributor

@realead realead commented Aug 1, 2018

… try to do smart things (GH22160)

@codecov
Copy link

codecov bot commented Aug 1, 2018

Codecov Report

Merging #22161 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22161   +/-   ##
=======================================
  Coverage   92.08%   92.08%           
=======================================
  Files         169      169           
  Lines       50704    50704           
=======================================
  Hits        46691    46691           
  Misses       4013     4013
Flag Coverage Δ
#multiple 90.49% <100%> (ø) ⬆️
#single 42.33% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.69% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 475e391...ab06c38. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this change anything from a user perspective?

@@ -134,7 +134,7 @@ def _ensure_data(values, dtype=None):
return values, dtype, 'int64'

# we have failed, return object
values = np.asarray(values)
values = np.asarray(values, dtype=np.object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we actually should prob use pandas.core.dtypes.cast.construct_1d_array_preserving_na which is even better here. further pls run the performance suite for things like factorize, value_counts, isin, this a very performance sensitive section.

Copy link
Contributor Author

@realead realead Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Actually, pandas.core.dtypes.cast.construct_1d_ndarray_preserving_na would not work for two reasons:

  1. For [42, 's'] it returns array(['42', 's'], dtype='<U11') and not the wanted array([42, 's'], dtype=object)), not sure this is the intended behavior of the function though
  2. For [np.nan] it returns array([nan], dtype=float64) which leads to result[0] is np.nan being False, but we would like to keep the id of the object.

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Aug 1, 2018
@pep8speaks
Copy link

pep8speaks commented Aug 3, 2018

Hello @realead! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 10, 2018 at 09:50 Hours UTC

@realead realead changed the title ensuring that np.asarray() simple handles data as objects and doesn't… BUG: ensuring that np.asarray() simple handles data as objects and doesn't… Aug 3, 2018
@realead
Copy link
Contributor Author

realead commented Aug 3, 2018

For asv continuous -f 1.1 upstream/master HEAD -b ^series_methods -b ^algorithms the result was:

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

can you rebase and run perf tests on algos. I find ever small changes sometimes can really impact perf here.

@@ -573,6 +573,5 @@ Other
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now takes a ``text_color_threshold`` parameter to automatically lighten the text color based on the luminance of the background color. This improves readability with dark background colors without the need to limit the background colormap range. (:issue:`21258`)
- Require at least 0.28.2 version of ``cython`` to support read-only memoryviews (:issue:`21688`)
- :meth: `~pandas.io.formats.style.Styler.background_gradient` now also supports tablewise application (in addition to rowwise and columnwise) with ``axis=None`` (:issue:`15204`)
-
-
- :meth:`pandas.core.algorithms.isin` avoids spurious casting for lists (:issue:`22160`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this user visible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Only when the user uses pandas.core.algorithms.isin directly, then the wrong behavior from #22160 is fixed. There is however no difference if isin is used via Series or Index - the values are already in a np.array and thus the bug ins't triggered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , this is an internal, routine, ok removing this whatsnew note.

@realead
Copy link
Contributor Author

realead commented Aug 9, 2018

There were no performance changes:

asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms -b ^categorial

· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 60 total benchmarks (2 commits * 1 environments * 30 benchmarks)
[  0.00%] · For pandas commit hash f11b14a6:
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Running benchmarks.........................
[  0.00%] ··· Setting up algorithms.py:95
[  0.00%] ··· Running benchmarks.........
[  1.67%] ··· algorithms.Duplicated.time_duplicated_float           16.1±1ms;...
[  3.33%] ··· algorithms.Duplicated.time_duplicated_int           9.85±0.3ms;...
[  5.00%] ··· algorithms.Duplicated.time_duplicated_string        10.6±0.2μs;...
[  6.67%] ··· ...icatedUniqueIndex.time_duplicated_unique_int         31.1±0.1μs
[  8.33%] ··· algorithms.Factorize.time_factorize_float             60.3±3ms;...
[ 10.00%] ··· algorithms.Factorize.time_factorize_int               20.0±3ms;...
[ 11.67%] ··· algorithms.Factorize.time_factorize_string           143±0.5ms;...
[ 13.33%] ··· algorithms.Match.time_match_string                         368±4μs
[ 15.00%] ··· series_methods.Clip.time_clip                            121±0.5μs
[ 16.67%] ··· series_methods.Dir.time_dir_strings                    1.97±0.03ms
[ 18.33%] ··· series_methods.Dropna.time_dropna                  2.72±0.04ms;...
[ 20.00%] ··· series_methods.IsIn.time_isin                      1.56±0.01ms;...
[ 21.67%] ··· ...ForObjects.time_isin_long_series_long_values         3.30±0.1ms
[ 23.33%] ··· ...cts.time_isin_long_series_long_values_floats         5.94±0.1ms
[ 25.00%] ··· ...orObjects.time_isin_long_series_short_values        2.03±0.02ms
[ 26.67%] ··· series_methods.IsInForObjects.time_isin_nans               636±6μs
[ 28.33%] ··· ...orObjects.time_isin_short_series_long_values        1.10±0.03ms
[ 30.00%] ··· series_methods.Map.time_map                           979±20μs;...
[ 31.67%] ··· series_methods.NSort.time_nlargest                 2.78±0.05ms;...
[ 33.33%] ··· series_methods.NSort.time_nsmallest                2.31±0.05ms;...
[ 35.00%] ··· ...s_methods.SeriesConstructor.time_constructor        157±1μs;...
[ 36.67%] ··· ...SeriesGetattr.time_series_datetimeindex_repr        3.11±0.03μs
[ 38.33%] ··· series_methods.ValueCounts.time_value_counts        2.20±0.3ms;...
[ 40.00%] ··· algorithms.Hashing.time_frame                           21.2±0.3ms
[ 41.67%] ··· algorithms.Hashing.time_series_categorical              5.42±0.1ms
[ 43.33%] ··· algorithms.Hashing.time_series_dates                   3.23±0.06ms
[ 45.00%] ··· algorithms.Hashing.time_series_float                   3.29±0.03ms
[ 46.67%] ··· algorithms.Hashing.time_series_int                     3.34±0.04ms
[ 48.33%] ··· algorithms.Hashing.time_series_string                   13.8±0.5ms
[ 50.00%] ··· algorithms.Hashing.time_series_timedeltas              3.29±0.03ms
[ 50.00%] · For pandas commit hash 475e391e:
[ 50.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt...
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running benchmarks.........................
[ 50.00%] ··· Setting up algorithms.py:95
[ 50.00%] ··· Running benchmarks.........
[ 51.67%] ··· algorithms.Duplicated.time_duplicated_float         14.8±0.6ms;...
[ 53.33%] ··· algorithms.Duplicated.time_duplicated_int           10.7±0.8ms;...
[ 55.00%] ··· algorithms.Duplicated.time_duplicated_string        10.5±0.2μs;...
[ 56.67%] ··· ...icatedUniqueIndex.time_duplicated_unique_int         30.8±0.4μs
[ 58.33%] ··· algorithms.Factorize.time_factorize_float             58.8±3ms;...
[ 60.00%] ··· algorithms.Factorize.time_factorize_int               21.5±2ms;...
[ 61.67%] ··· algorithms.Factorize.time_factorize_string             148±5ms;...
[ 63.33%] ··· algorithms.Match.time_match_string                         363±2μs
[ 65.00%] ··· series_methods.Clip.time_clip                              118±1μs
[ 66.67%] ··· series_methods.Dir.time_dir_strings                    1.94±0.02ms
[ 68.33%] ··· series_methods.Dropna.time_dropna                  2.77±0.06ms;...
[ 70.00%] ··· series_methods.IsIn.time_isin                      1.56±0.02ms;...
[ 71.67%] ··· ...ForObjects.time_isin_long_series_long_values        3.29±0.08ms
[ 73.33%] ··· ...cts.time_isin_long_series_long_values_floats        5.89±0.09ms
[ 75.00%] ··· ...orObjects.time_isin_long_series_short_values        2.07±0.03ms
[ 76.67%] ··· series_methods.IsInForObjects.time_isin_nans             668±200μs
[ 78.33%] ··· ...orObjects.time_isin_short_series_long_values         1.36±0.8ms
[ 80.00%] ··· series_methods.Map.time_map                        1.02±0.02ms;...
[ 81.67%] ··· series_methods.NSort.time_nlargest                 2.84±0.06ms;...
[ 83.33%] ··· series_methods.NSort.time_nsmallest                2.40±0.09ms;...
[ 85.00%] ··· ...s_methods.SeriesConstructor.time_constructor        157±2μs;...
[ 86.67%] ··· ...SeriesGetattr.time_series_datetimeindex_repr         3.17±0.2μs
[ 88.33%] ··· series_methods.ValueCounts.time_value_counts       2.19±0.04ms;...
[ 90.00%] ··· algorithms.Hashing.time_frame                           21.1±0.6ms
[ 91.67%] ··· algorithms.Hashing.time_series_categorical             5.02±0.02ms
[ 93.33%] ··· algorithms.Hashing.time_series_dates                   3.18±0.01ms
[ 95.00%] ··· algorithms.Hashing.time_series_float                   3.23±0.06ms
[ 96.67%] ··· algorithms.Hashing.time_series_int                     3.38±0.03ms
[ 98.33%] ··· algorithms.Hashing.time_series_string                   13.9±0.4ms
[100.00%] ··· algorithms.Hashing.time_series_timedeltas              3.25±0.07ms
**BENCHMARKS NOT SIGNIFICANTLY CHANGED.**

@jreback jreback added this to the 0.24.0 milestone Aug 9, 2018
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 9, 2018
@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

@realead ok with this, can you remove the whatsnew note. ping on green.

@realead
Copy link
Contributor Author

realead commented Aug 10, 2018

Close/Open to trigger CI-run

@realead realead closed this Aug 10, 2018
@realead realead reopened this Aug 10, 2018
@jreback jreback merged commit cc3ab4a into pandas-dev:master Aug 10, 2018
@jreback
Copy link
Contributor

jreback commented Aug 10, 2018

thanks @realead

@realead realead deleted the fix_GH22160 branch August 11, 2018 19:45
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

unexpected behavior of pd.core.algorithms._ensure_data()
3 participants