Speed up checking for NaN for floats #25946

vnlitvinov · 2019-04-01T16:42:07Z

closes N/A
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This isn't giving much speedup because this is simple change, but for some certain inputs like empty datetime fields in csv it gives some speed (because empty fields are parsed as float NaN-s).

vnlitvinov · 2019-04-01T16:43:04Z

Here's what asv continuous -f 1.01 upstream/master float-nat-speedup -e -b io.csv -a sample_time=1 -a warmup_time=1 shows:

before	after	ratio	test name
[`c7c4c94`]	[`54533f3`]
master	float-nat-speedup
1.91±0.03ms	1.86±0.01ms	0.97	io.csv.ReadCSVParseDates.time_multiple_date

I'm run more thorough benchmark now.

codecov · 2019-04-01T17:18:37Z

Codecov Report

Merging #25946 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25946      +/-   ##
==========================================
- Coverage   91.82%   91.81%   -0.01%     
==========================================
  Files         175      175              
  Lines       52581    52581              
==========================================
- Hits        48280    48276       -4     
- Misses       4301     4305       +4

Flag	Coverage Δ
#multiple	`90.36% <ø> (ø)`	⬆️
#single	`41.89% <ø> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c4c94...54533f3. Read the comment docs.

codecov · 2019-04-01T17:18:37Z

Codecov Report

Merging #25946 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25946      +/-   ##
==========================================
- Coverage   91.82%   91.81%   -0.01%     
==========================================
  Files         175      175              
  Lines       52581    52581              
==========================================
- Hits        48280    48276       -4     
- Misses       4301     4305       +4

Flag	Coverage Δ
#multiple	`90.36% <ø> (ø)`	⬆️
#single	`41.89% <ø> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`75% <0%> (-12.5%)`	⬇️
pandas/core/frame.py	`96.79% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c4c94...6e77c81. Read the comment docs.

pandas/_libs/tslibs/nattype.pyx

jreback · 2019-04-02T12:35:39Z

can you run the benchmarks for missing value checking and see what the change is.

vnlitvinov · 2019-04-02T16:05:52Z

Sure, I am already running those, but it takes huuuuge time to run when you set warmup and sampling times high enough (and with default settings the results are too flaky to be believable).

jreback · 2019-04-02T16:17:49Z

Sure, I am already running those, but it takes huuuuge time to run when you set warmup and sampling times high enough (and with default settings the results are too flaky to be believable).

right, though just the missing ones shouldn't be that huge

vnlitvinov · 2019-04-02T16:26:11Z

Can you point out these "missing" benchmark names?

jreback · 2019-04-02T16:29:44Z

you can pass a regex to select a subset

vnlitvinov · 2019-04-03T07:27:23Z

Could you please recommend what benchmark names might be relevant? I didn't study the whole list of them yet...

vnlitvinov · 2019-04-03T15:17:31Z

So running this asv continuous -f 1.01 upstream/master float-nat-speedup -b '.*missing.*' -b io.csv -e -a sample_time=2 -a warmup_time=2 produces:

before	after	ratio	test name
[`c7c4c94`]	[`6e77c81`]
master	float-nat-speedup
1.48±0.01μs	1.40±0.03μs	0.95	timedelta.TimedeltaConstructor.time_from_missing
8.41±0.09ms	7.79±0.06ms	0.93	io.csv.ReadCSVSkipRows.time_skipprows(10000)

I'm sure this is not benchmarking the worst case, as I think it should speed up parsing a date column where most fields are empty, but only after #25754 is merged so that NaN-s (to which empty fields are translated) stay as floats instead of literal "nan" strings.

jreback · 2019-04-04T14:01:54Z

thanks @vnlitvin

jreback reviewed Apr 1, 2019

View reviewed changes

pandas/_libs/tslibs/nattype.pyx Outdated Show resolved Hide resolved

vnlitvinov closed this Apr 1, 2019

vnlitvinov force-pushed the float-nat-speedup branch from 54533f3 to c7c4c94 Compare April 1, 2019 18:43

vnlitvinov reopened this Apr 1, 2019

Speed up util.is_nan for float values

6e77c81

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Apr 2, 2019

jreback added this to the 0.25.0 milestone Apr 2, 2019

jreback merged commit 1679e54 into pandas-dev:master Apr 4, 2019

vnlitvinov deleted the float-nat-speedup branch April 4, 2019 14:40

vnlitvinov mentioned this pull request Apr 8, 2019

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

Merged

4 tasks

anmyachev pushed a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019

Speed up util.is_nan for float values (pandas-dev#25946)

f41dd55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up checking for NaN for floats #25946

Speed up checking for NaN for floats #25946

vnlitvinov commented Apr 1, 2019

vnlitvinov commented Apr 1, 2019

codecov bot commented Apr 1, 2019

codecov bot commented Apr 1, 2019 •

edited

Loading

jreback commented Apr 2, 2019

vnlitvinov commented Apr 2, 2019

jreback commented Apr 2, 2019

vnlitvinov commented Apr 2, 2019

jreback commented Apr 2, 2019

vnlitvinov commented Apr 3, 2019

vnlitvinov commented Apr 3, 2019 •

edited

Loading

jreback commented Apr 4, 2019

Speed up checking for NaN for floats #25946

Speed up checking for NaN for floats #25946

Conversation

vnlitvinov commented Apr 1, 2019

vnlitvinov commented Apr 1, 2019

codecov bot commented Apr 1, 2019

Codecov Report

codecov bot commented Apr 1, 2019 • edited Loading

Codecov Report

jreback commented Apr 2, 2019

vnlitvinov commented Apr 2, 2019

jreback commented Apr 2, 2019

vnlitvinov commented Apr 2, 2019

jreback commented Apr 2, 2019

vnlitvinov commented Apr 3, 2019

vnlitvinov commented Apr 3, 2019 • edited Loading

jreback commented Apr 4, 2019

codecov bot commented Apr 1, 2019 •

edited

Loading

vnlitvinov commented Apr 3, 2019 •

edited

Loading