Skip to content

Speed up checking for NaN for floats #25946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2019

Conversation

vnlitvinov
Copy link
Contributor

  • closes N/A
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This isn't giving much speedup because this is simple change, but for some certain inputs like empty datetime fields in csv it gives some speed (because empty fields are parsed as float NaN-s).

@vnlitvinov
Copy link
Contributor Author

Here's what asv continuous -f 1.01 upstream/master float-nat-speedup -e -b io.csv -a sample_time=1 -a warmup_time=1 shows:

before after ratio test name
[c7c4c94] [54533f3]
master float-nat-speedup
1.91±0.03ms 1.86±0.01ms 0.97 io.csv.ReadCSVParseDates.time_multiple_date

I'm run more thorough benchmark now.

@codecov
Copy link

codecov bot commented Apr 1, 2019

Codecov Report

Merging #25946 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25946      +/-   ##
==========================================
- Coverage   91.82%   91.81%   -0.01%     
==========================================
  Files         175      175              
  Lines       52581    52581              
==========================================
- Hits        48280    48276       -4     
- Misses       4301     4305       +4
Flag Coverage Δ
#multiple 90.36% <ø> (ø) ⬆️
#single 41.89% <ø> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c4c94...54533f3. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 1, 2019

Codecov Report

Merging #25946 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25946      +/-   ##
==========================================
- Coverage   91.82%   91.81%   -0.01%     
==========================================
  Files         175      175              
  Lines       52581    52581              
==========================================
- Hits        48280    48276       -4     
- Misses       4301     4305       +4
Flag Coverage Δ
#multiple 90.36% <ø> (ø) ⬆️
#single 41.89% <ø> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 75% <0%> (-12.5%) ⬇️
pandas/core/frame.py 96.79% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7c4c94...6e77c81. Read the comment docs.

@vnlitvinov vnlitvinov closed this Apr 1, 2019
@vnlitvinov vnlitvinov reopened this Apr 1, 2019
@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Apr 2, 2019
@jreback jreback added this to the 0.25.0 milestone Apr 2, 2019
@jreback
Copy link
Contributor

jreback commented Apr 2, 2019

can you run the benchmarks for missing value checking and see what the change is.

@vnlitvinov
Copy link
Contributor Author

Sure, I am already running those, but it takes huuuuge time to run when you set warmup and sampling times high enough (and with default settings the results are too flaky to be believable).

@jreback
Copy link
Contributor

jreback commented Apr 2, 2019

Sure, I am already running those, but it takes huuuuge time to run when you set warmup and sampling times high enough (and with default settings the results are too flaky to be believable).

right, though just the missing ones shouldn't be that huge

@vnlitvinov
Copy link
Contributor Author

Can you point out these "missing" benchmark names?

@jreback
Copy link
Contributor

jreback commented Apr 2, 2019

you can pass a regex to select a subset

@vnlitvinov
Copy link
Contributor Author

Could you please recommend what benchmark names might be relevant? I didn't study the whole list of them yet...

@vnlitvinov
Copy link
Contributor Author

vnlitvinov commented Apr 3, 2019

So running this asv continuous -f 1.01 upstream/master float-nat-speedup -b '.*missing.*' -b io.csv -e -a sample_time=2 -a warmup_time=2 produces:

before after ratio test name
[c7c4c94] [6e77c81]
master float-nat-speedup
1.48±0.01μs 1.40±0.03μs 0.95 timedelta.TimedeltaConstructor.time_from_missing
8.41±0.09ms 7.79±0.06ms 0.93 io.csv.ReadCSVSkipRows.time_skipprows(10000)

I'm sure this is not benchmarking the worst case, as I think it should speed up parsing a date column where most fields are empty, but only after #25754 is merged so that NaN-s (to which empty fields are translated) stay as floats instead of literal "nan" strings.

@jreback jreback merged commit 1679e54 into pandas-dev:master Apr 4, 2019
@jreback
Copy link
Contributor

jreback commented Apr 4, 2019

thanks @vnlitvin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants