Implement helper method to get char* buffer from Python objects #25895

vnlitvinov · 2019-03-27T17:12:36Z

closes N/A
tests added - test_parsers_iso8601_leading_space
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is a follow-up PR for #25754 which adds a utility method for getting internal char * buffer from unicode or bytes Python object and switch at least some of usages to it.

Its advantages over what is built-in in Cython is saving at least one extra memory allocation (Cython internally calls Python C API that creates a new char * copy which is not needed for most unicode objects which are internally stored in utf8 encoding), and it also obtains the length in one call.

cc @jbrockmendel

…ze in Python3.7 case; added docstring to get_string_data func

…et_string_data docstring

…ring object

pandas/_libs/tslibs/np_datetime.pyx

pandas/_libs/tslibs/util.pxd

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

pandas/_libs/hashtable_class_helper.pxi.in

vnlitvinov · 2019-03-27T18:00:01Z

Note for reviewers: errors in CI seem to be caused by #25875

codecov · 2019-03-27T18:49:22Z

Codecov Report

Merging #25895 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #25895   +/-   ##
=======================================
  Coverage   91.47%   91.47%           
=======================================
  Files         175      175           
  Lines       52863    52863           
=======================================
  Hits        48357    48357           
  Misses       4506     4506

Flag	Coverage Δ
#multiple	`90.04% <ø> (ø)`	⬆️
#single	`41.8% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac318d2...4d1916c. Read the comment docs.

codecov · 2019-03-27T18:49:26Z

Codecov Report

Merging #25895 into master will increase coverage by 0.23%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25895      +/-   ##
==========================================
+ Coverage   91.53%   91.77%   +0.23%     
==========================================
  Files         175      175              
  Lines       52808    52606     -202     
==========================================
- Hits        48338    48277      -61     
+ Misses       4470     4329     -141

Flag	Coverage Δ
#multiple	`90.32% <ø> (+0.22%)`	⬆️
#single	`41.9% <ø> (+0.07%)`	⬆️

Impacted Files	Coverage Δ
pandas/compat/pickle_compat.py	`69.51% <0%> (-6.1%)`	⬇️
pandas/plotting/_compat.py	`83.33% <0%> (-3.34%)`	⬇️
pandas/compat/numpy/__init__.py	`93.1% <0%> (-0.23%)`	⬇️
pandas/io/sas/sas_xport.py	`90.09% <0%> (-0.05%)`	⬇️
pandas/io/packers.py	`88.08% <0%> (-0.04%)`	⬇️
pandas/core/groupby/generic.py	`87.03% <0%> (-0.02%)`	⬇️
pandas/io/formats/csvs.py	`98.2% <0%> (-0.02%)`	⬇️
pandas/core/arrays/datetimes.py	`97.79% <0%> (-0.01%)`	⬇️
pandas/io/formats/format.py	`97.99% <0%> (ø)`	⬆️
pandas/core/arrays/sparse.py	`92.17% <0%> (ø)`	⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 882961d...f5f8e29. Read the comment docs.

vnlitvinov · 2019-03-28T16:24:06Z

Fun fact - it speeds up iso8601 parsing noticeably:

asv continuous -f 1.05 882961df 1db613e6 -e -b timeseries.ToDatetimeISO8601 -a sample_time=1 -a warmup_time=1

before	after	ratio	test name
[`882961d`]	[`1db613e`]
master	get_string_data_pr
2.78±0.02ms	2.44±0.02ms	0.87	timeseries.ToDatetimeISO8601.time_iso8601_nosep
2.85±0.03ms	2.47±0.03ms	0.87	timeseries.ToDatetimeISO8601.time_iso8601_format
2.81±0.01ms	2.43±0.01ms	0.87	timeseries.ToDatetimeISO8601.time_iso8601_format_no_sep
2.84±0.03ms	2.44±0.03ms	0.86	timeseries.ToDatetimeISO8601.time_iso8601

jreback · 2019-03-28T19:09:43Z

lgtm. @jbrockmendel ?

pandas/_libs/tslibs/util.pxd

pandas/tests/tslibs/test_parse_iso8601.py

pandas/_libs/tslibs/np_datetime.pyx

jreback · 2019-03-29T12:22:39Z

lgtm. @jbrockmendel merge when ready.

jbrockmendel · 2019-03-29T15:26:50Z

Thanks @vnlitvin

…as-dev#25895) * removed extra layer; using get_string_data now * fix problem with const char* value, that return PyUnicode_AsUTF8AndSize in Python3.7 case; added docstring to get_string_data func * fix code style * replaced get_c_string to get_string_data, added 'note' paragraph in get_string_data docstring * Re-instate raising TypeError when trying to get string data of non-string object * test case for overflow in parse_iso_8601_datetime * change get_string_data signature to more pythonic * Added test for parsing leading spaces * Rework get_string_data to cleaner get_c_string_buf_and_size * Fix Python 3.7 compilation * added comment for test; changed name variable: s -> py_string

anmyachev and others added 5 commits March 27, 2019 18:38

removed extra layer; using get_string_data now

3551121

fix problem with const char* value, that return PyUnicode_AsUTF8AndSi…

d24e728

…ze in Python3.7 case; added docstring to get_string_data func

fix code style

44efe95

replaced get_c_string to get_string_data, added 'note' paragraph in g…

c90635b

…et_string_data docstring

Re-instate raising TypeError when trying to get string data of non-st…

4d1916c

…ring object

vnlitvinov mentioned this pull request Mar 27, 2019

PERF: cythonizing _concat_date_cols; conversion to float without exceptions in _does_string_look_like_datetime #25754

Merged

4 tasks

jbrockmendel reviewed Mar 27, 2019

View reviewed changes

pandas/_libs/tslibs/np_datetime.pyx Show resolved Hide resolved

jbrockmendel reviewed Mar 27, 2019

View reviewed changes

pandas/_libs/tslibs/util.pxd Outdated Show resolved Hide resolved

jbrockmendel reviewed Mar 27, 2019

View reviewed changes

pandas/_libs/tslibs/src/datetime/np_datetime_strings.c Show resolved Hide resolved

jbrockmendel reviewed Mar 27, 2019

View reviewed changes

pandas/_libs/hashtable_class_helper.pxi.in Outdated Show resolved Hide resolved

gfyoung added the Internals Related to non-user accessible pandas implementation label Mar 27, 2019

anmyachev and others added 4 commits March 28, 2019 10:17

test case for overflow in parse_iso_8601_datetime

789614e

change get_string_data signature to more pythonic

2bdddc8

Added test for parsing leading spaces

13c8e95

Rework get_string_data to cleaner get_c_string_buf_and_size

1db613e

vnlitvinov force-pushed the get_string_data_pr branch from 0b6902b to 1db613e Compare March 28, 2019 11:34

Fix Python 3.7 compilation

bbf37c6

jreback added this to the 0.25.0 milestone Mar 28, 2019