Skip to content

CLN: some code cleanups in pandas/_libs/ #31808

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 22, 2020

Conversation

ShaharNaveh
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@ShaharNaveh
Copy link
Member Author

ShaharNaveh commented Feb 8, 2020

The only "big" changes I have done here, is taking this pattern:

arr = ["a", "b", "c"]

for i in range(len(arr)):
    print(arr[i])

and making it to look like this:

arr = ["a", "b", "c"]

for value in arr:
    print(value)

When it comes to python, I can't see why would one use the first pattern, but when it comes to Cython, I don't know if there's some sort of optimization, that the first pattern gives that the second one don't

@@ -448,7 +448,7 @@ cdef class BlockIndex(SparseIndex):
ylen = y.blengths

# block may be split, but can't exceed original len / 2 + 1
max_len = int(min(self.length, y.length) / 2) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there no cases where int rounds up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in order for int() to round up it needs to be done with math.ceil or with a "hack" like:

number // divider + (number % divider > 0)

For example:

number = 42.01
divider = 8

number / divider # 5.25125
number // divider # 5.0
int(number / divider) # 5
number // divider + (number % divider > 0) # 6.0

@jorisvandenbossche
Copy link
Member

but when it comes to Cython, I don't know if there's some sort of optimization, that the first pattern gives that the second one don't

Then you will need to show that it gives similar results (eg show some timings of the changed code, or look at the generated C code).

For example, with a simple test (but I don't know if that is fully equivalent with the changes you did, this was just a quick test), I see a big overhead when iterating directly through the values of an array. Eg check

%%cython -a
import cython

import numpy as np
cimport numpy as np


@cython.wraparound(False)
@cython.boundscheck(False)
def sum1():
    cdef:
        np.int64_t[:] arr
        int n
        int val, total
    
    arr = np.random.randint(10, size=10000)
    
    n = len(arr)
    total = 0
    
    for i in range(n):
        val = arr[i]
        total += val
    
    return total

vs

%%cython -a
import cython

import numpy as np
cimport numpy as np


@cython.wraparound(False)
@cython.boundscheck(False)
def sum2():
    cdef:
        np.int64_t[:] arr
        np.int64_t val, total
    
    arr = np.random.randint(10, size=10000)
    
    n = len(arr)
    total = 0
    
    for val in arr:
        total += val
    
    return total

the second version is much slower (and with the annotation in the notebook you clearly see a lot of python interaction). But I am no cython expert, so I might also have done something wrong.

check_dts_bounds,
NPY_DATETIMEUNIT,
NPY_FR_D,
NPY_FR_us
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing trailing commas in some of these

if is_period_object(p):
return p.freq
if is_period_object(value):
return value.freq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for avoiding 1-character variable name

@jbrockmendel
Copy link
Member

@MomIsBestFriend i'd advocate reverting the non-easy parts of this and getting the clear improvements in

@ShaharNaveh
Copy link
Member Author

@jbrockmendel NP, I will do it over the weekend as soon as I have time.

if row < 6:
return 0
elif col < 6:
if row < 6 or col < 6:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this need brackets to be done in the correct order?

Copy link
Member Author

@ShaharNaveh ShaharNaveh Feb 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I have reverted this, as I was doing it wrong before, I haven't looked at the possibility that row can be let's say 12 and col be 2.

Reverted in 208bc03

@@ -289,17 +299,15 @@ cdef int64_t DtoB(npy_datetimestruct *dts, int roll_back,
return DtoB_weekday(unix_date)


cdef inline int64_t upsample_daytime(int64_t ordinal,
asfreq_info *af_info) nogil:
cdef inline int64_t upsample_daytime(int64_t ordinal, asfreq_info *af_info) nogil:
if (af_info.is_end):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you wanted to remove the parens here, i wouldnt object

@@ -78,12 +77,10 @@ def array_strptime(object[:] values, object fmt,
if fmt is not None:
if '%W' in fmt or '%U' in fmt:
if '%Y' not in fmt and '%y' not in fmt:
raise ValueError("Cannot use '%W' or '%U' without "
"day and year")
raise ValueError("Cannot use '%W' or '%U' without day and year")
if ('%A' not in fmt and '%a' not in fmt and '%w' not
in fmt):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the in fmt can go on the previous line, then parens removed

@jbrockmendel
Copy link
Member

@MomIsBestFriend can you rebase and ping on green

@ShaharNaveh
Copy link
Member Author

ping @jbrockmendel

@jbrockmendel jbrockmendel merged commit 9e69040 into pandas-dev:master Feb 22, 2020
@jbrockmendel
Copy link
Member

thanks @MomIsBestFriend

@simonjayhawkins simonjayhawkins added this to the 1.1 milestone Feb 24, 2020
@ShaharNaveh ShaharNaveh deleted the CLN-libs branch February 29, 2020 10:27
roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020
* CLN: some code cleanups in pandas/_libs/

* Reverted "bint"

REF:
https://github.com/pandas-dev/pandas/pull/31808/files#r376721172

* Added trailing comma to imports

REF:
https://github.com/pandas-dev/pandas/pull/31808/files#r378523656

* Reverted bad code

* Lint issues

* Reverted wrong code

REF:
pandas-dev#31808 (comment)

* Removed parens

REF:
pandas-dev#31808 (comment)

* "in fmt" in prev line

REF:
pandas-dev#31808 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants