Skip to content

RecursionError in DataFrame.replace with Out of Bounds Datetime #20380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JackKZ opened this issue Mar 16, 2018 · 15 comments · Fixed by #22108
Closed

RecursionError in DataFrame.replace with Out of Bounds Datetime #20380

JackKZ opened this issue Mar 16, 2018 · 15 comments · Fixed by #22108
Labels
Bug Datetime Datetime data dtype
Milestone

Comments

@JackKZ
Copy link

JackKZ commented Mar 16, 2018

RecursionError in DataFrame.replace

import pandas as pd
import datetime
df = pd.DataFrame({
    "dt" : [datetime.datetime(3017, 12, 20)], 
    "str" : ["blah"]
})
df.replace("blah", "cats")

Leads to lots of

OutOfBoundsDatetime                       Traceback (most recent call last)
~/sandbox/pandas/pandas/core/internals.py in replace(self, to_replace, value, inplace, filter, regex, convert, mgr)
    804                 blocks = [b.convert(by_item=True, numeric=False,
--> 805                                     copy=not inplace) for b in blocks]
    806             return blocks

~/sandbox/pandas/pandas/core/internals.py in <listcomp>(.0)
    804                 blocks = [b.convert(by_item=True, numeric=False,
--> 805                                     copy=not inplace) for b in blocks]
    806             return blocks

~/sandbox/pandas/pandas/core/internals.py in convert(self, *args, **kwargs)
   2355         if by_item and not self._is_single_block:
-> 2356             blocks = self.split_and_operate(None, f, False)
   2357         else:

~/sandbox/pandas/pandas/core/internals.py in split_and_operate(self, mask, f, inplace)
    508             if m.any():
--> 509                 nv = f(m, v, i)
    510             else:

~/sandbox/pandas/pandas/core/internals.py in f(m, v, i)
   2345             shape = v.shape
-> 2346             values = fn(v.ravel(), **fn_kwargs)
   2347             try:

~/sandbox/pandas/pandas/core/dtypes/cast.py in soft_convert_objects(values, datetime, numeric, timedelta, coerce, copy)
    834     if datetime:
--> 835         values = lib.maybe_convert_objects(values, convert_datetime=datetime)
    836

~/sandbox/pandas/pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_objects()
   1317                     seen.datetime_ = 1
-> 1318                     idatetimes[i] = convert_to_tsobject(
   1319                         val, None, None, 0, 0).value

~/sandbox/pandas/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject()
    299     elif PyDateTime_Check(ts):
--> 300         return convert_datetime_to_tsobject(ts, tz, nanos)
    301     elif PyDate_Check(ts):

~/sandbox/pandas/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject()
    379
--> 380     check_dts_bounds(&obj.dts)
    381     check_overflows(obj)

~/sandbox/pandas/pandas/_libs/tslibs/np_datetime.pyx in pandas._libs.tslibs.np_datetime.check_dts_bounds()
    120                                                dts.min, dts.sec)
--> 121         raise OutOfBoundsDatetime(
    122             'Out of bounds nanosecond timestamp: {fmt}'.format(fmt=fmt))

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3017-12-20 00:00:00

(thanks to @ChrisMuir for the example)

@TomAugspurger
Copy link
Contributor

Can you make your example reproducible? http://stackoverflow.com/help/mcve

@TomAugspurger TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Mar 16, 2018
@JackKZ
Copy link
Author

JackKZ commented Mar 16, 2018

example_df.zip
Here is the pickle file that I have trouble with.

Thanks!

Jack

@JackKZ
Copy link
Author

JackKZ commented Mar 16, 2018

Also here is the output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 16, 2018 via email

@ChrisMuir
Copy link

ChrisMuir commented Mar 16, 2018

Just to chime on this with more info, I'm getting a RecursionError exception when running df = df.replace(np.nan, 'NA') on Jack's data frame. Here's code to reproduce the issue:

import pandas as pd
from io import BytesIO
import zipfile
import requests

# Read the zipped file from this issue thread.
res = requests.get("https://github.com/pandas-dev/pandas/files/1820497/example_df.zip")

# Read the pkl file via pandas (pkl contains a data frame).
with zipfile.ZipFile(BytesIO(res.content)) as z:
    pkl_file = z.namelist()[0]
    with z.open(pkl_file) as pk:
        df = pd.read_pickle(pk)

# Simple call to replace.
df = df.replace("blah", "NA")

Here's the traceback:

  File "<ipython-input-6-0fda1114890b>", line 1, in <module>
    df = df.replace(np.nan, "NA")

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\generic.py", line 4612, in replace
    regex=regex)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3468, in replace
    return self.apply('replace', **kwargs)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3329, in apply
    applied = getattr(b, f)(**kwargs)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
    convert=convert, mgr=mgr)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
    filter=filter, regex=regex, convert=convert)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
    convert=convert, mgr=mgr)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
    filter=filter, regex=regex, convert=convert)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
    convert=convert, mgr=mgr)

  ....
  <snip a ton of references to line 771 and line 2216>
  ....

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in replace
    copy=not inplace) for b in blocks]

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in <listcomp>
    copy=not inplace) for b in blocks]

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2135, in convert
    blocks = self.split_and_operate(None, f, False)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 482, in split_and_operate
    block = make_a_block(nv, [ref_loc])

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 458, in make_a_block
    placement=ref_loc, fastpath=True)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 210, in make_block
    return make_block(values, placement=placement, ndim=ndim, **kwargs)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2957, in make_block
    return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2082, in __init__
    placement=placement, **kwargs)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 114, in __init__
    self.mgr_locs = placement

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 229, in mgr_locs
    new_mgr_locs = BlockPlacement(new_mgr_locs)

  File "pandas/_libs/lib.pyx", line 1696, in pandas._libs.lib.BlockPlacement.__init__

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 729, in require
    requirements = set(possible_flags[x.upper()] for x in requirements)

RecursionError: maximum recursion depth exceeded

I tried isolating the issue to a specific column

try:
    df = df.replace("blah", "NA")
except RecursionError:
    for curr_col in df.columns:
        try:    
            df[curr_col] = df.loc[:, curr_col].replace("blah", "NA")
            print("success for col %s" % curr_col)
        except RecursionError:
            print("fail for col %s" % curr_col)

This outputs:

success for col adulterant
success for col announcement_date
success for col data_source_detailed
success for col data_source_general
success for col failing_results
success for col filename
success for col food_name
success for col inspection_results
success for col legal_limit
success for col manufacturer_address
success for col manufacturer_name
success for col notice_number
success for col product_classification
fail for col production_date
success for col sampled_location_address
success for col sampled_location_name
success for col sampled_location_province
success for col sheetname
success for col specifications_model
success for col task_source_or_project_name
success for col test_outcome
success for col testing_agency

So the issue seems to be with col production_date. The data in this col is made up of str and datetime.datetime.

types = set()
for n in df["production_date"]:
    if type(n) not in types:
        types.add(type(n))

printing types gives:

{datetime.datetime, str}

This is as far as I've gotten. Not sure what to do with all this info, but figured I'd share.

@ChrisMuir
Copy link

Oh and here's my output for pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.22.0
pytest: 3.0.7
pip: 9.0.1
setuptools: 37.0.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@ChrisMuir
Copy link

ChrisMuir commented Mar 16, 2018

@TomAugspurger I believe I've narrowed this down to replace() trying to throw OutOfBoundsDatetime exception when hitting datetime.datetime values that are out of range, but instead is getting stuck in an inf loop. Check out the minimal example below.

This works without errors (year 2017 in the datetime object):

import pandas as pd
import datetime
df = pd.DataFrame({
    "dt" : [datetime.datetime(2017, 12, 20)], 
    "str" : ["blah"]
})
df.replace("blah", "cats")

However this throws a RecursionError (year 3017 in the datetime object):

import pandas as pd
import datetime
df = pd.DataFrame({
    "dt" : [datetime.datetime(3017, 12, 20)], 
    "str" : ["blah"]
})
df.replace("blah", "cats")

Saving the bad code to file and then calling it from command prompt, I'm able to see the entire traceback. Near the top, this block appears a bunch of times:

Traceback (most recent call last):
  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in replace
    copy=not inplace) for b in blocks]
  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 763, in <listcomp>
    copy=not inplace) for b in blocks]
  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2135, in convert
    blocks = self.split_and_operate(None, f, False)
  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 478, in split_and_operate
    nv = f(m, v, i)
  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2125, in f
    values = fn(v.ravel(), **fn_kwargs)
  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 807, in soft_convert_objects
    values = lib.maybe_convert_objects(values, convert_datetime=datetime)
  File "pandas/_libs/src/inference.pyx", line 1290, in pandas._libs.lib.maybe_convert_objects
  File "pandas/_libs/tslib.pyx", line 1575, in pandas._libs.tslib.convert_to_tsobject
  File "pandas/_libs/tslib.pyx", line 1669, in pandas._libs.tslib.convert_datetime_to_tsobject
  File "pandas/_libs/tslib.pyx", line 1848, in pandas._libs.tslib._check_dts_bounds
pandas._libs.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3017-12-20 00:00:00

Followed by a bunch of calls referencing lines 771 and 2216, and ending with this:

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
    filter=filter, regex=regex, convert=convert)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
    convert=convert, mgr=mgr)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 771, in replace
    filter=filter, regex=regex, convert=convert)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2216, in replace
    convert=convert, mgr=mgr)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 760, in replace
    blocks = self.putmask(mask, value, inplace=inplace)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 1021, in putmask
    return [self.make_block(new_values, fastpath=True)]

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 210, in make_block
    return make_block(values, placement=placement, ndim=ndim, **kwargs)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2957, in make_block
    return klass(values, ndim=ndim, fastpath=fastpath, placement=placement)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 2082, in __init__
    placement=placement, **kwargs)

  File "C:\Users\cmuir\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\internals.py", line 114, in __init__
    self.mgr_locs = placement

RecursionError: maximum recursion depth exceeded

Looking at function check_dts_bounds in pandas/_libs/tslibs/np_datetime.pyx, it looks like the call to replace is supposed to throw OutOfBoundsDatetime exception.

@cbertinato
Copy link
Contributor

cbertinato commented Mar 17, 2018

The recursion is occurring because OutofBoundsDatetime subclasses ValueError, which is getting caught here at line 807 in core/internals.py:

        except (TypeError, ValueError):

            # try again with a compatible block
            block = self.astype(object)
            return block.replace(
                to_replace=original_to_replace, value=value, inplace=inplace,
                filter=filter, regex=regex, convert=convert)

There is little sense in trying again if the datetime is out of bounds. This issue can be fixed by adding an except block for OutofBoundsDatetime before this block. A relevant question: should the out-of-bounds datetime have been caught when the DataFrame was created?

I'm inclined to think 'yes'. It's better to fail sooner rather than later.

@TomAugspurger
Copy link
Contributor

Thanks for digging. In @ChrisMuir's small example, the column is object dtype, so I think that's OK.

passing convert=False also works in this case, though that probably breaks other things.

@TomAugspurger TomAugspurger added Bug Difficulty Intermediate Datetime Datetime data dtype and removed Needs Info Clarification about behavior needed to assess issue labels Jun 27, 2018
@TomAugspurger
Copy link
Contributor

Updated the original post with the nice example.

@TomAugspurger TomAugspurger changed the title replace() method kills kernel RecursionError in DataFrame.replace with Out of Bounds Datetime Jun 27, 2018
@minggli
Copy link
Contributor

minggli commented Jul 28, 2018

raised a PR trying to fix this. 😃 convert=False breaks other cases by the look of it. at the moment using try except on cast module, new ideas welcome.

@jreback jreback added this to the 0.24.0 milestone Jul 29, 2018
@minggli
Copy link
Contributor

minggli commented Aug 8, 2018

@JackKZ @ChrisMuir would you like an error to be raised (for the out of bound datetime) when encountering above example or ignore silently and carry out the replace anyhow?

@ChrisMuir
Copy link

Well, personally I'd prefer that replace() not throw an error when it hits an out of bound datetime, but that's coming from a purely selfish standpoint. I don't use Pandas a ton, and my use cases probably only represent a fraction of total common use cases (I'm always working with data generated by 3rd parties to which I'm completely disconnected). Also, I don't know enough about the Pandas internals and philosophy to weigh-in on which option makes the most sense, with all users in mind.

@cbertinato
Copy link
Contributor

In the example given, and perhaps in most use cases, the OOB datetime is irrelevant. Of course that datetime could lead to other problems, but as far as replace is concerned it shouldn't matter. In my opinion, it would only make sense to raise the exception when manipulating the datetime.

@mroeschke
Copy link
Member

Looks like this was closed by #22108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants