Skip to content

Add support for date_unit="D" in read_json and to_json #51000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

DavideCanton
Copy link

@DavideCanton DavideCanton commented Jan 26, 2023

This PR extends the date_unit parameter of the pd.read_json and the to_json method of Series and DataFrame methods by adding another allowed value: "D".

This value can be used to serialize and deserialize dates into YYYY-MM-DD format.

In the deserialization, I chose 365 as min value for the D unit since all the minimum values represent a year.

The differences with the other supported units in deserialization are:

  • this unit is not automatically inferred, I've left the default min to "s" to avoid breaking retrocompatibility.
  • this unit is not case insensitive

I've already updated unit tests and the documentation. I wasn't sure of where this enhancement should be put in the whatsnew, so feel free to suggest me and I will update it.

All suggestions are welcome, please let me know what I can improve.

@@ -1063,7 +1066,9 @@ def __init__(
self.dtype = dtype

if date_unit is not None:
date_unit = date_unit.lower()
# avoid lowercasing "D" but ensure retrocompatibility for other units
if date_unit != "D":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just put a lowercase D in _STAMP_UNITS?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the exact same format passed to the underlying function

@mroeschke mroeschke added the IO JSON read_json, to_json, json_normalize label Jan 30, 2023
@mroeschke
Copy link
Member

In the deserialization, I chose 365 as min value for the D unit since all the minimum values represent a year.

Does this work correctly when calling this during a leap year?

@DavideCanton
Copy link
Author

In the deserialization, I chose 365 as min value for the D unit since all the minimum values represent a year.

Does this work correctly when calling this during a leap year?

What do you mean? This threshold is used only when detecting if a number is a suitable value for decoding a timestamp with date unit "D".

@WillAyd
Copy link
Member

WillAyd commented Feb 10, 2023

Not an expert in this area but I'm also a little unsure of how accurate this would be. The code seems to assume there is always 24 hours in a day which isn't true when you think about things like DST

Do you know how numpy internally manages going from nanoseconds to day precision?

@WillAyd
Copy link
Member

WillAyd commented Feb 10, 2023

I guess DST might not apply here since we are talking about UTC, but yea still not sure about taking this on directly the JSON code. @jbrockmendel might have some insights since he has been working on non-ns array support

@DavideCanton
Copy link
Author

DavideCanton commented Feb 10, 2023

Not an expert in this area but I'm also a little unsure of how accurate this would be. The code seems to assume there is always 24 hours in a day which isn't true when you think about things like DST

Do you know how numpy internally manages going from nanoseconds to day precision?

I tried some manipulations and it seems to truncate the time part, at least from my experiments:

In [105]: a = np.array(["2023-03-25T00:23:45.248755"], dtype="datetime64[ns]")

In [106]: b = a + np.array([np.timedelta64(i, 'h') for i in range(30)])

In [107]: b.astype('datetime64[D]')
Out[107]:
array(['2023-03-25', '2023-03-25', '2023-03-25', '2023-03-25',
       '2023-03-25', '2023-03-25', '2023-03-25', '2023-03-25',
       '2023-03-25', '2023-03-25', '2023-03-25', '2023-03-25',
       '2023-03-25', '2023-03-25', '2023-03-25', '2023-03-25',
       '2023-03-25', '2023-03-25', '2023-03-25', '2023-03-25',
       '2023-03-25', '2023-03-25', '2023-03-25', '2023-03-25',
       '2023-03-26', '2023-03-26', '2023-03-26', '2023-03-26',
       '2023-03-26', '2023-03-26'], dtype='datetime64[D]')

In [108]: b - a
Out[108]:
array([              0,   3600000000000,   7200000000000,  10800000000000,
        14400000000000,  18000000000000,  21600000000000,  25200000000000,
        28800000000000,  32400000000000,  36000000000000,  39600000000000,
        43200000000000,  46800000000000,  50400000000000,  54000000000000,
        57600000000000,  61200000000000,  64800000000000,  68400000000000,
        72000000000000,  75600000000000,  79200000000000,  82800000000000,
        86400000000000,  90000000000000,  93600000000000,  97200000000000,
       100800000000000, 104400000000000], dtype='timedelta64[ns]')

In [109]: (b-a).astype('timedelta64[D]')
Out[109]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1], dtype='timedelta64[D]')

I've chosen 2023-03-25 since the DST starts in the night between 25 and 26. IIRC, numpy does not handle neither timezones nor DST.

@jbrockmendel
Copy link
Member

Do you know how numpy internally manages going from nanoseconds to day precision?

division/multiplication. numpy does everything with timezone-naive datetimes.

@DavideCanton
Copy link
Author

Hi, is there any interest in merging this PR? If so, I can resolve the conflicts with the main branch

@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Jun 26, 2023
@DavideCanton
Copy link
Author

I didn't update this on purpose, since I got no feedback and had no time to keep it updated. As I said before, if there is any interest in this feature, I'll update it

@mousedownmike
Copy link

I would have used this feature if it were available. My use case involves regulatory data that applies to a day of the year (no regard for timezone).

Having an output option that does not include the time of day would significantly reduce the payload size. As previously discussed, this would still be an ISO 8601 conformant date format.

I'm subscribed to this issue and happy to provide additional information/support if it will help.

@DavideCanton
Copy link
Author

Hi @mousedownmike! The feature in this PR was working at the time I created the PR, I was waiting for the review by some maintainer, but it was developed on pandas 1.5.x, so this probably needs some work to adapt all the code to the new major.

I'll probably look into it, just I don't know when, since in this period I'm quite busy, if you have some spare time feel free to work on it!

@DavideCanton
Copy link
Author

@mroeschke I've updated the code, can you please reopen the PR?

@mroeschke mroeschke reopened this Jul 15, 2023
@mroeschke mroeschke requested a review from WillAyd as a code owner July 15, 2023 19:42
@DavideCanton
Copy link
Author

@mroeschke why almost all the checks were cancelled?

@mroeschke
Copy link
Member

Could you merge main and push to this branch again?

@WillAyd
Copy link
Member

WillAyd commented Jul 19, 2023

How do we see this aligning with the efforts in #53757 (review) ? I think we are developing two different ways to deal with various levels of precision.

I'm still inclined to go with the other approach as it is more native to the type system, but maybe that doesn't include all of the work here

@DavideCanton
Copy link
Author

How do we see this aligning with the efforts in #53757 (review) ? I think we are developing two different ways to deal with various levels of precision.

I'm still inclined to go with the other approach as it is more native to the type system, but maybe that doesn't include all of the work here

It makes sense, since this PR was opened before pandas 2, when things were slightly different.

Said so, I read a bit the linked PR and the issue, and I agree it doesn't include all the support for unit="D" since it seems not supported in read_json. I'm not very familiar with pandas internals though, so I may have missed something.

Maybe it may make sense to either close this PR and have a fresh start when the other is merged, or rebase this onto the one you provided?

@mroeschke
Copy link
Member

Looks like #53757 has been merged. I agree it's probably best to have a fresh PR start relative to that PR so closing.

@mroeschke mroeschke closed this Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No way with to_json to write only date out of datetime
5 participants