Skip to content

DataFrame.to_json silently ignores index parameter for most orients. #25513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dargueta opened this issue Mar 1, 2019 · 20 comments · Fixed by #52143
Closed

DataFrame.to_json silently ignores index parameter for most orients. #25513

dargueta opened this issue Mar 1, 2019 · 20 comments · Fixed by #52143
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@dargueta
Copy link
Contributor

dargueta commented Mar 1, 2019

see
#25513 (comment)

Code Sample, a copy-pastable example if possible

from __future__ import print_function

import pandas as pd

df = pd.DataFrame.from_records(
    [
        {'id': 'abc', 'data': 'qwerty'},
        {'id': 'def', 'data': 'uiop'}
    ],
    index='id'
)

print(df.to_json(orient='records', index=True))

# Prints out:
[{"data":"qwerty"},{"data":"uiop"}]

series = df.squeeze()
print(series.to_json(orient='records', index=True))

# Prints out:
["qwerty","uiop"]

Problem description

When creating a DataFrame that has two columns, one to be used as an index and another for the data, if you call .to_json(orient='records') the index is omitted. I know that in theory I should be using a Series for this, but I'm using it to convert a CSV file into JSONL and I don't know what the CSV file is going to look like ahead of time.

In any case, squeezing the DataFrame into a Series doesn't work either. In fact, the bug in Series.to_json is even worse, as it produces an array of strings instead of an array of dictionaries.

This bug is present in master.

Expected Output

Expected output is:

[{"id":"abc", "data":"qwerty"},{"id":"def","data":"uiop"}]

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: 0.12.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: None
s3fs: 0.2.0
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@mroeschke mroeschke added Bug IO JSON read_json, to_json, json_normalize labels Mar 4, 2019
@coderop2
Copy link

coderop2 commented Mar 4, 2019

@mroeschke in the function description its clearly mentioned that by using orient = "records" the index is not preserved... so what to here..?
@dargueta u could use df.reset_index(inplace=True) to make the index as a column before using to_json()

@TomAugspurger
Copy link
Contributor

Right, so the index= parameter only works for split and table orients.

I think we should change the default split to None, and

  1. Make the default index be True when orient is split or table.
  2. Raise if index is specified for any other orient.

We should deprecate the current behavior (ignoring index for other orients) first.

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Mar 4, 2019
@TomAugspurger TomAugspurger changed the title to_json(orient='records') omits index if df has one column DataFrame.to_json ignores index parameter for most orients. Mar 4, 2019
@TomAugspurger TomAugspurger changed the title DataFrame.to_json ignores index parameter for most orients. DataFrame.to_json silently ignores index parameter for most orients. Mar 4, 2019
@coderop2
Copy link

coderop2 commented Mar 4, 2019

I think ignoring index for other orients might not be good because sometimes index like dates is used....so in that case the index might get ignored completely

@TomAugspurger
Copy link
Contributor

@coderop2 can you clarify? My proposal is to stop ignoring the index parameter when it's not valid for the given orient.

@dargueta
Copy link
Contributor Author

dargueta commented Mar 4, 2019

in the function description its clearly mentioned that by using orient = "records" the index is not preserved.

@coderop2 I disagree about "clearly". The documentation of the parameter states:

Whether to include the index values in the JSON string. Not including the index (index=False) is only supported when orient is ‘split’ or ‘table’.

I took this to mean the index is always included when the orientation is "records". We should at the very least rephrase this to be more explicit.

@TomAugspurger I agree with most of your proposed solution, though I don't understand why the index should be automatically dropped for orient="records". Dropping it would make sense if the index is the default sequential series, but if I explicitly specify a column as an index, I'd expect it to be included by default.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 4, 2019 via email

@dargueta
Copy link
Contributor Author

dargueta commented Mar 4, 2019

Good point.

@coderop2
Copy link

coderop2 commented Mar 5, 2019

@TomAugspurger I am sorry I misunderstood for what you were explaining...I understand now
@dargueta I was talking about the description given in the code...it goes like this
Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.
@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req

@dargueta
Copy link
Contributor Author

dargueta commented Mar 5, 2019

I still think dropping the index is a bit of a nasty surprise, especially if my code gets handed a DataFrame that someone else's code created. I don't want to break their code if I change my serialization format on the back end. Do you think there should be a way to override the default behavior, at least for orient="records"?

@TomAugspurger
Copy link
Contributor

@dargueta we need to consider backwards compatibility. Making df.to_json(orient='records') include the index would be an API breaking change.

We'll need a period where providing index=True with the records orient raises. After that time period we can reconsider allowing it.

In the meantime, you'll need to .reset_index() before calling .to_json with the records orient.

@TomAugspurger
Copy link
Contributor

@TomAugspurger since I am new to open source can I work up a solution and present it here...if that solution is acceptable then I will create a pull req

That'd be great! Let us know if you need help getting started.

@dargueta
Copy link
Contributor Author

dargueta commented Mar 5, 2019

Oh I didn't mean making it the default, just providing a way to override the default behavior so that we can include it if desired without breaking backwards compatibility.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 5, 2019 via email

@dargueta
Copy link
Contributor Author

dargueta commented Mar 5, 2019

Well if we change the default of index to be None like you said earlier, then passing True explicitly would indicate the intent, no?

@WillAyd
Copy link
Member

WillAyd commented Mar 5, 2019

Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.

We already have quite a few JSON serialization formats so I'd be hesitant to add more, especially given this functionality overlaps already with one of the other formats

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 5, 2019 via email

@dargueta
Copy link
Contributor Author

dargueta commented Mar 5, 2019

Can you use the table orient in this case? It at least provides the data that you are looking for, albeit with some extra metadata attached.

Sadly no, I'm dumping it using lines=True so that it can be read by a database. I didn't put that in my example to minimize the number of arguments that could be interacting.

We already have quite a few JSON serialization formats so I'd be hesitant to add more

I wasn't suggesting adding another format, only changing the behavior of index.

Hmm, yes. I worry a bit about people who were previously passing index=True

I wouldn't say that's an invalid concern, but it does make me question why someone would deliberately pass index=True and orient='records' if they weren't trying to get the behavior I wanted, based on what the documentation states. I imagine the change be an issue for users who're dynamically constructing the arguments and are relying on index being ignored. At the risk of coming off as a pedantic jerk, I'd say relying on that behavior is unwise to begin with.

@coderop2
Copy link

coderop2 commented Mar 6, 2019

@TomAugspurger if we are ignoring the index for the records orient then why cannot we just set it to false. But if we do a error is thrown accrding to:
if not index and orient not in ['split', 'table']:
raise ValueError("'index=False' is only valid when 'orient' is "" 'split' or 'table' ")
Because if you include records in the list and pass index as false no error is thrown

@ARDivekar
Copy link

FYI. For those looking for a solution, I got around this by doing a simple: df['index_name'] = df.index before calling df.to_json(...)

@tommyjcarpenter
Copy link

tommyjcarpenter commented Sep 10, 2021

I just want to chime in and say, the docs are very subtle w.r.t. this issue: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html

It is very easy to miss the line Note that index labels are not preserved with this encoding.

The fact that orient=records, index=True is a LEGAL combination of parameters, yet it ...doesnt do that, is the unexpected behavior. Pandas basically says "we heard you but nah". I would prefer orient=records, index=True be an illegal combination and explicitly tell the user, "you cant do that" [for some reason] rather than just silently ignoring the intent.

Perhaps the solution above df['index_name'] = df.index before calling df.to_json(...) should be mentioned in those docs to help the reader understand what to do, other than googling and eventually finding this issue.

It would also be nice if index=True just did this for you, so that it "did" work.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
dshemetov added a commit to dshemetov/pandas that referenced this issue Mar 23, 2023
- split and table allow index=True/False
- records and values only allow index=False
- index and columns only allow index=True
- raise for contradictions in the latter two
- see pandas-dev#25513
mroeschke pushed a commit that referenced this issue May 11, 2023
…2143)

* API/BUG: Make to_json index= consistent with orient
- split and table allow index=True/False
- records and values only allow index=False
- index and columns only allow index=True
- raise for contradictions in the latter two
- see #25513

* style: lint

* style: make mypy happy

* review: simplify

* review: clarify and consolidate branches

* style: add explainer comment

* doc: change error message in _json

* docs: update whatsnew 2.1.0

* docs: sort whatsnew
Rylie-W pushed a commit to Rylie-W/pandas that referenced this issue May 19, 2023
…ndas-dev#52143)

* API/BUG: Make to_json index= consistent with orient
- split and table allow index=True/False
- records and values only allow index=False
- index and columns only allow index=True
- raise for contradictions in the latter two
- see pandas-dev#25513

* style: lint

* style: make mypy happy

* review: simplify

* review: clarify and consolidate branches

* style: add explainer comment

* doc: change error message in _json

* docs: update whatsnew 2.1.0

* docs: sort whatsnew
Daquisu pushed a commit to Daquisu/pandas that referenced this issue Jul 8, 2023
…ndas-dev#52143)

* API/BUG: Make to_json index= consistent with orient
- split and table allow index=True/False
- records and values only allow index=False
- index and columns only allow index=True
- raise for contradictions in the latter two
- see pandas-dev#25513

* style: lint

* style: make mypy happy

* review: simplify

* review: clarify and consolidate branches

* style: add explainer comment

* doc: change error message in _json

* docs: update whatsnew 2.1.0

* docs: sort whatsnew
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants