-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: read_json silently skipping records? #4359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just tried to open the Pandas created json file with the json module and I got the error below.
|
can you provide a minimal reproducible example? |
post |
When I call pd.read_json(...) on the JSON string below, I only get one dataframe row back.
|
can you post what you SHOULD get back simple json returns this which AFAICT is only 1 row as well.
|
@tdhopper can you post say a |
|
The problem is that the article_id is an index column, but it is not unique. |
Try this
|
Should DataFrame.to_json() give a warning when index values are not unique? |
yep....I think it should actually raise. so non-unique index when any others? hmm....need some tests for this... will mark as a bug |
Yeah both Its output actually contains the duplicated keys but they are silently dropped by the JSON parser when reading the JSON string back in. Note In [30]: df = pd.DataFrame([['a','b'],['c','d']],index=[1,1],columns=['x','y'])
In [31]: df
Out[31]:
x y
1 a b
1 c d
In [32]: df.to_json()
Out[32]: '{"x":{"1":"a","1":"c"},"y":{"1":"b","1":"d"}}'
In [33]: json.loads(df.to_json())
Out[33]: {u'x': {u'1': u'c'}, u'y': {u'1': u'd'}}
In [34]: df.to_dict()
Out[34]: {'x': {1: 'c'}, 'y': {1: 'd'}}
In [35]: df.to_json(orient='columns')
Out[35]: '{"x":{"1":"a","1":"c"},"y":{"1":"b","1":"d"}}'
In [36]: df = pd.DataFrame([['a','b'],['c','d']],index=[1,1],columns=['x','x'])
In [37]: df
Out[37]:
x x
1 a b
1 c d
In [38]: df.to_json()
Out[38]: '{"x":{"1":"a","1":"c"},"x":{"1":"b","1":"d"}}'
In [39]: json.loads(df.to_json())
Out[39]: {u'x': {u'1': u'd'}} So what do you think, raise an exception prompting the user to either uniqify the data or choose a different |
I think should raise in the writing (and to_dict should change too) index.is_unique you can do (or can); your post above are basically the tests lmk |
FYI In [48]: df
Out[48]:
x x
1 a b
1 c d
In [47]: df.to_dict()
pandas/core/frame.py:984: UserWarning: DataFrame columns are not unique, some columns will be omitted.
"columns will be omitted.", UserWarning)
Out[47]: {'x': {1: 'd'}} |
it will be undefined what is returned from |
OK I'll put together a PR for the JSON and to_dict changes. IMO it should just be a warning for to_dict as it deals with the problem but it should be an exception for to_json, as it ends up producing invalid json. |
how does to_dict with this? (aside from the warning) |
i'm not sure why this would be useful since you cannot predict which columns will be returned
i would definitely like a big honking exception if i tried to do this since i unpredictably lose information |
It deals with the problem in the sense that it produces valid output, unlike to_json In [65]: dict((('a',1),('a',2)))
Out[65]: {'a': 2} |
ok lets leave to_out for now (though I still may make a pr to fix this) ideally u can provide a recommendation (could be generic ) |
I'm +1 on raising an exception, force the user to deal with it. |
closed via #4376 |
Thanks everyone! |
should raise when
orient='columns'
and index is non_uniqueorient='index'
and column is non_unique?I'm trying out to_json and read_json on a data frame with 800k rows. However, after calling to_json on the file, read_json gets back only 2k rows. This happens if I call them in series or if I give to_json a filename and call the filename with read_json. Judging by the size of the file, all the data is being written (the json is roughly the size of the pickled data frame object). Any idea what's going on?
The text was updated successfully, but these errors were encountered: