BUG: read_json silently skipping records? #4359

tdhopper · 2013-07-25T15:03:57Z

should raise when orient='columns' and index is non_unique
orient='index' and column is non_unique?

I'm trying out to_json and read_json on a data frame with 800k rows. However, after calling to_json on the file, read_json gets back only 2k rows. This happens if I call them in series or if I give to_json a filename and call the filename with read_json. Judging by the size of the file, all the data is being written (the json is roughly the size of the pickled data frame object). Any idea what's going on?

The text was updated successfully, but these errors were encountered:

tdhopper · 2013-07-25T15:11:34Z

I just tried to open the Pandas created json file with the json module and I got the error below.


---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-24-08d57842ab3e> in <module>()
      1 import json
      2 with open("data/df.json") as f:
----> 3     j = json.load(f)

C:\Anaconda\lib\json\__init__.pyc in load(fp, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    288         parse_float=parse_float, parse_int=parse_int,
    289         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook,
--> 290         **kw)
    291 
    292 

C:\Anaconda\lib\json\__init__.pyc in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    336             parse_int is None and parse_float is None and
    337             parse_constant is None and object_pairs_hook is None and not kw):
--> 338         return _default_decoder.decode(s)
    339     if cls is None:
    340         cls = JSONDecoder

C:\Anaconda\lib\json\decoder.pyc in decode(self, s, _w)
    363 
    364         """
--> 365         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    366         end = _w(s, end).end()
    367         if end != len(s):

C:\Anaconda\lib\json\decoder.pyc in raw_decode(self, s, idx)
    379         """
    380         try:
--> 381             obj, end = self.scan_once(s, idx)
    382         except StopIteration:
    383             raise ValueError("No JSON object could be decoded")

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

cpcloud · 2013-07-25T15:12:03Z

can you provide a minimal reproducible example?

jreback · 2013-07-25T15:13:04Z

post df.info() as well

tdhopper · 2013-07-26T19:22:58Z

When I call pd.read_json(...) on the JSON string below, I only get one dataframe row back.

'{"date":{"79820000000.0":1346889720000000000,"79820000000.0":1346889720000000000},"author":{"79820000000.0":"DEBBIE_GI","79820000000.0":"SPINFUEL_ECIGS"},"content":{"79820000000.0":" from University of Athens Tell the Public They Are Not Sure if Smoking is Any More Hazardous than Vaping... http:\\/\\/t.co\\/kL79zxAF","79820000000.0":"@towrofstgh @gingersejuice IT WAS YOU!!! You freaked me out! :-) I thought you were serious, seriously demented that is. Now? Funny as hell."},"following":{"79820000000.0":49,"79820000000.0":436},"followers":{"79820000000.0":38,"79820000000.0":456},"updates":{"79820000000.0":69,"79820000000.0":3024},"content_stripped":{"79820000000.0":"university athens tell public sure smoking hazardous vaping","79820000000.0":"towrofstgh gingersejuice freaked thought serious seriously demented funny hell"},"original_url":{"79820000000.0":"http:\\/\\/t.co\\/kl79zxaf","79820000000.0":""},"rt_source":{"79820000000.0":"","79820000000.0":""},"author_count":{"79820000000.0":54,"79820000000.0":2141},"is_retweet":{"79820000000.0":0,"79820000000.0":0},"is_reply":{"79820000000.0":0,"79820000000.0":1},"has_url":{"79820000000.0":1,"79820000000.0":0},"spam_prediction":{"79820000000.0":1.0,"79820000000.0":0.0}}'

jreback · 2013-07-26T19:41:00Z

can you post what you SHOULD get back

simple json returns this which AFAICT is only 1 row as well.
Is this valid json?

(Pdb) x = simplejson.loads(json)
(Pdb) x
{'author': {'79820000000.0': 'SPINFUEL_ECIGS'}, 'spam_prediction': {'79820000000.0': 0.0}, 'original_url': {'79820000000.0': u''}, 'is_retweet': {'79820000000.0': 0}, 'has_url': {'79820000000.0': 0}, 'content': {'79820000000.0': '@towrofstgh @gingersejuice IT WAS YOU!!! You freaked me out! :-) I thought you were serious, seriously demented that is. Now? Funny as hell.'}, 'following': {'79820000000.0': 436}, 'content_stripped': {'79820000000.0': 'towrofstgh gingersejuice freaked thought serious seriously demented funny hell'}, 'followers': {'79820000000.0': 456}, 'updates': {'79820000000.0': 3024}, 'date': {'79820000000.0': 1346889720000000000}, 'is_reply': {'79820000000.0': 1}, 'author_count': {'79820000000.0': 2141}, 'rt_source': {'79820000000.0': u''}}

tdhopper · 2013-07-26T19:49:00Z

That JSON was generated by Pandas. Does this help?

jreback · 2013-07-26T19:59:01Z

cc @Komnomnomnom

jreback · 2013-07-26T20:00:04Z

@tdhopper can you post say a to_csv of the original frame so its easy to reconstruct?
just paste the test in

tdhopper · 2013-07-26T20:03:41Z

article_id,date,author,content,following,followers,updates,content_stripped,original_url,rt_source,author_count,is_retweet,is_reply,has_url,spam_prediction
79820000000.0,2012-09-06 00:02:00,DEBBIE_GI, from University of Athens Tell the Public They Are Not Sure if Smoking is Any More Hazardous than Vaping... http://t.co/kL79zxAF,49,38,69,university athens tell public sure smoking hazardous vaping,http://t.co/kl79zxaf,,54,0,0,1,1.0
79820000000.0,2012-09-06 00:02:00,SPINFUEL_ECIGS,"@towrofstgh @gingersejuice IT WAS YOU!!! You freaked me out! :-) I thought you were serious, seriously demented that is. Now? Funny as hell.",436,456,3024,towrofstgh gingersejuice freaked thought serious seriously demented funny hell,,,2141,0,1,0,0.0

tdhopper · 2013-07-26T20:06:52Z

The problem is that the article_id is an index column, but it is not unique.

jreback · 2013-07-26T20:08:49Z

Try this

In [7]: x = df.set_index('article_id')

In [12]: pd.read_json(x.to_json(orient='split'),orient='split')
Out[12]: 
                           date          author                                            content  following  followers  updates                                   content_stripped          original_url  rt_source  author_count  is_retweet  is_reply  has_url  spam_prediction
79820000000 2012-09-06 00:02:00       DEBBIE_GI   from University of Athens Tell the Public The...         49         38       69  university athens tell public sure smoking haz...  http://t.co/kl79zxaf        NaN            54           0         0        1                1
79820000000 2012-09-06 00:02:00  SPINFUEL_ECIGS  @towrofstgh @gingersejuice IT WAS YOU!!! You f...        436        456     3024  towrofstgh gingersejuice freaked thought serio...                  None        NaN          2141           0         1        0                0

tdhopper · 2013-07-26T20:11:45Z

Should DataFrame.to_json() give a warning when index values are not unique?

jreback · 2013-07-26T20:13:08Z

yep....I think it should actually raise.

cc @Komnomnomnom

so non-unique index when orient='columns' is bad
prob non-unique columsn when orient='index' as well

any others? hmm....need some tests for this...

will mark as a bug

Komnomnomnom · 2013-07-27T00:29:08Z

Yeah both orient='columns' and orient='index' (the default) encode to JavaScript objects so the keys should be unique. orient='records' is also a problem when the columns are non-unique. orient='values' and orient='split' should be ok.

Its output actually contains the duplicated keys but they are silently dropped by the JSON parser when reading the JSON string back in.

Note to_dict has similar behaviour (silently drops rows with non-unique indices):

In [30]: df = pd.DataFrame([['a','b'],['c','d']],index=[1,1],columns=['x','y'])

In [31]: df
Out[31]: 
   x  y
1  a  b
1  c  d

In [32]: df.to_json()
Out[32]: '{"x":{"1":"a","1":"c"},"y":{"1":"b","1":"d"}}'

In [33]: json.loads(df.to_json())
Out[33]: {u'x': {u'1': u'c'}, u'y': {u'1': u'd'}}

In [34]: df.to_dict()
Out[34]: {'x': {1: 'c'}, 'y': {1: 'd'}}

In [35]: df.to_json(orient='columns')
Out[35]: '{"x":{"1":"a","1":"c"},"y":{"1":"b","1":"d"}}'

In [36]: df = pd.DataFrame([['a','b'],['c','d']],index=[1,1],columns=['x','x'])

In [37]: df
Out[37]: 
   x  x
1  a  b
1  c  d

In [38]: df.to_json()
Out[38]: '{"x":{"1":"a","1":"c"},"x":{"1":"b","1":"d"}}'

In [39]: json.loads(df.to_json())
Out[39]: {u'x': {u'1': u'd'}}

So what do you think, raise an exception prompting the user to either uniqify the data or choose a different orient if index or columns are not unique and are going to be used as keys?

jreback · 2013-07-27T00:42:55Z

I think should raise in the writing (and to_dict should change too)
quite easy to test for it

index.is_unique

you can do (or can); your post above are basically the tests

lmk

Komnomnomnom · 2013-07-27T00:43:07Z

FYI to_dict() displays a warning if columns are non-unique:

In [48]: df
Out[48]: 
   x  x
1  a  b
1  c  d

In [47]: df.to_dict()
pandas/core/frame.py:984: UserWarning: DataFrame columns are not unique, some columns will be omitted.
  "columns will be omitted.", UserWarning)
Out[47]: {'x': {1: 'd'}}

jreback · 2013-07-27T00:47:29Z

hmm

I think it should raise (and json too) but maybe I am in the minority

@cpcloud @wesm ?

cpcloud · 2013-07-27T00:51:53Z

it will be undefined what is returned from to_dict for non unique columns i think raise there, not sure about to_json. if it encodes them that way then i think it should raise

Komnomnomnom · 2013-07-27T00:54:24Z

OK I'll put together a PR for the JSON and to_dict changes.

IMO it should just be a warning for to_dict as it deals with the problem but it should be an exception for to_json, as it ends up producing invalid json.

jreback · 2013-07-27T00:56:04Z

how does to_dict with this? (aside from the warning)

cpcloud · 2013-07-27T00:59:41Z

i'm not sure why this would be useful since you cannot predict which columns will be returned

In [48]: df
Out[48]: 
   x  x
1  a  b
1  c  d

In [47]: df.to_dict()
pandas/core/frame.py:984: UserWarning: DataFrame columns are not unique, some columns will be omitted.
  "columns will be omitted.", UserWarning)
Out[47]: {'x': {1: 'd'}}

i would definitely like a big honking exception if i tried to do this since i unpredictably lose information

Komnomnomnom · 2013-07-27T01:02:44Z

It deals with the problem in the sense that it produces valid output, unlike to_json

In [65]: dict((('a',1),('a',2)))
Out[65]: {'a': 2}

jreback · 2013-07-27T01:06:35Z

ok lets leave to_out for now (though I still may make a pr to fix this)
I think raise on any loss of data because not immediately obvious that u lost something

ideally u can provide a recommendation (could be generic )
that the user try a different orient (eg if index is not unique then split is ok)

…andas-dev#4359)

wesm · 2013-07-27T21:31:26Z

I'm +1 on raising an exception, force the user to deal with it.

jreback · 2013-07-28T15:42:43Z

closed via #4376

tdhopper · 2013-07-29T13:00:30Z

Thanks everyone!

Komnomnomnom added a commit to Komnomnomnom/pandas that referenced this issue Jul 27, 2013

BUG: to_json should raise exception for non-unique index / columns (p…

f735141

…andas-dev#4359)

Komnomnomnom mentioned this issue Jul 27, 2013

BUG: to_json should raise exception for non-unique index / columns (#4359) #4376

Merged

jreback closed this as completed Jul 28, 2013

chris-b1 mentioned this issue Oct 6, 2016

BLD/CI: cython cache pxd files #14363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_json silently skipping records? #4359

BUG: read_json silently skipping records? #4359

tdhopper commented Jul 25, 2013

tdhopper commented Jul 25, 2013

cpcloud commented Jul 25, 2013

jreback commented Jul 25, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

jreback commented Jul 26, 2013

tdhopper commented Jul 26, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

cpcloud commented Jul 27, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

cpcloud commented Jul 27, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

wesm commented Jul 27, 2013

jreback commented Jul 28, 2013

tdhopper commented Jul 29, 2013

BUG: read_json silently skipping records? #4359

BUG: read_json silently skipping records? #4359

Comments

tdhopper commented Jul 25, 2013

tdhopper commented Jul 25, 2013

cpcloud commented Jul 25, 2013

jreback commented Jul 25, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

jreback commented Jul 26, 2013

tdhopper commented Jul 26, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

tdhopper commented Jul 26, 2013

jreback commented Jul 26, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

cpcloud commented Jul 27, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

cpcloud commented Jul 27, 2013

Komnomnomnom commented Jul 27, 2013

jreback commented Jul 27, 2013

wesm commented Jul 27, 2013

jreback commented Jul 28, 2013

tdhopper commented Jul 29, 2013