Skip to content

BUG: Complex Numbers Not Imported Correctly Under JSON Read #50782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
coatless opened this issue Jan 16, 2023 · 4 comments
Open
3 tasks done

BUG: Complex Numbers Not Imported Correctly Under JSON Read #50782

coatless opened this issue Jan 16, 2023 · 4 comments
Labels
Bug Complex Complex Numbers Enhancement IO JSON read_json, to_json, json_normalize

Comments

@coatless
Copy link

Pandas version checks

Reproducible Example

import pandas as pd
import json

def json_conversion(df, orient_type = "values"):

  # convert dataframe to a JSON string
  json_str = df.to_json(orient=orient_type)

  # write the JSON string to a file
  with open('data.json', 'w') as f:
      json.dump(json_str, f)

  # read the JSON string from the file
  with open('data.json', 'r') as f:
      json_str = json.load(f)

  # Convert the JSON string back to a dataframe
  df2 = pd.read_json(json_str, orient=orient_type)

  return df2

# Create a dataframe with imaginary numbers
df = pd.DataFrame({'a': [1 + 2j, 3 + 4j], 'b': [5 + 6j, 7 + 8j]})
print(df)
#           a         b
# 0  1.0+2.0j  5.0+6.0j
# 1  3.0+4.0j  7.0+8.0j

# Check with `values`
df_values_json = json_conversion(df, "values")
print(df_values_json)
#                             0                           1
# 0  {'imag': 2.0, 'real': 1.0}  {'imag': 6.0, 'real': 5.0}
# 1  {'imag': 4.0, 'real': 3.0}  {'imag': 8.0, 'real': 7.0}


# Check with `table`
df_table_json = json_conversion(df, "table")
# TypeError: float() argument must be a string or a number, not 'dict'

Issue Description

When trying to re-create a dataframe with complex numbers using JSON, the pd.read_json() function has trouble with different orientations, e.g. orient="values" and orient="table". In particular, the reconstructed data frame either treats the number as a combined dictionary with "imag" and "real" entries or is unable to be recreated due to a TypeError.

a b
0 1+2j 5+6j
1 3+4j 7+8j
JSON Output under `orient='values'`
[
    [
        {
            "imag":2.0,
            "real":1.0
        },
        {
            "imag":6.0,
            "real":5.0
        }
    ],
    [
        {
            "imag":4.0,
            "real":3.0
        },
        {
            "imag":8.0,
            "real":7.0
        }
    ]
]

This leads to the reconstructed data frame looking like so:

0 1
0 {'imag': 2.0, 'real': 1.0} {'imag': 6.0, 'real': 5.0}
1 {'imag': 4.0, 'real': 3.0} {'imag': 8.0, 'real': 7.0}

In the case of orient='table', we have:

JSON Output under `orient='table'`
{
    "schema":{
        "fields":[
            {
                "name":"index",
                "type":"integer"
            },
            {
                "name":"a",
                "type":"number"
            },
            {
                "name":"b",
                "type":"number"
            }
        ],
        "primaryKey":[
            "index"
        ],
        "pandas_version":"0.20.0"
    },
    "data":[
        {
            "index":0,
            "a":{
                "imag":2.0
            },
            "b":{
                "imag":6.0
            }
        },
        {
            "index":1,
            "a":{
                "imag":4.0
            },
            "b":{
                "imag":8.0
            }
        }
    ]
}

The end output is a TypeError of:

TypeError: float() argument must be a string or a number, not 'dict'

Expected Behavior

Ideally, the original data frame should be constructed up to column names in the values case whereas the table case should be identical to the original data frame.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.147+
Version : #1 SMP Sat Dec 10 16:00:40 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.2
numpy : 1.21.6
pytz : 2022.7
dateutil : 2.8.2
setuptools : 57.4.0
pip : 22.0.4
Cython : 0.29.32
pytest : 3.6.4
hypothesis : None
sphinx : 3.5.4
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.9.5
jinja2 : 2.11.3
IPython : 7.9.0
pandas_datareader: 0.9.0
bs4 : 4.6.3
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.2.2
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
snappy : None
sqlalchemy : 1.4.46
tables : 3.7.0
tabulate : 0.8.10
xarray : 2022.12.0
xlrd : 1.2.0
xlwt : 1.3.0
zstandard : None
tzdata : None

@coatless coatless added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 16, 2023
@dicristina
Copy link
Contributor

The documentation says that the output of df.to_json(orient="values") will be "just the values array" so it is not possible to recover the column names. To recover the actual complex values you can do something like:

df_values_json.applymap(lambda c: complex(**c))

Unlike in the previous case in the orient="table" case we have the data type of each column so in theory we should be able to reconstruct the values without doing any work at all. The problem here is that there is no special handling for complex numbers and when the values are read they are passed to the float function. The relevant code is in pandas/io/json/_table_schema.py.

@topper-123
Copy link
Contributor

Table Schema doesn't seem to have schema fields for complex numbers, so this isn't possible to fix for Pandas, under the constraint that we follow Table Schema. I'm not an expert on Table Schema at all, so if I'm wrong there, I appreciate feedback on that, of course.

So, I agree that the solution proposed by @dicristina using apply/applymap is the best possible right now and I don't think this is fixable, while following Table Schema.

@topper-123 topper-123 added IO JSON read_json, to_json, json_normalize Complex Complex Numbers Enhancement and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 25, 2023
@dicristina
Copy link
Contributor

There is a mechanism already in place to add an extDtype key to the field descriptor for extension types. When the table is read the data type indicated by this key is the one used instead of the one derived from the type key. Maybe this can be used for complex numbers even though they are not a pandas extension type.

Even when the correct data type is contained in the field descriptor the representation of the complex numbers presents a small problem. The parse_table_schema function builds a mapping of dtypes and then calls df.astype(dtypes). This does not work when we have the complex numbers represented as a dictionary.

@topper-123
Copy link
Contributor

Yes I agree.

Looking at the table schema number definition, it doesn't look like the dict is a legal value for a "number" field, so the current behavior is a bit strange.

Maybe complex numbers should have type "object" instead (i.e. allowing the dict) and a extDtype field with value "complex". I.e. type "object" will by default be read in as a json-like object (i.e. result from json.loads in python), except if the field has a "extDtype" with value of "complex", it will be converted to a complex type using complex(**val)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Complex Complex Numbers Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

3 participants