Incorrect dataframe serialisation if you have multiple columns starting with digits and the first alphabetically sorted column has a NaN value #485

fdorssers · 2022-08-08T13:01:29Z

Steps to reproduce:
I've made a small Python script that showcases the issue:

import os
import pandas as pd
import numpy as np
from influxdb_client import InfluxDBClient
from influxdb_client.client.write_api import PointSettings
from influxdb_client.client.write.dataframe_serializer import data_frame_to_list_of_points

conn = InfluxDBClient(
    url="http://localhost:8086",
    token="<token>",
    org="<org>",
)
write_api = conn.write_api()
df = pd.DataFrame(
    index=[pd.Timestamp("2022-07-29 00:01:00", tz="UTC")],
    data={
        "1_col": np.nan,
        "2_col": 1.1,
        "a_col": 2.2
    }
)
# The batch item wasn't processed successfully because: (400)
# Reason: Bad Request
# HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json; charset=utf-8', 'X-Influxdb-Build': 'OSS', 'X-Influxdb-Version': 'v2.3.0+SNAPSHOT.090f681737', 'X-Platform-Error-Code': 'invalid', 'Date': 'Mon, 08 Aug 2022 07:48:43 GMT', 'Content-Length': '128'})
# HTTP response body: {"code":"invalid","message":"unable to parse 'test_measurement ,2_col=1.1,a_col=2.2 1659052860000000000': invalid field format"}

data_frame = pd.DataFrame(data={
    '1value': [np.nan],
    'avalue': [  30.0],
    'bvalue': [  30.0]
}, index=pd.period_range('2020-05-24 10:00', freq='H', periods=1))

points = data_frame_to_list_of_points(data_frame,
                                      PointSettings(),
                                      data_frame_measurement_name='test')
# ['test avalue=30.0,bvalue=30.0 1590314400000000000'] ✅

data_frame = pd.DataFrame(data={
    '1value': [np.nan,   30.0, np.nan,   30.0, np.nan],
    '2value': [  30.0, np.nan, np.nan, np.nan, np.nan],
    '3value': [  30.0,   30.0,   30.0, np.nan, np.nan],
    'avalue': [  30.0,   30.0,   30.0,   30.0,   30.0]
}, index=pd.period_range('2020-05-24 10:00', freq='H', periods=5))

points = data_frame_to_list_of_points(data_frame,
                                      PointSettings(),
                                      data_frame_measurement_name='test')
# ['test ,2value=30.0,3value=30.0,avalue=30.0 1590314400000000000', ❌
#  'test 1value=30.0,3value=30.0,avalue=30.0 1590318000000000000',  ✅
#  'test ,3value=30.0,avalue=30.0 1590321600000000000',             ❌
#  'test 1value=30.0,avalue=30.0 1590325200000000000',              ✅
#  'test avalue=30.0 1590328800000000000']                          ✅

Expected behavior:
Data should be stored in InfluxDB if I try to save a DataFrame that has multiple columns starting with digits where NaNs might occur.

Actual behavior:
If all columns are sorted alphabetically, and the first column starts with a digit and has a NaN then it will crash if the first subsequent column that has a value also starts with a digit. If the first subsequent column that follows that has a value starts with a normal character, then it works as usual.

Time	1_col	2_col	3_col	a_col	Result
...	NaN	30.0	30.0	30.0	Fails
...	30.0	NaN	30.0	30.0	Works
...	NaN	NaN	30.0	30.0	Fails
...	30.0	NaN	NaN	30.0	Works
...	NaN	NaN	NaN	30.0	Works

However, if your first column starts with a digit, and this is the only column with a digit, then there's no problem.

Time	1_col	a_col	Result
...	NaN	30.0	Works

Other:
I've also immediately made a PR: #486 . Wasn't really sure whether it was enough to just make a PR or if an issue was required, so just made both.

Specifications:

Client Version: 1.31.0
InfluxDB Version: 2.3.0 (docker)
Platform: Intel MacBook Pro running macOS 12.5

fdorssers mentioned this issue Aug 8, 2022

fix: serialization of dataframes with NaN values and columns starting with digits #486

Merged

6 tasks

powersj closed this as completed in #486 Aug 8, 2022

bednar added this to the 1.32.0 milestone Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect dataframe serialisation if you have multiple columns starting with digits and the first alphabetically sorted column has a NaN value #485

Incorrect dataframe serialisation if you have multiple columns starting with digits and the first alphabetically sorted column has a NaN value #485

fdorssers commented Aug 8, 2022 •

edited

Loading

Incorrect dataframe serialisation if you have multiple columns starting with digits and the first alphabetically sorted column has a NaN value #485

Incorrect dataframe serialisation if you have multiple columns starting with digits and the first alphabetically sorted column has a NaN value #485

Comments

fdorssers commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fdorssers commented Aug 8, 2022 •

edited

Loading