Skip to content
This repository was archived by the owner on Oct 29, 2024. It is now read-only.

Incorrect nanosecond timestamps being written to influxdb #649

Closed
auphofBSF opened this issue Oct 13, 2018 · 12 comments · Fixed by #811
Closed

Incorrect nanosecond timestamps being written to influxdb #649

auphofBSF opened this issue Oct 13, 2018 · 12 comments · Fixed by #811

Comments

@auphofBSF
Copy link

auphofBSF commented Oct 13, 2018

Writing point with time precision in ns results in incorrect nanosecond timestamps being written to influxdb
effect can be noticed in writing points from both the DataFrameClient.writepointsas well as the InfluxDBClient.write
This has been noted in issues: #527,#489#346 and #344 and #340 

and a proposed pull request submitted. #346 . I believe this pull request does not go far enough

#489 describes the same "floor divide" solution as part of the solution being proposed here.

This solution I am proposing in a pull request will use, floor divide and pandas.Timestamp that supports ns resolution.

Implication

In trying to use the library to update a record it writes a new record with the erroneous timestamp

Effected code regions

_dataframe_client
_convert_dataframe_to_json
_convert_dataframe_to_lines
_datetime_to_epoch
line_protocol.py
_convert_timestamp

Explanation:

the divide of an (np.int64 or int ) nanosecond timepoint by 1 (ns precision) is producing errors. it is necessary to use the floor divide operator // which yields the correct (np.int64 or int) value

If an np.int64 is divided (Operator /) by an int it yields a float64. When the initial np.int64 value is large enough the conversion back to an np.int64 looses nanosecond precision. The same error occurs with standard python int types.

Example (formulated in a ipython notebook)

.......... dataframe  is from a query (influxdb measurement having particular field value errors) returned 2 timepoints
dataframe.index  --->   DatetimeIndex(['2018-05-29 00:53:12.889962156+00:00', '2018-10-03 07:06:36.975643599+00:00'], dtype='datetime64[ns, UTC]', freq=None)
dataframe.index[1]  ---> Timestamp('2018-10-03 07:06:36.975643599+0000', tz='UTC')
dataframe.index[1].value  ---> 1538550396975643599

######    alternative method for timepoint fabrication   #########
timepoint = pd.Timestamp('2018-10-03 07:06:36.975643599+0000', tz='UTC')
timepoint.value   ---> 1538550396975643599

type(timepoint.value)   ---> int
timepoint.value / 1  ----> 1.5385503969756436e+18
type(timepoint.value / 1) ----> float
type(timepoint.value // 1) ----> int
np.int64(timepoint.value // 1)   ---> 1538550396975643599
np.int64(timepoint.value / 1)  ---> 1538550396975643648

Error(ns) is

np.int64(timepoint.value // 1) - np.int64(timepoint.value / 1)  ---> -49

######## Check using unit test timepoint ################

EPOCH = pd.Timestamp('1970-01-01 00:00+00:00')
nowplus1h = EPOCH + pd.Timedelta('1 hour')
nowplus1h.value ---> 3600000000000
nowplus1h.value / 1 ---> 3600000000000.0
np.int64(nowplus1h.value / 1) ---> 3600000000000

No Error in this

######### Suggested unit test timepoint ################

futuretimepoint = EPOCH + pd.Timedelta('20000 day  +23:10:55.123456789')
futuretimepoint.value ---> 1728083455123456789
futuretimepoint.value / 1 ---> 1.7280834551234568e+18
np.int64(futuretimepoint.value / 1) ---> 1728083455123456768

Error(ns) is

futuretimepoint.value - np.int64(futuretimepoint.value / 1) ---> 21

all locations where timepoints are calculated need modification to yield the expected result in nanosecond precision
the unittests do not show this up as the 2 test timepoints are small enough to not show the loss of precision
points are EPOCH '1970-01-01 00:00+00:00' EPOCH + 1 hour. These 2 test points have only microsecond resolution with no nano second component.
I propose the test case 2 to be something like EPOCH + 20000days and 23h 10m 55.123456789s. It is necessary to change all calculations based on datetime (only has microsecond resolution) to pandas Timedelta and Timestamp (these have nanosecond resolution)

I am preparing a pull request that attempts to address all timestamp calculations and fixes the unittests

Product version where issue discovered and where fixes are being tested

pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 70 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
............................................
...........
..........................

np.version.full_version

@auphofBSF
Copy link
Author

have noted a further Issue #489 that identifies the same solution

auphofBSF added a commit to auphofBSF/influxdb-python that referenced this issue Oct 14, 2018
extensive but not fully tested, unittests updates to follow
also fixes ( influxdata#527,influxdata#489, influxdata#346, influxdata#344, influxdata#340)
auphofBSF added a commit to auphofBSF/influxdb-python that referenced this issue Oct 14, 2018
auphofBSF added a commit to auphofBSF/influxdb-python that referenced this issue Oct 15, 2018
auphofBSF added a commit to auphofBSF/influxdb-python that referenced this issue Oct 15, 2018
@BeckmaR
Copy link

BeckmaR commented Feb 21, 2019

Same here. Using precision = 'ms' gives wrong timestamps, which has ruined a database for me due to stuff getting out of order. Floats are really not a good idea here.

@auphofBSF
Copy link
Author

a further issue has been identified by @tpietruszka with microsecond timepoints not accurately saved when storing datapoints from a dataframe, possibly affects saving timepoints at any precision

Links and further notes are on the comments of the pull request #650 (comment)

@cuxcrider
Copy link

cuxcrider commented Oct 8, 2019

Hi all, Thanks for the active work on this, I just discovered this issue today in a script I am writing to sync two influx nodes. Fortunately I noticed it before unleashing the script on all my databases.

I tried to review the progress on this issue but I cannot really figure out why it is not merged into master yet? Does it actually take care of ns precision properly now but is not passing some build checks that look to be related to pypy? Does this mean that unless you need pypy, then all should be good?

@auphofBSF
Copy link
Author

auphofBSF commented Oct 9, 2019

Hi @cuxcrider , whilst this PR now works great for ns resolution timestamps there are 2 open issues

  1. (Not major) unit test failure for pypy - only an issue with config, have instructions now to fix, so will look at doing that.
  2. (Major for microsecond resolutions and other non ns resolutions) as identified by @tpietruszka in [Fix](#649) changes to support accurate ns timepoints #650 (comment) and then further analysed in [Fix](#649) changes to support accurate ns timepoints #650 (comment)

the last I looked at it I adjusted the test cases for the issue that was identified. The modified test cases are in a branch on my fork https://github.com/auphofBSF/influxdb-python/tree/bug/microsecond_issue_tpietruszka These test case changes need to be merged into this PR and the necessary changes made to support passing these adjusted test cases.

from what I recall and as described in the comment #650 (comment) we can fix this but just need to make a decision about truncation or rounding when converting between different resolutions.

I will relook at this and see what can be done to get this PR fully resolved

@auphofBSF
Copy link
Author

I have tested locally the changes to fix the precision conversions as identified by @tpietruszka , I will merge these into this PR over the next 24hrs with updated tests and try resolve all outstanding build checks for pypy.

@cuxcrider
Copy link

@auphofBSF awesome! As far as I can tell your branch is working properly for me on nanoseconds, which I guess you already know. The 'microsecond issue' seems more challenging to get it to consistently behave, but it sounds like you have it nearly working.

Also, in case it helps someone, using the standard master branch influxdb python package I did in fact inject a bunch of data points that are off by a nanosecond (or a few nanoseconds?) leading to effectively duplicate datapoints in my database. As long as you do not care about having the "correct" nanosecond time of your data, you can downsample from your database and keep only one value from within an interval.

I tried using influxdb aggregate functions, but that was turning into a nightmare because it renames field keys into things like "max_field_value", "distinct_field_value". So I am opting to use this python package and implement pandas resample function.

i.e.: df_downsample = df.resample('S').first() #T = minutes, S= seconds, L = milliseconds

I think the only "gotcha" here is that you have no way of knowing the true nanosecond timestamp of the originally recorded data.

@cuxcrider
Copy link

I came across one issue in your branch:

It seems that using the dataframe client to write points will write dataframe columns that contain nans as strings. Or in other words, it won't write the data thanks to Influx's inflexibility in that regard. I saw this is well-documented here:
#422
#195
I reverted back to the master branch and my writes worked again.
Let me know if you're seeing different behavior, it's been a long few days, I could be doing something dumb!

@auphofBSF
Copy link
Author

auphofBSF commented Oct 10, 2019

Hello @cuxcrider can you please confirm the commit of the branch you are using is 7e161e2 which incorporates the precision conversion fixes merged in today.

What version of Pandas, Numpy, Python and platform are you running ?

import pandas as pd
pd.show_versions()

Can you also provide a snippet of code to illustrate issue. All the current functional tests are passing so this may be something not tested for. With this info will look further into the documented issues you have listed.

I have scanned through the code of this PR and compared it against current master 3d61f1f and on first pass can't identify a probable cause, so will have to dig a little further

@cuxcrider
Copy link

I am now on your most up to date commit 7e161e2, and it seems to still have the issues.
I am on python 3.7.1, pandas 0.23.4, 1.15.4

Here is code that runs fine on the official influxdb repo but breaks on your branch 7e161e2
You just have to add in your host, user, and password to run. Sorry for the long timestamps, I was having trouble creating nanosecond datetime so I just gave up and literally pulled some from a database.

from influxdb import InfluxDBClient, DataFrameClient
import numpy as np
pd.show_versions()

#I use this make sure I write to influxdb anything that is a number as a float
def df_int_to_float(df):
    try:
        for i in df.select_dtypes('number').columns.values:
            df[i] = df[i].astype('float64')
    except:
        print('cycle not in dataframe')
    return df

###remember to enter your host, user, and password
def main(host = enter_your_host_here, port=8086):
    """Instantiate a connection to the InfluxDB."""
    
    user = xxx
    password = xxx
    db_name = 'data_type_example'
    client = InfluxDBClient(host, port, user, password)
    client.drop_database(db_name) 
    client.create_database(db_name)
    dfclient = DataFrameClient(host, port, user, password, db_name)
            
    for_df_dict = {"nanFloats": [1.1, float('nan') , 3.3, 4.4], "onlyFloats": [1.1, 2.2, 3.3, 4.4], 
                                  "strings":['one_one', 'two_two' ,'three_three', 'four_four']}
    df = pd.DataFrame.from_dict(for_df_dict)
    df['time'] = ['2019-10-04 06:27:19.850557111+00:00', '2019-10-04 06:27:19.850557184+00:00', '2019-10-04 06:27:42.251396864+00:00',
      '2019-10-04 06:27:42.251396974+00:00']
    df['time'] = pd.to_datetime(df['time'], unit='ns')
    df = df.set_index('time')
    df = df_int_to_float(df) 
    #####  df_types just for informational purposes
    df_types_float = df.select_dtypes(include = ['float64']) 
    df_types_bool = df.select_dtypes(include = ['bool'])
    df_types_obj = df.select_dtypes(include = ['object'])
    ########
    dfclient.write_points(df, 'test')     
    

if __name__ == '__main__':
    main()```


@auphofBSF
Copy link
Author

Hello @cuxcrider , I can confirm I am getting the same issue when just using my branch with the ns fix. What is missing is all the fixes already merged into the master influxdata/influxdb-python. As my PR#650 is still failing the pypy test passing one has to merge locally this PR branch.

I have setup a branch on my fork merging the primary Master with my PR. If you do that locally or pull my merge https://github.com/auphofBSF/influxdb-python/tree/test/merge_pr650issue649 and it should work. This branch passes all existing tests except the pypy. Here is the output on my test instance of your code.

It would be good to get this pypy test failure sorted and then hopefully this PR can be merged.

hope this works well for you

curl 'http://influxdb.local:8086/query?pretty=true&epo
ch=ns' --data-urlencode 'db=test_db_19W41issuewithNan' --data-urlencode "q=SELECT * FROM test"
{
    "results": [
        {
            "statement_id": 0,
            "series": [
                {
                    "name": "test",
                    "columns": [
                        "time",
                        "nanFloats",
                        "onlyFloats",
                        "strings"
                    ],
                    "values": [
                        [
                            1570170439850557111,
                            1.1,
                            1.1,
                            "one_one"
                        ],
                        [
                            1570170439850557184,
                            null,
                            2.2,
                            "two_two"
                        ],
                        [
                            1570170462251396864,
                            3.3,
                            3.3,
                            "three_three"
                        ],
                        [
                            1570170462251396974,
                            4.4,
                            4.4,
                            "four_four"
                        ]
                    ]
                }
            ]
        }
    ]
}

@cuxcrider
Copy link

wow, awesome, thanks for doing that!

I know next to nothing about pypy but looking for incompatibilities with Pandas seems like a reasonable place to start. In searching for "pypy pandas issues" I found this issue which has a relatively simple workaround it sounds like. https://bitbucket.org/pypy/pypy/issues/2989/pandas-currently-fails-to-install-on-pypy3

Maybe it's something easy like that?

Sorry I cannot be of much help. I hope some other folks can help so your branch can be merged into the official release. I would think people must have some major headaches with duplicate data otherwise.

Nice work on your patch thus far!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants