Skip to content

Inconsistent results when concatenating parsed csv files with function application on windows #12250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MMCMA opened this issue Feb 7, 2016 · 2 comments

Comments

@MMCMA
Copy link

MMCMA commented Feb 7, 2016

I am puzzled with the following problem. I have a set of csv files, which I parse iterativly. Before collecting the dataframes in a list, I apply some function (as simple as tmp_df*2) to each of the tmp_df. It all worked perfectly fine at first glance, until I've realized I have inconsistencies with the results from run to run. Strangly, I don't not observe inconsistencies like this one when I don't manipulate the parsed data (simply set with_function=False ). I have managed to re-create the problem, it should run on win and ux. I've tested on win8.1 facing the problem when with_function=True (typically after 1-5 runs), on ux it runs without problems. with_function=False runs without differences for win and ux. I can also reject the hypothesis that it is related to int or float issue as also the simulated int are different. Here is the discussion on SO http://stackoverflow.com/questions/35252460/inconsistent-results-when-concatenating-parsed-csv-files

import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir


def simulate_csv_data(tmp_dir,num_files=5):
    """ simulate a csv files
    :param tmp_dir: Path, csv files are saved to
    :param num_files: int, how many csv files to simulate
    :return:
    """

    rows = 20000
    columns = 5
    np.random.seed(1282)

    for file_num in range(num_files):

        file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
        simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
        simulated_df['some_int'] = np.random.randint(0,100)
        simulated_df.to_csv(str(file_path))


def get_csv_data(tmp_dir,num_files=5, with_function=True):
    """ Collect various csv files and return a concatenated dfs
    :param tmp_dir: Path, csv files are saved to
    :param num_files: int, how many csv files to simulate
    :param with_function: Bool, apply function to tmp_dataframe
    :return:
    """

    data_list = list()

    for file_num in range(num_files):
        # current file path
        file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))

        #  load csv_file
        tmp_df = pd.read_csv(str(file_path), dtype=np.float64)

        # replace infs by na
        tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)

        # apply function to tmp_dataframe
        if with_function:
            tmp_df = tmp_df*2

        data_list.append(tmp_df)

    df = pd.concat(data_list, ignore_index=True)
    df.reset_index(inplace=True)

    return df

def main():

    # INPUT ----------------------------------------------
    num_files = 5
    with_function = True
    max_comparisons = 50
    # ----------------------------------------------------

    tmp_dir = gettempdir()
    # use temporary "non_existing" dir for new file
    tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')

    # if exists already don't simulate data/files again
    if tmp_csv_folder.exists() is False:
        tmp_csv_folder.mkdir()
        print('Simulating temp files...')
        simulate_csv_data(tmp_csv_folder, num_files)

    print('Getting benchmark data frame...')
    df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
    df_is_same = True
    count_runs = 0

    # Run until different df is found or max runs exceeded
    print('Comparing data frames...')
    while df_is_same:
        # get another data frame
        df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
        count_runs += 1
        # compare data frames
        if df1.equals(df2) is False:
            df_is_same = False
            print('Found unequal df after {} runs'.format(count_runs))
            # print out a standard deviations (arbitrary example)
            print('Std Run1: \n {}'.format(df1.std()))
            print('Std Run2: \n {}'.format(df2.std()))

        if count_runs > max_comparisons:
            df_is_same = False
            print('No unequal df found after {} runs'.format(count_runs))

    print('Delete the following folder if no longer needed: "{}"'.format(
            str(tmp_csv_folder)))


if __name__ == '__main__':
    main()
@jreback
Copy link
Contributor

jreback commented Feb 7, 2016

you need numexpr 2.4.6

see here: #12023

next time pls pd.show_versions() as well as a much shorter description - you are pasting code which cannot be reproduced without the source files and so is pretty much useless

@MMCMA
Copy link
Author

MMCMA commented Feb 7, 2016

Thanks, this solves the problem. Sorry, I'll try to improve.

@MMCMA MMCMA closed this as completed Feb 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants