|
| 1 | +# Databolt File Ingest |
| 2 | + |
| 3 | +Quickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. `d6tstack` solves many performance and schema problems typically encountered when ingesting raw files. |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +### Features include |
| 8 | + |
| 9 | +* Fast pd.to_sql() for postgres and mysql |
| 10 | +* Quickly check columns for consistency across files |
| 11 | +* Fix added/missing columns |
| 12 | +* Fix renamed columns |
| 13 | +* Check Excel tabs for consistency across files |
| 14 | +* Excel to CSV converter (incl multi-sheet support) |
| 15 | +* Out of core functionality to process large files |
| 16 | +* Export to CSV, parquet, SQL, pandas dataframe |
| 17 | + |
| 18 | +### Sample Use |
| 19 | + |
| 20 | +``` |
| 21 | +
|
| 22 | +import d6tstack |
| 23 | +
|
| 24 | +# fast CSV to SQL import - see SQL examples notebook |
| 25 | +d6tstack.utils.pd_to_psql(df, 'postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') |
| 26 | +d6tstack.utils.pd_to_mysql(df, 'mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') |
| 27 | +d6tstack.utils.pd_to_mssql(df, 'mssql+pymssql://usr:pwd@localhost/db', 'tablename') # experimental |
| 28 | +
|
| 29 | +# ingest mutiple CSVs which may have data schema changes - see CSV examples notebook |
| 30 | +
|
| 31 | +import glob |
| 32 | +>>> c = d6tstack.combine_csv.CombinerCSV(glob.glob('data/*.csv')) |
| 33 | +
|
| 34 | +# quick check if all files have consistent columns |
| 35 | +>>> c.is_all_equal() |
| 36 | +False |
| 37 | +
|
| 38 | +# show which files have missing columns |
| 39 | +>>> c.is_col_present() |
| 40 | + filename cost date profit profit2 sales |
| 41 | +0 feb.csv True True True False True |
| 42 | +2 mar.csv True True True True True |
| 43 | +
|
| 44 | +>>> c.combine_preview() # keep all columns |
| 45 | + filename cost date profit profit2 sales |
| 46 | +0 jan.csv -80 2011-01-01 20 NaN 100 |
| 47 | +0 mar.csv -100 2011-03-01 200 400 300 |
| 48 | +
|
| 49 | +>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_select_common=True).combine_preview() # keep common columns |
| 50 | + filename cost date profit sales |
| 51 | +0 jan.csv -80 2011-01-01 20 100 |
| 52 | +0 mar.csv -100 2011-03-01 200 300 |
| 53 | +
|
| 54 | +>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_rename={'sales':'revenue'}).combine_preview() |
| 55 | + filename cost date profit profit2 revenue |
| 56 | +0 jan.csv -80 2011-01-01 20 NaN 100 |
| 57 | +0 mar.csv -100 2011-03-01 200 400 300 |
| 58 | +
|
| 59 | +# to come: check if columns match database |
| 60 | +>>> c.is_columns_match_db('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') |
| 61 | +
|
| 62 | +# export to csv, parquet, sql. Out of core with optimized fast imports for postgres and mysql |
| 63 | +>>> c.to_pandas() |
| 64 | +>>> c.to_csv_align(output_dir='process/') |
| 65 | +>>> c.to_parquet_align(output_dir='process/') |
| 66 | +>>> c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') |
| 67 | +>>> c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast, using COPY FROM |
| 68 | +>>> c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast, using LOAD DATA LOCAL INFILE |
| 69 | +
|
| 70 | +# read Excel files - see Excel examples notebook for more details |
| 71 | +import d6tstack.convert_xls |
| 72 | +
|
| 73 | +d6tstack.convert_xls.read_excel_advanced('test.xls', |
| 74 | + sheet_name='Sheet1', header_xls_range="B2:E2") |
| 75 | +
|
| 76 | +d6tstack.convert_xls.XLStoCSVMultiSheet('test.xls').convert_all(header_xls_range="B2:E2") |
| 77 | +
|
| 78 | +d6tstack.convert_xls.XLStoCSVMultiFile(glob.glob('*.xls'), |
| 79 | + cfg_xls_sheets_sel_mode='name_global',cfg_xls_sheets_sel='Sheet1') |
| 80 | + .convert_all(header_xls_range="B2:E2") |
| 81 | +
|
| 82 | +``` |
| 83 | + |
| 84 | + |
| 85 | +## Installation |
| 86 | + |
| 87 | +We recommend using the latest version from github `pip install git+https://github.com/d6t/d6tstack.git`. |
| 88 | + |
| 89 | +If you cannot install from github, use the latest published version `pip install d6tstack`. For Excel and parquet support, install `d6tstack[xls]` and `d6tstack[parquet]`. Certain database specific function require packages which you will be prompted for as you use them. |
| 90 | + |
| 91 | + |
| 92 | +## Documentation |
| 93 | + |
| 94 | +* [CSV examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-csv.ipynb) - Quickly load any type of CSV files |
| 95 | +* [Excel examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-excel.ipynb) - Quickly extract from Excel to CSV |
| 96 | +* [Dask Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) - How to use d6tstack to solve Dask input file problems |
| 97 | +* [Pyspark Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-pyspark.ipynb) - How to use d6tstack to solve pyspark input file problems |
| 98 | +* [SQL examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb) - Fast loading of CSV to SQL with pandas preprocessing |
| 99 | +* [Function reference docs](http://d6tstack.readthedocs.io/en/latest/py-modindex.html) - Detailed documentation for modules, classes, functions |
| 100 | +* [www.databolt.tech](https://www.databolt.tech/index-combine.html) - Web app if you don't want to code |
| 101 | + |
| 102 | +## Faster Data Engineering |
| 103 | + |
| 104 | +Check out other d6t libraries to solve common data engineering problems, including |
| 105 | +* data ingest, quickly ingest raw data |
| 106 | +* fuzzy joins, quickly join data |
| 107 | +* data pipes, quickly share and distribute data |
| 108 | + |
| 109 | +https://github.com/d6t/d6t-python |
| 110 | + |
| 111 | +And we encourage you to join the Databolt blog to get updates and tips+tricks http://blog.databolt.tech |
| 112 | + |
| 113 | +## Collecting Errors Messages and Usage statistics |
| 114 | + |
| 115 | +To help us make this library better, it collects anonymous error messages and usage statistics. It works similar to how websites collect data. See [d6tcollect](https://github.com/d6t/d6tcollect) for details including how to disable collection. |
| 116 | + |
| 117 | +It might not catch all errors so if you run into any problems, please raise an issue on github. |
0 commit comments