Skip to content

Commit ddbc108

Browse files
committed
recommit
0 parents  commit ddbc108

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+11673
-0
lines changed

.gitignore

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
tests/.test-cred.yaml
2+
3+
.idea/
4+
.env
5+
temp/
6+
fiddle*
7+
.pytest_cache/
8+
test-data/output/
9+
10+
# add this manually
11+
test-data/
12+
13+
# Byte-compiled / optimized / DLL files
14+
__pycache__/
15+
*.py[cod]
16+
*$py.class
17+
18+
# C extensions
19+
*.so
20+
21+
# Distribution / packaging
22+
.Python
23+
build/
24+
develop-eggs/
25+
dist/
26+
downloads/
27+
eggs/
28+
.eggs/
29+
lib/
30+
lib64/
31+
parts/
32+
sdist/
33+
var/
34+
wheels/
35+
*.egg-info/
36+
.installed.cfg
37+
*.egg
38+
MANIFEST
39+
40+
# PyInstaller
41+
# Usually these files are written by a python script from a template
42+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
43+
*.manifest
44+
*.spec
45+
46+
# Installer logs
47+
pip-log.txt
48+
pip-delete-this-directory.txt
49+
50+
# Unit test / coverage reports
51+
htmlcov/
52+
.tox/
53+
.coverage
54+
.coverage.*
55+
.cache
56+
nosetests.xml
57+
coverage.xml
58+
*.cover
59+
.hypothesis/
60+
61+
# Translations
62+
*.mo
63+
*.pot
64+
65+
# Django stuff:
66+
*.log
67+
.static_storage/
68+
.media/
69+
local_settings.py
70+
71+
# Flask stuff:
72+
instance/
73+
.webassets-cache
74+
75+
# Scrapy stuff:
76+
.scrapy
77+
78+
# Sphinx documentation
79+
docs/_build/
80+
81+
# PyBuilder
82+
target/
83+
84+
# Jupyter Notebook
85+
.ipynb_checkpoints
86+
87+
# pyenv
88+
.python-version
89+
90+
# celery beat schedule file
91+
celerybeat-schedule
92+
93+
# SageMath parsed files
94+
*.sage.py
95+
96+
# Environments
97+
.env
98+
.venv
99+
env/
100+
venv/
101+
ENV/
102+
env.bak/
103+
venv.bak/
104+
105+
# Spyder project settings
106+
.spyderproject
107+
.spyproject
108+
109+
# Rope project settings
110+
.ropeproject
111+
112+
# mkdocs documentation
113+
/site
114+
115+
# mypy
116+
.mypy_cache/
117+
118+
# pypi config file
119+
.pypirc

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2018 Databolt
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

MANIFEST.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
include README.md
2+
include LICENSE

README.md

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Databolt File Ingest
2+
3+
Quickly ingest raw files. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas. `d6tstack` solves many performance and schema problems typically encountered when ingesting raw files.
4+
5+
![](https://www.databolt.tech/images/combiner-landing-git.png)
6+
7+
### Features include
8+
9+
* Fast pd.to_sql() for postgres and mysql
10+
* Quickly check columns for consistency across files
11+
* Fix added/missing columns
12+
* Fix renamed columns
13+
* Check Excel tabs for consistency across files
14+
* Excel to CSV converter (incl multi-sheet support)
15+
* Out of core functionality to process large files
16+
* Export to CSV, parquet, SQL, pandas dataframe
17+
18+
### Sample Use
19+
20+
```
21+
22+
import d6tstack
23+
24+
# fast CSV to SQL import - see SQL examples notebook
25+
d6tstack.utils.pd_to_psql(df, 'postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')
26+
d6tstack.utils.pd_to_mysql(df, 'mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename')
27+
d6tstack.utils.pd_to_mssql(df, 'mssql+pymssql://usr:pwd@localhost/db', 'tablename') # experimental
28+
29+
# ingest mutiple CSVs which may have data schema changes - see CSV examples notebook
30+
31+
import glob
32+
>>> c = d6tstack.combine_csv.CombinerCSV(glob.glob('data/*.csv'))
33+
34+
# quick check if all files have consistent columns
35+
>>> c.is_all_equal()
36+
False
37+
38+
# show which files have missing columns
39+
>>> c.is_col_present()
40+
filename cost date profit profit2 sales
41+
0 feb.csv True True True False True
42+
2 mar.csv True True True True True
43+
44+
>>> c.combine_preview() # keep all columns
45+
filename cost date profit profit2 sales
46+
0 jan.csv -80 2011-01-01 20 NaN 100
47+
0 mar.csv -100 2011-03-01 200 400 300
48+
49+
>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_select_common=True).combine_preview() # keep common columns
50+
filename cost date profit sales
51+
0 jan.csv -80 2011-01-01 20 100
52+
0 mar.csv -100 2011-03-01 200 300
53+
54+
>>> d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), columns_rename={'sales':'revenue'}).combine_preview()
55+
filename cost date profit profit2 revenue
56+
0 jan.csv -80 2011-01-01 20 NaN 100
57+
0 mar.csv -100 2011-03-01 200 400 300
58+
59+
# to come: check if columns match database
60+
>>> c.is_columns_match_db('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')
61+
62+
# export to csv, parquet, sql. Out of core with optimized fast imports for postgres and mysql
63+
>>> c.to_pandas()
64+
>>> c.to_csv_align(output_dir='process/')
65+
>>> c.to_parquet_align(output_dir='process/')
66+
>>> c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename')
67+
>>> c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast, using COPY FROM
68+
>>> c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast, using LOAD DATA LOCAL INFILE
69+
70+
# read Excel files - see Excel examples notebook for more details
71+
import d6tstack.convert_xls
72+
73+
d6tstack.convert_xls.read_excel_advanced('test.xls',
74+
sheet_name='Sheet1', header_xls_range="B2:E2")
75+
76+
d6tstack.convert_xls.XLStoCSVMultiSheet('test.xls').convert_all(header_xls_range="B2:E2")
77+
78+
d6tstack.convert_xls.XLStoCSVMultiFile(glob.glob('*.xls'),
79+
cfg_xls_sheets_sel_mode='name_global',cfg_xls_sheets_sel='Sheet1')
80+
.convert_all(header_xls_range="B2:E2")
81+
82+
```
83+
84+
85+
## Installation
86+
87+
We recommend using the latest version from github `pip install git+https://github.com/d6t/d6tstack.git`.
88+
89+
If you cannot install from github, use the latest published version `pip install d6tstack`. For Excel and parquet support, install `d6tstack[xls]` and `d6tstack[parquet]`. Certain database specific function require packages which you will be prompted for as you use them.
90+
91+
92+
## Documentation
93+
94+
* [CSV examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-csv.ipynb) - Quickly load any type of CSV files
95+
* [Excel examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-excel.ipynb) - Quickly extract from Excel to CSV
96+
* [Dask Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-dask.ipynb) - How to use d6tstack to solve Dask input file problems
97+
* [Pyspark Examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-pyspark.ipynb) - How to use d6tstack to solve pyspark input file problems
98+
* [SQL examples notebook](https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb) - Fast loading of CSV to SQL with pandas preprocessing
99+
* [Function reference docs](http://d6tstack.readthedocs.io/en/latest/py-modindex.html) - Detailed documentation for modules, classes, functions
100+
* [www.databolt.tech](https://www.databolt.tech/index-combine.html) - Web app if you don't want to code
101+
102+
## Faster Data Engineering
103+
104+
Check out other d6t libraries to solve common data engineering problems, including
105+
* data ingest, quickly ingest raw data
106+
* fuzzy joins, quickly join data
107+
* data pipes, quickly share and distribute data
108+
109+
https://github.com/d6t/d6t-python
110+
111+
And we encourage you to join the Databolt blog to get updates and tips+tricks http://blog.databolt.tech
112+
113+
## Collecting Errors Messages and Usage statistics
114+
115+
To help us make this library better, it collects anonymous error messages and usage statistics. It works similar to how websites collect data. See [d6tcollect](https://github.com/d6t/d6tcollect) for details including how to disable collection.
116+
117+
It might not catch all errors so if you run into any problems, please raise an issue on github.

d6tstack/__init__.py

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
import d6tstack.combine_csv
2+
#import d6tstack.convert_xls
3+
import d6tstack.sniffer
4+
#import d6tstack.sync
5+
import d6tstack.utils

0 commit comments

Comments
 (0)