Merge pull request pandas-dev#779 from manahl/shashank88-patch-1

shashank88 · web-flow · commit 42f9277c0017 · 2019-05-21T13:38:36.000+01:00
Add some more documentation
diff --git a/docs/contributing.md b/docs/contributing.md
@@ -0,0 +1,15 @@
+## Contributing to Arctic Development
+
+* Feel free to pick up an issue from the bug tracker: https://github.com/manahl/arctic/issues or add an issue in general and assign it to yourself so we don't duplicate the work on the same issue.
+
+* Local installation   
+    * Clone the repo locally
+    * Create a virtualenv eg. `virtualenv .venv -p python3`
+    * Activate the virtualenv eg. `source .venv/bin/activate`
+    * Run `python setup.py install` to install dependencies in your virtualenv.
+    * Arctic should be ready to use locally, you can test it by importing it in your python interpreter
+    
+* After you have made changes, you can run tests with `python setup.py test`. You can also do something like: `python setup.py test -a tests/integration/<test_name>` to run a specific test.
+
+* Run pycodestyle locally to make sure it passes the coding style checks. 
+ 
diff --git a/docs/faq.md b/docs/faq.md
@@ -10,19 +10,28 @@ other data types and optional versioning.
 
 Arctic can query millions of rows per second per client, achieves ~10x compression on network bandwidth,
 ~10x compression on disk, and scales to hundreds of millions of rows per second per
-[MongoDB](https://www.mongodb.org/) instance.
+[MongoDB](https://www.mongodb.org/) instance. 
+
+Other benefits are:-
+* Serializes a number of data types eg. Pandas DataFrames, Numpy arrays, Python objects via pickling etc. so you don't have to handle different datatypes manually. 
+* Uses LZ4 compression by default on the client side to get big savings on network / disk.  
+* Allows you to version different stages of an object and snapshot the state (In some ways similar to git), and allows you to freely experiment and then just revert back the snapshot. [VersionStore only] 
+* Does the chunking (breaking a Dataframe to smaller part* for you. 
+* Adds a concept of Users and per User Libraries which can build on Mongo's auth. 
+* Has different types of Stores, each with it's own benefits. Eg. Versionstore allows you to version and snapshot stuff, TickStore is for storage and highly efficient retrieval of streaming data, ChunkStore allows you to chunk and efficiently retrieve ranges of chunks. If nothing suits you, feel free to use vanilla Mongo commands with BSONStore.
+* Restricts data access to Mongo and thus prevents ad hoc queries on unindexed / unsharded collections
+
 
 ## Differences between VersionStore and TickStore?
 
-tickstore is for constant streams of data, version store is for working with data
-(i.e. playing around with it). It keeps versions so you can 'undo' changes and keep
-track of updates.
+Tickstore is for tick style data generally via streaming, VersionStore is for playing around with data. It keeps versions so you can 'undo' changes and keep track of updates.
 
 ## Which Store should I use?
 
-* VersionStore: when ..
-* ChunkStore: when ..
-* TickStore: when ..
+* VersionStore: This is the default Store type. This gives you the ability to Version and Snapshot your objects while doing the serialization, compression etc alongside it. This is useful as you can basically play with your data and revert back to an older state if needed 
+* ChunkStore: Use ChunkStore when you don't care about versioning, and want to store DataFrames into user defined chunks with fast reads. 
+* TickStore: When you are storing constant tick data (eg. buy / sell info from exchanges). This generally plays well with Kafka / other message brokers.
+* BSONStore: For basically using raw Mongo operations via arctic. Can be used for storing adhoc data. 
 
 ## Why Mongo?
 
@@ -32,4 +41,8 @@ chose Mongo as the backend for Arctic.
 ## I'm running Mongo in XXXX setup - what performance should I expect?
 We're constantly asked what the expected performance of Arctic is/should be for given configutations and Mongo cluster setups. Its hard to know for sure given the enormous number of ways Mongo, networks, machines, workstations, etc can be configured. MongoDB performance tuning is outside the scope of this library, but countless tutorials and examples are available via a quick search of the Internet. 
 
-... Work in Progress.
+
+## Thread safety
+
+VersionStore is thread safe, and operations that are interrupted should never corrupt the data, based on us writing the data segments first and then the pointers to it. This could leak data in cases though.
+ 
diff --git a/docs/index.md b/docs/index.md
@@ -2,6 +2,22 @@
 
 Arctic is a timeseries / dataframe database that sits atop MongoDB. Arctic supports serialization of a number of datatypes for storage in the mongo document model.
 
+## Why use Arctic? 
+
+Some of the reasons to use Arctic are:-
+
+* Serializes a number of data types eg. Pandas DataFrames, Numpy arrays, Python objects via pickling etc. so you don't have to handle different datatypes manually. 
+* Uses LZ4 compression by default on the client side to get big savings on network / disk.  
+* Allows you to version different stages of an object and snapshot the state (In some ways similar to git), and allows you to freely experiment and then just revert back the snapshot. [VersionStore only] 
+* Does the chunking (breaking a Dataframe to smaller part* for you. 
+* Adds a concept of Users and per User Libraries which can build on Mongo's auth. 
+* Has different types of Stores, each with it's own benefits. Eg. Versionstore allows you to version and snapshot stuff, TickStore is for storage and highly efficient retrieval of streaming data, ChunkStore allows you to chunk and efficiently retrieve ranges of chunks. If nothing suits you, feel free to use vanilla Mongo commands with BSONStore.
+* Restricts data access to Mongo and thus prevents ad hoc queries on unindexed / unsharded collections
+
+Head over to the FAQs and James's presentation given below for more details. 
+
+## Basic Operations
+
 Arctic provides a [wrapper](../arctic/arctic.py) for handling connections to Mongo. The `Arctic` class is what actually connects to Arctic.
 
 ```
@@ -58,11 +74,7 @@ Other basic methods:
 
 * `library.list_symbols()`
     - Does what you might expect - lists all the symbols in the given library
-      ```
-      >>> lib.list_symbols()
-
-      ['US_EQUITIES', 'EUR_EQUITIES', ...]
-      ```
+```['US_EQUITIES', 'EUR_EQUITIES', ...]```
 * `arctic.get_quota(library_name)`, `arctic.set_quota(library_name, quota_in_bytes)`
    - Arctic internally sets quotas on libraries so they do not consume too much space.    You can check and set quotas with these two methods. Note these operate on the       `Arctic` object, not on libraries
 
diff --git a/docs/tickstore.md b/docs/tickstore.md
@@ -9,4 +9,62 @@ like kafka / redis queue etc.
 
 ## Reading and Writing data with Tickstore
 
-TBD.
+Sample tick:  
+
+```python
+    sample_ticks = [
+    {
+            'ASK': 1545.25,
+            'ASKSIZE': 1002.0,
+            'BID': 1545.0,
+            'BIDSIZE': 55.0,
+            'CUMVOL': 2187387.0,
+            'DELETED_TIME': 0,
+            'INSTRTYPE': 'FUT',
+            'PRICE': 1545.0,
+            'SIZE': 1.0,
+            'TICK_STATUS': 0,
+            'TRADEHIGH': 1561.75,
+            'TRADELOW': 1537.25,
+            'index': 1185076787070
+        },
+        {
+            'CUMVOL': 354.0,
+            'DELETED_TIME': 0,
+            'PRICE': 1543.75,
+            'SIZE': 354.0,
+            'TRADEHIGH': 1543.75,
+            'TRADELOW': 1543.75,
+            'index': 1185141600600
+        }
+    ]
+   
+```
+
+### Writing and reading to tickstore
+
+ tickstore_lib.write('FEED::SYMBOL', sample_ticks)
+
+ df = tickstore_lib.read('FEED::SYMBOL', columns=['BID', 'ASK', 'PRICE'])
+
+Another example with datetime index with tz_info
+```python
+    data = [{'A': 120, 'D': 1}, {'A': 122, 'B': 2.0}, {'A': 3, 'B': 3.0, 'D': 1}]
+    tick_index = [dt(2013, 6, 1, 12, 00, tzinfo=mktz('UTC')),
+                  dt(2013, 6, 1, 11, 00, tzinfo=mktz('UTC')),  # Out-of-order
+                  dt(2013, 6, 1, 13, 00, tzinfo=mktz('UTC'))]
+    data = pd.DataFrame(data, index=tick_index)
+
+    tickstore_lib._chunk_size = 3
+    tickstore_lib.write('SYM', data)
+    tickstore_lib.read('SYM', columns=None)
+```
+
+## Usecases
+
+* Storing billions of ticks in a compressed way with fast querying by date ranges.
+* Customizable chunk sizes. The default is 100k, which should fit easily in a single mongo doc for fast reads.
+* Structured to work with financial tick data stored on a per symbol basis. Generally used with kafka / redis queue or 
+some sort of message broker for streaming data.
+ 
+See [James's talk](https://vimeo.com/showcase/3660528/video/145842301) for more details
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -19,12 +19,12 @@ pages:
   - Introduction: 'index.md'
   - Quickstart: 'quickstart.md'
   - Configuration: 'configuration.md'
-  - Developing on mac: 'developing-conda-mac.md'
   - Storage Engines:
     - VersionStore: 'versionstore.md'
     - TickStore: 'tickstore.md'
     - ChunkStore: 'chunkstore.md'
   - ChunkStore API Reference: 'chunkstore_api.md'
+  - Contributing to Arctic: 'contributing.md'
   - Releasing: 'releasing.md'
   - Users: 'users.md'
   - FAQ: 'faq.md'