Skip to content

Commit 78a908c

Browse files
authored
Add config docs (pandas-dev#667)
* added configuration documentation
1 parent 2415b1a commit 78a908c

File tree

2 files changed

+236
-0
lines changed

2 files changed

+236
-0
lines changed

docs/README.md

+8
Original file line numberDiff line numberDiff line change
@@ -77,3 +77,11 @@ Arctic is designed to be very extensible and currently supports a numer of diffe
7777
* [Chunkstore](chunkstore.md)
7878
7979
Each one has various features and is designed to support specific and general use cases.
80+
81+
82+
83+
### Arctic configuration settings
84+
85+
There is a large number of configuration knobs which tune Arctic's performance, and enable/disable various (experimental) features.
86+
87+
For more details refer to the [Arctic configuration guide](configuration.md).

docs/configuration.md

+228
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Configuration variables
2+
3+
Arctic has several tuning knobs under [arctic/_config.py](https://github.com/manahl/arctic/blob/master/arctic/_config.py) which affect the functionality of certain modules.
4+
5+
Most of these variables are initialized via environment variables, which are explained in the rest of this section.
6+
7+
8+
<br><br>
9+
10+
## VersionStore
11+
12+
### STRICT_WRITE_HANDLER_MATCH
13+
14+
Controls if Arctic can only match the intended handlers for the data type. If set to true it prevents falling back to pickling if the matching handler (based on the type of data) can't serialize without objects.
15+
16+
```
17+
export STRICT_WRITE_HANDLER_MATCH=1
18+
```
19+
20+
When strict match is enabled and handler fails to serialize without objects, users will receive the the following error:
21+
22+
```
23+
Traceback (most recent call last):
24+
File "<ipython-input-2-f8740688f9bf>", line 1, in <module>
25+
library.write('SymbolA', df)
26+
File "/projects/git/arctic/arctic/decorators.py", line 49, in f_retry
27+
return f(*args, **kwargs)
28+
File "/projects/git/arctic/arctic/store/version_store.py", line 671, in write
29+
handler = self._write_handler(version, symbol, data, **kwargs)
30+
File "/projects/git/arctic/arctic/store/version_store.py", line 340, in _write_handler
31+
raise ArcticException("Not falling back to default handler for %s" % symbol)
32+
arctic.exceptions.ArcticException: Not falling back to default handler for SymbolA
33+
```
34+
35+
36+
<br><br>
37+
38+
39+
## NdArrayStore
40+
41+
### CHECK_CORRUPTION_ON_APPEND
42+
43+
Enables more thorough sanity checks for detecting data corruption when issuing appends. The checks will introduce a 5-7% performance hit. This is disabled by default.
44+
45+
```
46+
export CHECK_CORRUPTION_ON_APPEND=1
47+
```
48+
49+
50+
51+
<br><br>
52+
53+
54+
## Serialization
55+
56+
### ARCTIC_AUTO_EXPAND_CHUNK_SIZE
57+
58+
If a row is too large, then auto-expand the data chunk size from the default _CHUNK_SIZE (it is 2MB). It is disabled by default, and the written DataFrame in its serialized Numpy array form shouldn't exceed 2MB.
59+
60+
This setting is effective only when using the incremental serializer.
61+
62+
```
63+
export ARCTIC_AUTO_EXPAND_CHUNK_SIZE=1
64+
```
65+
66+
67+
### MAX_DOCUMENT_SIZE
68+
69+
This configuration variable is used only when ARCTIC_AUTO_EXPAND_CHUNK_SIZE is set and the user writes a DataFrame which has extremely large number of columns (exceed 2MB serialized).
70+
71+
This value must be less than 16MB which is the maximum document size of MongoDB, taking into account the size of other fields, serialized as a BSON object.
72+
73+
It is set with the following default value:
74+
75+
```
76+
In[8]: pymongo.common.MAX_BSON_SIZE * 0.8
77+
Out[8]: 13421772.8
78+
```
79+
80+
81+
### FAST_CHECK_DF_SERIALIZABLE
82+
83+
Optional optimisation feature. When set, it applies a fast check for the *can_write()* of the Pandas-specific store implementations. This check takes place to decide the right write handler, among the registered ones, applied on the data provided by the user to be written/appended. It has a significant impact for large DataFrames, which have columns with object dtype. The benefits are even more evident if the number of object columns is proportionally small to the total number of columns.
84+
85+
```
86+
export FAST_CHECK_DF_SERIALIZABLE=1
87+
```
88+
89+
90+
91+
92+
93+
<br><br>
94+
95+
96+
97+
## Forward pointers
98+
99+
### <span style="color:orange">ARCTIC_FORWARD_POINTERS_CFG</span> (**EPXERIMENTAL**)
100+
101+
This feature flag controls the mode of operation for the data segment references model.
102+
103+
The original implementation of Arctic stores the *_id* values of version document inside the data segment documents (parent references). Therefore, even small updates (i.e. writes where data change very little) cause a large number of documents to be updated, and trigger fetches in the WiredTiger cache (e.g. when appending).
104+
105+
Since Arctic version *1.73.0* the segment referencing model of VersionStore has included a new implementation, named as *forward pointers*. In this model, the segments no longer hold information about the versions which reference/use them, but instead, the list of segment SHAs is stored in the version document itself. This is beneficial in many ways. First, small updates result in updates/writes only for the new data segments, reducing dramatically the number of affected documents by the update queries (faster execution). Second, the WiredTiger cache gets less polluted utilizing the necessary indexes, not fetching existing large data segments in cache. Finally, all the data information is now in one place, the version document itself, making the debugging of data integrity issues easier.
106+
107+
Variable *ARCTIC_FORWARD_POINTERS_CFG* controls the three modes of operation as described below:
108+
109+
```
110+
# This is the default mode of operation (i.e. same as not setting the variable).
111+
# Arctic operates identically to previous versions (<1.73.0).
112+
export ARCTIC_FORWARD_POINTERS_CFG=DISABLED
113+
114+
# This mode of operation maintains both forward pointers and parent references in segments.
115+
# For reads the forward pointer segment references are preferred if they exist.
116+
# This mode is a performance regression, so should be transitional, before switching to ENABLED.
117+
export ARCTIC_FORWARD_POINTERS_CFG=HYBRID
118+
119+
# In this mode of operation, only forward pointers are used, and the created versions are not
120+
# backwards compatible with older (< v1.73.0) Arctic versions.
121+
# Note that it is still possible to read versions written with older Arctic versions
122+
export ARCTIC_FORWARD_POINTERS_CFG=ENABLED
123+
```
124+
125+
The following table documents the compatibility for reading and writing data for all possible combinations of *ARCTIC_FORWARD_POINTERS_CFG*.
126+
127+
```
128+
| | Version written with | Version written with | Version written with |
129+
| | legacy Arctic / DISABLED | HYBRID | ENABLED |
130+
| ------------------------------------------------ | ------------------------- | -------------------- | -------------------- |
131+
| Read with Arctic < v1.73.0 | Y | Y | - |
132+
| Read with ARCTIC_FORWARD_POINTERS_CFG=DISABLED | Y | Y | Y |
133+
| Read with ARCTIC_FORWARD_POINTERS_CFG=HYBRID | Y | Y | Y |
134+
| Read with ARCTIC_FORWARD_POINTERS_CFG=ENABLED | Y | Y | Y |
135+
| | | | |
136+
| Update with Arctic < v1.73.0 | Y | Y | - |
137+
| Update with ARCTIC_FORWARD_POINTERS_CFG=DISABLED | Y | Y | Y(*) |
138+
| Update with ARCTIC_FORWARD_POINTERS_CFG=HYBRID | Y | Y | Y(*) |
139+
| Update with ARCTIC_FORWARD_POINTERS_CFG=ENABLED | Y | Y | Y |
140+
141+
(*) appends will be converted to a full write
142+
143+
```
144+
145+
It is recommended to make a full write for all symbols in HYBRID mode when switching for ENABLED to DISABLED.
146+
147+
**Migration plan**
148+
149+
After a certain time under test, the *ARCTIC_FORWARD_POINTERS_CFG=ENABLED* will become the default configuration for future versions of Arctic.
150+
When this happens, the users shouldn't notice any issues, because the transition from DISABLED/HYBRID to ENABLED is safe
151+
for reads and writes per the compatibility table above.
152+
153+
We strongly recommend however, to always transit to configurations via the HYBRID mode: "DISABLED->HYBRID->ENABLED", and test thoroughly while in HYBRID mode.
154+
155+
When the user desires to switch back to DISABLED it is required to apply the following steps:
156+
157+
1. Drop all snapshots.
158+
2. Switch to HYBRID
159+
3. Force a full read/write for all symbols.
160+
4. Switch to DISABLED
161+
162+
163+
### ARCTIC_FORWARD_POINTERS_RECONCILE
164+
165+
When enabled, the number of segments will be cross-verified between forward and legacy (parent) pointers. It is mostly used to verify the correct functionality of forward pointers as long as it is in experimental state.
166+
167+
It has an effect only when *ARCTIC_FORWARD_POINTERS_CFG=HYBRID* and affects writes/appends.
168+
169+
```
170+
export ARCTIC_FORWARD_POINTERS_RECONCILE=1
171+
```
172+
173+
174+
175+
<br><br>
176+
177+
178+
179+
## Compression settings
180+
181+
### DISABLE_PARALLEL
182+
183+
This flag disables the parallel compression (multiple-threads) for the data segments. The parallel LZ4 compression is enabled by default.
184+
185+
```
186+
export DISABLE_PARALLEL=1
187+
```
188+
189+
190+
### LZ4_HIGH_COMPRESSION
191+
192+
Use the high-compression configuration for LZ4 (trade runtime speed for better compression ratio).
193+
194+
195+
```
196+
export LZ4_HIGH_COMPRESSION=1
197+
```
198+
199+
200+
### LZ4_WORKERS
201+
202+
Configures the size of the compression thread pool (size of 2 by default).
203+
204+
For a guide on how to tune the following parameters, read: arctic/benchmarks/lz4_tuning/README.txt
205+
206+
A rough rule of thumb is to use 2 for non HC (VersionStore/NDarrayStore/PandasStore, and 8 for HC (TickStore).
207+
208+
```
209+
export LZ4_WORKERS=4
210+
```
211+
212+
213+
### LZ4_N_PARALLEL
214+
215+
This setting controls the minimum number of chunks required to use the parallel compression. The default value is 16, derived from benchmark results.
216+
217+
```
218+
export LZ4_N_PARALLEL=4
219+
```
220+
221+
222+
### LZ4_MINSZ_PARALLEL
223+
224+
This setting controls the minimum data size required to use the parallel compression. The default value is 524288 (0.5 MB), derived from benchmark results.
225+
226+
```
227+
export LZ4_MINSZ_PARALLEL=1048576
228+
```

0 commit comments

Comments
 (0)