How does Indeni's platform saves and retrieves collected time-series data ?
In this short post I will explain from top level perspective how Indeni platform stores, retrieves time-serie (TS) data from our In-Memory time-series store, for how log we store the data and how we represent the retrieved sequence of points.
I will start with relevant simple hard facts about our In-Memory TS store:
- We store 60 minutes of data.
- The data is saved in 1 minute granularity.
One more fact about Indeni's platform rules evaluation, each indeni's platform rule specifies how much data should be taken in the rule evaluation, considering the facts above it can span from the last one minute to 60 minutes. By default the value is the full available length which is 60 minutes.
The above fact raise the following underlying question, if we want to take the last T minutes how to interpret it into a sequence of data points, in more detail to respect of inclusiveness and exclusiveness of end data points ?
To make the question more concrete let assume we would like to lake the last 15 minutes from 01:00pm:
The first answer in real life scenario would be "take all the point from 12:45pm to 01:00pm", if we put more attention to the answer we will see that there are 16 data points within this time interval [12:45, 12:46,..., 01:00] since we unconsciously included both of the time interval ends, which is not what we want in some cases.
Note: Square brackets (braces) mean inclusive, round brackets (parentheses) means exclusive.
In addition the above representation is more bug prone:
- what if we want to concat two adjacent time intervals ? concatenation of [A, B] and [B, C] intervals will result in having 'B' element twice unless taking it in account and adding special more complex concatenation logic.
- what if 'A' == 'B' ? Calculating the length by 'B'-'A' wont work, rather it should be ('B'-'A')+1
The above problems are indeed simply solvable, however requiers more attention when using or manipulating the data.
Taking (A, B) i.e. exclusive on both sides even raise more problems as:
- if 'A'=='B' the interval length is zero.
- on concatenation (A, B) with (B, C) with result with a "hole" at the concatenation point.
Therefore either [A, B) or (A, B] would be a wiser choice, here is why ?
- ('B'-'A') equals to the number of elements in the sequence.
- The upper bound of the first interval is the lower bound the next - meaning we can simply concat two adjacent intervals.
Which one ? lets check out the trade-offs between the two.
- (A, B] means exclusive on the left end and inclusive on the right end, i.e "we are biased towards evaluating with the latest data point although sometimes it may be messing" - we haven't collected it yet for this minute.
- [A, B) would mean inclusive on the left end and exclusive on the right end. i.e "we are biased towards evaluating with a full set of data point although we are probably not including the latest value".
At Indeni we have chosen to use (A, B] for the reason that we want to reflect the latest state as accurate as possible and tolerate possible missing one data point.
For further reading, a nice article by Djikstra http://www.cs.utexas.edu/~EWD/ewd08xx/EWD831.PDF