(couchdb) 01/05: Implement an exponentially decaying time intervals data structure

vatamane Sun, 07 Sep 2025 09:27:28 -0700

This is an automated email from the ASF dual-hosted git repository.

vatamane pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/couchdb.git


commit 1b4cca582de1baf08397951f451e9bdfd98f013d
Author: Nick Vatamaniuc <[email protected]>
AuthorDate: Sun Jul 20 23:33:05 2025 -0400

    Implement an exponentially decaying time intervals data structure
    
    This data structure maps time intervals to database update sequences. The 
idea
    is to be able to quickly determine which changes occurred in a time 
interval.
    
    The main goal of the design is to have a small data structure to fit well 
under
    a few KBs and yet represent time intervals from few hours up to a decades. 
This
    goal was accomplished by using exponentially decaying time intervals. The
    further back in time we go, the longer the intervals get. This matches how
    humans usually keep track of time: if we're talking about yesterday, we may
    care about hours; if we talk about last month, we may care about single 
days;
    and if we talk about last year, we may only care about the months or 
quarters,
    and so on. If we accept this historical loss of accuracy, we can hit the 
design
    goals of having only 60 time bins and a small, under 500B on-disk
    representation.
    
    The data structure format is a KV list of integers which which looks like:
    `[{Time, Seq}, {OlderTime, OlderSeq}, ...]`. Times are rounded to whole 
three
    hour blocks.
    
    The head of the KV list is the youngest entry. The `Time` value is the time 
of
    the earliest sequence in that time interval. The `Seq` value indicates the
    first sequence observed in the time interval.
    
    During updates, if we're into the next three hour block and all the bins are
    filled already, then the bins are "rolled up". That means finding some older
    bins to merge together to make some room for the new one, such that that the
    bin count does not increase and stays at or below the maximum limit.
    
    The main API functions are:
      * `new()` : create a new time sequence (`TSeq`) context.
      * `update(TSeq, Seq)` insert a new sequence into the timeline.
      * `since(TSeq, Time) -> Seq` get sequence before the timestamp.
      * `histogram(TSeq, UpdateSeq)` return formatted time bins and the count 
of updates which
         occurred during each interval. Use this for debugging or to give users 
an idea
         how many changes occurred in each interval. If the database was 
upgraded
         with some existing updates already, those are represented as occurring 
in
         a time bin starting in 1970-01-01.
    
    Since we're using the operating system's knowledge of time, the solution is 
not
    perfect. However, there are few mitigations to help with some scenarios:
    
      * Time values are rounded to three hour blocks. Even if the 
synchronization
      is known to be off by a day, the user can always restring the usage of the
      `since` parameter to a larger interval, for example only ask about time
      intervals greater than a few days.
    
      * Ignore updates which appear to happen back in time. Those are likely 
from a
      not yet synchronized clock after boot. Compare update times to the last 
entry
      or to a config setting. Users can set the config setting to a particular
      time, for example to 1971 if they know their systems jumps to 1970 after 
boot
      due to some hardware default. Any updates during that time won't be
      registered but the sequence will catch up once the NTP synchronization 
kicks
      in. It's best to set it to a much more recent time. The default is recent
      date before this feature is implemented. Future releases may bump that up.
    
      * If, due to some misconfiguration time jumps far ahead, say, to year 
3000,
      or any other time configuration mishap occurred it's always safe to reset 
the
      time-seq structure and simply start fresh at the new time. The plan is for
      the structure to not be interleaved into the doc bodies or the rev tree, 
but
      instead to have it in a separate off-to-the-side unused header field. As 
much
      it can always be safely inspected and reset if needed.
    
    There are EUnit and property tests for 100% test coverage.
    
    Thanks to Ilya (@iilyak) for writing the property tests!
---
 rel/overlay/etc/default.ini                   |  10 +
 src/couch/src/couch_time_seq.erl              | 301 ++++++++++++++++++++++
 src/couch/test/eunit/couch_time_seq_tests.erl | 348 ++++++++++++++++++++++++++
 3 files changed, 659 insertions(+)

diff --git a/rel/overlay/etc/default.ini b/rel/overlay/etc/default.ini
index 92405b01b..f972a0a10 100644
--- a/rel/overlay/etc/default.ini
+++ b/rel/overlay/etc/default.ini
@@ -130,6 +130,16 @@ view_index_dir = {{view_index_dir}}
 ; downgrade or even open the databases in question.
 ;prohibit_downgrade = true
 
+; Ignore time-sequence updates which are below this threshold. Sometimes
+; embedded systems get booted into 1970 and then get their time from NTP. To
+; mitigate that, updates below a time threshold can be ignored. This setting
+; allows setting that threshold. The value in Unix seconds.
+;
+; The default is 1754006400 = 2025-08-01T00:00:00, just some date before the
+; feature was developed. This may be bumped in future releases.
+;
+;time_seq_min_time = 1754006400
+
 [bt_engine_cache]
 ; Memory used for btree engine cache. This is a cache for top levels of
 ; database btrees (id tree, seq tree) and a few terms from the db header. Value
diff --git a/src/couch/src/couch_time_seq.erl b/src/couch/src/couch_time_seq.erl
new file mode 100644
index 000000000..7b9c5e44e
--- /dev/null
+++ b/src/couch/src/couch_time_seq.erl
@@ -0,0 +1,301 @@
+% Licensed under the Apache License, Version 2.0 (the "License"); you may not
+% use this file except in compliance with the License. You may obtain a copy of
+% the License at
+%
+%   http://www.apache.org/licenses/LICENSE-2.0
+%
+% Unless required by applicable law or agreed to in writing, software
+% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+% License for the specific language governing permissions and limitations under
+% the License.
+
+% This module implements exponentially decaying time intervals which map to
+% database update sequences. The idea is be able to quickly determine the set
+% of changes which occurred in a rough time interval. The closer to the last
+% update time -- the higher the accuracy. For instance, going back a few days,
+% it's only possible to target individual days. Then weeks, then months, then
+% going back years can target only years.
+%
+% An example of the shape of the data structure might be:
+%
+%  +---------+  +---------+  +--------+
+%  | seq=986 |  | seq=891 |  | seq=19 |
+%  +---------+  +---------+  +--------+
+%            ^            ^           ^
+%            |            |           |
+%          t=42          t=40        t=37
+%
+% The head, on the left, points to the newest (most recently updated) time bin.
+% In this example it started at t=42. The last t=37 bin is the oldest time bin.
+%
+% If we're looking for sequences starting before t=41, we'd pick seq=891 and
+% run the changes since=891. If we're looking for a sequence starting before
+% t=37, we'd start with since=0. The main idea here is that we'd rather error
+% on the side of returning too many rows than not enough. This is how our
+% clustered changes feeds behave during rewinds already, so it's really new
+% behavior.
+%
+% The number of bins is always less than or equal to ?MAX_BIN_COUNT. During
+% updates, when we're forced to move to a new bin, the bins are rolled-up to
+% make some room before adding the latest update.
+%
+% During the roll-up, multiple old bins might end up merging into a single with
+% a larger width. For example the above bins might end up in a single bin. In
+% the example above the old two bins might merge and the bins now might look
+% like:
+%
+%  +---------+  +---------+
+%  | seq=986 |  | seq=891 |
+%  +---------+  +---------+
+%            ^            ^
+%            |            |
+%          t=42          t=37
+%
+% If we're now looking for sequences staring before t=40, we'd pick seq=891.
+%
+% The roll-up algorithm can be summarized as:
+%
+%   - As a preparatory step: tag all the bins with their size, and reverse them
+%     to put them in chronological order (oldest first).
+%
+%   - For each interval in ?INTERVALS:
+%
+%     * For each bin in the list of bins before the (now - next interval) time:
+%
+%        + If durations of two adjacent bins are =< interval, merge the newer
+%          one into the older one.
+%
+%     * If we merged any bins for the given interval, return.
+%
+%     * If we didn't merge any bins, continue with a longer interval
+%
+%     * Once we get to just one interval remaining, and didn't succeed merging
+%       anything using the interval schedule in ?INTERVALS, simply merge half
+%       the bins the oldest part of the interval. This is guaranteed to make
+%       more room.
+%
+
+-module(couch_time_seq).
+
+-export([
+    new/0,
+    new/1,
+    since/2,
+    update/3,
+    histogram/2,
+    timestamp/0
+]).
+
+-export_type([
+    time_seq/0
+]).
+
+% How bin count = 60 was picked:
+%
+%  - With the ?INTERVALS schedule defined below ran 1 update per hour for 1M
+%    updates starting at year 3000 and ending at year 3114 and obtained:
+%      * Less than 10 years of spacing between years at the start: 3000, 3009, 
3018 ...
+%      * Ten individual latest days: 3114-01-20 -> 3114-01-30
+%      * Seven individual latest months: 3113-07 -> 3114-01
+%  - Uncompressed term_to_binary(TSeq) = 920B
+%  - RAM flat size erts_debug:flat_size(TSeq) * erlang:system_info(wordsize) = 
2KB
+%
+-define(MAX_BIN_COUNT, 60).
+
+-define(H, 3600).
+-define(D, ?H * 24).
+-define(M, ?D * 30).
+-define(Y, ?D * 365).
+%% erlfmt-ignore
+-define(INTERVALS, [
+    ?H * 3, ?H * 6, ?H * 12,
+    ?D, ?D * 10,
+    ?M, ?M * 6,
+    ?Y, ?Y * 2, ?Y * 4, ?Y * 8, ?Y * 16
+]).
+
+% Version number so we could upgrade the data structure in the future.
+%
+-define(VER, 1).
+
+% Users can set a minimum time boundary to avoid errors with broken clocks.
+% Sometimes embedded systems get booted into 1970 and then get their time from
+% NTP. Updates are ignored if we can tell they are obviously wrong. Use some
+% recent time as a default min time value:
+%
+%   1754006400 = 2025-08-01T00:00:00Z
+%
+-define(DEFAULT_MIN_TIME, 1754006400).
+
+-type timestamp() :: pos_integer().
+-type bin_size() :: pos_integer().
+-type update_seq() :: non_neg_integer() | now.
+-type bin() :: {timestamp(), bin_size(), update_seq()}.
+-type time_seq() :: #{v := pos_integer(), bins := [bin()]}.
+
+-spec new() -> time_seq().
+new() ->
+    #{v => ?VER, bins => []}.
+
+-spec new(time_seq()) -> time_seq().
+new(#{v := ?VER, bins := Bins} = Ctx) when is_list(Bins) ->
+    % Future upgrade clauses would go here. When upgrading make sure to add a
+    % clause to check/1 to return true for the older version as well. But the
+    % upgrade itself will happen in new/1.
+    Ctx.
+
+-spec update(time_seq(), timestamp(), update_seq()) -> time_seq().
+update(#{v := ?VER} = Ctx, Time, Seq) when is_integer(Time), is_integer(Seq), 
Seq >= 0 ->
+    #{bins := Bins} = Ctx,
+    RoundTime = round_time(Time),
+    case Time >= min_time() of
+        true -> Ctx#{bins := update_bins(Bins, RoundTime, Seq)};
+        false -> Ctx
+    end.
+
+% @edoc: Return a highest known sequence that comes before the given time. If 
the time
+% falls before the oldest bin then return 0. If the time is before the time of
+% the first bin plus the size of size of the bin, return atom 'now'.
+% @end
+-spec since(time_seq(), timestamp()) -> update_seq().
+since(#{v := ?VER} = Ctx, Time) when is_integer(Time) ->
+    #{bins := Bins} = Ctx,
+    Resolution = resolution_sec(),
+    case lists:dropwhile(fun({T, _}) -> Time < T end, Bins) of
+        [] -> 0;
+        [{T, _} | _] when Time > (T + Resolution) -> now;
+        [{_, Seq} | _] -> Seq
+    end.
+
+% @edoc: Return a histogram of formatted time and number of sequence updates 
which
+% happened during that interval. The result might look like:
+%
+%   [["2025-01-02T03:04:00Z", 42], ["2025-01-02T01:01:00Z", 1], ...]
+%
+% The nested list format is so it can be emitted as json directly
+% @end
+-spec histogram(time_seq(), non_neg_integer()) -> [[binary() | 
non_neg_integer()]].
+histogram(#{v := ?VER, bins := Bins}, UpdateSeq) ->
+    [[time_fmt(T), Seq] || {T, Seq} <- seq_histogram(UpdateSeq, Bins)].
+
+% @edoc: Timestamp to use update/3 function. This is currently the OS 
Posix/Unix time
+% in seconds obtained from the os:system_time(second), the same as the one used
+% by couch_log for timestamp log entries.
+% @end
+-spec timestamp() -> pos_integer().
+timestamp() ->
+    os:system_time(second).
+
+%
+% Private functions
+%
+
+update_bins([], Time, Seq) ->
+    [{Time, Seq}];
+update_bins([{TopT, TopS} | _] = Bins, Time, Seq) when Time > TopT, Seq > TopS 
->
+    % We're into another time period, so create a new bin and roll up older 
ones
+    [{Time, Seq} | rollup(Time, Bins)];
+update_bins([_ | _] = Bins, _Time, _Seq) ->
+    % Only allow time and sequences to move forward, otherwise ignore the
+    % update. We'll catch up later with it.
+    Bins.
+
+rollup(_, Bins) when is_list(Bins), length(Bins) < ?MAX_BIN_COUNT ->
+    Bins;
+rollup(TimeNow, Bins) ->
+    % The roll-up algorithm is a bit easier to reason about if the bins are in
+    % chronological order (reversed) and we know their duration. So, as a
+    % preparatory step, reverse the bins and make them into 3-tuples that look
+    % like {Time, Duration, Seq}.
+    ReversedBins = add_durations(TimeNow, Bins),
+    RolledUpBins = do_rollup(ReversedBins, TimeNow, ?INTERVALS),
+    % Remove the durations and put them in the original order
+    lists:reverse([{T, S} || {T, _L, S} <- RolledUpBins]).
+
+do_rollup([_ | _] = Bins, _TimeNow, [_] = _Intervals) ->
+    % We're down to the last, largest interval. If the nicer binning didn't
+    % succeed just resample the oldest half. Using the max(durations), we're
+    % sure to always make some room.
+    {Before, After} = lists:split(length(Bins) div 2, Bins),
+    {_, IntervalDurations, _} = lists:unzip3(Before),
+    Interval = lists:max(IntervalDurations),
+    RolledUpBefore = merge(Before, Interval),
+    RolledUpBefore ++ After;
+do_rollup([_ | _] = Bins, TimeNow, [Interval, NextInterval | RestIntervals]) ->
+    % Don't roll up bins after (TimeNow - NextInterval) as we'd like to have
+    % have more precision closer to the present. For instance:
+    %   - When we roll-up 3 hours, stop before (now - 6 hours)
+    %   - When we roll-up 6 months, stop before (now - 1 year)
+    Thresh = TimeNow - NextInterval,
+    {Before, After} = lists:splitwith(fun({T, _, _}) -> T =< Thresh end, Bins),
+    case merge(Before, Interval) of
+        Before ->
+            % No change, try the next larger interval
+            do_rollup(Bins, TimeNow, [NextInterval | RestIntervals]);
+        RolledUpBefore ->
+            % We made some room so return
+            RolledUpBefore ++ After
+    end.
+
+% Merge adjacent two bins if they are =< interval
+%
+merge(Bins, Int) ->
+    merge(Bins, Int, []).
+
+merge([], _Int, Acc) ->
+    lists:reverse(Acc);
+merge([_] = Bins, _Int, Acc) ->
+    lists:reverse(Acc) ++ Bins;
+merge([{T1, D1, S1}, {_, D2, _} | Rest], Int, Acc) when D1 =< Int, D2 =< Int ->
+    Merged = {T1, D1 + D2, S1},
+    merge(Rest, Int, [Merged | Acc]);
+merge([Bin | Rest], Int, Acc) ->
+    merge(Rest, Int, [Bin | Acc]).
+
+% Assign durations to bins and put them in chronological order.
+% The duration will be the middle element [{Time, Duration, Seq}, ...]
+%
+add_durations(TNow, [{T0, S0} | Rest]) ->
+    Fun = fun({T, S}, [{TPrev, _, _} | _] = Acc) ->
+        [{T, TPrev - T, S} | Acc]
+    end,
+    lists:foldl(Fun, [{T0, TNow - T0, S0}], Rest).
+
+seq_histogram(_, []) ->
+    [];
+seq_histogram(UpdateSeq, Bins) ->
+    % Since update history might have started with an upgrade from a version
+    % which didn't have time-seq available, create a new bin starting at Unix
+    % time 0 to include the changes before the upgrade.
+    Bins1 =
+        case lists:last(Bins) of
+            {_, S0} when is_integer(S0), S0 > 0 -> Bins ++ [{0, 0}];
+            {_, 0} -> Bins
+        end,
+    Fun = fun({T, S}, {SPrev, Acc}) ->
+        SeqDelta = max(0, SPrev - S),
+        {S, [{T, SeqDelta} | Acc]}
+    end,
+    {_, Res} = lists:foldl(Fun, {UpdateSeq, []}, Bins1),
+    Res.
+
+% Smallest interval we can track in seconds
+%
+resolution_sec() ->
+    hd(?INTERVALS).
+
+% Round time according to the resolution.
+%
+round_time(T) when is_integer(T), T > 0 ->
+    RoundBy = resolution_sec(),
+    (T div RoundBy) * RoundBy.
+
+min_time() ->
+    config:get_integer("couchdb", "time_seq_min_time", ?DEFAULT_MIN_TIME).
+
+% Format time as YYYY-MM-DDTHH:MM:SSZ, it's accepted by RFC 3339 and ISO 8601
+% standards.
+%
+time_fmt(Time) when is_integer(Time) ->
+    list_to_binary(calendar:system_time_to_rfc3339(Time, [{offset, "Z"}])).
diff --git a/src/couch/test/eunit/couch_time_seq_tests.erl 
b/src/couch/test/eunit/couch_time_seq_tests.erl
new file mode 100644
index 000000000..289b519ce
--- /dev/null
+++ b/src/couch/test/eunit/couch_time_seq_tests.erl
@@ -0,0 +1,348 @@
+% Licensed under the Apache License, Version 2.0 (the "License"); you may not
+% use this file except in compliance with the License. You may obtain a copy of
+% the License at
+%
+%   http://www.apache.org/licenses/LICENSE-2.0
+%
+% Unless required by applicable law or agreed to in writing, software
+% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+% License for the specific language governing permissions and limitations under
+% the License.
+
+-module(couch_time_seq_tests).
+
+-ifdef(WITH_PROPER).
+-include_lib("couch/include/couch_eunit_proper.hrl").
+-else.
+-include_lib("couch/include/couch_eunit.hrl").
+-endif.
+
+% Some bogus time in the far future to avoid reaching it and have that
+% interfere with the tests
+%
+-define(TEST_TIME, "3000-01-01T00:00:00Z").
+
+new_test() ->
+    New = couch_time_seq:new(),
+    Timestamp = couch_time_seq:timestamp(),
+    TSeq1 = couch_time_seq:update(New, Timestamp, 1),
+    ?assertMatch(#{v := 1, bins := [_ | _]}, couch_time_seq:new(TSeq1)).
+
+min_time_limit_test() ->
+    TSeq = couch_time_seq:new(),
+    MinTime = couch_time_seq:update(TSeq, 1, 42),
+    ?assertEqual(TSeq, MinTime).
+
+update_first_test() ->
+    TSeq = couch_time_seq:new(),
+    TSeq1 = couch_time_seq:update(TSeq, test_time(), 1),
+    ?assertNotEqual(TSeq, TSeq1),
+    ?assertEqual(1, length(maps:get(bins, TSeq1))).
+
+update_same_hour_test() ->
+    TSeq = couch_time_seq:new(),
+    T = test_time(),
+    TSeq1 = couch_time_seq:update(TSeq, T, 1),
+    TSeq2 = couch_time_seq:update(TSeq1, T, 2),
+    #{bins := Bins} = TSeq2,
+    ?assertMatch([{_, 1}], Bins),
+    TSeq3 = couch_time_seq:update(TSeq2, T, 1),
+    ?assertEqual(TSeq2, TSeq3).
+
+stale_update_test() ->
+    TSeq = couch_time_seq:new(),
+    T = test_time(),
+    TSeq1 = couch_time_seq:update(TSeq, T + hours(3), 42),
+    TSeq2 = couch_time_seq:update(TSeq1, T, 43),
+    ?assertEqual(TSeq1, TSeq2).
+
+update_new_hour_test() ->
+    TSeq = couch_time_seq:new(),
+    TSeq1 = couch_time_seq:update(TSeq, test_time() + hours(3), 42),
+    TSeq2 = couch_time_seq:update(TSeq1, test_time() + hours(6), 43),
+    ?assertNotEqual(TSeq1, TSeq2),
+    #{bins := Bins} = TSeq2,
+    ?assertEqual(2, length(Bins)),
+    ?assertMatch([{_, 43}, {_, 42}], Bins).
+
+update_large_gap_test() ->
+    TSeq = couch_time_seq:new(),
+    T = test_time(),
+    TSeq1 = couch_time_seq:update(TSeq, T + hours(1), 42),
+    TSeq2 = couch_time_seq:update(TSeq1, T + years(30), 43),
+    ?assertNotEqual(TSeq1, TSeq2),
+    #{bins := Bins} = TSeq2,
+    ?assertEqual(2, length(Bins)),
+    ?assertMatch([{_, 43}, {_, 42}], Bins).
+
+update_when_bins_are_full_test() ->
+    TSeq = fill_bins(),
+    #{bins := Bins} = TSeq,
+    ?assertEqual(60, length(Bins)),
+    #{bins := [{TopTime, TopSeq} | _]} = TSeq,
+    TSeq1 = couch_time_seq:update(TSeq, TopTime + hours(1), TopSeq + 1),
+    #{bins := Bins1} = TSeq1,
+    ?assert(length(Bins1) =< 60),
+    TSeq2 = couch_time_seq:update(TSeq1, TopTime + hours(2), TopSeq + 2),
+    #{bins := Bins2} = TSeq2,
+    ?assert(length(Bins2) =< 60),
+    TSeq3 = couch_time_seq:update(TSeq2, TopTime + years(1), TopSeq + 3),
+    #{bins := Bins3} = TSeq3,
+    ?assert(length(Bins3) =< 60).
+
+update_large_gap_when_bins_are_full_test() ->
+    TSeq = fill_bins(),
+    #{bins := Bins} = TSeq,
+    ?assertEqual(60, length(Bins)),
+    #{bins := [{TopTime, TopSeq} | _]} = TSeq,
+    TSeq1 = couch_time_seq:update(TSeq, TopTime + years(999), TopSeq + 1),
+    #{bins := Bins1} = TSeq1,
+    {FirstTime, _} = lists:last(Bins1),
+    ?assertEqual(test_time(), FirstTime),
+    {LastTime, _} = hd(Bins1),
+    ?assert(LastTime > test_time() + years(999)).
+
+update_3_hours_8_times_test() ->
+    TSeq = update_cnt(8, hours(3)),
+    #{bins := Bins} = TSeq,
+    ?assertEqual(8, length(Bins)),
+    {FirstTime, _} = lists:last(Bins),
+    ?assertEqual(test_time(), FirstTime),
+    {LastTime, _} = hd(Bins),
+    ?assertEqual(FirstTime + hours(21), LastTime),
+    Hist = couch_time_seq:histogram(TSeq, 8),
+    lists:foreach(fun([_T, N]) -> ?assertEqual(1, N) end, Hist).
+
+update_twice_an_hour_for_a_day_test() ->
+    TSeq = update_cnt(2 * 24, hours(1) div 2),
+    #{bins := Bins} = TSeq,
+    ?assertEqual(8, length(Bins)),
+    {FirstTime, _} = lists:last(Bins),
+    ?assertEqual(test_time(), FirstTime),
+    {LastTime, _} = hd(Bins),
+    ?assertEqual(FirstTime + hours(21), LastTime),
+    Hist = couch_time_seq:histogram(TSeq, 2 * 24),
+    lists:foreach(fun([_, N]) -> ?assertEqual(6, N) end, Hist).
+
+update_once_an_hour_for_180_hours_test() ->
+    TSeq = update_cnt(180, hours(1)),
+    #{bins := Bins} = TSeq,
+    ?assertEqual(60, length(Bins)),
+    {FirstTime, _} = lists:last(Bins),
+    ?assertEqual(test_time(), FirstTime),
+    {LastTime, _} = hd(Bins),
+    ?assertEqual(FirstTime + hours(177), LastTime),
+    Hist = couch_time_seq:histogram(TSeq, 180),
+    lists:foreach(fun([_, N]) -> ?assertEqual(3, N) end, Hist).
+
+update_once_an_hour_for_183_hours_test() ->
+    TSeq = update_cnt(183, hours(1)),
+    #{bins := Bins} = TSeq,
+    % We rolled up so the majority of bins were rolled up
+    ?assertEqual(32, length(Bins)),
+    {FirstTime, _} = lists:last(Bins),
+    ?assertEqual(test_time(), FirstTime),
+    {LastTime, _} = hd(Bins),
+    ?assertEqual(FirstTime + hours(180), LastTime),
+    Hist = couch_time_seq:histogram(TSeq, 183),
+    % All the counts should be 3 through 6 only
+    lists:foreach(fun([_, N]) -> ?assert(N >= 3 andalso N =< 6) end, Hist),
+    [_, HistCntFirst] = hd(Hist),
+    ?assertEqual(6, HistCntFirst),
+    [_, HistCntLast] = lists:last(Hist),
+    ?assertEqual(3, HistCntLast).
+
+update_once_an_hour_for_100_000_hours_test() ->
+    TSeq = update_cnt(100_000, 3600),
+    #{bins := Bins} = TSeq,
+    Hist = couch_time_seq:histogram(TSeq, 100_000),
+    SumChanges = lists:sum([C || [_T, C] <- Hist]),
+    ?assertEqual(100_000, SumChanges),
+    {FirstTime, _} = lists:last(Bins),
+    ?assertEqual(test_time(), FirstTime).
+
+update_10_000_with_large_interval_test() ->
+    Month = 3600 * 24 * 30,
+    TSeq = update_cnt(10_000, Month),
+    #{bins := Bins} = TSeq,
+    Hist = couch_time_seq:histogram(TSeq, 10_000),
+    SumChanges = lists:sum([C || [_T, C] <- Hist]),
+    ?assertEqual(10_000, SumChanges),
+    {FirstTime, _} = lists:last(Bins),
+    ?assertEqual(test_time(), FirstTime).
+
+before_test() ->
+    T = test_time(),
+    New = couch_time_seq:new(),
+    TSeq0 = couch_time_seq:update(New, T, 42),
+    TSeq = couch_time_seq:update(TSeq0, T + hours(3), 43),
+    % [{T + 3H, 43}, {T, 42}]
+    ?assertEqual(0, couch_time_seq:since(TSeq, 0)),
+    ?assertEqual(0, couch_time_seq:since(TSeq, T - 1)),
+    ?assertEqual(42, couch_time_seq:since(TSeq, T)),
+    ?assertEqual(42, couch_time_seq:since(TSeq, T + 1)),
+    ?assertEqual(42, couch_time_seq:since(TSeq, T + hours(3) - 1)),
+    ?assertEqual(43, couch_time_seq:since(TSeq, T + hours(3))),
+    ?assertEqual(43, couch_time_seq:since(TSeq, T + hours(3) + 1)),
+    ?assertEqual(now, couch_time_seq:since(TSeq, T + hours(999))).
+
+histogram_test() ->
+    New = couch_time_seq:new(),
+    ?assertEqual([], couch_time_seq:histogram(New, 0)),
+    ?assertEqual([], couch_time_seq:histogram(New, 1)),
+    T = test_time(),
+    % Start with seq 42 as if we're upgrading a db which was created
+    % before the time-seq feature. We don't know when those sequences were
+    % created so expect them to appear in bin with Unix time = 0 (1970-01-01)
+    TSeq0 = couch_time_seq:update(New, T, 42),
+    ?assertEqual(
+        [
+            [<<"1970-01-01T00:00:00Z">>, 42],
+            [<<"3000-01-01T00:00:00Z">>, 1]
+        ],
+        couch_time_seq:histogram(TSeq0, 43)
+    ),
+
+    TSeq = couch_time_seq:update(TSeq0, T + hours(3), 43),
+    % The 1 is the jump from 42 to 43, then 3 is a jump from 43 to now (46)
+    ?assertEqual(
+        [
+            [<<"1970-01-01T00:00:00Z">>, 42],
+            [<<"3000-01-01T00:00:00Z">>, 1],
+            [<<"3000-01-01T03:00:00Z">>, 3]
+        ],
+        couch_time_seq:histogram(TSeq, 46)
+    ).
+
+% Some test helper functions
+
+test_time() ->
+    calendar:rfc3339_to_system_time(?TEST_TIME).
+
+hours(Hours) ->
+    Hours * 3600.
+
+years(Years) ->
+    hours(1) * 24 * 356 * Years.
+
+fill_bins() ->
+    fill_bins(test_time(), 1, hours(1), couch_time_seq:new()).
+
+fill_bins(Time, Seq, TimeInc, #{bins := Bins} = TSeq) ->
+    case length(Bins) == 60 of
+        true ->
+            TSeq;
+        false ->
+            TSeq1 = couch_time_seq:update(TSeq, Time, Seq),
+            Seq1 = Seq + 1,
+            TimeInc1 =
+                case TimeInc of
+                    [From, To] -> From + rand:uniform(To - From) - 1;
+                    _ -> TimeInc
+                end,
+            Time1 = Time + TimeInc1,
+            fill_bins(Time1, Seq1, TimeInc, TSeq1)
+    end.
+
+update_cnt(N, TimeInc) ->
+    update_cnt(N, test_time(), 0, TimeInc, couch_time_seq:new()).
+
+update_cnt(0, _Time, _Seq, _TimeInc, TSeq) ->
+    TSeq;
+update_cnt(Cnt, Time, Seq, TimeInc, TSeq) ->
+    TSeq1 = couch_time_seq:update(TSeq, Time, Seq),
+    TimeInc1 =
+        case TimeInc of
+            [From, To] -> From + rand:uniform(To - From);
+            _ -> TimeInc
+        end,
+    Time1 = Time + TimeInc1,
+    Seq1 = Seq + 1,
+    update_cnt(Cnt - 1, Time1, Seq1, TimeInc, TSeq1).
+
+%
+% Property tests
+%
+
+-ifdef(WITH_PROPER).
+
+couch_time_property_test_() ->
+    ?EUNIT_QUICKCHECK(60, 10000).
+%
+% Properties
+%
+
+prop_sorted_after_update() ->
+    ?FORALL(
+        TSeq,
+        tseq_g(),
+        begin
+            #{bins := Bins} = TSeq,
+            Bins =:= lists:reverse(lists:ukeysort(1, Bins))
+        end
+    ).
+
+prop_sequences_are_non_decreasing_after_update() ->
+    ?FORALL(
+        TSeq,
+        tseq_g(),
+        begin
+            #{bins := Bins} = TSeq,
+            {_, Seqs} = lists:unzip(Bins),
+            Seqs =:= lists:reverse(lists:sort(Seqs))
+        end
+    ).
+
+prop_correct_size_after_update() ->
+    ?FORALL(
+        TSeq,
+        tseq_g(),
+        begin
+            #{bins := Bins} = TSeq,
+            60 >= length(Bins)
+        end
+    ).
+
+prop_no_0_seq_bins_after_updates() ->
+    ?FORALL(
+        TSeq,
+        tseq_g(),
+        begin
+            % Only the first bin may have a 0 sequence
+            #{bins := Bins} = TSeq,
+            case lists:reverse(Bins) of
+                [] -> true;
+                [{_, _}] -> true;
+                [{_, 0} | Rest] -> [] =:= [B || {_, 0} = B <- Rest];
+                [_ | _] -> [] =:= [B || {_, 0} = B <- Bins]
+            end
+        end
+    ).
+
+%
+% Generators
+%
+
+hours_g() ->
+    % 10 years worth of hours
+    % -24 is to testing jumping backwards in time
+    range(-24, 24 * 365 * 10).
+
+seq_g() ->
+    non_neg_integer().
+
+tseq_g() ->
+    ?LET(
+        Updates,
+        list({hours_g(), seq_g()}),
+        lists:foldl(
+            fun({Hours, Seq}, TSeq) ->
+                couch_time_seq:update(TSeq, test_time() + hours(Hours), Seq)
+            end,
+            couch_time_seq:new(),
+            Updates
+        )
+    ).
+
+-endif.

(couchdb) 01/05: Implement an exponentially decaying time intervals data structure

Reply via email to