Installation#
Just pip-install it:
$ pip install tshistory_refinery
Create first a postgresql database:
$ createdb my_time_series
Then initialize the database schema:
$ tsh init-db postgresql:///my_time_series --no-dry-run
Last item: we need a configuration file refinery.cfg
in the home
directory, containing:
[db]
uri = postgresql:///my_time_series
Introduction#
Purpose#
tshistory
is targetted at applications using time series where
backtesting and
cross-validation
are an essential feature.
It provides exhaustivity and efficiency of the storage, with a simple Python api.
It can be used as a building block for machine learning, model optimization and validation, both for inputs and outputs.
Principles#
There are many ways to represent timeseries in a relational database,
and tshistory
provides two things:
a base python API which abstracts away the underlying storage
a postgres model, which emphasizes the compact storage of successive states of series
The core idea of tshistory is to handle successive versions of timeseries as they grow in time, allowing to get older states of any series.
Timeseries Store Usage#
Starting with a fresh database#
You need a postgresql database. You can create one like this:
createdb mydb
Then, initialize the tshistory
tables, like this:
tsh init-db postgresql://me:password@localhost/mydb
From this you’re ready to go !
Creating a series#
However here’s a simple example:
>>> import pandas as pd
>>> from tshistory.api import timeseries
>>>
>>> tsa = timeseries('postgres://me:password@localhost/mydb')
>>>
>>> series = pd.Series([1, 2, 3],
... pd.date_range(start=pd.Timestamp(2017, 1, 1),
... freq='D', periods=3))
# db insertion
>>> tsa.update('my_series', series, 'babar@pythonian.fr')
...
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
Freq: D, Name: my_series, dtype: float64
# note how our integers got turned into floats
# (there are no provisions to handle integer series as of today)
# retrieval
>>> tsa.get('my_series')
...
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
Name: my_series, dtype: float64
Note that we generally adopt the convention to name the time series api
object tsa
.
Updating a series#
This is good. Now, let’s insert more:
>>> series = pd.Series([2, 7, 8, 9],
... pd.date_range(start=pd.Timestamp(2017, 1, 2),
... freq='D', periods=4))
# db insertion
>>> tsa.update('my_series', series, 'babar@pythonian.fr')
...
2017-01-03 7.0
2017-01-04 8.0
2017-01-05 9.0
Name: my_series, dtype: float64
# you get back the *new information* you put inside
# and this is why the `2` doesn't appear (it was already put
# there in the first step)
# db retrieval
>>> tsa.get('my_series')
...
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 7.0
2017-01-04 8.0
2017-01-05 9.0
Name: my_series, dtype: float64
It is important to note that the third value was replaced, and the two
last values were just appended. As noted the point at 2017-1-2
wasn’t a new information so it was just ignored.
Retrieving history#
We can access the whole history (or parts of it) in one call:
>>> history = tsa.history('my_series')
...
>>>
>>> for idate, series in history.items(): # it's a dict
... print('insertion date:', idate)
... print(series)
...
insertion date: 2018-09-26 17:10:36.988920+02:00
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
Name: my_series, dtype: float64
insertion date: 2018-09-26 17:12:54.508252+02:00
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 7.0
2017-01-04 8.0
2017-01-05 9.0
Name: my_series, dtype: float64
Note how this shows the full serie state for each insertion date. Also the insertion date is timzeone aware.
Specific versions of a series can be retrieved individually using the
get
method as follows:
>>> tsa.get('my_series', revision_date=pd.Timestamp('2018-09-26 17:11+02:00'))
...
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
Name: my_series, dtype: float64
>>>
>>> tsa.get('my_series', revision_date=pd.Timestamp('2018-09-26 17:14+02:00'))
...
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 7.0
2017-01-04 8.0
2017-01-05 9.0
Name: my_series, dtype: float64
It is possible to retrieve only the differences between successive insertions:
>>> diffs = tsa.history('my_series', diffmode=True)
...
>>> for idate, series in diffs.items():
... print('insertion date:', idate)
... print(series)
...
insertion date: 2018-09-26 17:10:36.988920+02:00
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
Name: my_series, dtype: float64
insertion date: 2018-09-26 17:12:54.508252+02:00
2017-01-03 7.0
2017-01-04 8.0
2017-01-05 9.0
Name: my_series, dtype: float64
You can see a series metadata:
>>> tsa.update_metadata('series', {'foo': 42})
>>> tsa.metadata('series')
{foo: 42}