Skip to content

Commit ddfde70

Browse files
committed
Initialize repo with latest version of plugin
Manual initalization to go from the old model of several plugins in one repo (one per folder) to the new model of 1 plugin per repo. Benefit is better git integration with Dataiku DSS.
1 parent 9b743c0 commit ddfde70

21 files changed

Lines changed: 2355 additions & 21 deletions

File tree

LICENSE

Lines changed: 0 additions & 21 deletions
This file was deleted.

LICENSE.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Copyright (C) 2019 Dataiku
2+
3+
Permission is hereby granted, free of charge, to any person obtaining
4+
a copy of this software and associated documentation files (the
5+
"Software"), to deal in the Software without restriction, including
6+
without limitation the rights to use, copy, modify, merge, publish,
7+
distribute, sublicense, and/or sell copies of the Software, and to
8+
permit persons to whom the Software is furnished to do so, subject to
9+
the following conditions:
10+
11+
The above copyright notice and this permission notice shall be
12+
included in all copies or substantial portions of the Software.
13+
14+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17+
NONINFRINGEMENT. IN NO EVENT SHALL THE X CONSORTIUM BE LIABLE FOR ANY
18+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
19+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
20+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Makefile

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
PLUGIN_VERSION=0.3.0
2+
PLUGIN_ID=time-series-forecast
3+
4+
plugin:
5+
cat plugin.json|json_pp > /dev/null
6+
rm -rf dist
7+
mkdir dist
8+
zip -r dist/dss-plugin-${PLUGIN_ID}-${PLUGIN_VERSION}.zip plugin.json code-env custom-recipes resource
9+
10+
include ../Makefile.inc

README.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Forecast Plugin
2+
3+
The Forecast plugin provides visual recipes in Dataiku DSS to work on time series data to solve forecasting problems.
4+
5+
>Forecasting is required in many situations: deciding whether to build another power generation plant in the next five years requires forecasts of future demand; scheduling staff in a call centre next week requires forecasts of call volumes; stocking an inventory requires forecasts of stock requirements. Forecasts can be required several years in advance (for the case of capital investments), or only a few minutes beforehand (for telecommunication routing). Whatever the circumstances or time horizons involved, forecasting is an important aid to effective and efficient planning.
6+
<p style="text-align: right"> - Hyndman, Rob J. and George Athanasopoulos</p>
7+
8+
9+
## Scope of the plugin
10+
11+
This plugin offers a set of 3 visual recipes to forecast yearly to hourly time series. It covers the full cycle of data cleaning, model training, evaluation and prediction.
12+
- Cleaning, aggregation, and resampling of time series data
13+
- Training of forecast models of time series data, and evaluation of these models
14+
- Predicting future values and get historical residuals based on trained models
15+
16+
The following models are available in the recipe:
17+
- Neural Network
18+
- Seasonal Trend
19+
- Exponential Smoothing
20+
- ARIMA
21+
22+
This plugin does NOT work on narrow temporal dimensions (data must be at least at the hourly level) and does not provide signal processing techniques (Fourier Transform…).
23+
24+
This plugin works well when:
25+
- The training data consists of one or multiple time series at the hour/day/week/month/quarter/year level and fits in the server’s RAM.
26+
- The object to predict is the future of one of these time series.
27+
28+
## When to use this plugin
29+
30+
Forecasting is a branch of Machine Learning where:
31+
- The training data consists of one or multiple time series.
32+
- The object to predict is the future values of one of these time series.
33+
34+
A time series is simply a variable with several values measured over time.
35+
36+
Forecasting is slightly different from "classic" Machine Learning (ML) as available currently in the Visual ML interface of Dataiku, because:
37+
- Forecast models output multiple values whereas one Visual ML analysis is designed to predict a single output.
38+
- Open source implementations of forecast models are different from the Python/Scala ones available in Visual ML.
39+
- Evaluation of forecast accuracy uses specific methods (errors across a forecast horizon, cross-validation) which are not currently available in Visual ML.
40+
41+
Having said that, it has always been possible to forecast time series in Dataiku using Visual ML with custom work:
42+
- Feature engineering to get lagged features for each time series, for instance using the Window recipe.
43+
- If the forecast is for more than one time step ahead: training one Visual ML model for each forecast horizon.
44+
- Custom code to evaluate the models accuracy and forecast future values for multiple steps.
45+
46+
Another way would be for a data scientist to code her own forecasting pipeline using open source R or Python libraries.
47+
48+
These two ways of building a forecasting pipeline require good knowledge of machine learning, forecasting techniques and programming. They are not accessible to a Data Analyst user. With this plugin, we want to offer a simple way to build a forecasting pipeline without code.
49+
50+
51+
## Installation and usage of the plugin
52+
53+
Please see [plugin page](https://www.dataiku.com/dss/plugins/info/forecast.html) on Dataiku's website.
54+
55+
Note that the plugin uses an R code environment so R must be installed and integrated with Dataiku on your machine (version 3.5.0 or above). Anaconda R is not supported.
56+
57+
## Changelog
58+
59+
**Version 0.3.0 "beta 3" (2019-05)**
60+
61+
* Remove dependency on rstan and prophet packages.
62+
63+
**Version 0.2.0 "beta 2" (2019-03)**
64+
65+
* Multivariate Forecasting: Added support of external regressors for Neural Network, Prophet and ARIMA models (requires availability of future values of regressors when forecasting).
66+
67+
**Version 0.1.0 "beta 1" (2019-01)**
68+
69+
* Initial release
70+
* First pipeline for univariate forecasting of hourly to yearly time series
71+
72+
## Roadmap
73+
74+
- Evaluation recipe:
75+
* For cross-validation strategy: error metrics at each step within the horizon
76+
- Prediction recipe:
77+
* Add ability to get multiple model forecasts at the same time for ensembling
78+
* Fan plot of confidence intervals within the horizon
79+
80+
You can also check its development branch ["time-series-forecast"](https://github.com/dataiku/dataiku-contrib/tree/time-series-forecast/time-series-forecast) on the [dataiku-contrib](https://github.com/dataiku/dataiku-contrib) git repo.
81+
82+
You can also ask your questions on our Q&A, [answers.dataiku.com](https://answers.dataiku.com), or open an [Github issue](https://github.com/dataiku/dataiku-contrib/issues).
83+
84+
85+
## Advanced Usages
86+
87+
### Forecasts by Entity
88+
89+
If you want run the recipes to get multiple forecast models per entity (e.g. per product or store), you will need partitioning. That requires to have all datasets partitioned by 1 dimension for the category, using the [discrete dimension](https://doc.dataiku.com/dss/latest/partitions/identifiers.html#discrete-dimension-identifiers) feature in Dataiku. If the input data is not partitioned, you can use a Sync recipe to repartition it, as explained in [this article](https://www.dataiku.com/learn/guide/other/partitioning/partitioning-redispatch.html).
90+
91+
### Combination of Forecast and Machine Learning
92+
93+
A full pipeline would combine ML with forecast models. First, you can predict the forecast residuals (actual value - forecast) using ML models. ML is indeed most effective once the trend and seasonality have been removed, so the time series is stationary. Second, you can perform anomaly detection using clustering in the Visual ML, by detecting spikes in the forecast residuals.
94+
95+
# License
96+
97+
The Forecast plugin is:
98+
99+
Copyright (c) 2019 Dataiku SAS
100+
Licensed under the [MIT License](LICENSE.md).

code-env/R/desc.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"installCorePackages": true,
3+
"installJupyterSupport": true,
4+
"conda": false,
5+
"forceConda": false,
6+
"deploymentMode": "DESIGN_MANAGED",
7+
"envSettings": {
8+
"inheritGlobalSettings": true
9+
}
10+
}

code-env/R/spec/rPackages.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"forecast","8.5"
2+
"xts","0.11-2"
3+
"lubridate","1.7.4"
4+
"R.utils","2.8.0"
5+
"dplyr","0.8.0.1"
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
##### LIBRARY LOADING #####
2+
3+
library(dataiku)
4+
source(file.path(dkuCustomRecipeResource(), "clean.R"))
5+
6+
7+
##### INPUT OUTPUT CONFIGURATION #####
8+
9+
INPUT_DATASET_NAME <- dkuCustomRecipeInputNamesForRole('INPUT_DATASET_NAME')[1]
10+
OUTPUT_DATASET_NAME <- dkuCustomRecipeOutputNamesForRole('OUTPUT_DATASET_NAME')[1]
11+
12+
config = dkuCustomRecipeConfig()
13+
for(n in names(config)) {
14+
assign(n, CleanPluginParam(config[[n]]))
15+
}
16+
17+
# Check that partitioning settings are correct if activated
18+
checkPartitioning <- CheckPartitioningSettings(INPUT_DATASET_NAME)
19+
20+
selectedColumns <- c(TIME_COLUMN, SERIES_COLUMNS)
21+
columnClasses <- c("character", rep("numeric", length(SERIES_COLUMNS)))
22+
dfInput <- dkuReadDataset(INPUT_DATASET_NAME, columns = selectedColumns, colClasses = columnClasses)
23+
24+
25+
##### DATA PREPARATION STAGE #####
26+
27+
PrintPlugin("Data preparation stage starting...")
28+
29+
dfOutput <- dfInput %>%
30+
PrepareDataframeWithTimeSeries(TIME_COLUMN, SERIES_COLUMNS,
31+
GRANULARITY, AGGREGATION_STRATEGY) %>%
32+
CleanDataframeWithTimeSeries(TIME_COLUMN, SERIES_COLUMNS, GRANULARITY,
33+
MISSING_VALUES, MISSING_IMPUTE_WITH, MISSING_IMPUTE_CONSTANT,
34+
OUTLIERS, OUTLIERS_IMPUTE_WITH, OUTLIERS_IMPUTE_CONSTANT)
35+
36+
if (nrow(dfOutput) > 3 * nrow(dfInput)) {
37+
PrintPlugin(paste0("Resampled data is 3 times longer than input data. ",
38+
"Please check time granularity setting."), stop = TRUE)
39+
}
40+
41+
PrintPlugin("Data preparation stage completed, saving prepared data to output dataset.")
42+
43+
WriteDatasetWithPartitioningColumn(dfOutput, OUTPUT_DATASET_NAME)
44+
45+
PrintPlugin("All stages completed!")

0 commit comments

Comments
 (0)