LTStsVarSel does variable selection in the robust time series model LTSts

LTSts requires variable selection when the optimal model parameters are not known in advance. This happens in particular when the function has to be applied to many heterogeneous datasets in an automatic way, possibly on a regular basis (that is the model parameters are expected to change over time even for datasets associated to the same phenomenon).

The approach consists in iteratively eliminating the less significant estimated model parameter, starting from an over-parametrized model. The model parameters are re-estimated with LTSts at each step, until all the p-values are below a given threshold. Then, the output is a reduced time series model with significant parameters only.

```
```

run LTStsVarSel with all default options.`reduced_est`

=LTStsVarSel(`y`

)

```
```

run LTStsVarSel starting from a specific over-parametrized model.`reduced_est`

=LTStsVarSel(`y`

,
`Name, Value`

)

```
[
```

run LTStsVarSel starting from over-parametrized model with autoregressive components.`reduced_est`

,
` reduced_model`

]
=LTStsVarSel(___)

```
[
```

run LTStsVarSel with default options and return warning messages.`reduced_est`

,
` reduced_model`

,
` msgstr`

]
=LTStsVarSel(___)

rng('default') rng(1); % data model model=struct; model.trend=1; % linear trend model.trendb=[0 1]; % parameters of the linear trend model.s=12; % monthly time series model.seasonal=1; % 1 harmonic model.seasonalb=[10 10]; % parameter for one harmonic model.signal2noiseratio = 100; % signal to noise ratio n = 100; % sample size % generate data out_sim=simulateTS(n,'plots',1,'model',model); %run LTStsVarSel with all default options [out_model_0, out_reduced_0] = LTStsVarSel(out_sim.y); % optional: add a FS step to the LTSts estimator % outFS = FSRts(out_sim.y,'model',out_model_0); % To be fixed: 'Non existent user option found-> ' 'ARp'

rng('default') rng(1); % sample size n = 100; tmp = rand(n,1); % data model model=struct; model.trend=1; % linear trend model.trendb=[0 1]; % parameters of the linear trend model.s=12; % monthly time series model.seasonal=1; % 1 harmonic model.seasonalb=[10 10]; % parameter for one harmonic model.lshiftb=100; % level shift amplitude model.lshift= 30; % level shift position model.signal2noiseratio = 100; % signal to noise ratio model.X = tmp.*[1:n]'; % a extra covariate model.Xb = 1; % beta coefficient of the covariate out_sim=simulateTS(n,'plots',1,'model',model); % complete model to be tested. overmodel=struct; overmodel.trend=2; % quadratic trend overmodel.s=12; % monthly time series overmodel.seasonal=303; % number of harmonics overmodel.lshift=4:n-4; % positions of level shift which have to be considered overmodel.X=tmp.*[1:n]'; % pval threshold thPval=0.01; [out_model_1, out_reduced_1] = LTStsVarSel(out_sim.y,'model',overmodel,'thPval',thPval,'plots',1);

rng('default') rng(1); % add three autoregressive components to the complete model. n = 100; % sample size tmp = rand(n,1); model=struct; model.trend=1; % linear trend model.trendb=[0 1]; % parameters of the linear trend model.s=12; % monthly time series model.seasonal=1; % 1 harmonic model.seasonalb=[10 10]; model.X = tmp.*[1:n]'; % a extra covariate model.Xb = 1; % beta coefficient of the covariate model.signal2noiseratio = 100; % signal to noise ratio out_sim=simulateTS(n,'plots',1,'model',model); % complete model to be tested. overmodel=struct; overmodel.trend=2; % quadratic trend overmodel.s=12; % monthly time series overmodel.seasonal=303; % number of harmonics overmodel.lshift=0; % no level shift overmodel.X=tmp.*[1:n]'; overmodel.ARp=1:3; % pval threshold thPval=0.01; [out_model_2, out_reduced_2] = LTStsVarSel(out_sim.y,'model',overmodel,'thPval',thPval);

rng('default') rng(1); % data model model=struct; model.trend=1; % linear trend model.trendb=[0 1]; % parameters of the linear trend model.s=12; % monthly time series model.seasonal=1; % 1 harmonic model.seasonalb=[10 10]; % parameter for one harmonic model.lshiftb=100; % level shift amplitude model.lshift= 30; % level shift position model.signal2noiseratio = 100; % signal to noise ratio n = 100; % sample size tmp = rand(n,1); model.X = tmp.*[1:n]'; % a extra covariate model.Xb = 1; % beta coefficient of the covariate % generate data out_sim=simulateTS(n,'plots',1,'model',model); [out_model_3, out_reduced_3, messages] = LTStsVarSel(out_sim.y);

rng('default') rng(10); % data model model=struct; model.trend=1; % linear trend model.trendb=[0 1]; % parameters of the linear trend model.s=12; % monthly time series model.seasonal=1; % 1 harmonic model.seasonalb=[2 3]; % parameter for one harmonic model.lshiftb=50; % level shift amplitude model.lshift= 35; % level shift position model.signal2noiseratio = 30; % signal to noise ratio model.ARp=1; model.ARb=0.9; n = 50; % sample size tmp = rand(n,1); model.X = tmp; % a extra covariate model.Xb = 1; % beta coefficient of the covariate % generate data out_sim=simulateTS(n,'plots',1,'model',model); overmodel=model; overmodel.trend=2; % quadratic trend overmodel.s=12; % monthly time series overmodel.seasonal=303; % number of harmonics overmodel.lshift=-1; overmodel=rmfield(overmodel,"trendb"); overmodel=rmfield(overmodel,"seasonalb"); overmodel=rmfield(overmodel,"signal2noiseratio"); overmodel=rmfield(overmodel,"lshiftb"); overmodel=rmfield(overmodel,"Xb"); overmodel=rmfield(overmodel,"ARb"); nsamp=100; tic; [out_model_3, out_reduced_3] = LTStsVarSel(out_sim.y, ... 'model',overmodel,'nsamp',nsamp); compTime=toc; disp(compTime) disp('Final selected model without option firstTestLS') disp(out_model_3) [out_model_3N, out_reduced_3N] = LTStsVarSel(out_sim.y, ... 'model',overmodel,'firstTestLS',true,'nsamp',nsamp); disp('Final selected model with option firstTestLS') disp(out_model_3N)

`y`

— Time series to analyze.
Vector.A row or a column vector with T elements, which contains the time series.

**
Data Types: **`single| double`

Specify optional comma-separated pairs of `Name,Value`

arguments.
`Name`

is the argument name and `Value`

is the corresponding value. `Name`

must appear
inside single quotes (`' '`

).
You can specify several name and value pair arguments in any order as ```
Name1,Value1,...,NameN,ValueN
```

.

```
'firstTestLS', false
```

,```
'model', model
```

,```
'nsamp',500
```

,```
'thPval',0.05
```

,```
'plots',1
```

,```
'msg',1
```

,```
'dispresults',true
```

`firstTestLS`

—initial test for presence of level shift.boolean.if firstTestLS is true, we immediately find the position of the level shift in a model which does not contain autoregressive terms, the seasonal specification is 101 If the level shift component is significant we pass the level shift component in fixed position to the variable selection procedure. Note also that the units declared as outliers with a p-value smaller than 0.001 are used to form model.ARtentout. model.ARtentout is used in the subsequent steps of the variable selection procedure, every time there is a call to LTSts with an autoregressive component. The default value of firstTestLS is false.

**Example: **```
'firstTestLS', false
```

**Data Types: **`logical`

`model`

—model type.structure.A structure which specifies the (over-parametrized) model which will be used to initialise the variable selection process. The model structure is identical to the one defined for function LTSts: for convenience, we list the fields also here:

Value | Description |
---|---|

`s` |
scalar (length of seasonal period). For monthly data s=12 (default), for quartely data s=4, ... |

`trend` |
scalar (order of the trend component). trend = 0 implies no trend, trend = 1 implies linear trend with intercept, trend = 2 implies quadratic trend, trend = 3 implies cubic trend. Admissible values for trend are, 0, 1, 2 and 3. In the paper RPRH to denote the order of the trend symbol A is used. If this field is not present into input structure model, model.trend=2 is used. |

`seasonal` |
scalar (integer specifying number of frequencies, i.e. harmonics, in the seasonal component. Possible values for seasonal are $1, 2, ..., [s/2]$, where $[s/2]=floor(s/2)$. For example: if seasonal =1 we have: $\beta_1 \cos( 2 \pi t/s) + \beta_2 sin ( 2 \pi t/s)$; if seasonal =2 we have: $\beta_1 \cos( 2 \pi t/s) + \beta_2 \sin ( 2 \pi t/s) + \beta_3 \cos(4 \pi t/s) + \beta_4 \sin (4 \pi t/s)$. Note that when $s$ is even the sine term disappears for $j=s/2$ and so the maximum number of trigonometric parameters is $s-1$. If seasonal is a number greater than 100 then it is possible to specify how the seasonal component grows over time. For example, seasonal = 101 implies a seasonal component which just uses one frequency which grows linearly over time as follows: $(1+\beta_3 t)\times ( \beta_1 cos( 2 \pi t/s) + \beta_2 \sin ( 2 \pi t/s))$. For example, seasonal =201 implies a seasonal component which just uses one frequency which grows in a quadratic way over time as follows: $(1+\beta_3 t + \beta_4 t^2)\times( \beta_1 \cos( 2 \pi t/s) + \beta_2 \sin ( 2 \pi t/s))$. seasonal =0 implies a non seasonal model. In the paper RPRH to denote the number of frequencies of the seasonal component symbol B is used, while symbol G is used to denote the order of the trend of the seasonal component. Therefore, for example, model.seasonal=201 corresponds to B=1 and G=2, while model.seasonal=3 corresponds to B=3 and G=0. If this field is not present into input structure model, model.seasonal=303 is used. |

`X` |
matrix of size T-by-nexpl containing the values of nexpl extra covariates which are likely to affect y. |

`lshift` |
scalar or vector associated to level shift component. lshift=0 (default) implies no level shift component. If model.lshift is vector of positive integers, then it is associated to the positions of level shifts which have to be considered. The most significant one is included in the fitted model. For example if model.lshift =[13 20 36] a tentative level shift is imposed in position $t=13$, $t=20$ and $t=36$. The most significant among these positions in included in the final model. In other words, the following extra parameters are added to the final model: $\beta_{LS1}* I(t \geq \beta_{LS2})$ where $\beta_{LS1}$ is a real number (associated with the magnitude of the level shift) and $\beta_{LS2}$ is an integer which assumes values 13, 20 or 36 and and $I$ denotes the indicator function. As a particular case, if model.lshift =13 then a level shift in position $t=13$ is added to the model. In other words the following additional parameters are added: $\beta_{LS1}* I(t \geq 13)$ where $\beta_{LS1}$ is a real number and $I$ denotes the indicator function. If lshift = -1 tentative level shifts are considered for positions $p+1,p+2, ..., T-p$ and the most significant one is included in the final model ($p$ is the total number of parameters in the fitted model). Note that lshift=-1 is not supported for C-coder translation. In the paper RPRH $\beta_{LS1}$ is denoted with symbol $\delta_1$, while, $\beta_{LS2}$ is denoted with symbol $\delta_2$. |

`ARp` |
vector with non negative integer numbers specifying the autoregressive components. For example: model.ARp=[1 2] means a AR(2) process; model.ARp=2 means just the lag 2 component; model.ARp=[1 2 5 8] means AR(2) + lag 5 + lag 8; model.ARp=0 (default) means no autoregressive component. |

`ARtentout` |
matrix of size r-by-2 containing the list of the units declared as outliers (first column) and corresponding fitted values (second column) or empty scalar. If model.ARtentout is not empty, when the autoregressive component is present, the y values which are used to compute the autoregressive component are replaced by model.tentout(:,2) for the units contained in model.tentout(:,1) Remark: the default overparametrized model is for monthly data with a quadratic trend (3 parameters) + seasonal component with three harmonics which grows in a cubic way over time (9 parameters), no additional explanatory variables, no level shift and no AR component that is model=struct; model.s=12; model.trend=2; model.seasonal=303; model.X=''; model.lshift=0; model.ARp=0; Using the notation of the paper RPRH we have A=2, B=3; G=3; and $\delta_1=0$. |

**Example: **```
'model', model
```

**Data Types: **`struct`

`nsamp`

—number of subsamples to extract.scalar | vector of length 2.Vector of length 1 or 2 which controls the number of subsamples which will be extracted to find the robust estimator. If lshift is not equal to 0 then nsamp(1) controls the number of subsets which have to be extracted to find the solution for t=lshift(1). nsamp(2) controls the number of subsets which have to be extracted to find the solution for t=lshift(2), lshift(3), ..., lshift(end).

Note that nsamp(2) is generally smaller than nsamp(1) because in order to compute the best solution for t=lshift(2), lshift(3), ..., lshift(end), we use the lts.bestr/2 best solutions from previous t (after shifting the position of the level shift in the estimator of beta). If lshift is a vector of positive integers the default value of nsamp is (500 250). If lshift is a vector of positive integers and nsamp is supplied as a scalar the default is to extract [nsamp/2] subsamples for t=lshift(1), lshift(2), ... Therefore, for example, in order to extract 600 subsamples for t=lshift(1) and 300 subsamples for t= lshift(2) ... you can use nsamp =600 or nsamp=[600 300].

The default value of nsamp is 1000;

Remark: if nsamp=0 all subsets will be extracted.

They will be (n choose p).

**Example: **```
'nsamp',500
```

**Data Types: **`double`

`thPval`

—threshold for pvalues.scalar.A value between 0 and 1.

An estimated parameter/variable is eliminated if the associated pvalue is below thPval. Default is thPval=0.01.

**Example: **```
'thPval',0.05
```

**Data Types: **`double`

`plots`

—Plots on the screen.scalar.If plots = 1, the typical LTSts plots will be shown on the screen. The default value of plot is 0 i.e. no plot is shown on the screen.

**Example: **```
'plots',1
```

**Data Types: **`double`

`msg`

—Messages on the screen.scalar.Scalar which controls whether LTSts will display or not messages on the screen. Deafault is msg=0, that is no messages are displayed on the screen. If msg==1 messages displayed on the screen are about estimated time to compute the estimator and the warnings about 'MATLAB:rankDeficientMatrix', 'MATLAB:singularMatrix' and 'MATLAB:nearlySingularMatrix'.

**Example: **```
'msg',1
```

**Data Types: **`double`

`dispresults`

—Display results of final fit.boolean.If dispresults is true, labels of coefficients, estimated coefficients, standard errors, tstat and p-values are shown on the screen in a fully formatted way. The default value of dispresults is false.

**Example: **```
'dispresults',true
```

**Data Types: **`logical`

`reduced_est`

—A reduced model structure obtained by eliminating
parameters that are non-significant.
It is a structure
containing the typical input model fields for function
LTSts (refer to LTSts for details):
model.s = the optimal length of seasonal periodmodel.trend = the optimal order of the trend.

model.seasonal = the optimal number of frequencies in the seasonal component.

model.lshift = the optimal level shift position.

model.X = a matrix containing the values of the extra covariates which are likely to affect y. If the imput model specifies autoregressive components in model.ARp, then the selected ones will be also included in model.X.

`reduced_model`

—Output fields of the optimal model.
StructureThe fields are those of function LTSts (refer to LTSts for details): out.B = matrix of estimated beta coefficients.

out.h = number of observations that have determined the initial LTS estimator.

out.bs = vector of the units with the smallest squared residuals before the reweighting step.

out.Hsubset = matrix of the units forming best H subset for each tentative level shift considered.

out.numscale2 = matrix of the values of the lts.bestr smallest values of the target function.

out.BestIndexes = matrix of indexes associated with the best nbestindexes solutions.

out.Likloc = matrix containing local sum of squares of residuals determining the best position of level shift.

out.RES = matrix containing scaled residuals for all the units of the original time series monitored in steps lshift+1, lshift+2, ....

out.yhat = vector of fitted values after final step.

out.residuals = vector of scaled residuals from after final NLS step.

out.weights = vector of weights after adaptive reweighting.

out.scale = final scale estimate of the residuals using final weights.

out.conflev = confidence level used to declare outliers.

out.outliers = vector of the units declared outliers.

out.singsub = number of subsets wihtout full rank.

out.y = response vector y.

out.X = data matrix X containing trend, seasonal, expl (with autoregressive component) and lshift.

out.class = 'LTSts'.

Optional Output:

Rousseeuw, P.J., Perrotta D., Riani M. and Hubert, M. (2018), Robust Monitoring of Many Time Series with Application to Fraud Detection, "Econometrics and Statistics". [RPRH]