The R package dpseg
performs piecewise linear
segmentation of 2-dimensional data by a dynamic programming algorithm
(R. Bellman 1961). It was developed for
time series data, dissection of bacterial growth phases, and for
genome-wide read-count data from next generation sequencing.
print
, plot
and predict
methods allow quick evaluation of the results. Parameter scanning
routines (estimateP
, scanP
) help to evaluate
the best choice of parameters for a given problem.
The package and its documentation are also intended to serve as a
tutorial on the concepts of linear regression, dynamic programming and
the segmentation problem. The movie
function animates the
progress of the algorithm through the data. Generic implementations of
dynamic programming routines allow to test alternative segmentation
criteria.
The problem is to find break-points in 2-dimensional ordered data {(xi, yi), i = 1, ..., n}, eg., a time series. This can be formulated as a dynamic programming recursion:
$$ \begin{equation} S_j = \max_{i\le j} (S_{i-\jump} + \text{score}(i,j)) - P\;, (\#eq:recur) \end{equation} $$
where the scoring function score(i, j) quantifies how well a segment between points i and j is defined.
The break-point penalty term P sets bounds on segment lengths and should be chosen close to the optimal value of the scoring function (Section @ref(selectp)). Higher P will yield longer segments, and especially for non-monotonic data P lower than the optimal value for the scoring function can work better.
The binary jump parameter $\jump \in \{0,1\}$ determines whether the break-point i is both, the end of the previous and start of the current segment ($\jump=0$), or the previous segments ends at i − 1 ($\jump=1$). The latter choice allows discontinuous jumps between adjacent segments (Fig. @ref(fig:jumps)).
Segmentation into linear segments can be achieved by using a
goodness-of-fit measure. Linear models are often evaluated by the
coefficient of determination R2 (“r-squared”), and we
can use this directly as a scoring function (type="r2"
in
dpseg
):
$$ \begin{equation} \text{score}(i,j) = R^2 - 1 = \frac{\sum (\hat y_i - \bar y)^2}{\sum (y_i - \bar y)^2}-1\;, (\#eq:scorer2) \end{equation} $$
where ŷi is the linear
model and ȳ the mean of the
original data, see Section @ref(appii) and equation @ref(eq:r2) for
details. Its square root, the Pearson correlation can also be used as
scoring function (type="cor"
). Subtraction of 1 simply
aligns different scoring functions in dpseg
at score ≤ 0 and thereby allows the use of a
consistent default penalty parameter of P = 0.
R2 depends on
the slope (eq. @ref(eq:r2)) and will score poorly for segments without
slope (Fig. @ref(fig:lines) and @ref(fig:custom), Section @ref(lines)).
The negative variance of the residuals ri = yi − ŷi
between data and the regression line does not depend on the slope (eq.
@ref(eq:var)) and is for many cases the better choice and the
default in dpseg
(type="var"
):
$$ \begin{equation} \text{score}(i,j) = -\Var(r) = - \frac{1}{n-1} \sum r_i^2\;, (\#eq:score) \end{equation} $$
See Section @ref(scoring) for a short discussion on the choice of the built-in scoring functions. Section @ref(lines) discusses the handling of straight horizontal or vertical lines (with zero variance) in the data. And Sections @ref(matrix) and @ref(generic) introduce custom scoring options.
Incremental calculation of the variances (Section @ref(appii)) allows
for a computationally efficient implementation (Section @ref(benchmark))
of the recursion in dpseg
.
During calculation of Sj the position imax < j that yielded the maximal score is stored as a vector. The segmentation can be easily retrieved by starting at the last position j = n, the end of the last segment, looking up its imax, the start of the segment, and proceed from $j=i_\text{max} - \jump$ (the end of the next segment). If jumps were allowed ($\jump=1$) the previous segment ended one position ahead of the next segment at i − 1, and this must be accounted for in backtracing.
Depending on the scoring function, short segments may not be meaningful, eg. goodness-of-fit measures for linear regression such as the variance of residuals are 0 for segments of length ℓ = 2 and an “optimal” segmentation would split the data in pairs with perfect “fits”. Thus, we can restrict score function maximization for only $i\le(j-\lmin+1)$.
When handling data sets much larger then the expected segment length, one can also define a maximal segment length, and thereby save memory and computation time by only considering $i\ge(j-\lmax +1)$.
Combining both restrictions, and with $\Ell_x=\ell_x+1$ the recursion becomes:
$$ \begin{equation} S_j = \max_{\substack{i\le j-\Ell_{\min}\\i\ge j-\Ell_{\max}}} (S_{i-\jump} + \text{score}(i,j)) - P\;. (\#eq:recurl) \end{equation} $$
The main function dpseg
takes x and y vectors of equal length as input
and returns an object of class dpseg
. This results object
can be inspected by print
and plot
methods. A
predict
method returns y values and specified novel x values at the straight line
segments.
The results object is a simple list (S3
object) with the
list item segments
as the main result: a table (data frame)
that lists the start and end x-values of the segments, the start and end
indices in the data vectors, the linear regression coefficients and
goodness-of-fit measures for the segments (intercept, slope, r-squared,
variance of residuals).
library(dpseg)
type <- "var" # use the (default) scoring, -var(residuals(lm(y~x)))
jumps <- FALSE # allow discrete jumps between segments?
P <- 1e-4 # break-point penalty, use higher P for longer segments
# get example data `oddata` - bacterial growth measured as optical density OD
x <- oddata$Time
y <- log(oddata[,"A2"]) # note: exponential growth -> log(y) is linear
# NOTE: the scoring function results are stored as a matrix for re-use below
segs <- dpseg(x=x, y=y, jumps=jumps, P=P, type=type, store.matrix=TRUE)
## using 'recursion_linreg_c' with type 'var'
## calculating recursion for 229 datapoints
##
## Dynamic programming-based segmentation of 229 xy data points:
##
## x1 x2 start end intercept slope r2 var
## 1 1.022778 1.362778 1 3 0.7194953 -4.383977 0.9909440 0.005076021
## 2 1.362778 1.703194 3 5 -9.3479269 3.016111 0.9626144 0.010235471
## 3 1.703194 2.213611 5 8 -6.2212774 1.153347 0.9766701 0.001532996
## ...
## x1 x2 start end intercept slope r2 var
## 7 15.68903 17.91181 87 100 -3.4426960 0.18899698 0.9940106 1.101408e-04
## 8 17.91181 21.32958 100 120 -0.3825874 0.01895679 0.8638382 6.374447e-05
## 9 21.32958 39.94000 120 229 0.4767462 -0.01989881 0.9919736 9.501867e-05
##
## Found: 9 segments.
## Parameters:
## type: var; minl: 3; maxl: 229; P: 1e-04; jumps: 0
For both, educational purposes or detailed evaluation of novel
scoring functions (Section @ref(generic)) or parameters, the
movie
function allows to visualize the performance of the
dynamic programming as it progresses through the data:
# use arguments frames and res to control file size
movie(segs, format="gif", file.name="movie", path="pkg/vignettes/figures",
frames=seq(1,length(x),3), delay=.3,res=50)
This will plot each step of the algorithm and visualize the
development of the scoring function. For this function to work, the
argument store.matrix=TRUE
must be set in the call to
dpseg
.
The minimal and maximal segment length parameters $\lmin$ and $\lmax$ (arguments minl
and
maxl
of dpseg
) are the easiest way to restrict
the algorithm (and memory usage) to certain segment lengths. However,
this overrules potentially better segmentations with lower variance.
A more meaningful way to restrict segment lengths for given data is
to explicitly allow higher variance of segments. This can be achieved
with the break-point penalty parameter P (argument P
). This
parameter can be directly used to tune segmentation. It can be chosen in
the order of magnitude of the tolerated variances. A higher P will allow higher variance of the
individual segments and will yield longer segments. P < 0 will reward break-points
and yield more shorter segments.
To choose an optimal P for
a given application, the package offers two functions:
estimateP
and scanP
. estimateP
makes use of the excellent performance of base R’s
smooth.spline
implementation and reports the variance of
the residuals; argument plot=TRUE
allows to evaluate
results:
## using 'recursion_linreg_c' with type 'var'
## calculating recursion for 229 datapoints
The reported value is a good starting point. For our example data,
values an order of magnitude lower than this estimate, P=estimateP(x=x,y=y)/10
achieved a satisfying segmentation, but this will strongly depend on the
type of data and noise around expected segments.
Alternative estimators can be easily defined, eg.
estimateP
simply calls smooth.spline
:
The convenience function scanP
calculates segmentations
for a vector of P values and
returns (plots) the resulting numbers of segments and the median of
segment variances of residuals. Based on these values one can select a
value of P that splits the
data into a reasonable number of segments with acceptable variance of
residuals:
scanP
is more efficient than scanning over
dpseg
calls since it calculates the scoring function matrix
only once and uses this matrix for scans over P (Section @ref(matrix)).
## NOTE: dpseg is slower for many segments!
sp <- scanP(x=x, y=y, P=seq(-.01,.1,length.out=50), plot=TRUE)
## running dpseg 50 times: ..................................................
The performance of the different built-in scoring functions depends on the type of data and the goal of the user.
The default scoring function, the negative variance of residuals
(type="var"
, $-\Var(r)$),
is independent of the slope of segments (eq. @ref(eq:var)) and is
recommended in most cases.
Only when looking explicitly for segmentation of zig-zag like data
with only increases and decreases, the coefficient of determination
(type="r2"
, R2) or its square root,
Pearson’s correlation (type="cor"
), may be better choices.
In dpseg
, these are not used in the sense of a
goodness-of-fit measure but in their mathematical meaning as a measure
of (linear) correlation: how well can x explain variation in y? Lines with low slope score worse
than steeper lines (see slope b1 in eq. @ref(eq:r2)).
This behaviour is exemplified in the Sections @ref(example),
@ref(lines), and @ref(generic). Figure @ref(fig:rcppdynprogdata)
provides an example where the solution is stable over a broader range of
penalty P when scoring by the
Pearson correlation.
Using the non-monotonic, sine-based example
data from the RcppDynProg
R package (Mount and Zumel 2019), we can first use
scanP
to test different penalties for different scoring
functions. Here, dpseg
yields the intended segmentation
with the default scoring function (type="var"
) and a
negative penalty, while the Pearson correlation
(type="cor"
) worked well with default penalty P = 0 and appears to have a broader
range of penalties that achieve the intented split into 7 segments:
## example from https://winvector.github.io/RcppDynProg/articles/SegmentationL.html
set.seed(2018)
d <- data.frame(x = 0.05*(1:(300))) # ordered in x
d$y_real <- sin((0.3*d$x)^2)
d$y <- d$y_real + 0.25*rnorm(length(d$y_real))
par(mfrow=c(2,2))
scanP(x=d$x, y=d$y, P=seq(-0.005,0.005,length.out=50), verb=0) # note: no messages by verb=0
scanP(x=d$x, y=d$y, P=seq(-0.05,0.05,length.out=50), type="cor", verb=0)
plot(dpseg(x=d$x, y=d$y, P=-.0025, verb=0))
plot(dpseg(x=d$x, y=d$y, type="cor", verb=0))
Experimental data may contain short horizontal or vertical data stretches, e.g., as artefacts of low measurement sensitivity. For such segments the calculation of the scoring function involves division by 0 and undefined values.
For a horizontal straight line with $\Var(y)=0$ (top panel in Fig.
@ref(fig:lines)), i.e., a short segment of x values with equal y values, the R2 and Pearson
correlation measures are undefined (eq. @ref(eq:r2)). dpseg
will set them to 0, their minimal value, consistent with their general
behaviour to favor sloped and ignore horizontal segments. The variance
of residuals can be calculated (eq. @ref(eq:var)) as $\Var(r)=0$ and the segments can be detected
with the default scoring function (top left panel in Fig.
@ref(fig:lines)).
For a vertical line with $\Var(x)=0$
(lower panel in Fig. @ref(fig:lines)), i.e., different y values at equal x, all measures are undefined (eq.
@ref(eq:r2) and @ref(eq:var)). dpseg
will set the scoring
function to −∞, effectively ignoring
them in segmentation. The data points will be part of adjacent or
spanning segments. If such cases are present in the data between
intended segments, it may be advisable to use jumps=TRUE
to
allow for split segments and discrete jumps between adjacent segments
(see Section @ref(jumps)).
For both cases, dpseg
will issue a summarized warning if
it was relevant for the used scoring function, and providing the counts
of occurrences of such cases during the recursion:
y <- c(1:5,rep(5,10),6:10) # several equal y values
x <- 1:length(y)
par(mfrow=c(2,2))
plot(dpseg(x=x, y=y, type="var", verb=0)) # Var(y)=0, horizontal line
plot(dpseg(x=x, y=y, type="r2", verb=0))
## Warning in recursion(x = x, y = y, maxl = maxl, jumps = jumps, P = P, minl =
## minl, : Syy==0: equal y-values or numeric cancellation in segment, score was
## set to -1 (r2=cor=0). This occurred 45 times.
## Warning in recursion(x = x, y = y, maxl = maxl, jumps = jumps, P = P, minl =
## minl, : Sxx==0: equal x-values or numeric cancellation in segment, score was
## set to -infinity. This occurred 45 times.
## Warning in recursion(x = x, y = y, maxl = maxl, jumps = jumps, P = P, minl =
## minl, : Sxx==0: equal x-values or numeric cancellation in segment, score was
## set to -infinity. This occurred 45 times.
Per default, dpseg
assumes that adjacent segments are
linked, that is: the last data point of a segment is also the first data
point of the next adjacent segment. Setting the argument
jumps=TRUE
allows to obtain segments that do not share
their start and end points (see $\jump$
in eq. @ref(eq:recur)). This can be useful if the data contains discrete
jumps between adjacent segments (Fig. @ref(fig:jumps)).
x <- c(1:10)
y <- c(1:4,7:12) # discrete jump between two lines
par(mfcol=c(1,2))
plot(dpseg(x=x, y=y, type="var", jumps=FALSE, verb=0))
plot(dpseg(x=x, y=y, type="var", jumps=TRUE, verb=0))
When the scoring function can be pre-calculated, a simple look-up in
the triangular matrix (banded by $\lmin$ and $\lmax$) suffices for the recursion. This
option is available in dpseg
by passing a scoring matrix
Sij
as argument y
(instead of a numeric vector).
For example, we can use the scoring matrix generated above (Fig.
@ref(fig:dpsegdemo), stored due to argument
store.matrix=TRUE
) and test different parameters P, $\jump$, $\lmin'>\lmin$ and $\lmax'<\lmax$ more efficiently (Section
@ref(benchmark)). This approach is also used in the scanP
function (Section @ref(selectp)).
## use the scoring matrix from a previous run for generic recursion,
## with store.matrix=TRUE, and test different parameters.
segm <- dpseg(y=segs$SCR, jumps=jumps, P=2*P, minl=5, maxl=50)
## setup use with pre-calculated scoring matrix
## calculating recursion for 229 datapoints
##
## Dynamic programming-based segmentation of 229 xy data points:
##
## x1 x2 start end
## 1 1 37 1 37
## 2 37 46 37 46
## 3 46 87 46 87
## ...
## x1 x2 start end
## 5 100 131 100 131
## 6 131 180 131 180
## 7 180 229 180 229
##
## Found: 7 segments.
## Parameters:
## type: matrix; minl: 5; maxl: 50; P: 2e-04; jumps: 0
Note, that in this case the algorithm does not use or know of the
original x, y data
and the result therefore contains only the segment break-point indices
in the original data vectors. The predict
and
plot
functions will not work. We can use the convenience
function addLm
to add the original x, y data and linear
regression coefficients (via lm
) for the calculated
segments:
## add lm-based regression coefficients and original x/y data
segm <- addLm(segm, x=oddata$Time, y=log(oddata[,"A2"]))
plot(segm)
Alternative scoring functions can be tested easily by either
providing a scoring matrix Sij
as described above, or by providing the score(i,j)
function
directly. The latter can be achieved by defining a score function with
the signature as in the following example, testing to use the
coefficient of determination R2 (r-squared) for
segmentation:
Since our scoring function doesn’t provide linear regression
parameters, we can use the argument add.lm=TRUE
to add
intercept and slope data via base R’s lm
, required for the
predict
and plot
methods.
score_rsq <- function(i, j, x, y,...) summary(lm(y[i:j]~x[i:j]))$r.squared
segn <- dpseg(x=x, y=y, jumps=jumps, P=.99, scoref=score_rsq, add.lm=TRUE)
plot(segn)
The implementation of equation @ref(eq:recurl) is straightforward.
Section @ref(appi) shows a fully functional implementation in about 30
lines of code in R. However, this is highly inefficient. The theoretical
complexity is quadratic, ie. 𝒪(n2), and even with
segment length restrictions a linear regression with base R’s
lm
function is performed $(n-1)*(\lmax-\lmin+1)$ times.
A more efficient implementation calculates the linear regression
parameters incrementally (eq. @ref(eq:score) and Section @ref(appii))
while looping through the data. Benchmarking of different
implementations, using a growth curve of a culture of Escherichia
coli cells, showed that this incremental implementation was about 2
orders of magnitude faster then the generic implementation, and another
2 orders of magnitude faster when implemented in C++
via
the Rcpp
package (Fig. @ref(fig:benchmark)). The latter
approach is thus the default recursion used in dpseg
.
Without incremental regression, using a pre-calculated scoring matrix
as input to dpseg
, the recursion was ca. 25% faster. This
allows to scan over parameters P, $\jump$, $\lmin$ and $\lmax$ more efficiently.
The function piecewise_linear
of the related package
RcppDynProg
(Mount and Zumel
2019) (version 0.1.3, without weights) was only slightly faster
than incremental calculation in base R and ca. 100x slower than the
default implementation of dpseg
.
The implementation of the recursion @ref(eq:recurl), the scoring
function @ref(eq:score) and the backtrace
is
straightforward:
## RECURSION
recursion <- function(x, y, maxl, jumps=0, P=0, minl=3, S0=1) {
N = length(x)
S = numeric(N) # init to 0
imax = numeric(N)
S[1] = -P
for ( j in 2:N) {
si = rep(-Inf, maxl-minl+1)
irng = (j-maxl):(j-minl) +1
irng = irng[irng>0]
for ( i in irng ) {
idx = i-(j-maxl)
sij = ifelse(i==1&jumps==1, S0, S[i-jumps])
si[idx] = sij + score(x, y, i, j) - P
}
S[j] = max(si)
idx = which.max(si)
imax[j] = idx + (j-maxl)
}
imax
}
## SCORING FUNCTION
score <- function(x, y, k, l) -var(residuals(lm(y[k:l]~x[k:l])))
## backtracing
backtrace <- function(imax, jumps=0) {
end = length(imax) # end of last segment
ends = end
while( end>1 ) {
end = imax[end] - jumps
ends = c(end, ends)
}
ends[1] <- ends[1] + jumps # note: start of first segment
ends
}
We construct a simple test case, with three linear segments without jumps, and added noise:
# simple test case
k1=1
k2=.05
k3=-.5
y1 <- k1*1:5
y2 <- k2*1:5 + k1*5
y3 <- k3*1:5 + k2*5 + k1*5
set.seed(1)
ym <- c(y1, y2, y3)
nsd <- .25 # noise, standard deviation
y <- ym + rnorm(length(ym), 0, nsd) # add noise
x <- 1:length(y)
## run recursion
JUMPS = 0
imax = recursion(x, y, maxl=length(x), jumps=JUMPS, P=0, minl=3, S0=1)
## backtrace
ends = backtrace(imax, jumps=JUMPS)
print(ends)
## [1] 1 5 10 15
## plot
plot(x,y)
lines(x,ym)
legend("bottom", title="slopes:", legend=paste(k1,k2,k3,sep=", "), bty="n")
abline(v=ends)
Summarized values such as the mean
can not be calculated
incrementally, ie. while looping through the data i = 1, ..., n. We are
looking for a method to calculate linear regression parameters
incrementally for an efficient implementation of the dynamic programming
recursions (eq. @ref(eq:recur) and @ref(eq:recurl)).
Consider n measured data pairs {(xi, yi), i = 1, ..., n}, with the dependent variable yi (eg. a measured value) and independent variable xi (eg. the time of measurement), for which we suspect a linear relation with intercept β0 and slope β1, and with added random measurement errors, in regression analysis denoted the residuals ri:
The goal is to find a straight line, denoted the regression line, that best describes this linear relation:
The gold standard approach to find the regression line is to minimize the sum of squares of the residuals: $\min\limits_{\beta_0,\beta_1} \sum r_i^2$. We treat the data as constants and look for parameters b0 and b1, estimators of the real parameters β0 and β1, that minimize the deviation of the data from this line.
$$ \begin{equation} f(\beta_0, \beta_1) = \sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2 (\#eq:beta) \end{equation} $$
Note that for clarity we shorten the sum symbol from now on as: $\sum = \sum_{i=1}^n$. It is important not to mix such estimators with the real parameters, since we do not know how accurate our estimation will be. The real parameters β0 and β1 reflect the actual physical process that underlies the data we measured.
To minimize the residuals, we get the partial derivatives and set them to 0.
$$ \begin{align} \frac{\partial f}{\partial \beta_0} \vert_{b_0} &= \sum 2\cdot (y_i-b_0-\beta_1 x_i)\cdot 1 &=0\\ &= 2 \sum y_i - 2 n b_0 - 2 \beta_1 \sum x_i &=0\\ b_0 &= \frac{1}{n}\sum y_i - \beta_1 \frac{1}{n} \sum x_i\\\hline \frac{\partial f}{\partial \beta_1} \vert_{b_1} &= \sum 2\cdot (y_i-\beta_0-b_1 x_i)\cdot x_i &=0\\ &= \sum x_i y_i - \beta_0 \sum x_i - b_1 \sum x_i^2 &=0 \end{align} $$
Combining both minimization criteria, with b0 = β0 and b1 = β1:
$$ \begin{align} \sum x_i y_i - \left(\frac{1}{n}\sum y_i - b_1 \frac{1}{n} \sum x_i\right) \sum x_i - b_1 \sum x_i^2 =0\\ \sum x_i y_i - \frac{1}{n}\sum y_i \sum x_i + b_1 \frac{1}{n} \left(\sum x_i\right)^2 - b_1 \sum x_i^2 =0\\ \sum x_i y_i - \frac{1}{n}\sum y_i \sum x_i = b_1 \left(\sum x_i^2 - \frac{1}{n} \left(\sum x_i\right)^2\right)\\ b_1 = \frac{\sum x_i y_i - \frac{1}{n}\sum y_i \sum x_i}{\sum x_i^2 - \frac{1}{n} \left(\sum x_i\right)^2} \end{align} $$
Introducing the arithmetic mean, $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$, and using the centered form of the variance, we obtain the well known equation for linear regression:
$$ \begin{align} b_0 &= \bar y - b_1 \bar x\\ b_1 &= \frac{\sum(x_i-\bar x )(y_i-\bar y)}{\sum(x_i-\bar x)^2}\;. (\#eq:emin) \end{align} $$
This can be understood intuitively, especially the centered form. It contains the sum of the squared differences of the xi data points from their arithmetic mean x̄, as well as a similar expression with the products of x and y data. These “sums of squares (products)” (SQ) are a measure of the spread of the data around their mean, known as variance (covariance).
Required Differentiation Rules:
Required Summation Rules:
The last step to obtain equation @ref(eq:emin) is known in German as Verschiebungssatz von Steiner (Steiner translation theorem), in French as théorème de König-Huygens. In English this transformation has no special name and the two forms are just known as the centered (left) and expanded (right) forms of the variance:
We can apply this to all sum of squares required to calculate various statistical measures of the data:
$$ \begin{align} S_{xx} \def & \sum(x_i - \bar x)^2 &=& \sum x_i^2 - \frac{1}{n} (\sum x_i)^2\\ S_{yy} \def & \sum(y_i - \bar y)^2 &=& \sum y_i^2 - \frac{1}{n} (\sum y_i)^2\\ S_{xy} \def & \sum(y_i - \bar y)(x_i - \bar x) &=& \sum y_i x_i - \frac{1}{n} \sum y_i\sum x_i (\#eq:sos) \end{align} $$
Their expanded forms consist of additions of simple sums of the data, and will thus allow to calculate the sum of squares incrementally by adding up the sum expressions ∑xi2, ∑xi, ∑yi2, ∑yi, and ∑xiyi.
An important caveat is that the expanded forms can become 0 for large x̄2, known as “catastrophic cancellation”.
For completeness and context, we provide some standard definitions in statistics literature.
Normalizing the sum of squares by the number of data points provides the variance, $\Var(x)$:
$$ \begin{align} &\; \tilde s_x^2 =& \frac{1}{n} S_{xx}\\ \Var(x) \def &\; s_x^2 =& \frac{1}{n-1}S_{xx} \end{align} $$
The intuitive normalization by all data points n is often denoted the “empirical variance”, in German “mittleres Abweichungsquadrat”. Statistics packages usually use the second version, normalized by n − 1, the “degrees of freedom”, and denoted the “theoretical variance” or “inductive variance”, in German Stichprobenvarianz. This corrected version accounts for the fact that the deviation of the last value xn from the mean is already defined by the first 1 to n − 1 values. It does not further contribute to the spread of the data. Correcting s̃2 by multiplication with $\frac{n}{n-1}$ is in German textbooks sometimes called Bessel-Korrektur.
Again a similar concept exists combining x and y data, and this is called co-variance, $\Cov(x,y)$:
$$ \Cov(x,y) \def s_{xy}^2 = \frac{1}{n-1} S_{xy}\;. $$
Going back to our residual error minimization in equation @ref(eq:emin), we see that the slope of our regression line is given by the ratio of the data co-variance over the variance of the x data, and the normalization terms are canceled out:
$$ b_1 = \frac{\Cov(x,y)}{\Var(x)}\;. $$
Above symbols s2 refer to the sum of squares or products and accordingly have squared or two different units. To directly compare the spread of the data with the data itself, eg. the mean of the data, we simple take the (positive) square root of the variances to obtain the well known standard deviation:
$$ s = \sqrt{s^2} \;. $$
The standard deviation is a very useful measure in conjunction with the mean, if we a are interested how wide the population data are spread around a mean, consider eg. body height distributions. Ie. if the thing we measure actually varies.
If we are more interested in the “precision of the mean” value of our measurements, ie. if we are sure that the thing we measure does not vary, eg. the mass of an atom, and most variation of measurement values just comes from errors of our measurement device, we further normalize the standard deviation to obtain the standard error:
$$ e = \frac{s}{\sqrt n}\;. $$
In short, if we expect actual variation of the measured phenomenon, the standard deviation is the measure of choice, while standard error is an estimate of the precision of a measurement device, eg. chemical or optical probes.
Let’s keep these definitions in mind but return to linear regression. The aim was to obtain expressions that we can calculate incrementally. We have found a regression line that minimizes the sum of squares of the residuals, the distance of actual measured values y from the regression line ŷi.
To judge how well the regression line describes our data we can define a new term that quantifies the fraction of the spread of the data that can be explained by the regression line. Again “sum of squares” measures are used. The “Sum of sQuares Explained” (SQE) describes the spread of our prognosed data ŷi around the mean of the original data ȳ:
and the “Sum of sQuares Total” (SQT) describes the total spread of the data around its mean:
Their ratio is the fraction of the total data spread that we can explain by our regression line, often denoted r-squared:
$$ R^2 \def \frac{SQE}{SQT}\;, $$
which reaches R2 → 1 for a perfect fit. We have already introduced SQT as Syy in equation @ref(eq:sos). To calculate SQE we replace the defining terms by our regression results:
$$ \begin{align} SQE = \sum (\hat y_i - \bar y)^2 &= \sum(b_0 + b_1 x_i - (b_0 + b_1 \bar x))^2\\ &= \sum(b_1(x_i - \bar x))^2\\ &= b_1^2 \sum (x_i - \bar x)^2\\ &= \left(\frac{S_{xy}}{S_{xx}}\right)^2S_{xx} = \frac{S_{xy}^2}{S_{xx}} \end{align} $$
and get:
$$ \begin{equation} R^2 = \frac{SQE}{SQT} = \frac{S_{xy}^2}{S_{xx}S_{yy}} = b_1 \frac{S_{xy}}{S_{yy}} = \frac{\Cov(x,y)}{\Var(x)}\cdot\frac{\Cov(x,y)}{\Var(y)}\;, (\#eq:r2) \end{equation} $$
for which we can obtain all values incrementally via the expanded forms of the sum of squares Sxy, Sxx and Sxy in equation @ref(eq:sos).
Notably, r-squared is also the squared version of the Pearson correlation r:
$$ | r | = \sqrt{R^2} = \frac{| S_{xy} |}{\sqrt{S_{xx}S_{yy}}} \;, $$
and both, r-squared (type="r2"
) and Pearson correlation
(type="cor"
) can be used as optional scoring functions of
dpseg
, where -1 is subtracted (eq. @ref(eq:scorer2)) to
allow a consistent default penalty parameter P = 0.
The r-squared value (R2) reflects a direct linear dependence of y on x values, and depends on the slope b1 ≠ 0. R2 actually measures correlation between x and y, or how well changes in y depend on changes in x. A horizontal line has no correlation, y does not change with x. When segmenting data into linear parts, R2-based measures ignore regions without change in y at b1 ≈ 0, where data spreads around a mean value ȳ = b0. This can be seen in Figures @ref(fig:lines) and @ref(fig:custom) where a short horizontal segment of the data is ignored when using R2 as scoring functions.
Minimization of the variance of the residuals is independent of the slope and is thus a better optimization criterium for such cases:
$$ \Var(r) = s_r^2 = \frac{1}{n-1} \sum (r_i-\bar r)^2= \frac{1}{n-1} \sum r_i^2\;, $$
where ∑ri = r̄ = 0 follows from the condition of the minimization $\frac{\partial f}{\partial \beta_0} \vert_{b_0}=0$ of equation @ref(eq:beta).
Minimization of the “Sum of sQuares of the Residuals” (SQR) was the initial optimization criterium (eq. @ref(eq:beta)):
$$ SQR \def \sum r_i^2 = \sum(y_i- \hat y_i)^2\;, $$
and an expression for incremental calculation is obtained by partitioning the residual sum of squares (Quadratsummenzerlegung):
$$ \begin{align} \sum(y_i- \bar y_i)^2 &= \sum(y_i- \bar y_i + \hat y_i - \hat y_i)^2\\ &= \sum((\hat y_i -\bar y)+(y_i- \hat y_i))^2\\ &= \sum((\hat y_i -\bar y)+ r_i)^2\\ &= \sum((\hat y_i - \bar y)^2 + r_i^2 + 2 r_i \hat y_i - 2 r_i \bar y)\\ &= \sum(\hat y_i - \bar y)^2 + \sum r_i^2 + 2 \sum r_i \hat y_i - 2 \bar y \sum r_i\\ \sum(y_i- \bar y_i)^2 &= \sum(\hat y_i - \bar y)^2 + \sum r_i^2 \\ SQT &= SQE + SQR\;, \end{align} $$
again with ∑ri = 0.
In other words the total spread of the data is the sum of the spread explained by our regression line ŷ and the spread that remains in the un-explained residuals. We can obtain the latter as:
$$ \begin{align} \sum r_i^2 &= \sum(y_i- \bar y_i)^2 - \sum(\hat y_i - \bar y)^2 = SQT - SQE\;, \end{align} $$
and the variance of the residuals as:
$$ \begin{equation} \Var(r) = s_r^2 = \frac{1}{n-1} \sum r_i^2 = \frac{1}{n-1}\left(S_{yy} - \frac{S_{xy}^2}{S_{xx}}\right) = \Var(y) - \frac{\Cov(x,y)^2}{\Var(x)}\,, (\#eq:var) \end{equation} $$
which allows incremental calculation from the expanded forms of the sum of squares Sxy, Sxx and Sxy in equation @ref(eq:sos).
The negative variance of the residuals, $-\Var(r)$, is used as the default scoring
function (type="var"
, eq. @ref(eq:score)) of
dpseg
.
Vertical segments in the data with $\Var(x)=0$, i.e., multiple y values at a single x value, lead to division by 0 for
all scoring functions (eq. @ref(eq:r2) and @ref(eq:var)). These cases
are detected by dpseg
, the scoring functions are set to
−∞, and a warning issued. The data
points will be part of adjacent or spanning segments (bottom panels in
Fig. @ref(lines)).
Horizontal segments in the data with $\Var(y)=0$ and slope b1 = 0, i.e., equal y values at adjacent x, lead to division by 0 in the
correlation-based scoring functions (eq. @ref(eq:r2)). Here,
dpseg
will set R2 = 0 and issue a
warning if correlation-based scoring is used (type="r2"
and
type="cor"
). The scoring functions will evaluate to their
minimum. This is consistent with the general behaviour of
correlation-based scoring functions to favor sloped (|b1| > 0) over flat
(b1 ≈ 0) segments.
The default scoring function based on the variance of residuals (eq.
@ref(eq:var)) is not affected and will detect such segments (top left
panel in Fig. @ref(lines)).
Both cases are exemplified in Section @ref(lines) and Figure @ref(lines).