Image source: Where is Wally?
Outlier detection
Novelty detection
In novelty detection, there may be normal samples in the training data that are far from other points.
not important
not important
not important
and
$$ h(x) = 0 \text{, whenever } x \text{ is an inlier} $$Positive prediction | Negative prediction | total | |
---|---|---|---|
Outlier(positive) | 0 | 10 | 10 |
Inlier(negative) | 0 | 9990 | 9990 |
total | 0 | 10000 | 10000 |
where $N$: the number of negative observations and $P$: the number of positive observations
Positive prediction | Negative prediction | |
---|---|---|
Observed positive | True positive (TP) | False negative (FN) |
Observed negative | False positive (FP) | True negative (TN) |
Positive prediction | Negative prediction | |
---|---|---|
Observed positive | True positive (TP) | False negative (FN) |
Observed negative | False positive (FP) | True negative (TN) |
Prediction based | label based |
---|---|
Positive predictive value $ = \frac{TP}{TP + FP}$ | True positive rate $ = \frac{TP}{P}$ |
Negative predictive value $ = \frac{TN}{TN+FN}$ | True negative rate $ = \frac{TN}{N}$ |
False positive rate $ = \frac{FP}{N}$ | |
Accuracy $=\frac{TP + TN}{P+N}$ |
Positive prediction | Negative prediction | total | |
---|---|---|---|
Outlier(positive) | 3 | 15 | 18 |
Inlier(negative) | 2 | 80 | 82 |
total | 5 | 95 | 100 |
Prediction based | label based |
---|---|
Positive predictive value $ = \qquad\qquad$ | True positive rate $ = \qquad\qquad$ |
Negative predictive value $ =\qquad \qquad$ | True negative rate $ = \qquad\qquad$ |
False positive rate $ = \qquad\qquad$ | |
Accuracy $=\qquad\qquad$ |
Assume higher scores indicate that samples are more likely to be an outlier, e.g. $s(x)$ is indicated the estimated probability that the sample $x$ is an outlier.
where $X_{out}$ is the collection of outliers in $X$ and $X_{in} = X\setminus X_{out}$ (we assumed that we know the labels of test data).
Once such a scoring function has been learned(obtained), a classifier can be constructed by threshold $\lambda \in \mathbb R$:
$$ h^\lambda (x):=\begin{cases} 1 &s(x) \geq \lambda \\ 0 &s(x)< \lambda \end{cases} $$Tradeoffs between benefits(true positives rate) and cost(false positives or false alarm rate)
Classifier $h^\lambda(x)$ is determined by threshold $\lambda$ so one point in ROC graph is obtained by $\lambda$
Simply, for a scoring function $s(x)$
Unknown parameters: location $\mu \in \mathbb R^p$ and scatter positive definite $p\times p$ matrix $\Sigma$
For instance, density function of the multivariate normal distribution
and $$ \Sigma:=cov(X):= \frac{1}{N} \sum_{i=1}^N (\mathbf x_i - \bar{\mathbf x})(\mathbf x_i - \bar {\mathbf x})^t, \text{ (sample covariance)} $$
$\hat{\Sigma}_{MCD}$: MCD covariance estimate (from $h$ observations)
Generally, ${N}\choose{h}$ is too many, so we need something...
Consider a data set $X:=\{ \mathbf x_1, ... ,\mathbf x_N \}$ of $p$-variate observations. Let $H_1 \subset \{1,...,N\}$ with $|H_1|=h$ and put
$$ T_1:= \frac{1}{h} \sum_{i\in H_1} \mathbf x_i, \quad S_1:= \frac{1}{h} \sum_{i\in H_1} (\mathbf x_i - T_1)(\mathbf x_i- T_1)^t. $$If $\det (S_1) \neq 0$, then define the relative distances
$$ d_1(i):= \sqrt{(\mathbf x_i - T_1)^t S_1^{-1}(\mathbf x_i- T_1)}, \text{ for } i=1,...,N. $$Sort these $N$ distances from the smallest $d_1(i_1)\leq d_1(i_2)\leq \cdots \leq d_1(i_N)$, then we obtain the ordered tuple $(i_1, i_2, ...,i_N)$(which is some permutaion of $(1,2,...,N)$). Let $H_2:=\{i_1,...,i_h\}$ and compute $T_2$ and $S_2$ based on $H_2$. Then
$$ \det(S_2) \leq \det(S_1) $$with equality if and only if $T_2=T_1$ and $S_2=S_1$.
Proof. Assume that $det(S_2)>0$, otherwise the result is already satisfied. We can thus compute $d_2(i)=d_{(T_2, S_2)}(i)$ for all $i=1,...,N$. Using $|H_2|=h$ and the definition of $(T_2, S_2)$ we find
$$\begin{align*} \frac{1}{hp}\sum_{i\in H_2}d_2^2(i) &= \frac{1}{hp} tr \sum_{i\in H_2}(\mathbf x_i -T_2) S_2^{-1}(\mathbf x_i - T_2)^t \\ \tag{A.1} &=\frac{1}{hp} tr \sum_{i\in H_2}S_2^{-1}(\mathbf x_i -T_2) (\mathbf x_i - T_2)^t \label{A.1} \end{align*}$$Moreover, put
$$\begin{equation*} \tag{A.2} \lambda:= \frac{1}{hp} \sum_{i\in H_2} d_1^2(i) = \frac{1}{hp} \sum_{k=1}^h d_1^2(i_k)\leq \frac{1}{hp} \sum_{j\in H_1} d_1^2(j) =1, \label{A.2} \end{equation*}$$where $\lambda>0$ because otherwise $\det(S_1)=0$. Combining (\ref{A.1}) and (\ref{A.2}) yields
$$ \frac{1}{hp} \sum_{i\in H_2} d^2_{(T_1,\lambda S_1)}(i) = \frac{1}{hp} \sum_{i\in H_2}(\mathbf x_i -T_1)^t \frac{1}{\lambda}S_1^{-1}(\mathbf x_i - T_1) = \frac{1}{\lambda hp} \sum_{i\in H_2}d_1^2(i) =\frac{\lambda}{\lambda}=1. $$Grubel(1988) proved that $(T_2, S_2)$ is the unique minimizer of $\det(S)$ among all $(T,S)$ for which $\frac{1}{hp}\sum_{i\in H_2} d^2_{(T,S)}(i) = 1$. This implies that $\det(S_2)\leq det(\lambda S_1)$. On theother hand it follows from the inequality (\ref{A.2}) that $\det(\lambda S_1)\leq \det(S_1)$, hence
$$\begin{equation*} \tag{A.3} \det(S_2)\leq \det(\lambda S_1)\leq \det(S_1) \label{A.3} \end{equation*}$$Moreover, note that $\det(S_2)=\det(S_1)$ if and only if both inequalities in (\ref{A.3}) are equalities. For the first we know from Grubel's result that $\det(S_2)=\det(\lambda S_1)$ if and only if $(T_2, S_1)= (T_1,\lambda S_1)$. For the second, $\det(\lambda S_1) = \det(S_1)$ if and only if $\lambda = 1$, i.e. $S_1 = \lambda S_1$. Combining both tields $(T_2, S_2) = (T_1, S_1)$.
and
$$ \mathbf w \cdot \mathbf x_i - b \leq -1, \quad \text{ if } y_i =-1 $$Kernel method(1992)
For instance,
However, we have some problems:
Choice of feature map, e.g. random mapping $ \varphi: \mathbb R^2 \longrightarrow \mathbb R^\infty$
$$ \varphi(x_1, x_2):=(\sin(x_2), \exp(x_1+x_2), x_2, x_1^{\tan(x_2)},...) $$
Generally, dim of feature space may be high
with Datasaurus