(Section 1, Basic Math & Linear ALgebra )

Introduction to Machine Learning

inear Algebra


1.A

{ A symmetric matrix A is PSD such that

$ v^tAv = (v^tu) diag (\lambda)(u^tv) = \sum_{i}\lambda_i(Vu^t)^2\geq0 $ where $\lambda$ is the Eigenvalue of $A$.

And matrix A can be decomposed as:

A=QDQt=Qdiag(λ1,λ2λn)Qt =Qdiag(λ1,λ2λn)diag(λ1,λ2λn)Qt=XXt

$A$ can be written as $v^tXX^tv$ we get:

vtAv=vtXXtv=(Xtv)t(Xtv)=∥Xvt∥≥0 }


1.B

for a given PSD matrix A and $\mathbf \alpha\in{R}$

(*) $ v^t(\alpha A)\geq 0$ $\Rightarrow v^t(u A)\geq 0$ when $u , A \geq 0 $

then for PSD matrix’s A,B when $A,B \geq 0$ , $A+B \geq 0$

now let’s apply (*) on (A+B) we will get (**) vt(A+B)v=vtAv+vtBv0

then from both (*) and (**) immediately get $ \alpha A+\beta B\geq 0 $

the set of all n × n PSD matrices over ${R}$ is not a vector space over $R$ because its not apply the closures to multiplication in scalar property , for $\lambda < 0$ and A0λA<0λA{PSD}

Calculus and Probability

1.a

For $x_1,x_2\ldots x_n$ i.i.d $U([0, 1])$ continuous random variables, lets write the Order Statistics such as $\overline{x_1} ,\overline{x_2}\ldots \overline{x_n}$ when $\forall$ i , $\overline{x_i} \leq \overline{x_{i+1}}$

first lets find the CDP of $Y=MAX\lbrace x_1,x_2\ldots x_n \rbrace=\overline{x_n}$ :

Fy(x)=Fxn=Pr(xnk)=Pr(x1k,x2kxnk) since they i.i.d

Pr(x1k)Pr(x2k)Pr(xnk)=[Pr(xik)]n=[Fx(k)]n ()F(xi)={0forx<0x/1forx{0,1}1forx>1f(x)={1forx{0,1}0forx{0,1}

we get:

fy(k)=fxn(k)=ddk(Fxn(k))n=n(F(k))n1f(k)

now lets set values in (*) and get:

fy(k)=nkn1f(k)=nkn1f(k)=nkn1I(1,0)Beta(n,1)

therefore :

lim(E[y])ninf=lim(nn+1)n1

and :(var[y])ninf=lim(n(n+1)2(n+2))ninf0


E[|xα|]=+|xα|f(x)dx=α|xα|f(x)dx+α+|xα|f(x)dx when $\alpha \in argmin$ :

(αx)f(x)|0+αf(x)dx+(xα)f(x)|0α+f(x)dx αf(x)dx=α+f(x)dxPr(xα)=Pr(xα)

 Pr(xα)=1/2

Optimal Classifiers and Decision Rules

Let X and Y be random variables where Y can take values in $Y = \lbrace1, \dots, L\rbrace$, and Let $\ell$ be the 0-1 loss function defined in class , hence:

E[(y,f(x))]=k=1LPr(X=x^,Y=k)(k,f(x))

using bayes :

k=1LPr(X=x^)Pr(Y=k|X=x^)(k,f(k))=Pr(X=x^)k=1LPr(y=k|X=x^)(k,f(k)) L(h)=Argminf:XY{Pr(x=x^)yk,y{1L}Pr(y=k|X=x^)}=f(x^)=h(x)=k h(x^)=Argmax{Pr(x=x^)(1Pr(y=k|X=x^)}:h(x^)=k h(x^)=ArgmaxyYPr(y=i|x=x^)

To find decision rule for: Pr[y=1|X]>Pr[y=0|X] lets apply bayes rule on both sides. we get: fX|Y=1(x)Pr[Y=y]fX(X)fx>fX|Y=0(x)Pr[Y=y]fX(X)fx

pf1(x,μ1,)>(1p)f0(x,μo,) exp((1/2)(xμ1)T1(xμ1)exp((1/2)(xμ0)T1(xμ0)>1pp (xμ0)T1(xμ0)(xμ1)T1(xμ1)>2ln(1pp)

where $(x-\mu)^T \sum^{-1}(x-\mu)$ is the square Mahalanobis Distance between $x$ and $\mu$
so our simpler Decision rule will be

h(x)={1ford2m(x,μ0)>d2m(x,μ1)+2ln(1pp)0otherwise

when d=1 the general Matrix $\sum$ size will be size d X d, so the shape of the decision shape boundery will be just dot,in the same way when d=2 we will have a line , and for general d its might be d-dimenonal shape…


For $d = 1,\mu _0 = \mu _1 = \mu $ and $\sigma_1 \neq \sigma_0 $we looking for equation in the decision rule formula we go had above:

d2m(x,μ0)d2m(x,μ1)=2ln(1pp) (xμ)2(1σ021σ12)=2ln(1pp)(xμ)2=(σ02σ12)2ln(1pp)

(xμ)=+(σ02σ12)2ln(1pp)x=μ+(σ02σ12)2ln(1pp) ..