plastex built version Machine Learning Section.tex -split-level=0
1 Assignment 1
1.1 Linear Algebra
1.a
\(\Rightarrow \) symmetric matrix A is PSD such that \({v^tAv}\) \(=\) \({(v^tu) diag (\lambda )(u^tv)}\) =
\({\sum _{i}\lambda _i(Vu^t)^2\geq 0}\) where \(\lambda \) is the Eigenvalue of \(A\).
and matrix A can be decomposed as:
\[ A = QDQ^t = Q*diag(\lambda _1,\lambda _2\ldots \lambda _n)*Q^t = \]
\[ Q*diag(\sqrt{\lambda _1},\sqrt{\lambda _2}\ldots \sqrt{\lambda _n})*diag(\sqrt{\lambda _1},\sqrt{\lambda _2}\ldots \sqrt{\lambda _n})*Q^t= XX^t \]
\(\Leftarrow \) A can be written as \(v^tXX^tv\) we get:
\[ v^tAv=v^tXX^tv=(X^tv)^t(X^tv)=\parallel X^t_v \parallel \geq 0 \]
1.b
for a given PSD matrix A and \(\mathbf\alpha \in {R}\)
(*) \(v^t(\alpha A)\geq 0\) \(\Rightarrow \) \(v^t(u A)\geq 0\) when \(u , A \geq 0\)
then for PSD matrix’s A,B when \(A,B \geq 0\) , \(A+B \geq 0\)
now let’s apply (*) on (A+B) we will get (**)
\[ v^t(A+B)v=v^tAv+v^tBv \geq 0 \]
then from both (*) and (**) immediately get \(\alpha A+\beta B\geq 0\)
the set of all n × n PSD matrices over \({R}\) is not a vector space over \(R\) because its not apply the closures to multiplication in scalar property , for \(\lambda {\lt} 0\) and
\[ A\geq 0 \rightarrow \lambda A {\lt} 0 \rightarrow \lambda A \notin \lbrace PSD \rbrace \]
1.2 Calculus and Probability
1.a
for \(x_1,x_2\ldots x_n\) i.i.d \(U([0, 1])\) continuous random variables, lets write the
Order Statistics such as \(\overline{x_1} ,\overline{x_2}\ldots \overline{x_n}\) when \(\forall \) i , \(\overline{x_i} \leq \overline{x_{i+1}}\) first lets find the CDP of \(Y=MAX\lbrace x_1,x_2\ldots x_n \rbrace =\overline{x_n}\) :
\[ F_y(x)=F_{ \overline{x_n}}=\Pr (\overline{x_n}\leq k)=\Pr (\overline{x_1}\leq k,\overline{x_2}\leq k\ldots \overline{x_n}\leq k) \]
because they i.i.d
\[ \Pr (\overline{x_1}\leq k)\Pr (\overline{x_2}\leq k)\ldots \Pr (\overline{x_n}\leq k)=[\Pr (\overline{x_i}\leq k)]^n=[F_x(k)]^n \]
\[ (*) F(x_i) = \left\{ \begin{array}{rcl} {0} & \mbox{for} & x{\lt}0 \\ x/1 & \mbox{for} & x\in \lbrace 0,1\rbrace \\ 1 & \mbox{for} & x{\gt}1 \end{array}\right. f(x) = \left\{ \begin{array}{rcl} {1} & \mbox{for} & x\in \lbrace 0,1\rbrace \\ 0 & \mbox{for} & x\notin \lbrace 0,1\rbrace \end{array}\right. \]
we get:
\[ f_y(k)=f_{\overline{x_n}}(k)=\frac{d}{dk}(F_{\overline{x_n}}(k))^n=n(F(k))^{n-1}f(k) \]
now lets set values in (*) and get :
\[ f_y(k)=nk^{n-1}f(k)=nk^{n-1}f(k)=nk^{n-1}I_{(1,0)}\sim Beta(n,1) \]
therefore :
\[ \lim (E[y])_{ n\to \inf }= \lim (\frac{n}{n+1})_{ n\to \infty }\longrightarrow 1 \]
and :
\[ (var[y])_{ n\to \inf }= \lim (\frac{n}{(n+1)^2(n+2)})_{ n\to \inf }\rightarrow 0 \]
2
\[ E[|x-\alpha |] = \int ^{+\infty }_{-\infty } |x-\alpha |f(x) \, dx = \int ^{\alpha }_{-\infty } |x-\alpha |f(x) \, dx + \int ^{+\infty }_{\alpha } |x-\alpha |f(x) \, dx \]
when \(\alpha \in argmin\) :
\[ \underbrace{(\alpha -x)f(x)|}_{\to 0}+ \int ^{\alpha }_{-\infty } f(x) \, dx + \underbrace{(x-\alpha )f(x)|}_{\to 0}- \int ^{+\infty }_{\alpha } f(x) \, dx \]
\[ \displaystyle {\Rightarrow \int ^{\alpha }_{-\infty } f(x) \, dx =\int ^{+\infty }_{\alpha } f(x) \, dx} \Rightarrow \Pr (x\leq \alpha )=\Pr (x\geq \alpha )\Leftrightarrow \]
.
\[ \ { \Pr (x\leq \alpha )=1/2} \]
1.3 Optimal Classifiers and Decision Rules
1.a
Let X and Y be random variables where Y can take values in \(Y = \lbrace 1, \dots , L\rbrace \), and Let \(\ell \) be the 0-1 loss function defined in class , hence:
\[ E[\triangle (y,f(x))]= \sum _{k=1}^{L}Pr(X=\hat{x},Y=k)\triangle (k,f(x)) \]
using bayes :
\[ \sum _{k=1}^{L}\Pr (X=\hat{x})\Pr (Y=k|X=\hat{x})\triangle (k,f(k))=\Pr (X=\hat{x})\sum _{k=1}^{L}\Pr (y=k|X=\hat{x})\triangle (k,f(k)) \]
\[ L(h)=Arg\min _{f:X\to Y}\lbrace \Pr (x=\hat{x})\sum _{y\neq k,y\in \lbrace 1\ldots L\rbrace }\Pr (y=k|X=\hat{x})\rbrace =f(\hat{x})=h(x)=k \]
\[ \Rightarrow h(\hat{x})=Arg\max \lbrace \Pr (x=\hat{x})(1-\Pr (y=k|X=\hat{x})\rbrace : h(\hat{x})=k \]
\[ h(\hat{x})=Arg\max _{y\in Y} \Pr (y=i|x=\hat{x}) \]
1.4 Optimal Classifiers and Decision Rules
1.b
To find decision rule for:
\[ \Pr [y = 1 | X] {\gt} \Pr [y = 0 | X] \]
lets apply bayes rule on both sides. we get:
\[ \frac{f_{X|Y =1} (x) Pr [Y = y] fX (X)}{f_x} {\gt} \frac{f_{X|Y =0} (x) Pr [Y = y] fX (X)}{f_x} \]
\[ p f_1 (x , \mu _1 , \sum ) {\gt} (1-p) f_0 (x , \mu _o , \sum ) \]
\[ \frac{ exp( -(1/2)(x-\mu _1)^T \sum ^{-1}(x-\mu _1) }{exp( -(1/2)(x-\mu _0)^T \sum ^{-1}(x-\mu _0)} {\gt} \frac{1-p}{p} \]
\[ (x-\mu _0)^T \sum ^{-1}(x-\mu _0) -(x-\mu _1)^T \sum ^{-1} (x-\mu _1){\gt}2ln(\frac{1-p}{p}) \]
where \((x-\mu )^T \sum ^{-1}(x-\mu )\) is the square Mahalanobis Distance between \(x\) and \(\mu \)
so our simpler Decision rule will be
\[ h(x) = \left\{ \begin{array}{rcl} {1} & \mbox{for} & d^2\mathbf{_m} (x,\mu _0) {\gt} d^2\mathbf{_m} (x,\mu _1)+2ln(\frac{1-p}{p}) \\ 0 & \mbox{} & \mbox{otherwise} \end{array}\right. \]
1.c
when d=1 the general Matrix \(\sum \) size will be size d X d, so the shape of the decision shape boundery will be just dot,in the same way when d=2 we will have a line , and for general d its might be d-dimenonal shape...
1.d
For \(d = 1,\mu _0 = \mu _1 = \mu \) and \(\sigma _1 \neq \sigma _0 \)we looking for equation in the decision rule formula we go had above:
\[ d^2\mathbf{_m}(x,\mu _0)-d^2\mathbf{_m}(x,\mu _1)=2ln(\frac{1-p}{p}) \]
\[ (x-\mu )^2 (\frac{1}{\sigma _0^2}-\frac{1}{\sigma _1^2})=2ln(\frac{1-p}{p}) \Rightarrow (x-\mu )^2=(\sigma _0^2-\sigma _1^2)2ln(\frac{1-p}{p}) \]
\[ (x-\mu )= +- \sqrt{(\sigma _0^2-\sigma _1^2)2ln(\frac{1-p}{p})} \Rightarrow x=\mu +- \sqrt{(\sigma _0^2-\sigma _1^2)2ln(\frac{1-p}{p})} \]
1.5 Programming Assignment
Visualizing the Hoeffding bound:
\(\graphicspath{ { myplot.png/fig/} }\)
1.6 Nearest Neighbor:
1
.The KNN accuracy for \(k=10\). its got \(882 / 1000\) correct labeling. and the accuracy rate is 0.882000
2
The best K found is \(k= 4 \) with \(883 / 1000\) correct labeling.and the accuracy rate is 0.883000
2 Assignment 2
2.1 1. PAC learnability of \(\ell _2\)-balls around the origin.
Given a real number \(R \geq 0\) define the hypothesis \(h_R : R^d \rightarrow \lbrace 0, 1\rbrace \) and we will proof that hypothesis class \(H_{ball} = \lbrace h_R | R \geq 0\rbrace \) is PAC learn-able in the realizable case.
lets design an algorithm \(A_{balls}\) that learns \(H_{ball}\).
- •
- Given a sample of size
- •
- N = u1, . . . , uN
- •
- lets find the smallest ball B which is consistent with the sample
- •
- •
- i.e
- •
- BR:uk=MAXu1, . . . , uN∧||uk||2 ≤R
- •
-
- •
mistake only by labeling positive points as negative.
- •
The error of the algorithm is \(e_P (h_R) = P [B_0 \setminus B_{R}]\)
We assume that \(P(B_{R}) {\gt} \epsilon \). otherwise, we stand with the property and finished. now lets define T to be the real boundary of \(B_0\), that "extend" to the direction \(\rightarrow \) (0,0).such that for all \(\epsilon ,P(T)=\epsilon \). since any sample is in the form of \(||u_i||_2=(x_1^2+x_2^2...+x_d^2)^{1/2} \geq 0\) for any \(u \in T\), we get
\[ e_P (h_R) = P [B_0 \setminus B_R] \leq P(T)=\epsilon \Rightarrow \]
since \(e_P (h_R)\leq \epsilon \) exists j such that for all \( 1 \leq i \leq N, u_i \notin T \)
\[ P [ e_P(h_R) {\gt} \epsilon ] \leq P [ \exists j \forall i : u (i) \in T ] \]
we can notice that
\[ P [ e_P(h_R) {\gt} \epsilon ] \leq (1-\epsilon )^n \leq e^{-n\epsilon } = \delta \Leftrightarrow n \leq \frac{1}{\epsilon }ln\frac{1}{\delta } \]
now lets set \(N(\epsilon ,\delta )=\frac{1}{\epsilon }ln\frac{1}{\delta }\)
we proved that there exists \(N(\epsilon ,\delta )=\frac{1}{\epsilon }ln\frac{1}{\delta }\) , such that for every \(\epsilon , \delta \) and every realizable distribution P over \(R^d\) with labeling function \(B_0 \in H_{ball}\) , when running \(A_{ball}\) on \(n \geq N(\epsilon ,\delta )\) training examples drawn i.i.d. from P, it returns a hypothesis \(h_R \in H_{ball}\) that hold the property above. moreover we can notice that the complexity is not depend on the dimension d
2.2 PAC in Expectation.
Theorem
hypothesis class \(\mathrm{H}\) is PAC learnable if and only if \(\mathrm{H}\) is PAC learnable in expectation
Proof
▼
\(\Rightarrow \) by definition exist A for any \(\delta ,\epsilon \) such that \( P [ e_P(A(s)) {\gt} \epsilon ] \leq \delta \).
and \(\epsilon ,e_P(A(s){\gt}0\) now by Markov’s inequality we get
\[ P [ e_P(h_R) {\gt} \epsilon ] \leq \frac{E [ e_P(h_R) ]}{\epsilon } \]
hence for \(n \geq N(a) \) lets define \( \hat{N} (\epsilon \delta ) : (0, 1) \rightarrow N |\forall a \in (0,1) \) and the following will hold
\[ \frac{E [ e_P(h_R) ]}{\epsilon }\leq \frac{\hat{N(a)}}{\epsilon }=\frac{\epsilon \delta }{\epsilon }=\delta \]
witch stand with the PAC in Expectation definition, with the same A.
\(\Leftarrow \) we sow before that the same algorithm A work with both. now using the law of total expectation.
\[ E [e_P(A(s))] \]
\[ = \underbrace{E [e_P(A(s))|e_P(A(s)) \leq \epsilon ]· P [e_P(A(s)) \leq \epsilon }_{\leq 1\epsilon } + \]
\[ \underbrace{E [e_P(A(s))|e_P (A (s)) {\gt}\epsilon ]P [e_P(A(s)) {\gt} \epsilon ]}_{\leq \delta 1} \]
\[ \leq \epsilon + \delta \]
and in general its hold for any \(\epsilon =1-\delta \) hence for \(n\geq N(\epsilon ,\delta )\) we get the equivalence
Union Of Intervals.
we can notice that any 2k distinct points on the real line can be shattered using k intervals. it suffices to shatter each of the k pairs of consecutive points with an interval. now lets look at set of 2k+1 points assume they sorted \(x_1{\lt}x_2{\lt}\dots x_{2k+1}\), now lets label any \(x_i\) with \((-1)^{i+1}\), hence we need 2k+1 intervals to shatter the set because no interval can contain two consecutive points. and the VC dimension is 2k
Prediction by polynomials.
The VC dimension of H is the size of the largest set of examples that can be shattered by H\(\Rightarrow \) The VC dimension is infinite if for all m, there is a set of m examples shattered by H.
for all \(m \in R\) lets say we have sample set size \(m=(y_1,y_2\dots y_m) \) now using the hint we know there for given n distinct values \(x_1, ..., x_n\in R\) there exists a polynomial P of degree n - 1 such that \(P(x_i) = y_i\). now we can set out some \(h_p \in H_{poly}\) and reduce each \(\epsilon \) from each sample set i.e \(2^m\) times. and each time using the hint above we can label 0-1 all the element for all m.
Hence the VC dimension of \(H_{poly}\) is \(\infty \).
2.3 Structural Risk Minimization.
Lets \(\hat{H}=\cup ^k_iH_i\) be k finite hypothesis such that \(|H_1| \leq \dots \leq |H_k|\), using the relating empirical and true errors property for any \(h_j \in H_i\)
\[ P[{sup}_{h\in H}|e_s(h)-e_p(h)|]\leq 2|H|e^{-2n\epsilon ^2} \]
now using the union bound we will get
\[ \mathop\displaylimits ^k_{i=1} P[|e_s(h)-e_p(h)|{\gt}\sqrt{\frac{1}{2|S|}ln\frac{2k|H_i|}{\delta }}]\leq \]
\[ \sum ^k_{i=1} P[{sup}_{h\in H}|e_s(h)-e_p(h)|{\gt}\sqrt{\frac{1}{2|S|}ln\frac{2k|H_i|}{\delta }}]] \leq \]
\[ \ k P[|e_s(h)-e_p(h)|{\gt}\sqrt{\frac{1}{2|S|}ln\frac{2k|H_i|}{\delta }}]] \leq 2k|H|e^{-2n\epsilon ^2} \]
for \(|S|=n\) and \(\epsilon = \sqrt{\frac{1}{2|S|}ln\frac{2k|H_i|}{\delta }}\)
\[ \leq 2k|H_i|exp(-2|S|\sqrt{(\frac{1}{2|S|}ln\frac{2k|H_i|}{\delta }})^2)=2k|H_i|(\frac{2k|H_i|}{\delta })^{-1}=\delta \]
\[ \Leftrightarrow \forall i \in \hat{H} \Rightarrow P[|e_s(h)-e_p(h)|{\gt}\sqrt{\frac{1}{2|S|}ln\frac{2k|H_i|}{\delta }}]] {\lt} 1-\delta \]
(b)
Lets \(\hat{i}\) be the hypothesis s.t SRM return \(\text{ERM}_{\hat{i}}\). and lets \(i^*\) be index of \(h*\)
\[ e_p(SRM)\leq e_s(\text{ERM}_{\hat{i}})+\sqrt{\frac{1}{2n}ln\frac{2k|H_{\hat{i}}|}{\delta }}\leq \underbrace{e_s(\text{ERM}_{i^*})+\sqrt{\frac{1}{2n}ln\frac{2k|H_{i^*}|}{\delta }}}_{\text{result from section a}} \]
now lets reduce \(h^*\) from the following ,and using the ERM property we get
\[ e_s(\text{ERM}_{i^*}(S))-e_p(h^*)+\sqrt{\frac{1}{2n}ln\frac{2k|H_{i^*}|}{\delta }}\leq e_s(h^*)-e_p(h^*)+\sqrt{\frac{1}{2n}ln\frac{2k|H_{i^*}|}{\delta }}\leq \]
\[ 2\sqrt{\frac{1}{2n}ln\frac{2k|H_{i^*}|}{\delta }}\leq \epsilon \Rightarrow n\geq \frac{2}{\epsilon ^2}ln\frac{2k|H_{i^*}|}{\delta } \]
hence for \(n\geq \frac{2}{\epsilon ^2}ln\frac{2k|H_{i^*}|}{\delta }\) we will get the 1-\(\delta \) probability
2.4 Programming Assignment Union Of Intervals
(a)
Lets the true distribution \(\Pr [x,y]=\Pr [y|x]\Pr [x]\)is as \(x\) is distributed uniformly on the interval [0, 1], and
\[ \Pr [y=1|x] = \begin{cases} 0.8 & \quad \text{if } x\in [0,0.2]\cup [0.4,0.6]\cup [0.8,1] \\ 0.1 & \quad \text{if } x\in (0.2,0.4)\cup (0.6,0.8) \\ \end{cases} \]
hence we looking for \(h=(\hat{x})\text{arg max}_{y\in {0,1}}\Pr [Y=y,X=\hat{x}]\)
when \(x\in [0,0.2]\cup [0.4,0.6]\cup [0.8,1]\) ,
\[ \Pr [Y=1|X=\hat{x}]=0.8 {\gt}Pr[Y=0|X=\hat{x}]=0.2 \]
and \(x\notin [0,0.2]\cup [0.4,0.6]\cup [0.8,1]\)
\[ \Pr [Y=0|X=\hat{x}]=0.9 {\gt}Pr[Y=1|X=\hat{x}]=0.1 \]
and \(x\) is distributed uniformly on the interval [0, 1], and the optimal hypothesis for \(H_{10}\)
\[ h(x)= \begin{cases} 1 & \quad \text{if } x\in [0,0.2]\cup [0.4,0.6]\cup [0.8,1] \\ 0 & \quad \text{else} \\ \end{cases} \]
(b)
From the plot, we can notice that the empirical error increasing according to the amount of samples taken since the probability to see samples outside the 3 intervals is increasing, moreover we can see that the empirical error approach to 0.15 since its the middle of the false-positve and true-negtive error from section a . i.e \((0.2+0.1)/2\). the true error is decreeing since we test more samples since we getting closer to the real distribution \(P\)
(c)
The empirical risk decreeing while k increase since the ERM algorithm have more option of disjoint interval to choose for given data so its can cover more samples. on the other hand while \(k{\gt}3\) we can notice the true error increasing since the model over fitting to the sample set. and \(k=3\) is the one with the best behaviour
(d)
We can notice that when the when h come from \(H_3\) the sum of the pendalty and the empirical error is minimizing.
(e)
best hypothesis found
after drawing the data we can notice that the following stand with the hold out property for \(1-\delta \), witch is close to the optimal true error
3 Assignment 3
3.1 Step-size Perceptron.
Consider the modification of Perceptron algorithm with the following update rule:
\[ w_{t+1} \leftarrow w_t + \eta _ty_tx_t \]
whenever \(\hat{y}\neq y\). Assume that data is separable with margin \(\gamma {\gt} 0\) and that \(||x_t|| = 1 \) for all t. for any \(1\leq i \leq m\), Perceptron’s i’th iterate takes the form:
\[ w_{t+1}w^*=(w_t+\eta _ty_tx_t)w^*=w_tw^*+\underbrace{y_tw^*x_t}_{x_ty_tw^*x\geq \gamma }\frac{1}{\sqrt{t}}\geq w_tw^*+\frac{\gamma }{\sqrt{t}} \]
the M mistake hold: \(w_{M}w^* \geq m\frac{\gamma }{\sqrt{m}}=\sqrt{m}\gamma \).
and now \(||w_t||_2^2\) upper bounded is
\[ M_\gamma \leq \frac{w^*\sum ^m_{t=1}y_tx_t}{||w^*||}\leq \frac{w^*\sum ^m_{t=1}(w_{t+1}-w_T)}{||w^*||\eta } \]
\[ ||w_{t+1}||_2^2\leq \sqrt{\sum ^m_{t=1}||w_t+\eta _ty_tx_t ||^2-||w_t||^2 }\leq \sqrt{\sum ^m_{t=1}\underbrace{2\eta _ty_tx_t}_{negtive}+\eta ^2||x_t||^2 } \]
\[ \leq \sqrt{\sum ^m_{t=1}\frac{1}{t}||x_t|| } \leq \sqrt{H_m}\thicksim \log (\sqrt{m}) \]
using Cauchy-Schwarz ineq
\[ \gamma \sqrt{m} \leq w_{M}w^* \leq ||w_{t+1}||_2^2 \leq \log (\sqrt{m}) \]
\[ \Rightarrow \sqrt{m} \leq \frac{1}{\gamma }\log (\sqrt{m})\Rightarrow \sqrt{m} \leq \frac{2}{\gamma }\log (\frac{1}{\gamma })\Rightarrow m\leq \frac{4}{\gamma ^2}\log ^2(\frac{1}{\gamma }) \]
3.2 Convex functions.
2.1
Let \(f : R^n \rightarrow R\) a convex function, \(A \in R^{n\times n}\) and \(b\in R^n\) .for some \(0{\lt} \lambda {\lt}1\),we like to have the graph of \(g\) on an interval \([x,y]\) falls below or on the graph. we can notice \(b=\lambda b +(1-\lambda )b\)
\[ g(\lambda x+(1-\lambda )y)=f(A(\lambda x+(1-\lambda )y)+b)=f(\lambda (Ax+b)+(1-\lambda )(Ay+b))\leq \]
using Jensen’s inequality
\[ \lambda f(Ax+b)+(1-\lambda )f(Ay+b)=\lambda g(x)+(1-\lambda )g(y). \]
and the sum of both convex function hold the convex property over \(R^n\)
2.2
Now lets consider \(f_1(x),f_2(x)\dots f_m(x)\) convex function \(f_i : R^d \rightarrow R\) and we will proof \(g(x)=\text{max}_i f_i(x)\) is also convex. using the property from section a \( f_i(\lambda x + (1-\lambda )y) \le \lambda f_i(x) + (1-\lambda )f_i(y)\), we take maximum of the both sides.
\[ \underset {i}{\text{max}} \left\{ f_i(\lambda x + (1-\lambda )y) \right\} \le \underset {i}{\text{max}} \left\{ \lambda f_i(x) + (1-\lambda )f_i(y) \right\} \]
\[ \underset {i}{\text{max}} \left\{ f_i(\lambda x + (1-\lambda )y) \right\} \le \underset {i}{\text{max}} \left\{ \lambda f_i(x) \right\} + \underset {i}{\text{max}} \left\{ (1-\lambda )f_i(y) \right\} \]
hence we can write \(g(x)\) in the form
\[ g(\lambda x + (1-\lambda )y) \le \lambda g(x) + (1-\lambda )g(y). \]
\(g(x)\) is convex
2.3
Let \(\ell _{ log} : R \rightarrow R\) be the log loss, defined by
\[ \ell _{ log}(z)=\log _{2} (1+e^{-z}) \]
we know \(f\) is covex iff \(f''{\gt}0\)
\[ \frac{d}{dz}\left(\log _2\left(1+e^{-z}\right)\right)=-\frac{e^{-z}}{\ln \left(2\right)\left(1+e^{-z}\right)} \]
\[ \Rightarrow \frac{d}{dz} \left(-\frac{e^{-z}}{\ln \left(2\right)\left(1+e^{-z}\right)}\right)=\: \frac{e^{-z}}{\ln \left(2\right)\left(1+e^{-z}\right)^2}{\gt}0 \]
using section a,b. for \(f(\mathbf{w})\) define by
\[ f(\mathbf{w}) = \ell _{log}(y\mathbf{w\cdot x})=\sum _{i=1}^n \log _{2} (1+e^{-yx_iw}) \]
lets set \(f_i=\log _{2} (1+e^{-yx_iw})\). the set \(\lbrace f_i \rbrace \) is convex set and any \(f_i\) can written as \(f(\alpha x+(1-\alpha )y)\) hence \(f(\mathbf{w})\) can written as
\[ f(\mathbf{w}) =\sum _{i=1}^n f (\alpha x+(1-\alpha )y)\leq n\underset {i}{\text{ max}}\{ f_i \} \leq n\underset {i}{\text{ max}} \left\{ \lambda f_i(x) \right\} + n\underset {i}{\text{ max}} \left\{ (1-\lambda )f_i(y) \right\} \]
GD with projection.
3.1
Let \(y \in \mathbb {R}^d\) and \(x = \prod _{\mathcal{K}} (y).\) and lets \(z\in \mathcal{K}\) by assumption \(\mathcal{K} \) is convex set, hence we can write any \(k\in \mathcal{K}\)
\((1-\lambda )x+\lambda z=x-\lambda (x-z)\in \mathcal{K} \) for any \(\lambda \in (0,1)\)
\[ ||x-y||^2\leq ||x-\lambda (x-z)-y||^2= ||(x-y)-\lambda (x-z)||^2 \]
\[ \leq ||x-y||^2-2\lambda \langle x-y,x-z \rangle +\lambda ^2||x-z||^2 \]
\[ \Rightarrow \langle x-y,x-z \rangle \leq \frac{\lambda }{2}||z-x||^2 \]
the following hold for any \(\lambda \in (0,1)\) since the right hand size can be small as we wish for a given z. on the other hand the right side can be less then 0 for some y s.t \(y\notin \mathcal{K} \) and we get
\[ \langle x-y,x-z \rangle \leq 0 \]
and now lets look at some \(z\in \mathcal{K}\) and we choose some \(\langle x-y,x-z \rangle \leq 0\)
\[ ||y-z||^2-||x-z||^2=||y-z+x-x||^2-||x-z||^2= \]
\[ ||(x-z)-(x-y)||^2-||x-z||^2 = ||x-z||^2-2\langle x-y,x-z \rangle \ +||x-y||^2-||x-z||^2{\gt}0 \]
\[ \Rightarrow ||y-z|| \geq ||x-z|| \]
3.2
Theorem
The GD with projection holds the Convergence Theorem. Given desired accuracy \(\epsilon \geq 0\) set \(\eta =\frac{B^2}{\epsilon }\) and ruining GD with projection for \(T=\left(\frac{\epsilon G}{B}\right)^2\)
Proof
▼
The GD with projection still holds the Convergence Theorem. using Jensen inequality and Convexity property
\[ f({w})-f(w^*)\leq \frac{1}{T}\sum ^T_{t=1}\nabla f(x_t)(x_t-x*)\leq \frac{1}{T}\sum ^T_{t=1}\frac{1}{\eta }(x_t-y_{t+1})(x_t-x*)\leq \]
using the identity \(2ab=||a||^2+||b||^2-||a-b||^2\)
\[ \leq \frac{1}{T}\sum ^T_{t=1}\frac{1}{2\eta }\left( ||x_t-y_{t+1}||^2+||x_t-x^*||^2 - ||y_{t+1}-x^*||^2\right) \]
\[ \leq \frac{1}{T}\sum ^T_{t=1}\frac{1}{2\eta }\underbrace{\left( ||x_t-x^*||^2-||y_{t+1}-x^*||^2\right)}_{3.1 }+\frac{1}{T}\sum ^T_{t=1} \frac{\nabla f(x_{t+1})}{2\eta } \]
using the result from 3.1 we know that \(||y_{t+1}-x^*||\geq ||x_{t+1}-x^*|| \) and assuming \(||\nabla f(x_{t+1})||\leq G\) we get
\[ \frac{||x_1-x*||^2}{2\eta }+\frac{TG^2}{2\eta }\leq \frac{B^2}{2\eta }+\frac{TG^2}{2\eta } \]
for any \(\epsilon \geq 0\) plug-in \(\eta =\frac{B^2}{\epsilon }\) and \(T=\left(\frac{\epsilon G}{B}\right)^2\)
\[ f({w})-f(w^*)\leq \frac{B^2}{2\eta }+\frac{TG^2}{2\eta }=\epsilon \]
3.3 Gradient Descent on Smooth Functions.
Let \(f : \mathbb {R}^n \rightarrow \mathbb {R}\) be a \(\beta \)-smooth and non-negative function. we Consider the gradient descent algorithm applied on f with constant step size \(\eta {\gt} 0\):
\[ x_{t+1}=x_t-\eta \nabla f(x_t) \]
now lets compute \(x_t,x_{t+1}\) in \(f\)
\[ f(x_{t+1})\leq f(x_t)+\langle \nabla f(x_t),x_{t+1}-(\eta \nabla f(x_t)+x_{t+1})\rangle +\frac{\beta }{2}||x_t-x_{t+1}||^2 \]
\[ \leq f(x_t)-\eta ||\nabla f(x_t)||^2 +\frac{\beta \eta ^2}{2}||\nabla f(x_t)||^2 \]
\[ f(x_{t+1})-f(x_t)\leq -(\eta -\frac{\beta \eta ^2}{2})||\nabla f(x_t)||^2 \]
since \(f\) is non-negative we can bound gradient squared norm of the gradient.
\[ \sum ^k_{t=1}||\nabla f(x_t)||^2\leq (\eta -\frac{\beta \eta ^2}{2})^{-1}(f(x_1)-f(x_{k+1})) \]
Thus either the function values \(f(x_k)\) tend to \(-\infty \) or the sequence \(\{ ||\nabla f(x_t)||^2\} \) is summable and therefore every limit point of the iterates \(x_k\) the GD is equal to zero, since \(0\leq f(x_{k+1}) , f(x_k)\),and \(\eta {\lt}2/ \beta \) lets define \(f^*:=\lim _{\rightarrow \infty }f(x_k)\)
\[ \underset {t}{\text{min}}||\nabla f(x_t)||^2 \leq \frac{1}{k}\sum ^{k \rightarrow \infty }_{t=1}||\nabla f(x_t)||^2\leq \frac{1}{k} \frac{1}{\eta (1-\frac{\beta \eta }{2})}(f(x_1)-f(x_{k+1}) \]
hence for some c (depends on \(x_1\) value) we can set \(c/ \sqrt{k}\) that holds
\[ ||\nabla f(x_t)||\leq \frac{cf({x_t+1})}}\underset {t \rightarrow \infty }{\longrightarrow }0 \]
3.4 Programming Assignment
SGD for Hinge loss
(c)
(d)
SGD for log-loss.
(a)
(b)
(c)
4 Assignment 4
SVM with multiple classes.
Define the following multiclass SVM problem:
\[ f(w_1,\dots w_k)=\frac{1}{n}\sum ^n_{i=1}\ell (w_1,\dots w_k,x_i,y_i)=\frac{1}{n}\sum ^n_{i=1} \underset {j\in [K]}{\text{max}}(w_j \cdot x_i -w_{y_i}\cdot x_i + \mathds {1}(j\neq y_i) ) \]
First lets notice that if \(i=j\) then f is sum of zeros, hence \(f\) is non-negative function and we assume that the data is linearly separable, we can consider \(\mathbf{w^*}=(w_1^*,\dots w_k^*) \) to be the actual separator of the data. its will be sufficient to see that. after plug-in \(\mathbf{w^*}\) any minimizer will lead \(f\) to 0 errors. so we just need to make sure that for any update s.t \(\mathds {1}(y_i\neq j)\). will lead \(f \mapsto 0\)
\[ w_j \cdot x_i -w_{y_i}\cdot x_i + \mathds {1}(j\neq y_i)=x_i(w_j-w_{y_i})+\mathds {1} \]
\(x_iw_j^*=M_j\) is the support vector of the data for some \(y\). according to the max margin hyperplane property the the true separator \(\mathbf{w^*}\) maximize the minimum distance for any \(y\). and now we just need to see that for any \(j\in [K]/y_i\)
\[ x_iw_j-x_iw_{y_i}=\frac{1}{M}(x_i(w_j^*-w_{y_i}^*)=\frac{-1}{M}(x_i(w_{y_i}^*-w_j^*))\leq \frac{-1}{M}\text{Min}_j(x_i(w_{y_i}^*-w_j^*))\leq -1 \]
since any other margin will be \(\geq \) \(M\).
Hence after the \(multiclass-hinge-loss\) find the actual max margin hyperplane any minimizer apply on \(f\) will lead to 0 errors.
Soft-SVM.
Consider the soft-SVM problem with seperable data:
\[ \begin{array}{ll} \underset {\mathbf{w},\xi }{\text{min}} & 0.5||\mathbf{w}||^2 + C\sum ^n_{i=1}\xi _i \\ \text{ s.t } \forall i: & y_i\mathbf{w\cdot x}_i \ge 1-\xi _i \\ & \xi _i \ge 0. \end{array} \]
Let \(\mathbf{w}^\star \) be the solution of hard SVM, and let \(\mathbf{w}',\xi '\) be solution for the soft SVM. since \(\mathbf{w}^\star \) feasible solution for the problem. I claim that the following holds for \(C\geq ||\mathbf{w}^\star ||^2\)
\[ \frac{1}{2}||\mathbf{w'}||^2+||\mathbf{w}^\star ||^2||\xi '||\leq \frac{1}{2}||\mathbf{w'}||^2+C||\xi '||\leq \frac{1}{2}||\mathbf{w}^\star ||^2\Rightarrow \frac{1}{2}||\mathbf{w'}||^2\leq ||\mathbf{w}^\star ||^2(\frac{1}{2}-||\xi '||) \]
Since all non negative any minimizer of the problem will lead to \(\sum _i \xi {\lt} 1\). we can notice that for any \(\xi _i\)
\[ 0\leq \frac{|1-\xi _i|}{||w||}{\lt}1 \]
Any point \(x_i\) which is within the margin or is located in the other side of the separating hyperplane, but none of them cross the separating hyperplane hence the data is separable
Separability using polynomial kernel.
Let \(x_1,\dots x_n \in \mathbb {R}\)e distinct real numbers, and let \(q \geq n\) be an integer. for separable data hard-SVM yield zero training errors. lets write the polynomial kernel in binomial form
\[ (x,x')=(1+xx')^q=\sum ^q_{k=0}\binom {q}{k}(xx')^k=\sum ^q_{k=0}x^k\sqrt{\binom {q}{k}}x'^k\sqrt{\binom {q}{k}} \]
Multiply each row of the Vandernow matrix with the constant from the binomial above.
\[ \begin{pmatrix} x_1^0
& x_1^1
& x_2^2
& \cdots
& x_1^q
\\ x_2^0
& x_2^1
& x_2^2
& \cdots
& x_2^q
\\ \vdots
& & & & \\ x_q^0
& x_q^1
& x_q^2
& \cdots
& x_q^q
\end{pmatrix} \Rightarrow \begin{pmatrix} x_1^0\sqrt{\binom {q}{0}}
& x_1^1\sqrt{\binom {q}{1}}
& x_2^2\sqrt{\binom {q}{2}}
& \cdots
& x_1^q\sqrt{\binom {q}{q}}
\\ x_1^0\sqrt{\binom {q}{0}}
& x_1^1\sqrt{\binom {q}{1}}
& x_2^2\sqrt{\binom {q}{2}}
& \cdots
& x_1^q\sqrt{\binom {q}{q}}
\\ \vdots
& & & & \\ x_1^0\sqrt{\binom {q}{0}}
& x_1^1\sqrt{\binom {q}{1}}
& x_2^2\sqrt{\binom {q}{2}}
& \cdots
& x_1^q\sqrt{\binom {q}{q}}
\end{pmatrix} \]
Hence the binomial form can get by the inner proudact, and \(K_S\) the kernel matrix \(K(x_i,x_j)=\phi (x_i)\phi (x_j)\). using the fact that Vandernow matrix is rank \(n\) the lemma holds here, The hard-SVM yield zero training errors.
Expressivity of ReLU networks.
- •
If \(x\geq 0\Rightarrow x=\max \{ 0,x\} ,0=\max \{ 0,-x\} \Rightarrow x=x-0=\max \{ 0,x\} -\max \{ 0,-x\} \)
If \(x{\lt} 0\Rightarrow 0=\max \{ 0,x\} ,-x=\max \{ 0,-x\} \Rightarrow x-x=0 \)
\(\Rightarrow x+\max \{ 0,-x\} =\max \{ 0,x\} \Rightarrow x= \max \{ 0,x\} -\max \{ 0,-x\} \)
- •
If \(x\geq 0 ,-x\leq 0\Rightarrow x\geq -x \Rightarrow \max (x,-x)=x=|x|\)
If \(x{\lt} 0 ,-x{\gt} 0\Rightarrow x{\lt} -x \Rightarrow \max (x,-x)=-x=|x|\)
- •
\begin{eqnarray} \frac{x_1+x_2}{2}+\frac{|x_1-x_2|}{2} & =& \frac{(x_1+x_2+\max (x_1-x_2,x_2-x_1))}{2} \\ & =& \frac{1}{2}(\max (x_1-x_2+x_1+x_2,x_2-x_1+x_1+x_2)) \\ & =& \frac{1}{2}(\max (2x_1,2x_2)) \\ & =& \max (x_1,x_2) \end{eqnarray}
4(b)
\(n_1=\max \{ x_1-x_2,0\} ,n_2=\max \{ x_2-x_1,0\} ,n_3=\max \{ x_1+x_2,0\} ,n_4=\max \{ -x_1-x_2,0\} \)
(*) We could achieve same result with 3 neutrons since \(\max \{ x_1,x_2\} =\max \{ x_1-x_2,0\} +x_2\)
4.1 Implementing boolean functions using ReLU networks.
Consider n boolean input variables \(x_1,x_2\dots ,x_n\in \{ 0,1\} \), lets construct a neural network with ReLU activations, which implements the AND function:
\[ f(x_1,x_2\dots ,x_n)=x_1\wedge x_2\wedge \dots \wedge x_n \]
we can consider the \(f\) as :
\[ f(x_1,x_2\dots ,x_n)=\max \{ -n+1+\sum ^n_{i=1} x_i,0\} \]
Hence we can built ReLU networks with one hidden layer, and constant extra input \(\hat{x}=1\), and the weight function will be
\[ w(x_i)=1 \text{ for } 0\le i\le n \text{ and } w(\hat{x})=n-1 \]
By connecting all the inputs to single neuron we get the \(\max \{ \sum x_i,0\} \) and connecting the constant \(\hat{x_i}\) to it as well will give us the AND function
\[ \max \{ \hat{x}(n-1)+\sum x_i,0\} \]
4.2 Programming Assignment.
SVM
figure
Homogeneous polynomial kernel.
Here, the polynomial kernel of degree 2 fits the data better, since the data is cycle shape shape, and the kernel trick need 2 degree for separate the hyperplane.
We can see here better fit of of the right-hand side module, the Independent term in kernel function give the needed correction , but still degree 2 polynomial fits here better
figure
Non-Homogeneous polynomial kernel.
figurepolynomial kernel and RBF kernel.
RBF kernel generalize better on the noisy data since its can separate the data in higher dimention then the polynomial kernel, that become more "elliptic" to cover the noise data
figureRBF kernel with different \(\gamma \) values
4.3 Neural Networks.
figure
figure
figure
From the plots we can learn that the best value for the learning rate is 0.1, while using larger value at any step we might get far away from the optimal minimizer, and while using smaller value we proses in less effective way and loosing information.
figureaccuracy in the final epoch
5 Assignment 5
5.1 Suboptimality of ID3.
Consider the following training set, where \(\mathcal{X}= \{ 0, 1\} ^3\) and \(Y = \{ 0, 1\} \):
\[ a=((1, 1, 1), 1)\quad b=((1, 0, 0), 1) \quad c=((1, 1, 0), 0)\quad d=((0, 0, 1), 0) \]
We wish to use this training set in order to build a decision tree of depth 2. for each split lets calculate the error gain as mutual information.
\[ G(S,i)=\underbrace{C(Y)}_{pre}\underbrace{-\mathbb {P}[X_i=0]C(Y|X_i=0)-\mathbb {P}[X_i=1]C(Y|X_i=1)}_{post} \]
Where \(C(Y)=C(\mathbb {P}[Y=1])\) and \(C\) define as Entropy gain
\begin{align*} G(Y;X_1)& = H(Y)-\mathbb {P}[X_1=0]H(Y|X_1=0)-\mathbb {P}[X_1=1]H(Y|X_1=1) \\ & =2(\frac{1}{2}\log \frac{1}{2})-\frac{3}{4}H(Y|X_1=1) \\ & =1+\frac{3}{4}\left(\frac{2}{3}\log \frac{2}{3}+\frac{1}{3}\log \frac{1}{3}\right)\\ & \approx 0.311\\ G(Y;X_2)& =1-\mathbb {P}[X_2=0]C(Y|X_2=0)-\mathbb {P}[X_3=1]C(Y|X_2=1) =0\\ G(Y;X_3)& =1-\mathbb {P}[X_3=0]C(Y|X_3=0)-\mathbb {P}[X_3=1]C(Y|X_3=1) =0\\ \end{align*}
Hence the first spilt will be on \(X_1\) and \(a,b,c\) go down on same branch, now if choose to ask about \(X_2\) then \(a,c\) will map to the same leaf, on the other hand if we ask about \(X_3\) then \(b,c\) will map to the same leaf. its following that no matter what split the ID3 choose we will get at least one wrong sample classified and the training error will be at least \(\frac{1}{4}\)
(b)
5.2 AdaBoost
(a)
Let \(x_1,\dots , x_m \in \mathbb {R}^d\) and \(y_1, \dots ,y_m \in \{ -1,1\} \) its labels. We run the AdaBoost algorithm as given in the lecture, and we are in iteration \(t\). and assuming that \(\epsilon _t{\gt}0\). The empirical error function at \(t\) in AdaBoost run, is define by:
\[ \epsilon _t=e_{D_t,S}(h_t)=\sum _i D_t (i)\mathds {1}[h_t(x_i)\neq y_i]=\sum _{i:h(x_i)\neq y_i} D_t(i) \]
Now we can notice that the error of \(h_t\) following from \(D_t\) written as
\[ \Pr _{x\sim D_{t+1}}[h_t(x)\neq y]= \frac{\sum _{i:h(x_i)\neq y_i} D_t(i)e^{w_t}}{\sum _j D_t(j)e^{-w_ty_jh_t(x_j)}}= \frac{\epsilon _te^{\frac{1}{2}\ln \frac{1-\epsilon _t}{\epsilon _t}}}{Z_t}= \frac{\epsilon _t\sqrt{\frac{1-\epsilon _t}{\epsilon _t}}}{2\sqrt{\epsilon _t(1-\epsilon _t)}}=\frac{1}{2} \]
the First equation hold since we divide the mistakes from the total destitution, Second by definition of \(Z_t\) and \(w_t\), and Third using the Lemma
(b)
Lets assume that AdaBoost pick the same hypothesis twice consecutively, that is
\[ h_{t}=h_{t+1}=WL(D_{t+1},S) \]
using the result of section (a) we know that the error of \(h_t\) with respect to \(D_{t+1}\) is \(\frac{1}{2}\), and we get an contradiction to the fact that \(h_{t+1}\) is week learner.
5.3 Sufficient Condition for Weak Learnability.
Let \(S=\{ (x_1,y_2),\dots (x_n,y_n)\} \) be a training set and let \(\mathcal{H}\) be a hypothesis class. Assume that there exists \(\gamma {\gt} 0\), hypotheses \(h_1,\dots ,h_k\in \mathcal{H}\) and coefficients \(a_1,\dots a_k \geq 0,\sum _i^k a_i=1\) for which the following holds:
\begin{align} y_i\sum ^k_{j=0}a_jh_j(x_i)\geq \gamma \quad \forall (x_i,y_i)\in S \end{align}
Let \(D\) be any distribution over \(S\). Then, taking expectations with respect to \(D\) of both sides of the equation
\[ \underset {i\sim D}{E}\left[ y_i\sum ^k_{j=0}a_jh_j(x_i)\right]= \sum ^k_{j=0}a_j \underset {i\sim D}{E}\left[ y_ih_j(x_i)\right]\geq \gamma \]
\[ \exists m\in \mathcal{H} \text{ s.t }\quad \underset {i\sim D}{E}\left[ y_i h_{m}(x_i)\right]\geq \gamma \]
the equation holds from the linearity of expectations and the definition of \(a_i,\dots a_j\) which constitute a distribution.
on the other hand we can consider the expectations as sum of indicator, that is:
\begin{align*} \underset {i\sim D}{E}\left[ y_i h_{m}(x_i)\right]& =\sum _{j=0}^n D (i)y_ih_m(x_i)\\ & = \sum _{j:h_m(x_j)=y_j}D(j) -\sum _{j:h_m(x_j)\neq y_j}D(j)\\ & =1-2\sum _{j:h_m(x_j)\neq y_j}D(j)\\ & =1-2\Pr _{i\sim D}[h_t(x)\neq y]\\ & \geq \gamma \end{align*}
following last two equation we get that \(\Pr _{i\sim D}[h_t(x)\neq y]\le \frac{1}{2}-\frac{1}{\gamma }.\)
(b)
Let \(S=\{ (x_1,y_2),\dots (x_n,y_n)\} \subseteq \mathbb {R}^d \times \{ -1,1\} \) be a training set that is realized by a
\(d\)-dimensional hyper-rectangle classifier. and let H be the class of decision stumps.
following the hint, set:
\[ k=4d-1\quad a=\frac{1}{4d-1} \quad h_{b_i}(x) = \begin{cases} 1 & \quad x_j\ge b_i \\ -1 & \quad x_j{\lt} b_i \end{cases} ,\quad h_{c_i}(x) = \begin{cases} 1 & \quad x_j\le c_i \\ -1 & \quad x_j {\gt}c_i \end{cases} \]
and
\[ \mathcal{H}_b:=\{ h_{bi}:\forall b_i\} ,\mathcal{H}_c:=\{ h_{bi}:\forall c_i\} \quad \mathcal{H}_k=\mathcal{H}_b\cup \mathcal{H}_c \cup \footnote{\text{ that is the $2d-1$ times of the constant hypotheses}}(2d-1)\times \{ h(x)=-1\} \]
then \(|\mathcal{H}_k|=k\). for some sample \(s=(x_i,y_i)\in S\). we can notice that \(s\) inside the hyper-rectangle iff \(b_j\le x_i \le c_j\) for all \( 1\le j \le d\), let define \(f\) s.t
\begin{align*} f(x)=\sum _{h_j\in \mathcal{H}_k }a h_j(x)=a\left(\sum _{h_j\in \mathcal{H}_b} h_j(x)+ \sum _{h_j\in \mathcal{H}_c}h_j(x) + \sum _{h_j\in \mathcal{H}_k/\mathcal{H}_{b,c}}h_j(x_i)\right) \\ =\frac{1}{4d-1}\left(\sum _{k=1}^{2d}\left(\underbrace{(h_{b_j}(x)+h_{c_j}(x)}_{\star }\right)-(2d-1)\right) \end{align*}
now plug \(y_if(x_i)\) and part \((\star )\) equal \(2d\) if all \(h\) agree in the inner sum, and its maximize \(f\). whenever just one \(h\) label -1 and all the other 1 we get the following mistake. and in any other case the sign will be negative. hence by choosing \(\gamma \)
\[ \frac{2d}{4d-1}-\frac{2d-1}{4d-1}=\gamma \quad \frac{2d-1}{4d-1}-\frac{2d-1}{4d-1}=0\quad -\frac{2d}{4d-1}-\frac{2d-1}{4d-1}=\gamma \]
then using section (a) and \(y_i f(x)\ge \gamma \) will give us the result .
5.4 Comparing notions of weak learnability.
Given a \(\gamma \)-weak learner \(\mathcal{A}\), let construct a learner \(\mathcal{A'}\) that gets as an input a number \(0{\lt} \delta {\lt}1\), a sample \(S\) and distribution \(D\), and returns with probability \(1- \delta \) an hypothesis \(h\) such that \(e_{S,D}(h) \le 0.5-\delta \). By set \(k\ge N(\delta )\) and running \(\mathcal{A}\) on \(S_1\dots S_k\) samples, it holds that:
\[ P[e_P(\mathcal{A}(S){\gt}1/2 - \gamma ) ] {\lt} \delta \]
now by drawing new samples \(\overline{S}\) from \(S\) where \(|\overline{S}|=N(\delta )\). its weighted error w.r.t \(D\) when \(\overline{S_i}\thicksim D(i)\) is leading us to:
\[ e_P(\mathcal{A}(\overline{S}))=\sum _iD(i)\mathds {1}[\mathcal{A}(\overline{S})(x_i)=y_i]=e_{D,S}(\mathcal{A}(\overline{S})) \]
Since \(|\overline{S}|\le |S|\),then \(\mathcal{A'}\) will return \(\mathcal{A}(\overline{S})\), hence:
\[ P[e_P(\mathcal{A}(S){\gt}1/2 - \gamma ) ] {\lt} \delta \]
(b)
Given a \(\gamma \)-weak learner \(\mathcal{A}\), let construct an \(empirical-weak \) learner \(\mathcal{A}_{week}\)using 4.a. by plugin \(\mathcal{A}_{week}(S,D,\delta /T)\). using \(AdaBoost\) property’s its yield that \(\epsilon _t\le 1/2-\gamma \) with probability of \(1-\delta /T\) and \(\epsilon _t(1-\epsilon _t)\le 1/4 -\gamma ^2\) and
\[ Z_t=2\sqrt{\epsilon _t(1-\epsilon _t)}\le e^{-2\gamma ^21} \]
and
\[ P(e_s(g)\le e^{-2\gamma ^21})\le P \left( \prod _t^T Z_t\ge e^{-2\gamma ^21} \right)\le \sum _t P(Z_t\ge e^{-2\gamma ^21}){\lt}T \times \delta /T=\delta \]
Hence with probability \(1-T\times \delta /T\)
\[ e_s(g)\le \prod _t^T Z_t \le e^{-2\gamma ^2T} \]
5.5 Programming Assignment
(a)
figure
training error and the test error of the classifier corresponding to each iteration t .
(b)
# 10 weak classifiers the algorithm choose plot : 1 ) h: (1, 27, 0.5) word : bad weight: 0.2678992723486787 2 ) h: (-1, 22, 0.5) word : life weight: 0.182869948120488 3 ) h: (-1, 31, 0.5) word : many weight: 0.15131544177021744 4 ) h: (1, 315, 0.5) word : worst weight: 0.15555522141776174 5 ) h: (-1, 282, 0.5) word : perfect weight: 0.19698743995337534 6 ) h: (1, 23, 0.5) word : stand weight: 0.12934453488530565 7 ) h: (-1, 17, 0.5) word : well weight: 0.1117544481762822 8 ) h: (1, 183, 0.5) word : looks weight: 0.11760123845680365 9 ) h: (-1, 107, 0.5) word : quite weight: 0.13419699857009101 10) h: (1, 373, 0.5) word : boring weight: 0.11035504620259538 The "Description words" that chosen by the algorithm that i.e "bad,worst,perfect" is most likely a good indicator for bed/good review. but irs more surprise to see words like "life,well, stand". the its might be word that have good context like " life".
(c)
figure