plastex built version Machine Learning Section.tex -split-level=0

1 Assignment 1

1.1 Linear Algebra

1.a

$\Rightarrow$ symmetric matrix A is PSD such that $v^{t} A v$ $=$ $(v^{t} u) d i a g (λ) (u^{t} v)$ =
$\sum_{i} λ_{i} (V u^{t})^{2} \geq 0$ where $λ$ is the Eigenvalue of $A$ .
and matrix A can be decomposed as:

A = Q D Q^{t} = Q * d i a g (λ_{1}, λ_{2} \dots λ_{n}) * Q^{t} =

Q * d i a g (\sqrt{λ_{1}}, \sqrt{λ_{2}} \dots \sqrt{λ_{n}}) * d i a g (\sqrt{λ_{1}}, \sqrt{λ_{2}} \dots \sqrt{λ_{n}}) * Q^{t} = X X^{t}

$\Leftarrow$ A can be written as $v^{t} X X^{t} v$ we get:

v^{t} A v = v^{t} X X^{t} v = (X^{t} v)^{t} (X^{t} v) =∥ X_{v}^{t} ∥\geq 0

1.b

for a given PSD matrix A and $α \in R$
(*) $v^{t} (α A) \geq 0$ $\Rightarrow$ $v^{t} (u A) \geq 0$ when $u, A \geq 0$
then for PSD matrix’s A,B when $A, B \geq 0$ , $A + B \geq 0$
now let’s apply (*) on (A+B) we will get (**)

v^{t} (A + B) v = v^{t} A v + v^{t} B v \geq 0

then from both (*) and (**) immediately get $α A + β B \geq 0$
the set of all n × n PSD matrices over $R$ is not a vector space over $R$ because its not apply the closures to multiplication in scalar property , for $λ < 0$ and

A \geq 0 \to λ A < 0 \to λ A \notin {P S D}

1.2 Calculus and Probability

1.a

for $x_{1}, x_{2} \dots x_{n}$ i.i.d $U ([0, 1])$ continuous random variables, lets write the
Order Statistics such as $\overset{―}{x_{1}}, \overset{―}{x_{2}} \dots \overset{―}{x_{n}}$ when $\forall$ i , $\overset{―}{x_{i}} \leq \overset{―}{x_{i + 1}}$ first lets find the CDP of $Y = M A X {x_{1}, x_{2} \dots x_{n}} = \overset{―}{x_{n}}$ :

F_{y} (x) = F_{\overset{―}{x_{n}}} = Pr (\overset{―}{x_{n}} \leq k) = Pr (\overset{―}{x_{1}} \leq k, \overset{―}{x_{2}} \leq k \dots \overset{―}{x_{n}} \leq k)

because they i.i.d

Pr (\overset{―}{x_{1}} \leq k) Pr (\overset{―}{x_{2}} \leq k) \dots Pr (\overset{―}{x_{n}} \leq k) = [Pr (\overset{―}{x_{i}} \leq k)]^{n} = [F_{x} (k)]^{n}

(*) F (x_{i}) = {\begin{array}{rcl} 0 & for & x < 0 \\ x / 1 & for & x \in {0, 1} \\ 1 & for & x > 1 \end{array} f (x) = {\begin{array}{rcl} 1 & for & x \in {0, 1} \\ 0 & for & x \notin {0, 1} \end{array}

we get:

f_{y} (k) = f_{\overset{―}{x_{n}}} (k) = \frac{d}{d k} (F_{\overset{―}{x_{n}}} (k))^{n} = n (F (k))^{n - 1} f (k)

now lets set values in (*) and get :

f_{y} (k) = n k^{n - 1} f (k) = n k^{n - 1} f (k) = n k^{n - 1} I_{(1, 0)} \sim B e t a (n, 1)

therefore :

lim (E [y])_{n \to inf} = lim (\frac{n}{n + 1})_{n \to \infty} ⟶ 1

and :

(v a r [y])_{n \to inf} = lim (\frac{n}{(n + 1)^{2} (n + 2)})_{n \to inf} \to 0

$\includegraphics[scale=0.4]{beta.png}$

2

E [| x - α |] = \int_{- \infty}^{+ \infty} | x - α | f (x) d x = \int_{- \infty}^{α} | x - α | f (x) d x + \int_{α}^{+ \infty} | x - α | f (x) d x

when $α \in a r g m i n$ :

\underset{\to 0}{\underset{⏟}{(α - x) f (x) |}} + \int_{- \infty}^{α} f (x) d x + \underset{\to 0}{\underset{⏟}{(x - α) f (x) |}} - \int_{α}^{+ \infty} f (x) d x

\Rightarrow \int_{- \infty}^{α} f (x) d x = \int_{α}^{+ \infty} f (x) d x \Rightarrow Pr (x \leq α) = Pr (x \geq α) \Leftrightarrow

Pr (x \leq α) = 1 / 2

1.3 Optimal Classifiers and Decision Rules

1.a

Let X and Y be random variables where Y can take values in $Y = {1, \dots, L}$ , and Let $ℓ$ be the 0-1 loss function defined in class , hence:

E [△ (y, f (x))] = \sum_{k = 1}^{L} P r (X = \hat{x}, Y = k) △ (k, f (x))

using bayes :

\sum_{k = 1}^{L} Pr (X = \hat{x}) Pr (Y = k | X = \hat{x}) △ (k, f (k)) = Pr (X = \hat{x}) \sum_{k = 1}^{L} Pr (y = k | X = \hat{x}) △ (k, f (k))

L (h) = A r g min_{f : X \to Y} {Pr (x = \hat{x}) \sum_{y \neq k, y \in {1 \dots L}} Pr (y = k | X = \hat{x})} = f (\hat{x}) = h (x) = k

\Rightarrow h (\hat{x}) = A r g max {Pr (x = \hat{x}) (1 - Pr (y = k | X = \hat{x})} : h (\hat{x}) = k

h (\hat{x}) = A r g max_{y \in Y} Pr (y = i | x = \hat{x})

1.4 Optimal Classifiers and Decision Rules

1.b

To find decision rule for:

Pr [y = 1 | X] > Pr [y = 0 | X]

lets apply bayes rule on both sides. we get:

\frac{f_{X | Y = 1} (x) P r [Y = y] f X (X)}{f_{x}} > \frac{f_{X | Y = 0} (x) P r [Y = y] f X (X)}{f_{x}}

p f_{1} (x, μ_{1}, \sum) > (1 - p) f_{0} (x, μ_{o}, \sum)

\frac{e x p (- (1 / 2) (x - μ_{1})^{T} \sum^{- 1} (x - μ_{1})}{e x p (- (1 / 2) (x - μ_{0})^{T} \sum^{- 1} (x - μ_{0})} > \frac{1 - p}{p}

(x - μ_{0})^{T} \sum^{- 1} (x - μ_{0}) - (x - μ_{1})^{T} \sum^{- 1} (x - μ_{1}) > 2 l n (\frac{1 - p}{p})

where $(x - μ)^{T} \sum^{- 1} (x - μ)$ is the square Mahalanobis Distance between $x$ and $μ$
so our simpler Decision rule will be

h (x) = {\begin{array}{rcl} 1 & for & d^{2}_{m} (x, μ_{0}) > d^{2}_{m} (x, μ_{1}) + 2 l n (\frac{1 - p}{p}) \\ 0 & otherwise \end{array}

1.c

when d=1 the general Matrix $\sum$ size will be size d X d, so the shape of the decision shape boundery will be just dot,in the same way when d=2 we will have a line , and for general d its might be d-dimenonal shape...

1.d

For $d = 1, μ_{0} = μ_{1} = μ$ and $σ_{1} \neq σ_{0}$ we looking for equation in the decision rule formula we go had above:

d^{2}_{m} (x, μ_{0}) - d^{2}_{m} (x, μ_{1}) = 2 l n (\frac{1 - p}{p})

(x - μ)^{2} (\frac{1}{σ_{0}^{2}} - \frac{1}{σ_{1}^{2}}) = 2 l n (\frac{1 - p}{p}) \Rightarrow (x - μ)^{2} = (σ_{0}^{2} - σ_{1}^{2}) 2 l n (\frac{1 - p}{p})

(x - μ) = + - \sqrt{(σ_{0}^{2} - σ_{1}^{2}) 2 l n (\frac{1 - p}{p})} \Rightarrow x = μ + - \sqrt{(σ_{0}^{2} - σ_{1}^{2}) 2 l n (\frac{1 - p}{p})}

1.5 Programming Assignment

Visualizing the Hoeffding bound:

$\graphicspath m y p l o t . p n g / f i g /$ $\includegraphics[scale=0.4]{plot1.png}$

1.6 Nearest Neighbor:

1

.The KNN accuracy for $k = 10$ . its got $882 / 1000$ correct labeling. and the accuracy rate is 0.882000

2

The best K found is $k = 4$ with $883 / 1000$ correct labeling.and the accuracy rate is 0.883000

$\includegraphics[scale=0.55]{plot2.png}$ $\includegraphics[scale=0.43]{plot 3.JPG}$

2 Assignment 2

2.1 1. PAC learnability of $ℓ_{2}$ -balls around the origin.

Given a real number $R \geq 0$ define the hypothesis $h_{R} : R^{d} \to {0, 1}$ and we will proof that hypothesis class $H_{b a l l} = {h_{R} | R \geq 0}$ is PAC learn-able in the realizable case.
lets design an algorithm $A_{b a l l s}$ that learns $H_{b a l l}$ .

•: Given a sample of size
•: N = u₁, . . . , u_N
•: lets find the smallest ball B which is consistent with the sample
•
•: i.e
•: B_R:u_k=MAXu₁, . . . , u_N∧||u_k||₂ ≤R
•
•: mistake only by labeling positive points as negative.
•: The error of the algorithm is $e_{P} (h_{R}) = P [B_{0} ∖ B_{R}]$

We assume that $P (B_{R}) > ϵ$ . otherwise, we stand with the property and finished. now lets define T to be the real boundary of $B_{0}$ , that "extend" to the direction $\to$ (0,0).such that for all $ϵ, P (T) = ϵ$ . since any sample is in the form of $| | u_{i} | |_{2} = (x_{1}^{2} + x_{2}^{2} . . . + x_{d}^{2})^{1 / 2} \geq 0$ for any $u \in T$ , we get

e_{P} (h_{R}) = P [B_{0} ∖ B_{R}] \leq P (T) = ϵ \Rightarrow

since $e_{P} (h_{R}) \leq ϵ$ exists j such that for all $1 \leq i \leq N, u_{i} \notin T$

P [e_{P} (h_{R}) > ϵ] \leq P [\exists j \forall i : u (i) \in T]

we can notice that

P [e_{P} (h_{R}) > ϵ] \leq (1 - ϵ)^{n} \leq e^{- n ϵ} = δ \Leftrightarrow n \leq \frac{1}{ϵ} l n \frac{1}{δ}

now lets set $N (ϵ, δ) = \frac{1}{ϵ} l n \frac{1}{δ}$
we proved that there exists $N (ϵ, δ) = \frac{1}{ϵ} l n \frac{1}{δ}$ , such that for every $ϵ, δ$ and every realizable distribution P over $R^{d}$ with labeling function $B_{0} \in H_{b a l l}$ , when running $A_{b a l l}$ on $n \geq N (ϵ, δ)$ training examples drawn i.i.d. from P, it returns a hypothesis $h_{R} \in H_{b a l l}$ that hold the property above. moreover we can notice that the complexity is not depend on the dimension d

2.2 PAC in Expectation.

Theorem

hypothesis class $H$ is PAC learnable if and only if $H$ is PAC learnable in expectation

Proof ▼

$\Rightarrow$ by definition exist A for any $δ, ϵ$ such that $P [e_{P} (A (s)) > ϵ] \leq δ$ .
and $ϵ, e_{P} (A (s) > 0$ now by Markov’s inequality we get

P [e_{P} (h_{R}) > ϵ] \leq \frac{E [e_{P} (h_{R})]}{ϵ}

hence for $n \geq N (a)$ lets define $\hat{N} (ϵ δ) : (0, 1) \to N | \forall a \in (0, 1)$ and the following will hold

\frac{E [e_{P} (h_{R})]}{ϵ} \leq \frac{\hat{N (a)}}{ϵ} = \frac{ϵ δ}{ϵ} = δ

witch stand with the PAC in Expectation definition, with the same A.

$\Leftarrow$ we sow before that the same algorithm A work with both. now using the law of total expectation.

E [e_{P} (A (s))]

= \underset{\leq 1 ϵ}{\underset{⏟}{E [e_{P} (A (s)) | e_{P} (A (s)) \leq ϵ] \cdot P [e_{P} (A (s)) \leq ϵ}} +

\underset{\leq δ 1}{\underset{⏟}{E [e_{P} (A (s)) | e_{P} (A (s)) > ϵ] P [e_{P} (A (s)) > ϵ]}}

\leq ϵ + δ

and in general its hold for any $ϵ = 1 - δ$ hence for $n \geq N (ϵ, δ)$ we get the equivalence

Union Of Intervals.

we can notice that any 2k distinct points on the real line can be shattered using k intervals. it suffices to shatter each of the k pairs of consecutive points with an interval. now lets look at set of 2k+1 points assume they sorted $x_{1} < x_{2} < \dots x_{2 k + 1}$ , now lets label any $x_{i}$ with $(- 1)^{i + 1}$ , hence we need 2k+1 intervals to shatter the set because no interval can contain two consecutive points. and the VC dimension is 2k

Prediction by polynomials.

The VC dimension of H is the size of the largest set of examples that can be shattered by H $\Rightarrow$ The VC dimension is infinite if for all m, there is a set of m examples shattered by H.
for all $m \in R$ lets say we have sample set size $m = (y_{1}, y_{2} \dots y_{m})$ now using the hint we know there for given n distinct values $x_{1}, . . ., x_{n} \in R$ there exists a polynomial P of degree n - 1 such that $P (x_{i}) = y_{i}$ . now we can set out some $h_{p} \in H_{p o l y}$ and reduce each $ϵ$ from each sample set i.e $2^{m}$ times. and each time using the hint above we can label 0-1 all the element for all m.
Hence the VC dimension of $H_{p o l y}$ is $\infty$ .

2.3 Structural Risk Minimization.

Lets $\hat{H} = \cup_{i}^{k} H_{i}$ be k finite hypothesis such that $| H_{1} | \leq \dots \leq | H_{k} |$ , using the relating empirical and true errors property for any $h_{j} \in H_{i}$

P [{s u p}_{h \in H} | e_{s} (h) - e_{p} (h) |] \leq 2 | H | e^{- 2 n ϵ^{2}}

now using the union bound we will get

\displaylimits_{i = 1}^{k} P [| e_{s} (h) - e_{p} (h) | > \sqrt{\frac{1}{2 | S |} l n \frac{2 k | H_{i} |}{δ}}] \leq

\sum_{i = 1}^{k} P [{s u p}_{h \in H} | e_{s} (h) - e_{p} (h) | > \sqrt{\frac{1}{2 | S |} l n \frac{2 k | H_{i} |}{δ}}]] \leq

k P [| e_{s} (h) - e_{p} (h) | > \sqrt{\frac{1}{2 | S |} l n \frac{2 k | H_{i} |}{δ}}]] \leq 2 k | H | e^{- 2 n ϵ^{2}}

for $| S | = n$ and $ϵ = \sqrt{\frac{1}{2 | S |} l n \frac{2 k | H_{i} |}{δ}}$

\leq 2 k | H_{i} | e x p (- 2 | S | \sqrt{(\frac{1}{2 | S |} l n \frac{2 k | H_{i} |}{δ}})^{2}) = 2 k | H_{i} | (\frac{2 k | H_{i} |}{δ})^{- 1} = δ

\Leftrightarrow \forall i \in \hat{H} \Rightarrow P [| e_{s} (h) - e_{p} (h) | > \sqrt{\frac{1}{2 | S |} l n \frac{2 k | H_{i} |}{δ}}]] < 1 - δ

(b)

Lets $\hat{i}$ be the hypothesis s.t SRM return ${ERM}_{\hat{i}}$ . and lets $i^{*}$ be index of $h *$

e_{p} (S R M) \leq e_{s} ({ERM}_{\hat{i}}) + \sqrt{\frac{1}{2 n} l n \frac{2 k | H_{\hat{i}} |}{δ}} \leq \underset{result from section a}{\underset{⏟}{e_{s} ({ERM}_{i^{*}}) + \sqrt{\frac{1}{2 n} l n \frac{2 k | H_{i^{*}} |}{δ}}}}

now lets reduce $h^{*}$ from the following ,and using the ERM property we get

e_{s} ({ERM}_{i^{*}} (S)) - e_{p} (h^{*}) + \sqrt{\frac{1}{2 n} l n \frac{2 k | H_{i^{*}} |}{δ}} \leq e_{s} (h^{*}) - e_{p} (h^{*}) + \sqrt{\frac{1}{2 n} l n \frac{2 k | H_{i^{*}} |}{δ}} \leq

2 \sqrt{\frac{1}{2 n} l n \frac{2 k | H_{i^{*}} |}{δ}} \leq ϵ \Rightarrow n \geq \frac{2}{ϵ^{2}} l n \frac{2 k | H_{i^{*}} |}{δ}

hence for $n \geq \frac{2}{ϵ^{2}} l n \frac{2 k | H_{i^{*}} |}{δ}$ we will get the 1- $δ$ probability

2.4 Programming Assignment Union Of Intervals

(a)

Lets the true distribution $Pr [x, y] = Pr [y | x] Pr [x]$ is as $x$ is distributed uniformly on the interval [0, 1], and

Pr [y = 1 | x] = {\begin{cases} 0.8 & if x \in [0, 0.2] \cup [0.4, 0.6] \cup [0.8, 1] \\ 0.1 & if x \in (0.2, 0.4) \cup (0.6, 0.8) \end{cases}

hence we looking for $h = (\hat{x}) {arg max}_{y \in 0, 1} Pr [Y = y, X = \hat{x}]$
when $x \in [0, 0.2] \cup [0.4, 0.6] \cup [0.8, 1]$ ,

Pr [Y = 1 | X = \hat{x}] = 0.8 > P r [Y = 0 | X = \hat{x}] = 0.2

and $x \notin [0, 0.2] \cup [0.4, 0.6] \cup [0.8, 1]$

Pr [Y = 0 | X = \hat{x}] = 0.9 > P r [Y = 1 | X = \hat{x}] = 0.1

and $x$ is distributed uniformly on the interval [0, 1], and the optimal hypothesis for $H_{10}$

h (x) = {\begin{cases} 1 & if x \in [0, 0.2] \cup [0.4, 0.6] \cup [0.8, 1] \\ 0 & else \end{cases}

(b)

$\includegraphics[scale=0.55]{plot_1b.png}$
From the plot, we can notice that the empirical error increasing according to the amount of samples taken since the probability to see samples outside the 3 intervals is increasing, moreover we can see that the empirical error approach to 0.15 since its the middle of the false-positve and true-negtive error from section a . i.e $(0.2 + 0.1) / 2$ . the true error is decreeing since we test more samples since we getting closer to the real distribution $P$

(c)

$\includegraphics[scale=0.55]{myplot_1c.png}$
The empirical risk decreeing while k increase since the ERM algorithm have more option of disjoint interval to choose for given data so its can cover more samples. on the other hand while $k > 3$ we can notice the true error increasing since the model over fitting to the sample set. and $k = 3$ is the one with the best behaviour

(d)

$\includegraphics[scale=0.45]{myplot_1d.png}$
We can notice that when the when h come from $H_{3}$ the sum of the pendalty and the empirical error is minimizing.

(e)

best hypothesis found
$\includegraphics[scale=0.55]{plot_1d.JPG}$
after drawing the data we can notice that the following stand with the hold out property for $1 - δ$ , witch is close to the optimal true error

3 Assignment 3

3.1 Step-size Perceptron.

Consider the modification of Perceptron algorithm with the following update rule:

w_{t + 1} \leftarrow w_{t} + η_{t} y_{t} x_{t}

whenever $\hat{y} \neq y$ . Assume that data is separable with margin $γ > 0$ and that $| | x_{t} | | = 1$ for all t. for any $1 \leq i \leq m$ , Perceptron’s i’th iterate takes the form:

w_{t + 1} w^{*} = (w_{t} + η_{t} y_{t} x_{t}) w^{*} = w_{t} w^{*} + \underset{x_{t} y_{t} w^{*} x \geq γ}{\underset{⏟}{y_{t} w^{*} x_{t}}} \frac{1}{\sqrt{t}} \geq w_{t} w^{*} + \frac{γ}{\sqrt{t}}

the M mistake hold: $w_{M} w^{*} \geq m \frac{γ}{\sqrt{m}} = \sqrt{m} γ$ .
and now $| | w_{t} | |_{2}^{2}$ upper bounded is

M_{γ} \leq \frac{w^{*} \sum_{t = 1}^{m} y_{t} x_{t}}{| | w^{*} | |} \leq \frac{w^{*} \sum_{t = 1}^{m} (w_{t + 1} - w_{T})}{| | w^{*} | | η}

| | w_{t + 1} | |_{2}^{2} \leq \sqrt{\sum_{t = 1}^{m} | | w_{t} + η_{t} y_{t} x_{t} | |^{2} - | | w_{t} | |^{2}} \leq \sqrt{\sum_{t = 1}^{m} \underset{n e g t i v e}{\underset{⏟}{2 η_{t} y_{t} x_{t}}} + η^{2} | | x_{t} | |^{2}}

\leq \sqrt{\sum_{t = 1}^{m} \frac{1}{t} | | x_{t} | |} \leq \sqrt{H_{m}} \sim \log (\sqrt{m})

using Cauchy-Schwarz ineq

γ \sqrt{m} \leq w_{M} w^{*} \leq | | w_{t + 1} | |_{2}^{2} \leq \log (\sqrt{m})

\Rightarrow \sqrt{m} \leq \frac{1}{γ} \log (\sqrt{m}) \Rightarrow \sqrt{m} \leq \frac{2}{γ} \log (\frac{1}{γ}) \Rightarrow m \leq \frac{4}{γ^{2}} \log^{2} (\frac{1}{γ})

3.2 Convex functions.

2.1

Let $f : R^{n} \to R$ a convex function, $A \in R^{n \times n}$ and $b \in R^{n}$ .for some $0 < λ < 1$ ,we like to have the graph of $g$ on an interval $[x, y]$ falls below or on the graph. we can notice $b = λ b + (1 - λ) b$

g (λ x + (1 - λ) y) = f (A (λ x + (1 - λ) y) + b) = f (λ (A x + b) + (1 - λ) (A y + b)) \leq

using Jensen’s inequality

λ f (A x + b) + (1 - λ) f (A y + b) = λ g (x) + (1 - λ) g (y) .

and the sum of both convex function hold the convex property over $R^{n}$

2.2

Now lets consider $f_{1} (x), f_{2} (x) \dots f_{m} (x)$ convex function $f_{i} : R^{d} \to R$ and we will proof $g (x) = {max}_{i} f_{i} (x)$ is also convex. using the property from section a $f_{i} (λ x + (1 - λ) y) \leq λ f_{i} (x) + (1 - λ) f_{i} (y)$ , we take maximum of the both sides.

max_{i} {f_{i} (λ x + (1 - λ) y)} \leq max_{i} {λ f_{i} (x) + (1 - λ) f_{i} (y)}

max_{i} {f_{i} (λ x + (1 - λ) y)} \leq max_{i} {λ f_{i} (x)} + max_{i} {(1 - λ) f_{i} (y)}

hence we can write $g (x)$ in the form

g (λ x + (1 - λ) y) \leq λ g (x) + (1 - λ) g (y) .

$g (x)$ is convex

2.3

Let $ℓ_{l o g} : R \to R$ be the log loss, defined by

ℓ_{l o g} (z) = \log_{2} (1 + e^{- z})

we know $f$ is covex iff $f^{″} > 0$

\frac{d}{d z} (\log_{2} (1 + e^{- z})) = - \frac{e^{- z}}{\ln (2) (1 + e^{- z})}

\Rightarrow \frac{d}{d z} (- \frac{e^{- z}}{\ln (2) (1 + e^{- z})}) = \frac{e^{- z}}{\ln (2) {(1 + e^{- z})}^{2}} > 0

using section a,b. for $f (w)$ define by

f (w) = ℓ_{l o g} (y w \cdot x) = \sum_{i = 1}^{n} \log_{2} (1 + e^{- y x_{i} w})

lets set $f_{i} = \log_{2} (1 + e^{- y x_{i} w})$ . the set ${f_{i}}$ is convex set and any $f_{i}$ can written as $f (α x + (1 - α) y)$ hence $f (w)$ can written as

f (w) = \sum_{i = 1}^{n} f (α x + (1 - α) y) \leq n max_{i} {f_{i}} \leq n max_{i} {λ f_{i} (x)} + n max_{i} {(1 - λ) f_{i} (y)}

GD with projection.

3.1

Let $y \in R^{d}$ and $x = \prod_{K} (y) .$ and lets $z \in K$ by assumption $K$ is convex set, hence we can write any $k \in K$
$(1 - λ) x + λ z = x - λ (x - z) \in K$ for any $λ \in (0, 1)$

| | x - y | |^{2} \leq | | x - λ (x - z) - y | |^{2} = | | (x - y) - λ (x - z) | |^{2}

\leq | | x - y | |^{2} - 2 λ ⟨ x - y, x - z ⟩ + λ^{2} | | x - z | |^{2}

\Rightarrow ⟨ x - y, x - z ⟩ \leq \frac{λ}{2} | | z - x | |^{2}

the following hold for any $λ \in (0, 1)$ since the right hand size can be small as we wish for a given z. on the other hand the right side can be less then 0 for some y s.t $y \notin K$ and we get

⟨ x - y, x - z ⟩ \leq 0

and now lets look at some $z \in K$ and we choose some $⟨ x - y, x - z ⟩ \leq 0$

| | y - z | |^{2} - | | x - z | |^{2} = | | y - z + x - x | |^{2} - | | x - z | |^{2} =

| | (x - z) - (x - y) | |^{2} - | | x - z | |^{2} = | | x - z | |^{2} - 2 ⟨ x - y, x - z ⟩ + | | x - y | |^{2} - | | x - z | |^{2} > 0

\Rightarrow | | y - z | | \geq | | x - z | |

3.2

Theorem

The GD with projection holds the Convergence Theorem. Given desired accuracy $ϵ \geq 0$ set $η = \frac{B^{2}}{ϵ}$ and ruining GD with projection for $T = {(\frac{ϵ G}{B})}^{2}$

Proof ▼

The GD with projection still holds the Convergence Theorem. using Jensen inequality and Convexity property

f (w) - f (w^{*}) \leq \frac{1}{T} \sum_{t = 1}^{T} \nabla f (x_{t}) (x_{t} - x *) \leq \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{η} (x_{t} - y_{t + 1}) (x_{t} - x *) \leq

using the identity $2 a b = | | a | |^{2} + | | b | |^{2} - | | a - b | |^{2}$

\leq \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{2 η} (| | x_{t} - y_{t + 1} | |^{2} + | | x_{t} - x^{*} | |^{2} - | | y_{t + 1} - x^{*} | |^{2})

\leq \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{2 η} \underset{3.1}{\underset{⏟}{(| | x_{t} - x^{*} | |^{2} - | | y_{t + 1} - x^{*} | |^{2})}} + \frac{1}{T} \sum_{t = 1}^{T} \frac{\nabla f (x_{t + 1})}{2 η}

using the result from 3.1 we know that $| | y_{t + 1} - x^{*} | | \geq | | x_{t + 1} - x^{*} | |$ and assuming $| | \nabla f (x_{t + 1}) | | \leq G$ we get

\frac{| | x_{1} - x * | |^{2}}{2 η} + \frac{T G^{2}}{2 η} \leq \frac{B^{2}}{2 η} + \frac{T G^{2}}{2 η}

for any $ϵ \geq 0$ plug-in $η = \frac{B^{2}}{ϵ}$ and $T = {(\frac{ϵ G}{B})}^{2}$

f (w) - f (w^{*}) \leq \frac{B^{2}}{2 η} + \frac{T G^{2}}{2 η} = ϵ

3.3 Gradient Descent on Smooth Functions.

Let $f : R^{n} \to R$ be a $β$ -smooth and non-negative function. we Consider the gradient descent algorithm applied on f with constant step size $η > 0$ :

x_{t + 1} = x_{t} - η \nabla f (x_{t})

now lets compute $x_{t}, x_{t + 1}$ in $f$

f (x_{t + 1}) \leq f (x_{t}) + ⟨ \nabla f (x_{t}), x_{t + 1} - (η \nabla f (x_{t}) + x_{t + 1}) ⟩ + \frac{β}{2} | | x_{t} - x_{t + 1} | |^{2}

\leq f (x_{t}) - η | | \nabla f (x_{t}) | |^{2} + \frac{β η^{2}}{2} | | \nabla f (x_{t}) | |^{2}

f (x_{t + 1}) - f (x_{t}) \leq - (η - \frac{β η^{2}}{2}) | | \nabla f (x_{t}) | |^{2}

since $f$ is non-negative we can bound gradient squared norm of the gradient.

\sum_{t = 1}^{k} | | \nabla f (x_{t}) | |^{2} \leq (η - \frac{β η^{2}}{2})^{- 1} (f (x_{1}) - f (x_{k + 1}))

Thus either the function values $f (x_{k})$ tend to $- \infty$ or the sequence ${| | \nabla f (x_{t}) | |^{2}}$ is summable and therefore every limit point of the iterates $x_{k}$ the GD is equal to zero, since $0 \leq f (x_{k + 1}), f (x_{k})$ ,and $η < 2 / β$ lets define $f^{*} := lim_{\to \infty} f (x_{k})$

min_{t} | | \nabla f (x_{t}) | |^{2} \leq \frac{1}{k} \sum_{t = 1}^{k \to \infty} | | \nabla f (x_{t}) | |^{2} \leq \frac{1}{k} \frac{1}{η (1 - \frac{β η}{2})} (f (x_{1}) - f (x_{k + 1})

hence for some c (depends on $x_{1}$ value) we can set $c / \sqrt{k}$ that holds

Extra close brace or missing open brace

3.4 Programming Assignment

SGD for Hinge loss

$\includegraphics[scale=0.55]{Q_1_HW3.png}$
$\includegraphics[scale=0.50]{Q_2_HW3.png}$

(c)

$\includegraphics[scale=0.65]{Q_3_HW_3.png}$

(d)

$\includegraphics[scale=0.85]{Q_4_HW3.JPG}$

SGD for log-loss.

(a)

$\includegraphics[scale=0.35]{Q_5_HW3.png}$

(b)

$\includegraphics[scale=0.35]{Q_6_HW3.png}$

(c)

$\includegraphics[scale=0.55]{LQ.JPG}$

4 Assignment 4

SVM with multiple classes.

Define the following multiclass SVM problem:

f (w_{1}, \dots w_{k}) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (w_{1}, \dots w_{k}, x_{i}, y_{i}) = \frac{1}{n} \sum_{i = 1}^{n} max_{j \in [K]} (w_{j} \cdot x_{i} - w_{y_{i}} \cdot x_{i} + \mathds 1 (j \neq y_{i}))

First lets notice that if $i = j$ then f is sum of zeros, hence $f$ is non-negative function and we assume that the data is linearly separable, we can consider $w^{*} = (w_{1}^{*}, \dots w_{k}^{*})$ to be the actual separator of the data. its will be sufficient to see that. after plug-in $w^{*}$ any minimizer will lead $f$ to 0 errors. so we just need to make sure that for any update s.t $\mathds 1 (y_{i} \neq j)$ . will lead $f \mapsto 0$

w_{j} \cdot x_{i} - w_{y_{i}} \cdot x_{i} + \mathds 1 (j \neq y_{i}) = x_{i} (w_{j} - w_{y_{i}}) + \mathds 1

$x_{i} w_{j}^{*} = M_{j}$ is the support vector of the data for some $y$ . according to the max margin hyperplane property the the true separator $w^{*}$ maximize the minimum distance for any $y$ . and now we just need to see that for any $j \in [K] / y_{i}$

x_{i} w_{j} - x_{i} w_{y_{i}} = \frac{1}{M} (x_{i} (w_{j}^{*} - w_{y_{i}}^{*}) = \frac{- 1}{M} (x_{i} (w_{y_{i}}^{*} - w_{j}^{*})) \leq \frac{- 1}{M} {Min}_{j} (x_{i} (w_{y_{i}}^{*} - w_{j}^{*})) \leq - 1

since any other margin will be $\geq$ $M$ .
Hence after the $m u l t i c l a s s - h i n g e - l o s s$ find the actual max margin hyperplane any minimizer apply on $f$ will lead to 0 errors.

Soft-SVM.

Consider the soft-SVM problem with seperable data:

\begin{array}{ll} min_{w, ξ} & 0.5 | | w | |^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ s.t \forall i : & y_{i} {w \cdot x}_{i} \geq 1 - ξ_{i} \\ ξ_{i} \geq 0. \end{array}

Let $w^{⋆}$ be the solution of hard SVM, and let $w^{'}, ξ^{'}$ be solution for the soft SVM. since $w^{⋆}$ feasible solution for the problem. I claim that the following holds for $C \geq | | w^{⋆} | |^{2}$

\frac{1}{2} | | w^{'} | |^{2} + | | w^{⋆} | |^{2} | | ξ^{'} | | \leq \frac{1}{2} | | w^{'} | |^{2} + C | | ξ^{'} | | \leq \frac{1}{2} | | w^{⋆} | |^{2} \Rightarrow \frac{1}{2} | | w^{'} | |^{2} \leq | | w^{⋆} | |^{2} (\frac{1}{2} - | | ξ^{'} | |)

Since all non negative any minimizer of the problem will lead to $\sum_{i} ξ < 1$ . we can notice that for any $ξ_{i}$

0 \leq \frac{| 1 - ξ_{i} |}{| | w | |} < 1

Any point $x_{i}$ which is within the margin or is located in the other side of the separating hyperplane, but none of them cross the separating hyperplane hence the data is separable

Separability using polynomial kernel.

Let $x_{1}, \dots x_{n} \in R$ e distinct real numbers, and let $q \geq n$ be an integer. for separable data hard-SVM yield zero training errors. lets write the polynomial kernel in binomial form

(x, x^{'}) = (1 + x x^{'})^{q} = \sum_{k = 0}^{q} (\binom{q}{k}) (x x^{'})^{k} = \sum_{k = 0}^{q} x^{k} \sqrt{(\binom{q}{k})} x^{' k} \sqrt{(\binom{q}{k})}

Multiply each row of the Vandernow matrix with the constant from the binomial above.

(\begin{matrix} x_{1}^{0} & x_{1}^{1} & x_{2}^{2} & \dots & x_{1}^{q} \\ x_{2}^{0} & x_{2}^{1} & x_{2}^{2} & \dots & x_{2}^{q} \\ ⋮ \\ x_{q}^{0} & x_{q}^{1} & x_{q}^{2} & \dots & x_{q}^{q} \end{matrix}) \Rightarrow (\begin{matrix} x_{1}^{0} \sqrt{(\binom{q}{0})} & x_{1}^{1} \sqrt{(\binom{q}{1})} & x_{2}^{2} \sqrt{(\binom{q}{2})} & \dots & x_{1}^{q} \sqrt{(\binom{q}{q})} \\ x_{1}^{0} \sqrt{(\binom{q}{0})} & x_{1}^{1} \sqrt{(\binom{q}{1})} & x_{2}^{2} \sqrt{(\binom{q}{2})} & \dots & x_{1}^{q} \sqrt{(\binom{q}{q})} \\ ⋮ \\ x_{1}^{0} \sqrt{(\binom{q}{0})} & x_{1}^{1} \sqrt{(\binom{q}{1})} & x_{2}^{2} \sqrt{(\binom{q}{2})} & \dots & x_{1}^{q} \sqrt{(\binom{q}{q})} \end{matrix})

Hence the binomial form can get by the inner proudact, and $K_{S}$ the kernel matrix $K (x_{i}, x_{j}) = ϕ (x_{i}) ϕ (x_{j})$ . using the fact that Vandernow matrix is rank $n$ the lemma holds here, The hard-SVM yield zero training errors.

Expressivity of ReLU networks.

•: If $x \geq 0 \Rightarrow x = max {0, x}, 0 = max {0, - x} \Rightarrow x = x - 0 = max {0, x} - max {0, - x}$
If $x < 0 \Rightarrow 0 = max {0, x}, - x = max {0, - x} \Rightarrow x - x = 0$
$\Rightarrow x + max {0, - x} = max {0, x} \Rightarrow x = max {0, x} - max {0, - x}$
•: If $x \geq 0, - x \leq 0 \Rightarrow x \geq - x \Rightarrow max (x, - x) = x = | x |$
If $x < 0, - x > 0 \Rightarrow x < - x \Rightarrow max (x, - x) = - x = | x |$
•: $\begin{array}{rcl} \frac{x_{1} + x_{2}}{2} + \frac{| x_{1} - x_{2} |}{2} & = & \frac{(x_{1} + x_{2} + max (x_{1} - x_{2}, x_{2} - x_{1}))}{2} \\ = & \frac{1}{2} (max (x_{1} - x_{2} + x_{1} + x_{2}, x_{2} - x_{1} + x_{1} + x_{2})) \\ = & \frac{1}{2} (max (2 x_{1}, 2 x_{2})) \\ = & max (x_{1}, x_{2}) \end{array}$

4(b)

$\begin{tikzpicture} [shorten >=1pt,node distance=2.5cm,on grid,auto] \node[state,initial] (q_0) {$x_1$}; \node[state] (q_2) [state] at (5,2) {$n_1$}; \node[state] (q_3) [ below=of q_2] {$n_2$}; \node[state] (q_4) [ below=of q_3] {$n_3$}; \node[state] (q_5) [ below=of q_4] {$n_4$}; \node[state,initial]at (0,-4) (q_1) {$x_2$}; \node[state,accepting]at (8.5,-2) (out) {out}; \path[->] (q_0) edge node {1} (q_2) (q_1) edge node {-1} (q_2) (q_0) edge node {-1} (q_3) (q_1) edge node {1} (q_3) (q_0) edge node {1} (q_4) (q_1) edge node {1} (q_4) (q_0) edge node {-1} (q_5) (q_1) edge node {-1} (q_5) (q_2) edge node {0.5} (out) (q_3) edge node {0.5} (out) (q_4) edge node {0.5} (out) (q_5) edge node {-0.5} (out) ; \end{tikzpicture}$

$n_{1} = max {x_{1} - x_{2}, 0}, n_{2} = max {x_{2} - x_{1}, 0}, n_{3} = max {x_{1} + x_{2}, 0}, n_{4} = max {- x_{1} - x_{2}, 0}$
(*) We could achieve same result with 3 neutrons since $max {x_{1}, x_{2}} = max {x_{1} - x_{2}, 0} + x_{2}$

4.1 Implementing boolean functions using ReLU networks.

Consider n boolean input variables $x_{1}, x_{2} \dots, x_{n} \in {0, 1}$ , lets construct a neural network with ReLU activations, which implements the AND function:

f (x_{1}, x_{2} \dots, x_{n}) = x_{1} \land x_{2} \land \dots \land x_{n}

we can consider the $f$ as :

f (x_{1}, x_{2} \dots, x_{n}) = max {- n + 1 + \sum_{i = 1}^{n} x_{i}, 0}

Hence we can built ReLU networks with one hidden layer, and constant extra input $\hat{x} = 1$ , and the weight function will be

w (x_{i}) = 1 for 0 \leq i \leq n and w (\hat{x}) = n - 1

By connecting all the inputs to single neuron we get the $max {\sum x_{i}, 0}$ and connecting the constant $\hat{x_{i}}$ to it as well will give us the AND function

max {\hat{x} (n - 1) + \sum x_{i}, 0}

4.2 Programming Assignment.

SVM

$\includegraphics[width=\textwidth ]{plot 4.1.png}$ figureHomogeneous polynomial kernel.

Here, the polynomial kernel of degree 2 fits the data better, since the data is cycle shape shape, and the kernel trick need 2 degree for separate the hyperplane.

We can see here better fit of of the right-hand side module, the Independent term in kernel function give the needed correction , but still degree 2 polynomial fits here better

$\includegraphics[width=\textwidth ]{plot 4.2.png}$ figureNon-Homogeneous polynomial kernel.

$\includegraphics[width=\textwidth ]{plot 4.3.png}$ figurepolynomial kernel and RBF kernel.

RBF kernel generalize better on the noisy data since its can separate the data in higher dimention then the polynomial kernel, that become more "elliptic" to cover the noise data

$\includegraphics[width=\textwidth ]{crosskernal.png}$ figureRBF kernel with different $γ$ values

4.3 Neural Networks.

$\includegraphics[width=\textwidth ]{plot 4.4.png}$ figure

$\includegraphics[width=\textwidth ]{plot 4.5.png}$ figure

$\includegraphics[width=100mm]{plot 4.6.png}$ figure

From the plots we can learn that the best value for the learning rate is 0.1, while using larger value at any step we might get far away from the optimal minimizer, and while using smaller value we proses in less effective way and loosing information.

$\includegraphics[width=\textwidth ]{plot 4.7.jpg}$ figureaccuracy in the final epoch

5 Assignment 5

5.1 Suboptimality of ID3.

Consider the following training set, where $X = {0, 1}^{3}$ and $Y = {0, 1}$ :

a = ((1, 1, 1), 1) b = ((1, 0, 0), 1) c = ((1, 1, 0), 0) d = ((0, 0, 1), 0)

We wish to use this training set in order to build a decision tree of depth 2. for each split lets calculate the error gain as mutual information.

G (S, i) = \underset{p r e}{\underset{⏟}{C (Y)}} \underset{p o s t}{\underset{⏟}{- P [X_{i} = 0] C (Y | X_{i} = 0) - P [X_{i} = 1] C (Y | X_{i} = 1)}}

Where $C (Y) = C (P [Y = 1])$ and $C$ define as Entropy gain

\begin{aligned} G (Y; X_{1}) & = H (Y) - P [X_{1} = 0] H (Y | X_{1} = 0) - P [X_{1} = 1] H (Y | X_{1} = 1) \\ = 2 (\frac{1}{2} \log \frac{1}{2}) - \frac{3}{4} H (Y | X_{1} = 1) \\ = 1 + \frac{3}{4} (\frac{2}{3} \log \frac{2}{3} + \frac{1}{3} \log \frac{1}{3}) \\ \approx 0.311 \\ G (Y; X_{2}) & = 1 - P [X_{2} = 0] C (Y | X_{2} = 0) - P [X_{3} = 1] C (Y | X_{2} = 1) = 0 \\ G (Y; X_{3}) & = 1 - P [X_{3} = 0] C (Y | X_{3} = 0) - P [X_{3} = 1] C (Y | X_{3} = 1) = 0 \end{aligned}

Hence the first spilt will be on $X_{1}$ and $a, b, c$ go down on same branch, now if choose to ask about $X_{2}$ then $a, c$ will map to the same leaf, on the other hand if we ask about $X_{3}$ then $b, c$ will map to the same leaf. its following that no matter what split the ID3 choose we will get at least one wrong sample classified and the training error will be at least $\frac{1}{4}$

(b)

$\begin{tikzpicture} [shorten >=0.5pt,node distance=2.5cm,on grid,auto] \node[state] (q_0) at (0,0) {1}; \node[state] (q_2) [state] at (1,2) {$x_3$=0?}; \node[state] (q_3) [ state] at (2,0) {0}; \node[state] (q_4) at (4,4) {$x_2$=0?}; \node[state] (q_5) at (6,0) {0}; \node[state] (q_6) at (8,0) {1}; \node[state] (q_7) at (7,2) {$x_3$=0?}; \path[->] (q_2) edge node {no} (q_0) (q_2) edge node {yes} (q_3) (q_7) edge node {yes} (q_6) (q_7) edge node {no} (q_5) (q_4) edge node {yes} (q_7) (q_4) edge node {no} (q_2) ; \end{tikzpicture}$

5.2 AdaBoost

(a)

Let $x_{1}, \dots, x_{m} \in R^{d}$ and $y_{1}, \dots, y_{m} \in {- 1, 1}$ its labels. We run the AdaBoost algorithm as given in the lecture, and we are in iteration $t$ . and assuming that $ϵ_{t} > 0$ . The empirical error function at $t$ in AdaBoost run, is define by:

ϵ_{t} = e_{D_{t}, S} (h_{t}) = \sum_{i} D_{t} (i) \mathds 1 [h_{t} (x_{i}) \neq y_{i}] = \sum_{i : h (x_{i}) \neq y_{i}} D_{t} (i)

Now we can notice that the error of $h_{t}$ following from $D_{t}$ written as

\underset{x \sim D_{t + 1}}{Pr} [h_{t} (x) \neq y] = \frac{\sum_{i : h (x_{i}) \neq y_{i}} D_{t} (i) e^{w_{t}}}{\sum_{j} D_{t} (j) e^{- w_{t} y_{j} h_{t} (x_{j})}} = \frac{ϵ_{t} e^{\frac{1}{2} \ln \frac{1 - ϵ_{t}}{ϵ_{t}}}}{Z_{t}} = \frac{ϵ_{t} \sqrt{\frac{1 - ϵ_{t}}{ϵ_{t}}}}{2 \sqrt{ϵ_{t} (1 - ϵ_{t})}} = \frac{1}{2}

the First equation hold since we divide the mistakes from the total destitution, Second by definition of $Z_{t}$ and $w_{t}$ , and Third using the Lemma ¹

(b)

Lets assume that AdaBoost pick the same hypothesis twice consecutively, that is

h_{t} = h_{t + 1} = W L (D_{t + 1}, S)

using the result of section (a) we know that the error of $h_{t}$ with respect to $D_{t + 1}$ is $\frac{1}{2}$ , and we get an contradiction to the fact that $h_{t + 1}$ is week learner.

5.3 Sufficient Condition for Weak Learnability.

Let $S = {(x_{1}, y_{2}), \dots (x_{n}, y_{n})}$ be a training set and let $H$ be a hypothesis class. Assume that there exists $γ > 0$ , hypotheses $h_{1}, \dots, h_{k} \in H$ and coefficients $a_{1}, \dots a_{k} \geq 0, \sum_{i}^{k} a_{i} = 1$ for which the following holds:

\begin{array}{r} y_{i} \sum_{j = 0}^{k} a_{j} h_{j} (x_{i}) \geq γ \forall (x_{i}, y_{i}) \in S \end{array}

Let $D$ be any distribution over $S$ . Then, taking expectations with respect to $D$ of both sides of the equation

\underset{i \sim D}{E} [y_{i} \sum_{j = 0}^{k} a_{j} h_{j} (x_{i})] = \sum_{j = 0}^{k} a_{j} \underset{i \sim D}{E} [y_{i} h_{j} (x_{i})] \geq γ

\exists m \in H s.t \underset{i \sim D}{E} [y_{i} h_{m} (x_{i})] \geq γ

the equation holds from the linearity of expectations and the definition of $a_{i}, \dots a_{j}$ which constitute a distribution.

on the other hand we can consider the expectations as sum of indicator, that is:

\begin{aligned} \underset{i \sim D}{E} [y_{i} h_{m} (x_{i})] & = \sum_{j = 0}^{n} D (i) y_{i} h_{m} (x_{i}) \\ = \sum_{j : h_{m} (x_{j}) = y_{j}} D (j) - \sum_{j : h_{m} (x_{j}) \neq y_{j}} D (j) \\ = 1 - 2 \sum_{j : h_{m} (x_{j}) \neq y_{j}} D (j) \\ = 1 - 2 \underset{i \sim D}{Pr} [h_{t} (x) \neq y] \\ \geq γ \end{aligned}

following last two equation we get that $\underset{i \sim D}{Pr} [h_{t} (x) \neq y] \leq \frac{1}{2} - \frac{1}{γ} .$

(b)

Let $S = {(x_{1}, y_{2}), \dots (x_{n}, y_{n})} \subseteq R^{d} \times {- 1, 1}$ be a training set that is realized by a
$d$ -dimensional hyper-rectangle classifier. and let H be the class of decision stumps.
following the hint, set:

k = 4 d - 1 a = \frac{1}{4 d - 1} h_{b_{i}} (x) = {\begin{cases} 1 & x_{j} \geq b_{i} \\ - 1 & x_{j} < b_{i} \end{cases}, h_{c_{i}} (x) = {\begin{cases} 1 & x_{j} \leq c_{i} \\ - 1 & x_{j} > c_{i} \end{cases}

and

H_{b} := {h_{b i} : \forall b_{i}}, H_{c} := {h_{b i} : \forall c_{i}} H_{k} = H_{b} \cup H_{c} \cup \footnote that is the 2 d - 1 times of the constant hypotheses (2 d - 1) \times {h (x) = - 1}

then $| H_{k} | = k$ . for some sample $s = (x_{i}, y_{i}) \in S$ . we can notice that $s$ inside the hyper-rectangle iff $b_{j} \leq x_{i} \leq c_{j}$ for all $1 \leq j \leq d$ , let define $f$ s.t

\begin{array}{r} f (x) = \sum_{h_{j} \in H_{k}} a h_{j} (x) = a (\sum_{h_{j} \in H_{b}} h_{j} (x) + \sum_{h_{j} \in H_{c}} h_{j} (x) + \sum_{h_{j} \in H_{k} / H_{b, c}} h_{j} (x_{i})) \\ = \frac{1}{4 d - 1} (\sum_{k = 1}^{2 d} (\underset{⋆}{\underset{⏟}{(h_{b_{j}} (x) + h_{c_{j}} (x)}}) - (2 d - 1)) \end{array}

now plug $y_{i} f (x_{i})$ and part $(⋆)$ equal $2 d$ if all $h$ agree in the inner sum, and its maximize $f$ . whenever just one $h$ label -1 and all the other 1 we get the following mistake. and in any other case the sign will be negative. hence by choosing $γ$

\frac{2 d}{4 d - 1} - \frac{2 d - 1}{4 d - 1} = γ \frac{2 d - 1}{4 d - 1} - \frac{2 d - 1}{4 d - 1} = 0 - \frac{2 d}{4 d - 1} - \frac{2 d - 1}{4 d - 1} = γ

then using section (a) and $y_{i} f (x) \geq γ$ will give us the result .

5.4 Comparing notions of weak learnability.

Given a $γ$ -weak learner $A$ , let construct a learner $A^{'}$ that gets as an input a number $0 < δ < 1$ , a sample $S$ and distribution $D$ , and returns with probability $1 - δ$ an hypothesis $h$ such that $e_{S, D} (h) \leq 0.5 - δ$ . By set $k \geq N (δ)$ and running $A$ on $S_{1} \dots S_{k}$ samples, it holds that:

P [e_{P} (A (S) > 1 / 2 - γ)] < δ

now by drawing new samples $\overset{―}{S}$ from $S$ where $| \overset{―}{S} | = N (δ)$ . its weighted error w.r.t $D$ when $\overset{―}{S_{i}} \sim D (i)$ is leading us to:

e_{P} (A (\overset{―}{S})) = \sum_{i} D (i) \mathds 1 [A (\overset{―}{S}) (x_{i}) = y_{i}] = e_{D, S} (A (\overset{―}{S}))

Since $| \overset{―}{S} | \leq | S |$ ,then $A^{'}$ will return $A (\overset{―}{S})$ , hence:

P [e_{P} (A (S) > 1 / 2 - γ)] < δ

(b)

Given a $γ$ -weak learner $A$ , let construct an $e m p i r i c a l - w e a k$ learner $A_{w e e k}$ using 4.a. by plugin $A_{w e e k} (S, D, δ / T)$ . using $A d a B o o s t$ property’s ³ its yield that $ϵ_{t} \leq 1 / 2 - γ$ with probability of $1 - δ / T$ and $ϵ_{t} (1 - ϵ_{t}) \leq 1 / 4 - γ^{2}$ and

Z_{t} = 2 \sqrt{ϵ_{t} (1 - ϵ_{t})} \leq e^{- 2 γ^{2} 1}

and

P (e_{s} (g) \leq e^{- 2 γ^{2} 1}) \leq P (\prod_{t}^{T} Z_{t} \geq e^{- 2 γ^{2} 1}) \leq \sum_{t} P (Z_{t} \geq e^{- 2 γ^{2} 1}) < T \times δ / T = δ

Hence with probability $1 - T \times δ / T$

e_{s} (g) \leq \prod_{t}^{T} Z_{t} \leq e^{- 2 γ^{2} T}

5.5 Programming Assignment

(a)

$\includegraphics[width=\textwidth ]{myplot2.png}$ figure

training error and the test error of the classifier corresponding to each iteration t .

(b)

# 10 weak classifiers the algorithm choose plot : 1 ) h: (1, 27, 0.5) word : bad weight: 0.2678992723486787 2 ) h: (-1, 22, 0.5) word : life weight: 0.182869948120488 3 ) h: (-1, 31, 0.5) word : many weight: 0.15131544177021744 4 ) h: (1, 315, 0.5) word : worst weight: 0.15555522141776174 5 ) h: (-1, 282, 0.5) word : perfect weight: 0.19698743995337534 6 ) h: (1, 23, 0.5) word : stand weight: 0.12934453488530565 7 ) h: (-1, 17, 0.5) word : well weight: 0.1117544481762822 8 ) h: (1, 183, 0.5) word : looks weight: 0.11760123845680365 9 ) h: (-1, 107, 0.5) word : quite weight: 0.13419699857009101 10) h: (1, 373, 0.5) word : boring weight: 0.11035504620259538 The "Description words" that chosen by the algorithm that i.e "bad,worst,perfect" is most likely a good indicator for bed/good review. but irs more surprise to see words like "life,well, stand". the its might be word that have good context like " life".

(c)

$\includegraphics[width=\textwidth ]{myplot3.png}$ figure