plastex built version Machine Learning Section.tex -split-level=0
1 Assignment 1
1.1 Linear Algebra
1.a
symmetric matrix A is PSD such that =
where is the Eigenvalue of .
and matrix A can be decomposed as:
A can be written as we get:
1.b
for a given PSD matrix A and
(*) when
then for PSD matrix’s A,B when ,
now let’s apply (*) on (A+B) we will get (**)
then from both (*) and (**) immediately get
the set of all n × n PSD matrices over is not a vector space over because its not apply the closures to multiplication in scalar property , for and
1.2 Calculus and Probability
1.a
for i.i.d continuous random variables, lets write the
Order Statistics such as when i , first lets find the CDP of :
because they i.i.d
we get:
now lets set values in (*) and get :
therefore :
and :
2
when :
.
1.3 Optimal Classifiers and Decision Rules
1.a
Let X and Y be random variables where Y can take values in , and Let be the 0-1 loss function defined in class , hence:
using bayes :
1.4 Optimal Classifiers and Decision Rules
1.b
To find decision rule for:
lets apply bayes rule on both sides. we get:
where is the square Mahalanobis Distance between and
so our simpler Decision rule will be
1.c
when d=1 the general Matrix size will be size d X d, so the shape of the decision shape boundery will be just dot,in the same way when d=2 we will have a line , and for general d its might be d-dimenonal shape...
1.d
For and we looking for equation in the decision rule formula we go had above:
1.5 Programming Assignment
Visualizing the Hoeffding bound:
1.6 Nearest Neighbor:
1
.The KNN accuracy for . its got correct labeling. and the accuracy rate is 0.882000
2
The best K found is with correct labeling.and the accuracy rate is 0.883000
2 Assignment 2
2.1 1. PAC learnability of -balls around the origin.
Given a real number define the hypothesis and we will proof that hypothesis class is PAC learn-able in the realizable case.
lets design an algorithm that learns .
- •
- Given a sample of size
- •
- N = u1, . . . , uN
- •
- lets find the smallest ball B which is consistent with the sample
- •
- •
- i.e
- •
- BR:uk=MAXu1, . . . , uN∧||uk||2 ≤R
- •
-
- •
mistake only by labeling positive points as negative.
- •
The error of the algorithm is
We assume that . otherwise, we stand with the property and finished. now lets define T to be the real boundary of , that "extend" to the direction (0,0).such that for all . since any sample is in the form of for any , we get
since exists j such that for all
we can notice that
now lets set
we proved that there exists , such that for every and every realizable distribution P over with labeling function , when running on training examples drawn i.i.d. from P, it returns a hypothesis that hold the property above. moreover we can notice that the complexity is not depend on the dimension d
2.2 PAC in Expectation.
Theorem
hypothesis class is PAC learnable if and only if is PAC learnable in expectation
Proof
▼
by definition exist A for any such that .
and now by Markov’s inequality we get
hence for lets define and the following will hold
witch stand with the PAC in Expectation definition, with the same A.
we sow before that the same algorithm A work with both. now using the law of total expectation.
and in general its hold for any hence for we get the equivalence
Union Of Intervals.
we can notice that any 2k distinct points on the real line can be shattered using k intervals. it suffices to shatter each of the k pairs of consecutive points with an interval. now lets look at set of 2k+1 points assume they sorted , now lets label any with , hence we need 2k+1 intervals to shatter the set because no interval can contain two consecutive points. and the VC dimension is 2k
Prediction by polynomials.
The VC dimension of H is the size of the largest set of examples that can be shattered by H The VC dimension is infinite if for all m, there is a set of m examples shattered by H.
for all lets say we have sample set size now using the hint we know there for given n distinct values there exists a polynomial P of degree n - 1 such that . now we can set out some and reduce each from each sample set i.e times. and each time using the hint above we can label 0-1 all the element for all m.
Hence the VC dimension of is .
2.3 Structural Risk Minimization.
Lets be k finite hypothesis such that , using the relating empirical and true errors property for any
now using the union bound we will get
for and
(b)
Lets be the hypothesis s.t SRM return . and lets be index of
now lets reduce from the following ,and using the ERM property we get
hence for we will get the 1- probability
2.4 Programming Assignment Union Of Intervals
(a)
Lets the true distribution is as is distributed uniformly on the interval [0, 1], and
hence we looking for
when ,
and
and is distributed uniformly on the interval [0, 1], and the optimal hypothesis for
(b)
From the plot, we can notice that the empirical error increasing according to the amount of samples taken since the probability to see samples outside the 3 intervals is increasing, moreover we can see that the empirical error approach to 0.15 since its the middle of the false-positve and true-negtive error from section a . i.e . the true error is decreeing since we test more samples since we getting closer to the real distribution
(c)
The empirical risk decreeing while k increase since the ERM algorithm have more option of disjoint interval to choose for given data so its can cover more samples. on the other hand while we can notice the true error increasing since the model over fitting to the sample set. and is the one with the best behaviour
(d)
We can notice that when the when h come from the sum of the pendalty and the empirical error is minimizing.
(e)
best hypothesis found
after drawing the data we can notice that the following stand with the hold out property for , witch is close to the optimal true error
3 Assignment 3
3.1 Step-size Perceptron.
Consider the modification of Perceptron algorithm with the following update rule:
whenever . Assume that data is separable with margin and that for all t. for any , Perceptron’s i’th iterate takes the form:
the M mistake hold: .
and now upper bounded is
using Cauchy-Schwarz ineq
3.2 Convex functions.
2.1
Let a convex function, and .for some ,we like to have the graph of on an interval falls below or on the graph. we can notice
using Jensen’s inequality
and the sum of both convex function hold the convex property over
2.2
Now lets consider convex function and we will proof is also convex. using the property from section a , we take maximum of the both sides.
hence we can write in the form
is convex
2.3
Let be the log loss, defined by
we know is covex iff
using section a,b. for define by
lets set . the set is convex set and any can written as hence can written as
GD with projection.
3.1
Let and and lets by assumption is convex set, hence we can write any
for any
the following hold for any since the right hand size can be small as we wish for a given z. on the other hand the right side can be less then 0 for some y s.t and we get
and now lets look at some and we choose some
3.2
Theorem
The GD with projection holds the Convergence Theorem. Given desired accuracy set and ruining GD with projection for
Proof
▼
The GD with projection still holds the Convergence Theorem. using Jensen inequality and Convexity property
using the identity
using the result from 3.1 we know that and assuming we get
for any plug-in and
3.3 Gradient Descent on Smooth Functions.
Let be a -smooth and non-negative function. we Consider the gradient descent algorithm applied on f with constant step size :
now lets compute in
since is non-negative we can bound gradient squared norm of the gradient.
Thus either the function values tend to or the sequence is summable and therefore every limit point of the iterates the GD is equal to zero, since ,and lets define
hence for some c (depends on value) we can set that holds
3.4 Programming Assignment
SGD for Hinge loss
(c)
(d)
SGD for log-loss.
(a)
(b)
(c)
4 Assignment 4
SVM with multiple classes.
Define the following multiclass SVM problem:
First lets notice that if then f is sum of zeros, hence is non-negative function and we assume that the data is linearly separable, we can consider to be the actual separator of the data. its will be sufficient to see that. after plug-in any minimizer will lead to 0 errors. so we just need to make sure that for any update s.t . will lead
is the support vector of the data for some . according to the max margin hyperplane property the the true separator maximize the minimum distance for any . and now we just need to see that for any
since any other margin will be .
Hence after the find the actual max margin hyperplane any minimizer apply on will lead to 0 errors.
Soft-SVM.
Consider the soft-SVM problem with seperable data:
Let be the solution of hard SVM, and let be solution for the soft SVM. since feasible solution for the problem. I claim that the following holds for
Since all non negative any minimizer of the problem will lead to . we can notice that for any
Any point which is within the margin or is located in the other side of the separating hyperplane, but none of them cross the separating hyperplane hence the data is separable
Separability using polynomial kernel.
Let e distinct real numbers, and let be an integer. for separable data hard-SVM yield zero training errors. lets write the polynomial kernel in binomial form
Multiply each row of the Vandernow matrix with the constant from the binomial above.
Hence the binomial form can get by the inner proudact, and the kernel matrix . using the fact that Vandernow matrix is rank the lemma holds here, The hard-SVM yield zero training errors.
Expressivity of ReLU networks.
- •
If
If
- •
If
If
- •
4(b)
(*) We could achieve same result with 3 neutrons since
4.1 Implementing boolean functions using ReLU networks.
Consider n boolean input variables , lets construct a neural network with ReLU activations, which implements the AND function:
we can consider the as :
Hence we can built ReLU networks with one hidden layer, and constant extra input , and the weight function will be
By connecting all the inputs to single neuron we get the and connecting the constant to it as well will give us the AND function
4.2 Programming Assignment.
SVM
![\includegraphics[width=\textwidth ]{plot 4.1.png}](images/img-0017.png)
figure
Homogeneous polynomial kernel.
Here, the polynomial kernel of degree 2 fits the data better, since the data is cycle shape shape, and the kernel trick need 2 degree for separate the hyperplane.
We can see here better fit of of the right-hand side module, the Independent term in kernel function give the needed correction , but still degree 2 polynomial fits here better
![\includegraphics[width=\textwidth ]{plot 4.2.png}](images/img-0018.png)
figure
Non-Homogeneous polynomial kernel.
![\includegraphics[width=\textwidth ]{plot 4.3.png}](images/img-0019.png)
figurepolynomial kernel and RBF kernel.
RBF kernel generalize better on the noisy data since its can separate the data in higher dimention then the polynomial kernel, that become more "elliptic" to cover the noise data
figureRBF kernel with different values
4.3 Neural Networks.
![\includegraphics[width=\textwidth ]{plot 4.4.png}](images/img-0021.png)
figure
![\includegraphics[width=\textwidth ]{plot 4.5.png}](images/img-0022.png)
figure
![\includegraphics[width=100mm]{plot 4.6.png}](https://saarbk.github.io/assets/img/images/img-0023.png)
figure
From the plots we can learn that the best value for the learning rate is 0.1, while using larger value at any step we might get far away from the optimal minimizer, and while using smaller value we proses in less effective way and loosing information.
![\includegraphics[width=\textwidth ]{plot 4.7.jpg}](images/img-0024.jpg)
figureaccuracy in the final epoch
5 Assignment 5
5.1 Suboptimality of ID3.
Consider the following training set, where and :
We wish to use this training set in order to build a decision tree of depth 2. for each split lets calculate the error gain as mutual information.
Where and define as Entropy gain
Hence the first spilt will be on and go down on same branch, now if choose to ask about then will map to the same leaf, on the other hand if we ask about then will map to the same leaf. its following that no matter what split the ID3 choose we will get at least one wrong sample classified and the training error will be at least
(b)
5.2 AdaBoost
(a)
Let and its labels. We run the AdaBoost algorithm as given in the lecture, and we are in iteration . and assuming that . The empirical error function at in AdaBoost run, is define by:
Now we can notice that the error of following from written as
the First equation hold since we divide the mistakes from the total destitution, Second by definition of and , and Third using the Lemma
(b)
Lets assume that AdaBoost pick the same hypothesis twice consecutively, that is
using the result of section (a) we know that the error of with respect to is , and we get an contradiction to the fact that is week learner.
5.3 Sufficient Condition for Weak Learnability.
Let be a training set and let be a hypothesis class. Assume that there exists , hypotheses and coefficients for which the following holds:
Let be any distribution over . Then, taking expectations with respect to of both sides of the equation
the equation holds from the linearity of expectations and the definition of which constitute a distribution.
on the other hand we can consider the expectations as sum of indicator, that is:
following last two equation we get that
(b)
Let be a training set that is realized by a
-dimensional hyper-rectangle classifier. and let H be the class of decision stumps.
following the hint, set:
and
then . for some sample . we can notice that inside the hyper-rectangle iff for all , let define s.t
now plug and part equal if all agree in the inner sum, and its maximize . whenever just one label -1 and all the other 1 we get the following mistake. and in any other case the sign will be negative. hence by choosing
then using section (a) and will give us the result .
5.4 Comparing notions of weak learnability.
Given a -weak learner , let construct a learner that gets as an input a number , a sample and distribution , and returns with probability an hypothesis such that . By set and running on samples, it holds that:
now by drawing new samples from where . its weighted error w.r.t when is leading us to:
Since ,then will return , hence:
(b)
Given a -weak learner , let construct an learner using 4.a. by plugin . using property’s its yield that with probability of and and
and
Hence with probability
5.5 Programming Assignment
(a)
![\includegraphics[width=\textwidth ]{myplot2.png}](images/img-0026.png)
figure
training error and the test error of the classifier corresponding to each iteration t .
(b)
# 10 weak classifiers the algorithm choose plot : 1 ) h: (1, 27, 0.5) word : bad weight: 0.2678992723486787 2 ) h: (-1, 22, 0.5) word : life weight: 0.182869948120488 3 ) h: (-1, 31, 0.5) word : many weight: 0.15131544177021744 4 ) h: (1, 315, 0.5) word : worst weight: 0.15555522141776174 5 ) h: (-1, 282, 0.5) word : perfect weight: 0.19698743995337534 6 ) h: (1, 23, 0.5) word : stand weight: 0.12934453488530565 7 ) h: (-1, 17, 0.5) word : well weight: 0.1117544481762822 8 ) h: (1, 183, 0.5) word : looks weight: 0.11760123845680365 9 ) h: (-1, 107, 0.5) word : quite weight: 0.13419699857009101 10) h: (1, 373, 0.5) word : boring weight: 0.11035504620259538 The "Description words" that chosen by the algorithm that i.e "bad,worst,perfect" is most likely a good indicator for bed/good review. but irs more surprise to see words like "life,well, stand". the its might be word that have good context like " life".
(c)
![\includegraphics[width=\textwidth ]{myplot3.png}](images/img-0027.png)
figure