Readings

This page contains pointers to a collection of optional readings, in case you wish to delve further into the topics covered in lectures.

Probabilistic and game-theoretic formulation of prediction problems

The following text books describe probabilistic formulations of prediction problems:
`A Probabilistic Theory of Pattern Recognition,' L. Devroye, L. Gyorfi, G. Lugosi, Springer, New York, 1996. [ebook] Links to an external site.
`The Nature of Statistical Learning Theory,' Vladimir N. Vapnik, Wiley, 1995. [ebook] Links to an external site.
`Neural Network Learning: Theoretical Foundations,' Martin Anthony and Peter L. Bartlett, Cambridge University Press, 1999, 2010. [ebook] Links to an external site.
`An Elementary Introduction to Statistical Learning Theory,' Sanjeev Kulkarni and Gilbert Harman, Wiley, 2011. [ebook] Links to an external site.
`Foundations of Machine Learning,' Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar, MIT Press, 2012.
`Understanding Machine Learning: From Theory to Algorithms,' Shai Shalev-Shwartz and Shai Ben-David, Cambridge University Press, 2014.

This text book describes game-theoretic formulations of prediction problems:
`Prediction, Learning, and Games.' N. Cesa-Bianchi and G. Lugosi, Cambridge University Press, 2006. [ebook] Links to an external site.

See also the following review papers.
'Theory of Classification: a Survey of Recent Advances' Links to an external site. Stephane Boucheron, Olivier Bousquet and Gabor Lugosi.
'Online learning and online convex optimization' Links to an external site. Shai Shalev-Shwartz.

Perceptron Algorithm

The argument giving the minimax lower bound for linear threshold functions is similar to the proof of the main result in the following paper.
`A General Lower Bound on the Number of Examples Needed for Learning.' Links to an external site. A. Ehrenfeucht, D. Haussler, M. Kearns and L. Valiant.

The following are old (1987 and 1990) revisions of older (1969 and 1965, respectively) books on linear threshold functions, the perceptron algorithm, and the perceptron convergence theorem.
Perceptrons: An Introduction to Computational Geometry. Marvin L. Minsky, Seymour A. Papert, MIT Press, 1987.
The Mathematical Foundations of Learning Machines, Nilsson, N., San Francisco: Morgan Kaufmann, 1990.

The upper bound on risk for the perceptron algorithm that we saw in lectures follows from the perceptron convergence theorem and results converting mistake bounded algorithms to average risk bounds. The following paper reviews these results.
Large margin classification using the perceptron algorithm. Links to an external site. Yoav Freund and Robert E. Schapire.

Risk Bounds and Uniform Convergence

The following books give more detail on concentration inequalities. See the first, in particular, for properties of sub-Gaussian and pre-Gaussian random variables.
Metric Characterization of Random Variables and Random Processes. V.~Buldygin and I.~Kozachenko. American Mathematical Society, 2000.
Concentration Inequalities: A Nonasymptotic Theory of Independence. S. Boucheron, G. Lugosi and P. Massart. Oxford University Press, 2013.

The Hoeffding-Azuma inequality:
`Weighted sums of certain dependent random variables.' Links to an external site. Kazuoki Azuma. 1967.
`Probability inequalities for sums of bounded random variables.' Links to an external site. Wassily Hoeffding. 1963.
`On the method of bounded differences.' Links to an external site. Colin McDiarmid. 1989.

Rademacher averages

See these papers for various versions of Rademacher complexities:
'Rademacher penalties and structural risk minimization.' Links to an external site. Vladimir Koltchinskii.
'Model selection and error estimation.' Links to an external site. P. Bartlett, S. Boucheron and G. Lugosi.

See also:
`Rademacher and Gaussian complexities: risk bounds and structural results' Links to an external site. P. L. Bartlett and S. Mendelson.
`A few notes on Statistical Learning Theory.' Links to an external site. Shahar Mendelson.

The `finite lemma' (Rademacher averages of finite sets) is Lemma 5.2 in this paper, which also introduces local Rademacher averages.
`Some applications of concentration inequalities to statistics.' Links to an external site. Pascal Massart.

The contraction inequality is Corollary 3.17 in this book.
Probability in Banach Spaces: Isoperimetry and Processes. Links to an external site. M. Ledoux and M. Talagrand. Springer, 1991.

Vapnik-Chervonenkis dimension

The growth function, VC-dimension, and pseudodimension are described in the following text (see chapter 3). Estimates of these quantities for parameterized function classes, such as neural networks, is covered in chapters 7 and 8.
`Neural network learning: Theoretical foundations.' Links to an external site. Martin Anthony and Peter Bartlett. Cambridge University Press. 1999. [ebook] Links to an external site.

Covering numbers

This is the classic survey paper on covering and packing numbers and metric entropy:
A N Kolmogorov and V M Tihomirov. epsilon-entropy and epsilon-capacity of sets in functional spaces. Links to an external site. Uspekhi Mat. Nauk, 1959, vol. 14, No.2, p. 3-86.

Haussler's bound on packing numbers for VC-classes is here:
Sphere packing numbers for subsets of the Boolean n-cube with bounded Vapnik-Chervonenkis dimension Links to an external site.. David Haussler. Journal of Combinatorial Theory, Series A,
69(2): 217-232, 1995.

Analogous results for scale-sensitive dimensions are in these papers:

Scale-sensitive dimensions, uniform convergence, and learnability Links to an external site.. N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Journal of the ACM, 44(4):615-631, 1997.

Entropy and the Combinatorial Dimension Links to an external site.. S. Mendelson, R. Vershynin. arXiv:math/0203275

Convex losses for classification

The definitions and results we covered in the lecture are from this paper:

Convexity, classification and risk bounds Links to an external site.. Peter L. Bartlett, Michael Jordan, and Jon McAuliffe. Journal of the American Statistical Association, 101(473):138–156, 2006.

Kernel Methods

The following texts give a detailed treatment of kernel methods:

Learning with Kernels : Support Vector Machines, Regularization, Optimization and Beyond. Schölkopf, Bernhard and Smola, Alexander J. MIT. 2002. [ebook Links to an external site.]

Kernel Methods for Pattern Analysis. Cristianini, Nello and Shawe-Taylor, John. Cambridge. 2004. [ebook Links to an external site.]

Support Vector Machines. I. Steinwart and A. Christmann. Springer, 2008. [ebook Links to an external site.]

These papers introduced the hard margin and soft margin SVMs:
A training algorithm for optimal margin classifiers Links to an external site.. B. Boser, I. Guyon, and V. Vapnik. In Fifth Annual Workshop on Computational Learning Theory, pp 144-152. 1992.

Support-Vector Networks. Links to an external site. Corinna Cortes and Vladimir Vapnik. Machine Learning, 20:273-297, 1995.

The following text book gives a good treatment of constrained optimization problems and Lagrangian duality (see Chapter 5):

Convex Optimization Links to an external site.. S. Boyd and L. Vandenberghe.

The original representer theorem:
Some results on Tchebycheffian Spline Functions Links to an external site.. G. Kimeldorf and G. Wahba. J. Mathematical Analysis and Applications, 33(1):82-95, 1971.

Ensemble Methods

Some early boosting papers:
A decision-theoretic generalization of on-line learning and an application to boosting. Links to an external site. Yoav Freund and Robert E. Schapire.

Experiments with a new boosting algorithm. Links to an external site. Yoav Freund and Robert E. Schapire.

This paper contains the result in lectures about the relationship between the existence of weak learners and the existence of a large margin convex combination. It also contains bounds on the misclassification probability of a large margin classifier.
Boosting the margin: A new explanation for the effectiveness of voting methods. Links to an external site. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee.

Some surveys of boosting.
The boosting approach to machine learning: An overview Links to an external site.. Robert E. Schapire.

Boosting: Foundations and Algorithms. Robert E. Schapire and Yoav Freund. MIT Press. 2012. [ebook] Links to an external site.

Two other views of boosting algorithms.
Arcing classifiers. Leo Breiman.
Additive logistic regression: a statistical view of boosting. Links to an external site. Jerome Friedman, Trevor Hastie and Robert Tibshirani.

An extension of AdaBoost to real-valued base classifiers.
Improved boosting algorithms using confidence-rated predictions. Links to an external site. R. E. Schapire and Y. Singer.

The analysis in lectures of AdaBoost as an entropy projection method follows Chapter 8 of Schapire and Freund's text, which in turn is based
on this paper, and references:
Logistic Regression, AdaBoost and Bregman Distances. Links to an external site. M. Collins, R. E. Schapire and Y. Singer.

Model Selection

The oracle inequality for complexity-regularized model selection is from this paper:
Model selection and error estimation. Links to an external site.Peter L. Bartlett, Stéphane Boucheron and Gábor Lugosi.

This talk has some more on oracle inequalities and universal consistency:
Regression Methods for Pattern Classification: Statistical Properties of Large Margin Classifiers. Peter Bartlett.

The universal consistency analysis in lectures for AdaBoost follows Chapter 12 of Schapire and Freund's text, which is based on
these papers:
AdaBoost is Consistent. Links to an external site. Peter L. Bartlett and Mikhail Traskin
The Rate of Convergence of AdaBoost. Links to an external site.Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire

The latter paper improved the algorithmic convergence rate bounds for AdaBoost over the bound from the following paper, which was the result
used in the first proof of universal consistency above.

Some Theory for Generalized Boosting Algorithms. Peter J. Bickel, Ya’acov Ritov, and Alon Zakai.

The following paper shows that the universal approximation assumption made in that analysis is essential: with a class of base classifiers that is too simple, a broad family of methods based on convex surrogate loss minimization (including AdaBoost) will fail spectacularly on some simple noisy problems.

Random Classiﬁcation Noise Defeats All Convex Potential Boosters. Links to an external site.Philip M. Long and Rocco A. Servedio

The following paper extends the universal consistency result to the logistic loss.
Boosting with the Logistic Loss is Consistent. Links to an external site. Matus Telgarsky.

Online learning

Most of the material up to online convex optimization (prediction with expert advice, exponential weights, the Bayesian interpretation, online to batch conversion, the dual game, sequential Rademacher averages, and optimal regret for linear games) is covered in these lecture notes:
Online Prediction. Peter Bartlett. 2016.

Early papers on prediction of individual sequences:
Aggregating strategies. Links to an external site. V. Vovk.

How to use expert advice. Links to an external site. N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and M.K. Warmuth.

On prediction of individual sequences. Links to an external site. N. Cesa-Bianchi and G. Lugosi

See also the text book:
Prediction, learning, and games. N. Cesa-Bianchi and G. Lugosi. Cambridge University Press, 2006. [ebook] Links to an external site.

The following paper gives some results on the relationship between prediction in adversarial and probabilistic settings.

On the generalization ability of on-line learning algorithms Links to an external site.. N. Cesa-Bianchi, A. Conconi, and C. Gentile.

The simple online-to-batch conversion described in the lecture is based on Theorem 5 from this paper.
Exploiting random walks for learning. Links to an external site. P. L. Bartlett, P. Fischer, K.-U. Hoeffgen, COLT 1994.

Optimal regret and the formulation of the dual game, plus the proof of the upper bound in terms of sequential Rademacher averages:

A Stochastic View of Optimal Regret through Minimax Duality. Links to an external site. J. Abernethy, A. Agarwal, P. L. Bartlett, A. Rakhlin.

The following paper introduced the online convex optimization formulation.
Online convex programming and generalized infinitesimal gradient ascent. Links to an external site. M. Zinkevich.

Logarithmic regret for strongly convex functions.
Logarithmic regret algorithms for online convex optimization. Links to an external site. E. Hazan, A. Agarwal, and S. Kale.

Lower bounds:

Optimal strategies and minimax lower bounds for online convex games. J. Abernethy, P. Bartlett, A. Rakhlin and A.Tewari.

Adapting the step size to the convexity using regularization:

Adaptive online gradient descent Links to an external site.. P. Bartlett, E. Hazan, A. Rakhlin.

Adapting the regularization (Adagrad):

Adaptive subgradient methods for online learning and stochastic optimization. Links to an external site. John Duchi, Elad Hazan, and Yoram Singer.

Some lecture notes and surveys that include online convex optimization:

Lecture notes on online learning. Links to an external site. A. Rakhlin.

Online learning and online convex optimization. Links to an external site. S. Shalev-Shwartz.

Convex Optimization: Algorithms and Complexity. Links to an external site. S. Bubeck