# deep learning slides

We currently offer slides for only some chapters. The 12 video lectures cover topics from neural network foundations and optimisation through to generative adversarial networks and responsible innovation. 2020 Feb 28. doi: 10.1002/hep.31207. All our models are just approximations of reality, Or avoid posterior density altogether, just sample from it, You use the prior to express your preferences on a model, There are priors that express absence of any preferences, We have a problem of classifying some objects $$x$$ (images, for example) into one of K classes with the correct class given by $$y$$, We assume the data is generated using some (partially known) classifier $$\pi_{\theta^*}$$: $$y \mid x, \pi_{\theta^*} \sim \text{Categorical}(\pi_{\theta^*}(x))$$ where $$\pi_{\theta^*}(\cdot)$$ is a neural network of a known structure and unknown weights $$\theta^*$$ believed to come from $$p(\theta)$$, After observing the training set $$\mathcal{D}$$ the learning boils down to finding $$p(\theta \mid \mathcal{D}) \propto p(\theta) \prod_{n=1}^N p(y_n \mid x_n, \pi_\theta)$$, We want to model uncertainties in, say, images $$x$$ (and maybe sample them), but these are very complicated objects, We assume that each image $$x$$ has some high-level features $$z$$ that can help explain its uncertainty in a non-linear way $$p(x \mid f(z)) \ne p(x)$$ where $$f$$ is a neural network, The features are believed to follow some simple distribution $$p(z)$$, Sample unseen images via $$z \sim p(z)$$, $$x \sim p(x \mid z)$$, Detect out-of-domain data using marginal density $$p(x)$$, Suppose we have a residual neural network $$H_l(x) = F_l(x) + x$$. Deep learning 4: regularization II slides Paper on dropout regularization homework 3 : 13 : … We want to make predictions about some $$x$$, $$p(X = k) = \pi_k \Leftrightarrow p(x) = \prod_{k=1}^K \pi_k^{[x = k]}$$, Variational Dropout Sparsifies Deep Neural Networks, D. Molchanov, A. Ashukha, D. Vetrov, ICML 2017. ∙ 52 ∙ share . Dimensions of a learning system (different types of feedback, representation, use of knowledge) 3. we don't need the exact true posterior $$\text{KL}(q(\theta | \Lambda) || p(\theta | \mathcal{D})) = \log p(\mathcal{D}) - \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)}$$, Hence we seek parameters $$\Lambda_*$$ maximizing the following objective (the ELBO) $$\Lambda_* = \text{argmax}_\Lambda \left[ \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} = \mathbb{E}_{q(\theta|\Lambda)} \log p(\mathcal{D}|\theta) - \text{KL}(q(\theta|\Lambda)||p(\theta)) \right]$$, We can't compute this quantity analytically either, but can sample from $$q$$ to get Monte Carlo estimates of the approximate posterior predictive distribution: $$q(y \mid x, \mathcal{D}) \approx \hat{q}(y|x, \mathcal{D}) = \frac{1}{M} \sum_{m=1}^M p(y \mid x, \theta^m), \quad\quad \theta^m \sim q(\theta \mid \Lambda_*)$$, Recall the objective for variational inference $$\mathcal{L}(\Lambda_*) = \mathbb{E}_{q(\theta | \Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} \to \max_{\Lambda}$$, We'll be using well-known optimization method, We need (stochastic) gradient $$\hat{g}$$ of $$\mathcal{L}(\Lambda)$$ s.t. The slides are published under the terms of the CC-By 4.0 Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. Please follow the installation_instructions.md Often we can't, we use approximate posteriors, Probability Theory is a great tool to reason about uncertainty, Bayesians quantify subjective uncertainty, Frequentists quantify inherent randomness in the long run, People seem to interpret probability as beliefs and hence are Bayesians, We formulate our prior beliefs about how the $$x$$ might be generated, We collect some data of already generated $$x$$: $$\mathcal{D}_\text{train} = (x_1, ..., x_N)$$, We update our beliefs regarding what kind of data exist by incorporating collected data, We now can make predictions about unseen data, And collect some more data to improve our beliefs, We'll assume random variables have and are described by their, $$p(X=x)$$ ($$p(x)$$ for short) – its probability density function, $$\text{Pr}[X \in A] = \int_{A} p(X=x) dx$$ – distribution function, In general several random variables $$X_1, ..., X_N$$ have, It describes joint probability $$\text{Pr}(X_1 \in A_1, ..., X_N \in A_N) = \int_{A_1} ... \int_{A_N} p(x_1, ..., x_N) dx_N ... dx_1$$, If (and only if) random variables are independent, the joint density is just a product of individual densities, Vector random variables are just a bunch of scalar random variables, For 2 and more random variables you should be considering their joint distribution, $$\mathbb{E}_{p(x)} X = \int x p(x) dx$$ –, $$\mathbb{E} [\alpha X + \beta Y] = \alpha \mathbb{E} X + \beta \mathbb{E} Y$$, $$\mathbb{V} X = \mathbb{E} [X^2] - (\mathbb{E} X)^2 = \mathbb{E}(X - \mathbb{E} X)^2$$, $$X$$ is said to be Uniformly distributed over $$(a, b)$$ (denoted $$X \sim U(a, b)$$ if its probability density function is $$p(x) = \begin{cases} \tfrac{1}{b-a}, & a < x < b \\ 0, &\text{otherwise} \end{cases} \quad\quad \mathbb{E} U = \frac{a+b}{2} \quad\quad \mathbb{V} U = \frac{(b-a)^2}{12}$$, $$X$$ is called a Multivariate Gaussian (Normal) random vector with mean $$\mu \in \mathbb{R}^n$$ and positive-definite covariance matrix $$\Sigma \in \mathbb{R}^{n \times n}$$ (denoted $$x \sim \mathcal{N}(\mu, \Sigma)$$) if its joint probability density function is, $$X$$ is said to be Categorically distributed with probabilities, $$X$$ is called a Bernoulli random variable with probability (of success) $$p \in [0, 1]$$ (denoted $$X \sim \text{Bern}(\pi)$$) if its probability mass function is $$p(X = 1) = \pi \Leftrightarrow p(x) = \pi^{x} (1-\pi)^{1-x}$$ (yes, this is a special case of the categorical distribution), Joint density on $$x$$ and $$y$$ defines the, Knowing value of $$y$$ can reduce uncertainty about $$x$$, expressed via the, Thus $$p(x, y) = p(y|x) p(x) = p(x|y) p(y)$$, Suppose we're having two jointly Gaussian random variables $$X$$ and $$Y$$: $$(X, Y) \sim \mathcal{N}\left(\left[\begin{array}{c}\mu_x \\ \mu_y \end{array} \right], \left[\begin{array}{cc}\sigma^2_x & \rho_{xy} \\ \rho_{xy} & \sigma^2_y\end{array}\right]\right)$$, Then one can show that marginal and conditionals are also Gaussian $$p(x) = \mathcal{N}(x \mid \mu_x, \sigma^2_x)$$  $$p(y) = \mathcal{N}(y \mid \mu_y, \sigma^2_y)$$  $$p(x|y) = \mathcal{N}\left(x \mid \mu_x + \tfrac{\rho}{\sigma_x^2} (y - \mu_y), \sigma^2_x - \tfrac{\rho_{xy}^2}{\sigma_y^2}\right)$$, If we're interested in $$y$$, then these distributions are called, We assume some data-generating model $$p(y, \theta \mid x) = p(y \mid x, \theta) p(\theta)$$, We obtain some observations $$\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N$$, We seek to make make predictions regarding $$y$$ for previously unseen $$x$$ having observed the training set $$\mathcal{D}$$. To find out more, please visit MIT Professional Education. UC Berkeley has done a lot of remarkable work on deep learning, including the famous Caffe — Deep Leaning Framework. July 24th, 2013 | Tags: representation learning , slides , talks , yoshua bengio | Category: anouncements, conference, news | One comment - (Comments are closed) Generator network and inference network essentially give us autoencoder, Inference network encodes observations into latent code, Generator network decodes latent code into observations, Can infer high-level abstract features of existing objects, Uses neural network to amortize inference, Bayesian methods are useful when we have low data-to-parameters ratio, Impose useful priors on Neural Networks helping discover solutions of special form, Provide Neural Networks with uncertainty estimates (uncovered), Neural Networks help us make more efficient Bayesian inference. CNNs are the current state-of-the-art architecture for medical image analysis. additional references. Get Free Introduction To Deep Learning Slides now and use Introduction To Deep Learning Slides immediately to get % off or \$ off or free shipping We will help you become good at Deep Learning. Lecture slides for Chapter 4 of Deep Learning www.deeplearningbook.org Ian Goodfellow Last modiﬁed 2017-10-14 Thanks to Justin Gilmer and Jacob Buckman for helpful discussions (Goodfellow 2017) Numerical concerns for implementations of deep learning algorithms Then $$H_l(x) = [z \le l] F_l(x) + x$$, Thus we have $$p(y|x,z) = \text{Categorical}(y \mid \pi(x, z))$$ where $$\pi(x, z)$$ is a residual network with $$z$$ that controls when to stop processing the $$x$$, We chose the prior on $$z$$ s.t. Deep learning is a sub-field of machine learning dealing with algorithms inspired by the structure and function of the brain called artificial neural networks. Artificial Intelligence Machine Learning Deep Learning Deep Learning by Y. LeCun et al. to parameters $$\theta$$ of the generator also! Description. How to decide upon number of layers at the test time? Deep Learning An MIT Press book in preparation Ian Goodfellow, Yoshua Bengio and Aaron Courville. Neural computation 1.4 (1989): 541-551. The slides and lectures are posted online, and the course are taught by three fantastic instructors. The Jupyter notebooks for the labs can be found in the labs folder of • 1993: Nvidia started… • Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. lectures-labs maintained by m2dsupsdlclass, Convolutional Neural Networks for Image Classification, Deep Learning for Object Detection and Image Segmentation, Sequence to sequence, attention and memory, Expressivity, Optimization and Generalization, Imbalanced classification and metric learning, Unsupervised Deep Learning and Generative models, Demo: Object Detection with pretrained RetinaNet with Keras, Backpropagation in Neural Networks using Numpy, Neural Recommender Systems with Explicit Feedback, Neural Recommender Systems with Implicit Feedback and the Triplet Loss, Fine Tuning a pretrained ConvNet with Keras (GPU required), Bonus: Convolution and ConvNets with TensorFlow, ConvNets for Classification and Localization, Character Level Language Model (GPU required), Transformers (BERT fine-tuning): Joint Intent Classification and Slot Filling, Translation of Numeric Phrases with Seq2Seq, Stochastic Optimization Landscape in Pytorch. with some fixed probability $$p$$ it's 0 and with probability $$1-p$$ it's some learnable value $$\Lambda_i$$, Then for some prior $$p(\theta)$$ our optimization objective is $$\mathbb{E}_{q(\theta|\Lambda)} \sum_{n=1}^N \log p(y_n | x_n, \theta) \to \max_{\Lambda}$$ where the KL term is missing due to the model choice, No need to take special care about differentiating through samples, Turns out, these are bayesian approximate inference procedures. However, many found the accompanying video lectures, slides, and exercises not pedagogic enough for a fresh starter. Note: press “P” to display the presenter’s notes that include some comments and Book Exercises External Links Lectures. Lecture slides Basic information about deep learning Cheat sheet – stuff that everyone needs to know Useful links Grading Plan your visit Visit previous iteration of Stats385 (2017) This page was generated by … license. However, while deep learning has proven itself to be extremely powerful, most of today’s most successful deep learning systems suffer from a number of important limitations, ranging from the requirement for enormous training data sets to lack of interpretability to vulnerability to … We plan to offer lecture slides accompanying all chapters of this book. Each layer accepts the information from previous and pass it on to the next on… Inria. The Deep Learning case! Machine Learning: An Overview: The slides presentintroduction to machine learningalong with some of the following: 1. Deep Learning Handbook. We assume the two-phase data-generating process: First, we decide upon high-level abstract features of the datum $$z \sim p(z)$$, Then, we unpack these features using Neural Networks into an actual observable $$x$$ using the (learnable) generator $$f_\theta$$, This leads to the following model $$p(x, z) = p(x|z) p(z)$$ where $$p(x|z) = p(z) \prod_{d=1}^D p(x_d | f_\theta(z))$$ $$p(z) = \mathcal{N}(z | 0, I)$$ and $$f_\theta$$ is some neural network, We can sample new $$x$$ by passing samples $$z$$ through the generator once we learn it, Would like to maximize log-marginal density of observed variables $$\log p(x)$$, Intractable integral $$\log p(x) = \log \int p(x|z) p(z) dz$$, Introduce approximate posterior $$q(z|x)$$: $$q(z|x) = \mathcal{N}(z|\mu_\Lambda(x), \Sigma_\Lambda(x))$$, Where $$\mu, \Sigma$$ are generated using auxiliary inference network from the observation $$x$$, Invoking the ELBO we obtain the following objective $$\tfrac{1}{N} \sum_{n=1}^N \left[ \mathbb{E}_{q(z_n|x_n)} \log p(x_n | z_n) - \text{KL}(q(z_n|x_n)||p(z_n)) \right] \to \max_\Lambda$$. lower values are more preferable. Gradient-based optimization in discrete models is hard, Invoke the Central Limit Theorem and turn the model into a continuous one, Consider a model with continuous noise on weights $$q(\theta_i | \Lambda) = \mathcal{N}(\theta_i | \mu_i(\Lambda), \alpha_i(\Lambda) \mu^2_i(\Lambda))$$, Neural Networks have lots of parameters, surely there's some redundancy in them, Let's take a prior $$p(\theta)$$ that would encourage large $$\alpha$$, Large $$\alpha_i$$ would imply that weight $$\theta_i$$ is unbounded noise that corrupts predictions, Such weights won't be doing anything useful, hence it should be zeroed out by putting $$\mu_i(\Lambda) = 0$$, Thus the weight $$\theta_i$$ would effectively turn into a deterministic 0. We will be giving a two day short course on Designing Efficient Deep Learning Systems at MIT in Cambridge, MA on July 20-21, 2020. Deep learning algorithms are similar to how nervous system structured where each neuron connected each other and passing information. Slides of the talk can be accessed from this link. Free + Easy to edit + Professional + Lots backgrounds. In other words, It mirrors the functioning of our brains. Can we drop unnecessary computations for easy inputs? to get started. All the code in this repository is made available under the MIT license Juergen Schmidhuber, Deep Learning in Neural Networks: An Overview. Obtain a sample from (or the mode statistic of) the true posterior $$p(y, z \mid x) \propto p(y|x, z) p(z)$$, We define some joint model $$p(y, \theta | x) = p(y | x, \theta) p(\theta)$$, We obtain observations $$\mathcal{D} = \{ (x_1, y_1), ..., (x_N, y_N) \}$$, We would like to infer possible values of $$\theta$$ given  observed data $$\mathcal{D}$$ $$p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} | \theta) p(\theta)}{\int p(\mathcal{D}|\theta) p(\theta) d\theta}$$, We will be approximating true posterior distribution with an approximate one, Need a distance between distributions to measure how good the approximation is $$\text{KL}(q(x) || p(x)) = \mathbb{E}_{q(x)} \log \frac{q(x)}{p(x)} \quad\quad \textbf{Kullback-Leibler divergence}$$, Not an actual distance, but $$\text{KL}(q(x) || p(x)) = 0$$ iff $$q(x) = p(x)$$ for all $$x$$ and is strictly positive otherwise, Will be minimizing $$\text{KL}(q(\theta) || p(\theta | \mathcal{D}))$$ over $$q$$, We'll take $$q(\theta)$$ from some tractable parametric family, for example Gaussian $$q(\theta | \Lambda) = \mathcal{N}(\theta \mid \mu(\Lambda), \Sigma(\Lambda))$$, Then we reformulate the objective s.t. Learn Deep Learning from deeplearning.ai. In this study, we used two deep-learning algorithms based … Predicting survival after hepatocellular carcinoma resection using deep-learning on histological slides Hepatology. The Deep Learning Handbook is a project in progress to help study the Deep Learning book by Goodfellow et al.. Goodfellow's masterpiece is a vibrant and precious resource to introduce the booming topic of deep learning. He has spoken and written a lot about what deep learning is and is a good place to start. 2014 Lecture 2 … Deep Learning is one of the most highly sought after skills in tech. Lets equip the network with a mechanism to decide when to stop processing and prefer networks that stop early, Let $$z$$ indicate the number of layers to use. The widespread adoption of whole slide imaging has increased the demand for effective and efficient gigapixel image analysis. ​Jeez, how is that related to this slide? 2012 IPAM Summer School deep learning and representation learning Videos and Slides at IPAM 2014 International Conference on Learning Representations (ICLR 2014) Turns out, the ELBO is also a lower bound on marginal log-likelihood (hence the name), we can maximize it w.r.t. The Course “Deep Learning” systems, typified by deep neural networks, are increasingly taking over all AI tasks, ranging from language understanding, and speech and image recognition, to machine translation, planning, and even game playing and autonomous driving. Nature 2015 “Over the next few years, start-ups and the usual big tech suspects will use deep learning to create new products and services … $$\mathbb{E} \hat{g} = \nabla_\Lambda \mathcal{L}(\Lambda)$$, Problem: We can't just take $$\hat{g} = \nabla_\Lambda \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)}$$ as the samples themselves depend on $$\Lambda$$ through $$q(\theta|\Lambda)$$, Remember the expectation is just an integral, and apply the log-derivative trick $$\nabla_\Lambda q(\theta | \Lambda) = q(\theta | \Lambda) \nabla_\Lambda \log q(\theta|\Lambda)$$ $$\nabla_\Lambda \mathcal{L}(\Lambda) = \int q(\theta|\Lambda) \log \frac{p(\mathcal{D}, \theta)}{q(\theta | \Lambda)} \nabla_\Lambda \log q(\theta | \Lambda) d\theta = \mathbb{E}_{q(\theta|\Lambda)} \log \frac{p(\mathcal{D}, \theta)}{q(\theta|\Lambda)} \nabla \log q(\theta | \Lambda)$$, Though general, this gradient estimator has too much variance in practice, We assume the data is generated using some (partially known) classifier $$\pi_{\theta}$$: $$p(y \mid x, \theta) = \text{Cat}(y | \pi_\theta(x)) \quad\quad \theta \sim p(\theta)$$, True posterior is intractable $$p(\theta \mid \mathcal{D}) \propto p(\theta) \prod_{n=1}^N p(y_n \mid x_n, \pi_\theta)$$, Approximate it using $$q(\theta | \Lambda)$$: $$\Lambda_* = \text{argmax} \; \mathbb{E}_{q(\theta | \Lambda)} \left[\sum_{n=1}^N \log p(y_n | x_n, \theta) - \text{KL}(q(\theta | \Lambda) || p(\theta))\right]$$, Essentially, instead of learning a single neural network that would solve the problem, we, $$p(\theta)$$ encodes our preferences on which networks we'd like to see, Let $$q(\theta_i | \Lambda)$$ be s.t. Deep Learning for Whole Slide Image Analysis: An Overview. Minimum Description Length for VAE Alice wants to transmit x as compactly as possible to Bob, who knows only the prior p(z) and the decoder weights We thank the Orange-Keyrus-Thalès chair for supporting this class. Computationally stained slides could help automate the time-consuming process of slide staining, but Shah said the ability to de-stain and preserve images for future use is the real advantage of the deep learning techniques. Andrew Ng from Coursera and Chief Scientist at Baidu Research formally founded Google Brain that eventually resulted in the productization of deep learning technologies across a large number of Google services.. This course is being taught at as part of Master Datascience Paris How do we backpropagate through samples $$\theta_i$$? 11/11/2019. Saclay. Video and slides of NeurIPS tutorial on Efficient Processing of Deep Neural Networks: from Algorithms to Hardware Architectures available here. Deep Learning algorithms aim to learn feature hierarchies with features at higher levels in the hierarchy formed by the composition of lower level features. Cognitive modeling 5.3 (1988): 1. Direct links to the rendered notebooks including solutions (to be updated in rendered mode): This lecture is built and maintained by Olivier Grisel and Charles Ollion, Charles Ollion, head of research at Heuritech - We have a continuous density $$q(\theta_i | \mu_i(\Lambda), \sigma_i^2(\Lambda))$$ and would like to compute the gradient of $$\mathbb{E}_{q(\theta|\Lambda)} \log \frac{p(\mathcal{D}|\theta) p(\theta)}{q(\theta|\Lambda)}$$, The inner part – expected gradients of $$\log \frac{p(\mathcal{D}|\theta) p(\theta)}{q(\theta|\Lambda)}$$, Sampling part – gradients through samples $$\theta \sim q(\theta|\Lambda)$$, The objective then becomes $$\mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} \log \tfrac{p(\mathcal{D}, \mu + \varepsilon \sigma)}{q(\mu + \varepsilon \sigma | \Lambda)}$$, The objective then becomes $$\mathbb{E}_{\varepsilon \sim \mathcal{N}(0, 1)} \left[\sum_{n=1}^N \log p(y_n | \theta=\mu(\Lambda) + \varepsilon \sigma(\Lambda)) \right] - \text{KL}(q(\theta|\Lambda) || p(\theta))$$, Training a neural network with special kind of noise upon weights, The magnitude of the noise is encouraged to increase, Zeroes out unnecessary weights completely, Essentially, training a whole ensemble of neural networks, Actually using the ensemble is costly: $$k$$ times slow for an ensemble of $$k$$ models, Single network (single-sample ensemble) also work. Online ahead of print. The course is Berkeley’s current offering of deep learning. Training the model is just one part of shipping a Deep Learning project. Seriously though, its just formal language, not much of the actual math is involved, We don't need no Bayes, we already learned a lot without it. The course covers the basics of Deep Learning, with a focus on applications. The Deep Learning Lecture Series 2020 is a collaboration between DeepMind and the UCL Centre for Artificial Intelligence. Its uncertainty quantified by the, This requires us to know the posterior distribution on model parameters $$p(\theta \mid \mathcal{D})$$ which we obtain using the Bayes' rule, Suppose the model $$y \sim \mathcal{N}(\theta^T x, \sigma^2)$$, with $$\theta \sim \mathcal{N}(\mu_0, \sigma_0^2 I)$$, Suppose we observed some data from this model $$\mathcal{D} = \{(x_n, y_n)\}_{n=1}^N$$ (generated using the same $$\theta^*$$), We don't know the optimal $$\theta$$, but the more data we observe, Posterior predictive would also be Gaussian $$p(y|x, \mathcal{D}) = \mathcal{N}(y \mid \mu_N^T x, \sigma_N^2)$$, Suppose we observe a sequence of coin flips $$(x_1, ..., x_N, ...)$$, but don't know whether the coin is fair $$x \sim \text{Bern}(\pi), \quad \pi \sim U(0, 1)$$, First, we infer posterior distribution on a hidden parameter $$\pi$$ having observed \(x_{