Search this blog

11 May, 2017

Where do GPUs come from.

A slide deck for a introduction to CG class.



PPTX - PDF (smaller)

Note: this is not really a tutorial in this form, there are no presenter's notes. But if you want to use this scheme to teach something similar, feel free. The CPU->GPU trajectory is heavily inspired by the brilliant work Kayvon Fatahalian did.

07 May, 2017

Privacy, bubbles, and being an expert.

Privacy is not the issue.

Much has been said about the risk of losing our privacy in this era of microwaves that can turn into cameras and other awful "internet of things" things. It seems that today there is nothing you can buy that does not both intentionally spy on your behaviors and is also insecure enough to allow third parties to spy on your behavior.

It's the wrong problem.

Especially when dealing with big, reputable companies, privacy is taken really quite seriously, there is virtually zero chance of anyone spying on you, as an individual. Even when it comes to anonymized data, care is taken to avoid singling out individuals, it might be unflattering, but big companies do not really care about you.

What they care about is targeting. Is being able to statistically know what various groups of people prefer, in order to serve them better and to sell them stuff.


CV Dazzle
The dangers of targeting.

Algorithmic targeting has two faces. There is a positive side to it, certainly. Why would a company not want to make its customers happier? If I know what you like, I can help you find more things that you will like, and yes, that will drive sales, but it's driving sales by effectively providing a better service, a sort of digital concierge. Isn't that wonderful? Why would anyone not opt in in such amazing technology...

But there is a dark side of this mechanism too, the ease with which algorithms can tune into the easiest ways to keep us engaged, to provide happiness, rewards. We're running giant optimizers attached to human minds, and these optimizers have access to tons of data and can do a lot of experiments on a huge (and quite real) population of samples, no wonder we can quickly converge towards a maximum of the gratification landscape.

Is it right to do so? Is it ethical? Are we really improving the quality of life, or are we just giving out quick jolts of pleasure and engagement? Who can say?
Where is the line between, for example, between keeping a player in a game because it's great, for some definition of great, and doing it so because it provides some compulsion loops that tap into basic brain chemistry the same way slot machines do?

Will we all end up living senseless lives attached to machines that provide to our needs, farmed like the humans in the Matrix?


Ellen Porteus

Pragmatically.

I don't know, and to be honest there are good reasons not to be a pessimist. Even with just a look at the history behind us, we had similar fears for many different technologies, and so far we always came up on top.

We're smarter, more literate, more creative, more productive, happy, healthy, pacific, rich that we ever were, globally. It is true that technology is quickly making leaps and opening options that were unthinkable even just a decade ago, but it's also true that there is not too much of a reason to think we can not adapt.

And I think if we look at newer generations, we can already see how this adaption is taking place, even observing product trends, it seems to be becoming harder and harder to engage people with the most basic compulsion loops and cheap content, acquiring users is increasingly hard, and the products that in practice make it onto the market doing so by truly offering some positive innovation.

Struggling with bubbles.

Even if I'm not a pessimist though, there is something I still struggle with: the apparent emergence of radicalization, echo chambers, bubbles. I have to admit, this is something hard to quantify on a global scale, especially when it comes to placing the phenomenon in a historical perspective, but it just bothers me personally, and I think it's something we have to be aware of.

I think we are at a peculiar intersection today. 

On one hand, we have increasingly risen out of ignorance and starting to be concerned with the matters of the world more. This might not seem to be the case looking at Trump and so on, but it's certainly true if we look at the trajectory of humanity with a bit more long-term historical perspective.

On the other hand, the kind of problems and concerns we are presented with increased in complexity exponentially. We are exposed to the matters of the world, and the world we live in is this enormous, interconnected beast were cause and effect get lost in the chaotic nature of interactions. 

Even experts don't have easy answers, and I think we know that because we might be experts in a field or two, and most big questions I believe would be answered with "it depends".
There are a myriad of local optima in the kind of problems we deal with today, and which way to go is more about what can work in a given environment, with given people, than what can be demonstrably proven to be the best direction.


Echo Chambers

The issue.

And this is where a big monster rears its head. In a world with lots of content and information, with systems that allow us to quickly connect to huge groups of similar-minded people, algorithms that feed contents that agree with our views seeking instant satisfaction over exploration, true knowledge and serendipity, when faced with increasingly complex issues, how attractive the dark side of the confirmation bias becomes?

We have mechanisms built-in all of us, regardless of how smart, that were designed not to seek the truth but to be effective when navigating the world and its social interactions. Cognitive biases are there because they serve us, they are tools stemmed from evolution. But is our world changing faster than our brain's ability to evolve?

Pragmatically again though, I don't intend to look too much at the far future (which I believe is generally futile, as you're trying to peek into a chaotic horizon). What annoys me is that even when you are aware of all this, and all these risks today, it's becoming hard to fight the system.
There is simply too much content out there, and too many algorithms around you (even if you isolate yourself from them) tuned to spread it in different groups that finding good information is becoming hard.

Then again, I am not sure of the scale of this issue, because if again, we look at things historically, probably we are still on average better informed today and less likely to be deceived than even just a few decades ago, where most people were not informed at all and it was much easier to control the few means of mass communication available.

Yet, it unavoidably irks me to look around and be surrounded by untrustworthy content and even worse, content that is made to myopically serve a small world view instead of trying to capture a phenomenon in all its complexity (either with malice, or just because it's simpler and gets clicks).
Getting accurate, global data is incredibly hard, as it's increasingly valuable and thus, kept hidden for competitive advantage.

John W.Tomac

Being an expert, or just decent.

I find that similar mechanisms and balances affect our professional lives, unsurprisingly. I often say that experience is a variance reduction technique: we become less likely to make mistakes, more effective, more knowledgeable and able to quickly dismiss dangerous directions, but we also risk of becoming less flexible, rooted in beliefs and principles that might not be relevant anymore.

I find no better example of these risks than in the trajectory of certain big corporations and how they managed to become irrelevant, not due to the lack of smart, talented people, but because at a given size one risks to have a gravity of its own, and truly believe in a snapshot of the world that meanwhile has moved on. How so many smart people can manage to be blinded.

Experience is a trade-off. We can be more effective even if we might be more wrong. Maybe, more importantly, we risk losing the ability to discover more revolutionary ideas.
How much should we be open to exploration and how much should we be focused on what we do best? How much should we seek diversity in a team, and how much should we value cohesion and unity of vision. I find these to be quite hard questions.

I don't have answers, but I do have some principles I believe might be useful. The first has to do with ego, and here it helps not to have a well-developed one, to begin with, because my suggestion is to go out and seek critique, "kill your darlings".
This has been taught to me at an early age by an artist friend of mine who was always critical of my work and when I protested made me notice how the only way to be better is to find people willing to trash what you do.

In practice, I think that we should be more severe, critical, doubtful of what we love and believe than anything else. We should set for our own ideas and social groups a higher standard of scrutiny than what we do for things that are alien to us.

The second principle that I believe can help is to encourage exploration, discovery, experimentation and failure. Going outside our comfort zones is always hard but even harder is to face failure, we don't like to fail, obviously and for good reasons.
So one cannot achieve these goals without setting some small, safe spaces where exploration is easier and not burdened (I would say unconstrained, but certain other constraints actually do help) by too much early judgment.

Lastly, beware that for how much you know about all this, and are willing to act, many times you will not. I don't always follow my own principles and I think that's normal. I try to be aware of these mechanisms though. And even there, the keyword is "try".

Epilog: affecting change.

I believe that polarizing, blinding, myopic forces are at work everywhere, in our personal and professional lives, in the society at large, and being aware of them is important even just to try to navigate our world.

But if instead of just navigating the world, one wants to actually affect change, then it's imperative to understand the fight that lies ahead.

The worst thing that can be ever done is to feed the polarization forces, cater to our own and scare away people who might have been willing to consider our ideals. It does not help, it damages.

Catering to our own enclaves, rallying our people, is easy and tempting and fulfilling. It's not useless, certainly, there is value in reaffirming people who are already inclined to be on our side, but there is much more to be gained in even just instilling a doubt reaching out to someone who is on the opposite side, or is undecided in the middle, than solidifying beliefs of people that share ours already.

You can even look at current events, elections and the way they are won.

Understanding people who think differently than us, applying empathy, extending reach, is so much harder. But it's also the only smart choice.



06 May, 2017

Shadow mistery

Can shadows cause the texture UVs to shift?

This was a bug assigned to one of our engineers. Puzzling. Instead of being really useful, I started to investigate with some offline rendering.


Look at the shadows of the disappearing pillar

Wow! Mental Ray is broken :)

So apparently yes, you can easily create this optical illusion. It's pretty easy to understand why, especially on a constant albedo the texture of the surface comes entirely from shading. If the light moves, so will the highlights traverse the surface, and that creates an illusion similar to a texture shift, of just a few pixels.

The effect is much stronger when the ambient light creates an highlight opposite to the main light, so when the main light is shadowed the ambient highlight dominates and the shift becomes apparent.

On a render where the ambient and highlight are on the same side, the effect is much less pronounced.



The nail on the coffin was though when I managed to reproduce the same effect in real-life, so, it's definitely an optical illusion that can happen, but it's probably made worse by realtime rendering unshadowed ambient/GI on normalmaps and by the fact that shadows abruptly cancel the sun/sky light, instead of gradually blocking only some rays and rolling out the highlight direction in the penumbra region.

01 April, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 2)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

Part 1 here!

- Curse of dimensionality

Classical machine learning works on datasets of small or moderate number of dimensions. If you ever tried to do any function approximation or to work with optimization in general, you should have noticed that all these problems suffer from the "curse of dimensionality".

Say for example that we have a set of outdoor images of a fixed size: 256x128. And let's say that this set is labeled with a value that tells at what time of the day the image was taken. We can see this dataset as samples of an underlying function that takes as input 98304 (WxHxRGB) values and outputs one.
In theory, we could use standard function fitting to find an expression that in general can tell from any image the time it was taken, but in practice, this approach goes nowhere: it's very hard to find functional expressions in so many dimensions!


More dimensions, more directions, more saddles!
The classic machine learning approach to these problems is to do some manual feature selection. We can observe our data set and come up with some statistics that describe our images compactly, and that we know are useful for the problem at hand.
We could, for example, compute the average color, thinking that different time of day does strongly change the overall tinting of the photos, or we could compute how much contrast we have, how many isolated bright pixels, all sorts of measures, then use or machine learning models on these.


A scatterplot matrix can show which dimensions of a dataset have interesting correlations.
The idea here is that there is a feedback loop between the data, the practitioner, and the algorithms. We look at the data set, make hypotheses on what could help reduce dimensionality, project, learn; if it didn't work well we rinse and repeat. 
This process is extremely useful, and it can help us understand the data better, discover relationships, invariants.

Multidimensional scaling, reducing a 3d data set to 1D
The learning process can sometimes instruct the exploration as well: we can train on a given set of features we picked, then notice that the training didn't really use much some of them (e.g. they have very small weights in our functional expression), and in that case we know these features are not very useful for the task at hand.
Or we could use a variety of automated dimensionality reduction algorithms to create the features we want. Or we could use interactive data visualization to try to understand which "axes" have more interesting information...

- Deep learning

Deep learning goes a step forwards and tries to eliminate the manual feature engineering part of machine learning. In general, it refers to any machine learning technique that learns hierarchical features. 

In the previous example we said it's very hard to learn directly from image pixels a given high-level feature, we have to engineer features first, do dimensionality reduction. 
Deep learning comes and says, ok, we can't easily go from 98304 dimensions to time-of-day, but could we automatically go from 98304 dimensions to maybe 10000, which represent the data set well? It's a form of compression, can we do a bit of dimensionality reduction, so that the reduced data still retains most of the information of the original?

Well, of course, sure we can! Indeed we know how to do compression, we know how to do some dimensionality reduction automatically, no problem. But if we can do a bit of that, can we then keep going? Go from 10000 to 1000, from 1000 to 100, and from 100 to 10? Always nudging each layer so it keeps in mind that we want features that are good for a specific final objective? In other words, can we learn good features, recursively, thus eliminating the laborious process of data exploration and manual projection?

Turns out that with some trickery, we can.

DNN for face recognition, each layer learns higher-order features
- Deep learning is hard

Why learning huge models fails? Try to picture the process: you have a very large vector of parameters to optimize, a point in a space of thousands, tens of thousands of dimensions. Each optimization step you want to choose a direction in which to move, you want to explore this space towards a minimum of the error function. 
There are just so many possible choices, if you were to explore randomly, just try different directions, it would take forever. If before doing a step we were to try any possible direction, we would need to evaluate the error function at least once per dimension, thousands of times.

Your most reliable guide through these choices is the gradient of the error function, a big arrow telling in which direction the error is going to be smaller if a small step is taken. But computing the gradient of a big model itself is really hard, numerical errors can lead to the so-called gradient diffusion.



Think of a neural network with many, many layers. A choice (change, gradient) of a weight in the first layer will change the output in a very indirect way, it will alter a bit the output of the layer but that output will be fed into many others before reaching the destination, the final value of the neural network.
The relationship between the layers near the output and the output itself is more clear, we can observe the layer inputs and we know what the weights will do, but layers very far from the output contribute in such an indirect way!


Imagine wanting to change the output of a pachinko machine.
Altering the pegs at the bottom has a direct result on the how the balls will fall
into the baskets, but changing the top peg will have less predictable results.
Another problem is that the more complex the model is, the more data we need to train it. If we have a model with tens of thousands of parameters, but we have only a few data point, we can easily "overfit": learn an expression that perfectly approximates the data points we have, but that is not related to the underlying process that generated these data points, the function that we really wanted to learn and that we know only by examples!


Overfitting example.
In general, the more powerful we want our model to be, the more data we need to have. But obtaining the data, especially if we need labeled data, is often non-trivial!

- The deep autoencoder

One idea that worked in the context of neural networks is to try to learn stuff layer-by-layer, instead of trying to train the whole network at once: enter the autoencoder.

An autoencoder is a neural network that instead of approximating a function that connects given inputs to some outputs, it connects inputs to themselves. In other words, the output of the neural network has to be the same as the inputs (or alternatively we can use a sparsity constraint on the weights). We can't just use an identity neural network though, the kink in this is that the autoencoder has to have a hidden layer with fewer neurons than the number of inputs!


Stacked autoencoder.
In other words, we are training the NN to do dimensionality reduction internally, and then expand the reduced representation back to the original number of dimensions, an autoencoder trains a coding layer and a decoding layer, all at once. 
This is perhaps surprisingly not hard, if the hidden layer bottleneck is not too small, we can just use backpropagation, follow the gradient, find all the weights we need.

The idea of the stacked autoencoder is to not stop at just one layer: once we train a first dimensionality-reduction NN we keep going by stripping the decoding layer (output) and connecting a second autoencoder to the hidden layer of the first one. And we can keep going from there, by the end, we'll have a deep network with many layers, each smaller than the preceding, each that has learned a bit of compression. Features!

The stacked autoencoder is unsupervised learning, we trained it without ever looking at our labels, but once trained nothing prevents us to do a final training pass in a supervised way. We might have gone from our thousands of dimensions to just a few, and there at the end, we can attach a regular neural network and train all the entire thing in a supervised way.

As the "difficult" layers, the ones very far from the outputs, have already learned some good features, the training will be much easier: the weights of these layers can still be affected by the optimization process, specializing to the dataset we have, but we know they already start in a good position, they have learned already features that in general are very expressive.

- Contemporary deep learning

More commonly today deep learning is not done via an unsupervised pre-training of the network, instead, we often are able to directly optimize bigger models. This has been possible via a better understanding of several components:

- The role of initialization: how to set the initial, randomized, set of weights.
- The role of the activation function shapes.
- Learning algorithms.
- Regularization.

We still use gradient descent type of algorithms, first-order local optimizers that use derivatives, but typically the optimizer uses approximate derivatives (stochastic gradient descent: the error is computed only with random subsets of the data) and tries to be smart (adagrad, adam, adadelta, rmsprop...) about how fast it should descend (the step size, also called learning rate).

In DNNs we don't really care about reaching a global optimum, it's expected for the model to have myriads of local minima (because of symmetries, of how weights can be permuted in ways that yield the same final solution), but reaching any of them can be hard. Saddle points are more common than local minima, and ill conditioning can make gradient descent not converge.

Regularization: how to reduce the generalization error.

By far the most important principle though is regularization. The idea is to try to learn general models, that do not just perform well on the data we have, but that truly embody the underlying hidden function that generated it. 

A basic regularization method that is almost always used with NNs is early stopping. By splitting the data in a training set (used to compute the error and the gradients) and a validation set (used to check that the solution is able to generalize): after some training iterations we might notice that the error on the training set keeps going down, but the one on the validation set starts rising, that's when overfitting is beginning to take place (and we should stop the training "early").

In general regularization it can be done by imposing constraints (often can be added to the error function as extra penalties) and biasing towards simpler models (think Occam's razor) that explain the data; we accept to perform a bit worse in terms of error on the training data set if in change we get a simpler, more general solution.

This is really the key to deep neural networks: we don't use small networks, we construct huge ones with a large number of parameters because we know that the underlying problem is complex, it's probably exceeding what a computer can solve exactly.
But at the same time, we steer our training so that our parameters try to be as sparse as possible; it has to have a cost to use a weight, to activate a circuit, this, in turn, ensures that when the network learns something, it's something important. 
We know that we don't have lots of data compared to how big the problem is, we have only a few examples of a very general domain.

Other ways to make sure that the weights are "robust" is to inject noise, or to not always train with all of them but try cutting different parts of the network out as we train (dropout).


Google's famous "cat" DNN
Think for example a deep neural network that has to learn how to recognize cats. We might have thousand of photos of cats, but still, there is really an infinite variety, we can't even enumerate all the possible cats in all possible poses, environments and so on, these are all infinite. And we know that in general we need a large model to learn about cats, recognizing complex shapes is something that requires quite some brain-power.
What we want though is to avoid that our model just learns to recognize exactly the handful of cats we showed it, we want it to extract some higher-level knowledge of what cats looks like.

Lastly, data augmentation can be used as well: we can always try to generate more data from a smaller set of examples. We can add noise and other "distractions" to make sure that we're not learning too much the specific examples provided. 
Maybe we know that certain transforms are still valid examples of the same data, for example, a rotated cat is still a cat. A cat behind a pillar is still a cat or on different backgrounds. Or maybe we can't generate more data of a given kind, but we can generate "adversarial" data: examples of things that are not what we are seeking for.

- Do you need all this, in your daily job?

Today there is a ton of hype around deep learning and deep neural networks, with huge investments on all fronts. DNNs are groundbreaking in terms of their representation power and deep learning is even guiding the design of new GPUs and ad-hoc hardware! But, for all the fantastic accomplishments and great results we see, I'd say that most of the times we don't need it...

One key compromise we have with deep learning is that it replaces feature engineering with architecture engineering. True, we don't need to hand-craft features anymore, but this doesn't mean that things just work!
Finding the right architecture for a deep learning problem is hard, and it's still mostly and art done with experience and trials and errors.

This might very well be a bad tradeoff. When we explore data and try to find good features we effectively learn (ourselves) some properties of the data. We make hypotheses about what might be significant and test them. 
In contrast, deep neural networks are much more opaque and impenetrable (even if some progress has been made). And this is important, because it turns out that DNN can be easily fooled (even if this is being "solved" via adversarial learning).

Architecture engineering also has slower iteration times, we have each time to train our architecture to see how it works, we need to tune the training algorithms themselves... In general building deep models is expensive, both in terms of human and machine time.


Decision trees, forests, gradient boosting are much
more explainable classifiers than DNNs
Deep models are in general much more complex and expensive than hand-crafted feature-based ones when it's possible to find good features for a given problem. In fact nowadays, together with solid and groundbreaking new research, we also see lots of publications of little value, that simply take a problem and apply a questionable DNN spin to it, with results that are not really better than the state of the art solution made with traditional handcrafted algorithms...

And lastly, deep models will always require more data to train. The reason is simple: when we create statistical features ourselves, we're effectively giving the machine learning process some a-priori model. We know that some things make sense, correlate with our problem and that some others do not. 
This a-priori information acts like a constraint: we restrict what we are looking for, but in exchange, we get fewer degrees of freedom and thus fewer data points can be used to fit our model.

In deep learning, we want to discover features by using very powerful models. We want to extract very general knowledge from the data, and in order to do so, we need to show the machine learning algorithm lots of examples...

In the end, the real crux of the issue is that most likely, especially if you're not already working with data and machine learning on a problem, you don't need to use the most complex, state of the art weapon in the machine learning arsenal in order to have great results!

- Conclusions

The way I see all this is that we have a continuum of choices. On one end we have expert-driven solutions: we have a problem, we apply our intellect and we come with a formula or an algorithm that solves it. This is the conventional approach to (computer) science and when it works, it works great.

Sometimes the problems we need to solve are intractable: we might not have enough information, we might not have an underlying theoretical framework to work with, or we might simply not have in practice enough computational resources.

In these cases, we can find approximations: we make assumptions, effectively chipping away at the problem, constraining it into a simpler one that we know how to solve directly. Often it's easy to make somewhat reasonable assumption leading to good solutions. It's very hard though to know that the assumptions we made are the best possible.

On the other end, we have "ignorant", black-box solutions: we use computers to automatically discover, learn, how to deal with a problem, and our intelligence is applied catering to the learning process, not the underlying problem we're trying to solve. 
If there are assumptions to be made, we hope that the black-box learning process will discover them automatically from the data, we don't provide any of our own reasoning.

This methodology can be very powerful and yield interesting results, as we didn't pose any limits, it might discover solutions we could have never imagined. On the other hand, it's also a huge waste: we are smart! Using our brain to chip away at a problem can definitely be better than hoping that an algorithm somehow will do something reasonable!

In between, we have an ocean of shades of data-driven solutions... It's like the old adage that one should not try to optimize a program without profiling, in general, we should say that we should not try to solve a problem without having observed it a lot, without having generated data and explored the data.

We can avoid making early assumptions: we just observe the data and try to generate as much data as possible, capturing everything we can. Then, from the data, we can find solutions. Maybe sometimes it will be obvious that a given conventional algorithm will work, we can discover through the data new facts about our problem, build a theoretical framework. Or maybe, other times, we will just be able to observe that for no clear reason our data has a given shape, and we can just approximate that and get a solution, even if we don't know exactly why...

Deep learning is truly great, but I think an even bigger benefit of the current DNN "hype", other than the novel solutions it's bringing, is that more generalist programmers are exposed to the idea of not making assumptions and writing algorithms out of thin air, but instead of trying to generate data-sets and observe them. 
That, to me, is the real lesson for everybody: we have to look at data more, now that it's increasingly easy to generate and explore it.

Deep learning then is just one tool, that sometimes is exactly what we need but most of the times is not. Most of the times we do know where to look. We do not have huge problems in many dimensions. And in these cases very simple techniques can work wonders!  Chances are that we don't even need neural networks, they are not in general that special.

Maybe we needed just a couple of parameters and a linear regressor. Maybe a polynomial will work, or a piecewise curve, or a small set of Gaussians. Maybe we need k-means or PCA.

Or we can use data just to prove certain relationships exist to then exploit them with entirely hand-crafted algorithms, using machine learning just to validate an assumption that a given problem is solvable from a small number of inputs... Who knows! Explore!

Links for further reading.

This is a good DNN tutorial for beginners.
NN playground is fun.
- Keras is a great python DNN framework.
- Alan Wolfe at Blizzard made some very nice blog posts about NNs.
- You should in general know about data visualization, dimensionality reduction, machine learning, optimization & fitting, symbolic regression...
- A good tutorial on deep reinforcement learning, and one on generative adversarial networks.
- History of DL
- DL reading listMost cited DNN papers. Another good one.
- Differentiable programming, applies gradient descent to general programming.


Thanks to Dave Neubelt, Fabio Zinno and Bart Wronski 
for providing early feedback on this article series!

26 March, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 1)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

This won't be short...

- Machine Learning

Machine learning is a huge field nowadays, with lots of techniques and sub-disciplines. It would be very hard for me to provide an overview in a single article, and I certainly don't claim to know all about it.
The goal of this article is to introduce you to the basic concepts, just enough so we can orient ourselves and understand what we might need in our daily job as programmers.

I'll try to do so using terminology that is as much as possible close to what a programmer might expect instead of the grammar of machine learning which annoyingly often likes to call the same things in different ways based on the specific subdomain.
This is particularly a shame because as we'll soon see, lots of different fields, even disciplines that are not even usually considered to be "machine learning", are really intertwined and closely related.

- Supervised and unsupervised learning

The first thing we have to know is that there are two main kinds of machine learning: supervised and unsupervised learning. 
Both deal with data, or if you wish, functions that we don't have direct access to but that we know through a number of samples of their outputs.

In the case of supervised learning, our data comes in the form of input->output pairs; each point is a vector of the unknown function inputs and it's labeled with the return value.
Our job is to learn a functional form that approximates the data; in other words, through data, we are learning a function that approximates a second unknown one.

Clearly supervised learning is closely related to function approximation. Another name for this is regression analysis or function fitting: we want to estimate the relationship between the input and output variables. Also related is (scattered) data interpolation and Kriging: in all cases we have some data points and we want to find a general function that underlies them.

Most of the times the actual methods that we use to fit functions to data come from numerical optimization: our model functions have a given number of degrees of freedom, flexibility to take different shapes, optimization is used to find the parameters that make the model as close as possible (minimize the error) to the data.

Function fitting: 1D->1D
If the function's outputs are from a discrete set instead of being real numbers supervised learning is also called classification: our function takes an input and emits a class label (1, 2, 3,... or cats, dogs, squirrels,...), our job is, seen some examples of this classification at work, learn a way to do the same job on inputs that are outside the data set provided.

Binary classifier: 2D->label
For unsupervised learning, on the other hand, the data is just made of points in space, we have no labels, no outputs, just a distribution of samples.

As we don't have outputs, fitting a function sounds harder, functions are relations of inputs to their outputs. What we could do though is to organize these points to discover relationships among themselves: maybe they form clusters, or maybe they span a given surface (manifold) in their n-dimensional space.

We can see clustering as a way of classifying data without knowing what the classes are, a-priori. We just notice that certain inputs are similar to each other, and we group these in a cluster. 
Maybe later we can observe the points in the cluster and decide that it's made of cats, assign a label a-posteriori.

2D Clustering
Closely related to clustering is dimensionality reduction (and dictionary learning/compressed sensing): if we have points in an n-dimensional space, and we can cluster them in k groups, where k is less than n, then probably we can express each point by saying how close to each group it is (projection), thus using k dimensions instead of n.

2D->1D Projection
Eigenfaces
Dimensionality reduction is, in turn, closely related to finding manifolds: let's imagine that our data are points in three dimensions, but we observe that they all lie always on the unit sphere.
Without losing any information, we can express them as coordinates on the sphere surface (longitude and latitude), thus having saved one dimension by having noticed that our data lied on a parametric surface.

And (loosely speaking) all the times we can project points to a lower dimension we have in turn found a surface: if we take all the possible coordinates in the lower-dimensionality space they will map to some points of the higher-dimensionality one, generating a manifold. 

Interestingly though unsupervised learning is also related to supervised learning in a way: if we think of our hidden, unknown function as a probability density one, and our data points as samples extracted according to said probability, then unsupervised learning really just wants to find an expression of that generating function. This is also the very definition of density estimation!

Finally, we could say that the two are also related to each other through the lens of dimensionality reduction, which can be seen as nothing else than a way to learn an identity function (inputs map to outputs) where we have the constraint that the function, internally, has to loose some information, has to have a bottleneck that ensures the input data is mapped to a small number of parameters.

- Function fitting

Confused yet? Head spinning? Don't worry. Now that we have seen that most of these fields are somewhat related, we can choose just one and look at some examples. 

The idea that most programmers will be most familiar with is function fitting. We have some data, inputs and outputs, and we want to fit a function to it so that for any given input our function has the smallest possible error when compared with the outputs given.

This is commonly the realm of numerical optimization. 

Let's say we suppose our data can be modeled as a line. A line has only two parameters: y=a*x+b, we want to find the values of a and b so that for each data point (x1,y1),(x2,y2)...(xN,yN), our error is minimized, for example, the L2 distance.
This is a very well studied problem, it's called linear regression, and in the way it's posed it's solvable using linear least squares.
Note: if instead of wanting to minimize the distance between the data output and the function output, we want to minimize the distance between the data points and the line itself, we end up with principal component analysis/singular value decomposition, a very important method for dimensionality reduction - again, all these fields are intertwined!

Now, you can imagine that if our data is very complicated, approximating it with a line won't really do much, we need more powerful models. Roughly speaking we can construct more powerful models in two ways: we either use more pieces of something simple, or we start using more complicated pieces.

So, on one extreme we can think of just using linear segments, but using many of them (fitting a piecewise linear curve), on the other hand, we can think instead of fitting higher-order polynomials, or rational function, or even to find an arbitrary function made of any combination of any number of operators (symbolic regression, often done via genetic programming).

Polynomial versus piecewise linear.
The rule of the thumb is that simpler models have usually easier ways to fit (train), but might be wasteful and grow rather large (in terms of the number of parameters). More powerful models might be much harder to fit (global nonlinear optimization), but be more succinct.

- Neural Networks

For all the mystique there is around Neural Networks and their biological inspiration, the crux of the matter is that they are nothing more than a way to approximate functions, rather like many others, but made from a specific building block: the artificial neuron.

This neuron is conceptually very simple. At heart is a linear function: it takes a number of inputs, it multiplies them with a weight vector, it adds them together into a single number (a dot product!) and then it adds a bias value (optionally).
The only "twist" there is that after the linear part is done, a non-linear function (the activation function) is applied to the results.

If the activation function is a step (outputting one if the result was positive, zero otherwise), we have the simplest kind of neuron and the simplest neural classifier (a binary one, only two classes): the perceptron.

Perceptron
In general, we can use many nonlinear functions as activations, depending on the task at hand.
Regardless of this choice though it should be clear that with a single neuron we can't do much, in fact, all we can ever do is express a distance from an hyperplane (again, we're doing a dot product), somewhat modified by the activation. The real power in neural networks come from the "network" part.

Source
The idea is again simple: if we have N inputs, we can connect to them M neurons. These neurons will each give one output, so we end up with M outputs, and we can call this structure a neural "layer".
We can then rinse and repeat, the M outputs can be considered as inputs of a second layer of neurons and so on, till we decide enough is enough and at the final layer we use a number of outputs equal to the ones of the function we are seeking to approximate (often just one, but nothing prevents to learn vector-valued functions).

The first layer, connected to our input data, is unimaginatively called the input layer, the last one is called the output layer, and any layer in between is considered a "hidden" layer. Non-deep neural networks often employ a single hidden layer.

We could write down the entire neural network as a single formula, it would end up nothing more than a nested sequence of matrix multiplies and function applications. In this formula we'll have lots of unknowns, the weights we use in the matrix multiplies. The learning process is nothing else than optimization, we find the best weights that minimize the error of our neural network to the data given.

Because we typically have lots of weights, this is a rather large optimization problem, so typically fast, local, gradient-descent based optimizers are used. The idea is to start with an arbitrary set of weights and then update them by following the function partial derivatives towards a local minimum of the error.

Source. See also this.
We need the partial derivatives for this process to work. It's impractical to compute them symbolically, so automatic differentiation is used, typically via a process called "backpropagation", but other methods could be used as well, or we can even have a mix of methods, using hand-written symbolic derivatives for certain parts where we know how to compute them, and automatic differentiation for other.

Under certain assumptions, it can be shown that a neural network with a single hidden layer is a universal approximator, it could (we might not be able to train it well, though...), with a finite (but potentially large number) of neurons approximate any continuous function on compact subsets of n-dimensional real spaces.

Part 2...