Search this blog

Showing posts with label Programming tutorials. Show all posts
Showing posts with label Programming tutorials. Show all posts

26 March, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 1)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

This won't be short...

- Machine Learning

Machine learning is a huge field nowadays, with lots of techniques and sub-disciplines. It would be very hard for me to provide an overview in a single article, and I certainly don't claim to know all about it.
The goal of this article is to introduce you to the basic concepts, just enough so we can orient ourselves and understand what we might need in our daily job as programmers.

I'll try to do so using terminology that is as much as possible close to what a programmer might expect instead of the grammar of machine learning which annoyingly often likes to call the same things in different ways based on the specific subdomain.
This is particularly a shame because as we'll soon see, lots of different fields, even disciplines that are not even usually considered to be "machine learning", are really intertwined and closely related.

- Supervised and unsupervised learning

The first thing we have to know is that there are two main kinds of machine learning: supervised and unsupervised learning. 
Both deal with data, or if you wish, functions that we don't have direct access to but that we know through a number of samples of their outputs.

In the case of supervised learning, our data comes in the form of input->output pairs; each point is a vector of the unknown function inputs and it's labeled with the return value.
Our job is to learn a functional form that approximates the data; in other words, through data, we are learning a function that approximates a second unknown one.

Clearly supervised learning is closely related to function approximation. Another name for this is regression analysis or function fitting: we want to estimate the relationship between the input and output variables. Also related is (scattered) data interpolation and Kriging: in all cases we have some data points and we want to find a general function that underlies them.

Most of the times the actual methods that we use to fit functions to data come from numerical optimization: our model functions have a given number of degrees of freedom, flexibility to take different shapes, optimization is used to find the parameters that make the model as close as possible (minimize the error) to the data.

Function fitting: 1D->1D
If the function's outputs are from a discrete set instead of being real numbers supervised learning is also called classification: our function takes an input and emits a class label (1, 2, 3,... or cats, dogs, squirrels,...), our job is, seen some examples of this classification at work, learn a way to do the same job on inputs that are outside the data set provided.

Binary classifier: 2D->label
For unsupervised learning, on the other hand, the data is just made of points in space, we have no labels, no outputs, just a distribution of samples.

As we don't have outputs, fitting a function sounds harder, functions are relations of inputs to their outputs. What we could do though is to organize these points to discover relationships among themselves: maybe they form clusters, or maybe they span a given surface (manifold) in their n-dimensional space.

We can see clustering as a way of classifying data without knowing what the classes are, a-priori. We just notice that certain inputs are similar to each other, and we group these in a cluster. 
Maybe later we can observe the points in the cluster and decide that it's made of cats, assign a label a-posteriori.

2D Clustering
Closely related to clustering is dimensionality reduction (and dictionary learning/compressed sensing): if we have points in an n-dimensional space, and we can cluster them in k groups, where k is less than n, then probably we can express each point by saying how close to each group it is (projection), thus using k dimensions instead of n.

2D->1D Projection
Eigenfaces
Dimensionality reduction is, in turn, closely related to finding manifolds: let's imagine that our data are points in three dimensions, but we observe that they all lie always on the unit sphere.
Without losing any information, we can express them as coordinates on the sphere surface (longitude and latitude), thus having saved one dimension by having noticed that our data lied on a parametric surface.

And (loosely speaking) all the times we can project points to a lower dimension we have in turn found a surface: if we take all the possible coordinates in the lower-dimensionality space they will map to some points of the higher-dimensionality one, generating a manifold. 

Interestingly though unsupervised learning is also related to supervised learning in a way: if we think of our hidden, unknown function as a probability density one, and our data points as samples extracted according to said probability, then unsupervised learning really just wants to find an expression of that generating function. This is also the very definition of density estimation!

Finally, we could say that the two are also related to each other through the lens of dimensionality reduction, which can be seen as nothing else than a way to learn an identity function (inputs map to outputs) where we have the constraint that the function, internally, has to loose some information, has to have a bottleneck that ensures the input data is mapped to a small number of parameters.

- Function fitting

Confused yet? Head spinning? Don't worry. Now that we have seen that most of these fields are somewhat related, we can choose just one and look at some examples. 

The idea that most programmers will be most familiar with is function fitting. We have some data, inputs and outputs, and we want to fit a function to it so that for any given input our function has the smallest possible error when compared with the outputs given.

This is commonly the realm of numerical optimization. 

Let's say we suppose our data can be modeled as a line. A line has only two parameters: y=a*x+b, we want to find the values of a and b so that for each data point (x1,y1),(x2,y2)...(xN,yN), our error is minimized, for example, the L2 distance.
This is a very well studied problem, it's called linear regression, and in the way it's posed it's solvable using linear least squares.
Note: if instead of wanting to minimize the distance between the data output and the function output, we want to minimize the distance between the data points and the line itself, we end up with principal component analysis/singular value decomposition, a very important method for dimensionality reduction - again, all these fields are intertwined!

Now, you can imagine that if our data is very complicated, approximating it with a line won't really do much, we need more powerful models. Roughly speaking we can construct more powerful models in two ways: we either use more pieces of something simple, or we start using more complicated pieces.

So, on one extreme we can think of just using linear segments, but using many of them (fitting a piecewise linear curve), on the other hand, we can think instead of fitting higher-order polynomials, or rational function, or even to find an arbitrary function made of any combination of any number of operators (symbolic regression, often done via genetic programming).

Polynomial versus piecewise linear.
The rule of the thumb is that simpler models have usually easier ways to fit (train), but might be wasteful and grow rather large (in terms of the number of parameters). More powerful models might be much harder to fit (global nonlinear optimization), but be more succinct.

- Neural Networks

For all the mystique there is around Neural Networks and their biological inspiration, the crux of the matter is that they are nothing more than a way to approximate functions, rather like many others, but made from a specific building block: the artificial neuron.

This neuron is conceptually very simple. At heart is a linear function: it takes a number of inputs, it multiplies them with a weight vector, it adds them together into a single number (a dot product!) and then it adds a bias value (optionally).
The only "twist" there is that after the linear part is done, a non-linear function (the activation function) is applied to the results.

If the activation function is a step (outputting one if the result was positive, zero otherwise), we have the simplest kind of neuron and the simplest neural classifier (a binary one, only two classes): the perceptron.

Perceptron
In general, we can use many nonlinear functions as activations, depending on the task at hand.
Regardless of this choice though it should be clear that with a single neuron we can't do much, in fact, all we can ever do is express a distance from an hyperplane (again, we're doing a dot product), somewhat modified by the activation. The real power in neural networks come from the "network" part.

Source
The idea is again simple: if we have N inputs, we can connect to them M neurons. These neurons will each give one output, so we end up with M outputs, and we can call this structure a neural "layer".
We can then rinse and repeat, the M outputs can be considered as inputs of a second layer of neurons and so on, till we decide enough is enough and at the final layer we use a number of outputs equal to the ones of the function we are seeking to approximate (often just one, but nothing prevents to learn vector-valued functions).

The first layer, connected to our input data, is unimaginatively called the input layer, the last one is called the output layer, and any layer in between is considered a "hidden" layer. Non-deep neural networks often employ a single hidden layer.

We could write down the entire neural network as a single formula, it would end up nothing more than a nested sequence of matrix multiplies and function applications. In this formula we'll have lots of unknowns, the weights we use in the matrix multiplies. The learning process is nothing else than optimization, we find the best weights that minimize the error of our neural network to the data given.

Because we typically have lots of weights, this is a rather large optimization problem, so typically fast, local, gradient-descent based optimizers are used. The idea is to start with an arbitrary set of weights and then update them by following the function partial derivatives towards a local minimum of the error.

Source. See also this.
We need the partial derivatives for this process to work. It's impractical to compute them symbolically, so automatic differentiation is used, typically via a process called "backpropagation", but other methods could be used as well, or we can even have a mix of methods, using hand-written symbolic derivatives for certain parts where we know how to compute them, and automatic differentiation for other.

Under certain assumptions, it can be shown that a neural network with a single hidden layer is a universal approximator, it could (we might not be able to train it well, though...), with a finite (but potentially large number) of neurons approximate any continuous function on compact subsets of n-dimensional real spaces.

Part 2...

06 September, 2014

Scientific Python 101

As for the Mathematica 101, after the (long) introduction I'll be talking with code...

Introduction to "Scientific Python"

In this I'll assume a basic knowledge of Python, if you need to get up to speed, learnXinYminute is the best resource for a programmer.

With "Scientific Python" I refer to an ecosystem of python packages built around NumPy/SciPy/IPython. I recommend installing a scientific python distribution, I think Anaconda is by far the best (PythonXY is an alternative), you could grab the packages from pypi/pip from any Python distribution, but it's more of a hassle.

NumPy is the building block for most other packages. It provides a matlab-like n-dimensional array class that provides fast computation via Blas/Lapack. It can be compiled with a variety of Blas implementations (Intel's MKL, Atlas, Netlib's, OpenBlas...), a perk of using a good distribution is that it usually comes with the fastest option for your system (which usually is multithreaded MKL). SciPy adds more numerical analysis routines on top of the basic operations provided by NumPy.

IPython (Jupyter) is a notebook-like interface similar to Mathematica's (really, it's a client-server infrastructure with different clients, but the only one that really matters is the HTML-based notebook one). 
An alternative environment is Spyder, which is more akin to Matlab's or Mathematica Workbench (a classic IDE) and also embeds IPython consoles for immediate code execution.

Especially when learning, it's probably best to start with IPython Notebooks.

Why I looked into SciPy

While I really like Mathematica for exploratory programming and scientific computation, there are a few reasons that compelled me to look for an alternative (other than Wolfram being an ass that I hate having to feed).

First of all, Mathematica is commercial -and- expensive (same as Matlab btw). Which really doesn't matter when I use it as a tool to explore ideas and make results that will be used somewhere else, but it's really bad as a programming language.

I wouldn't really want to redistribute the code I write in it, and even deploying "executables" is not free. Not to mention not many people know Mathematica to begin with.
Python, in comparison, is very well known, free, and integrated pretty much everywhere. I can drop my code directly in Maya (or any other package really, python is everywhere) for artists to use, for example.

Another big advantage is that Python is familiar, even for people that don't know it, it's a simple imperative scripting language.
Mathematica is in contrast a very odd Lisp, which will look strange at first even to people who know other Lisps. Also, it's mostly about symbolic computation, and the way it evaluate can be quite mysterious. CPython internals on the other hand, can be quite easily understood.

Lastly, a potential problem lies in the fact that python packages aren't guaranteed to have all the same licensing terms, and you might need many of them. Verifying that everything you end up installing can be used for commercial purposes is a bit of a hassle...

How does it fare?

It's free. It's integrated everywhere. It's familiar. It has lots of libraries. It works. It -can- be used as a Mathematica or Matlab replacement, while being free, so every time you need to redistribute your work (research!) it should be considered.

But it has still (many) weaknesses.

As a tool for exploratory programming, Mathematica is miles aheadIts documentation is great, it comes with a larger wealth of great tools and its visualization options are probably the best bar none.
Experimentation is an order of magnitude better if you have good visualization and interactivity support, and Mathematica, right now, kills the competition on that front. 
Manipulate[] is extremely simple, plotting is decently fast and the quality is quite high, there is lots of thought behind how the plots work, picking reasonable defaults, being numerically reliable and so on.

In Python on the other hand you get IPython and matplotlib. Ok, you got a ton of other libraries too, but matplotlib is popular and the basis of many others too. 
IPython can't display output if assignments are made, and displays only the last evaluated expression. Matplotlib is really slow, really ugly, and uses a ton of memory. Also you can either get it integrated in IPython, but with zero interactivity, or in a separate window, with just very bare-bones support for plot rotation/translation/scale.

There are other tools you can use, but most are 2D only, some are very fast and 3D but more cumbersome to use and so on and so forth...
Update: nowadays there are a few more libraries using WebGL, which are both fast and allow interactivity in IPython!

As a CAS I also expect Mathematica to be the best, you can do CAS in Python via SymPy/Sage/Mathics but I don't rely too much on that, personally, so I'm not in a position to evaluate.

Overall, I'll still be using Mathematica for many tasks, it's a great tool.

As a tool for numerical computation it fares better. Its main rival would be Matlab, whose strength really lies in the great toolboxes Mathworks provides. 
Even if the SciPy ecosystem is large with a good community, there are many areas where its packages are lacking, not well supported or immature.

Sadly though for the most Matlab is not that popular because of the unique functionality it provides, but because MathWorks markets well to the academia and it became the language of choice for many researchers and courses.
Also, researchers don't care about redistributing source nearly as much as they really should, this day and age it's all still about printed publications...

So, is Matlab dead? Not even close, and to be honest, there are many issues Python has to solve. Overall though, things are shifting already, and I really can't see a bright future for Matlab or its clones, as fundamentally Python is a much better language, and for research being open is probably the most important feature. We'll see.

A note on performance and exploration

For some reason, most of the languages for scientific exploratory programming are really slow. Python, Matlab, Mathematica, they are all fairly slow languages. 

The usual argument is that it doesn't matter at all, because these are scripting languages used to glue very high-performance numerical routines. And I would totally agree. If it didn't matter.
A language for exploratory programming has to be expressive and high-level, but also fast enough for the abstractions not to fall on their knees. Sadly, Python isn't.

Even with simple code, if you're processing a modicum amount of data, you'll need to know its internals, and the variety of options available for optimization. It's similar in this regard to Mathematica, where using functions like Compile often requires planning the code up-front to fit in the restrictions of such optimizers.

Empirically though it seems that the amount of time I had to spend minding performance patterns in Python is even higher than what I do in Mathematica. I suspect it's because many packages are pure python.

It's true that you can do all the optimization staying inside the interactive environment, not needing to change languages. That's not bad. But if you start having to spend a significant amount of time thinking about performance, instead of transforming data, it's a problem.

Also, it's a mystery to me why most scientific languages are not built for multithreading, at all. All of them, Python, Matlab and Mathematica, execute only some underlying C code in parallel (e.g. blas routines). But not anything else (all the routines not written in native code, often things such as plots, optimizers, integrators).

Even Julia, which was built specifically for performance, doesn't really do multithreading so far, just "green" threads (one at a time, like python) and multiprocessing.

Multiprocessing in Python is great, IPython makes it a breeze to configure a cluster of machines or even many processes on a local machine. But it still requires order of magnitudes more effort than threading, killing interactivity (push global objects, imports, functions, all manually across instances).

Mathematica at least does the multiprocessing data distribution automatically, detecting dependencies and input data that need to be transferred.

Learn by example: 



Other resources:

Tutorials
Packages
  • Scipy: numpy, scipy, matplotlib, sympy, pandas
  • Optimization and learning
  • Dill, a package that can serialize/snapshot a python kernel. Useful when one wants to stop working on an iPython session but want to be able to pick it up again from the same state next time.
  • Performance
    • A comparison of Cython, Numba, PyCuda, PyOpenCl, NumPy and other frameworks on a simple problem (Mandelbrot set)
    • SciPy Weave, inlines C code in Python code, compiles and links to python on demand. Deprecated. Try Cython instead.
    • Numba, a numpy "aware" compiler, targets LLVM, compiles in runtime (annotated functions)
    • Cython, compiles annotated python to C. Bottleneck uses it to accelerate some NumPy functions. (see also ShedskinPythran and ocl)
    • JobLib, makes multiprocessing easier (see IPython.Parallel too) but still not great as you can't have multithreading, multiprocessing means you'll have to send data around independent python interpreters :/
    • NumExpr, a fast interpreter of numerical expressions on arrays. Faster than numpy by aggregating operations (instead of doing one at at time)
    • WeldNumpy is another faster interpreter, the idea here is to lazy-evaluate expressions to be able to execute them more optimally.
    • Theano, targets cpu and gpu, numpy aware, automatic differentiation. Clumsy...
    • Nuikta, offline compiles python to C++, should support all extensions
    • PyPy, a JIT, with a tracing interpreter written in python. Doesn't support all extensions (the CPython C library interface)
    • Python/Cuda links
  • Non-homogeneous data
    • Blaze, like numpy but for non-homogeneous, incomplete data
    • PyTables, hierarchical data
  • Graphics/Plotting
    • For 3d animations, VisVis seems the only tool that is capable of achieving decent speed, quality, and has a good interface and support. It has a matlab-like interface, but actually creating objects (Line() instead of plot...) is much better/faster.
      • Update: Its successor is VisPy, at the time I first wrote this, it was still experimental. I have not tried it yet, but it seems better now.
      • Update: Ipyvolume seems viable too. 
    • Bokeh, nice plotting library, 2d only, outputs HTML5/JS so it can be interacted with in IPython Notebook. Somewhat lower-level than Matplotlib, albeit it does provide a bunch of plotting functions
      • Chaco is another 2d plot/gui library, very OO, similar to Bokeh it might require more code to create a graph
    • Matplotlib toolkits (MPL is SLOW and rather ugly, but it's the most supported):
      • Mplot3d, quite crappy 3d plots
      • Seaborn, good looking 2d plots
      • mpld3, a matplotlib compatible library that emits HTML5/JS using d3.js
      • NodeBox OpenGL is nifty, and DrawBot is very similar too (but OSX only at the moment). They actually derive from the same base sourcecode.
      • Point Cloud Library and PyGTS, Gnu Triangulated Surface Library
      • Others:
    # For anaconda windows distribution, to use mayavi you need to install
    # mayavi and wxpython, use from command line binstar search -t conda mayavi
    %gui wx
    %pylab wx
    %matplotlib inline
    # In Ipython Notebook %matplotlib has to come after pylab, it seems. 

    # "inline" is cool but "qt" and "wx" allows interactivity

    # qt is faster than wx, but mayavi requires wx

        • PyQwt 2d/3d library, faster than matplotlib but even uglier. PyQtGraph is another similar project. Good if interactivity is needed. Also provide GUI components to code interactive graphs.
        • DisLin, 2d/3d graphs. Does not seem to support animations
    Other

    28 June, 2014

    Stuff that every programmer should know: Data Visualization

    If you're a programmer and you don't have visualization as one of your main tools in your belt, then good news, you just found how to easily improve your skill set. Really it should be taught in any programming course.

    Note: This post won't get you from zero to visualization expert, but hopefully it can pique your curiosity and will provide plenty of references for further study.

    Visualizing data has two main advantages compared to looking at the same data in a tabular form. 

    The first is that we can pack more data in a graph that we can get by looking at numbers on screen, even more if we make our visualizations interactive, allowing explorations inside a data set. Our visual bandwidth is massive!

    This is useful also because it means we can avoid (or rely less on) summarization techniques (statistics) that are always by their nature "lossy" and can easily hide important details (the Anscombe's quartet is the usual example).

    Anscombe's quartet, from wikipedia. Data has the same statistics, but clearly different when visualized

    The second advantage, which is even more important, is that we can reason about the data much better in a visual form. 

    0.2, 0.74, 0.99, 0.87, 0.42, -0.2, -0.74, -0.99, -0.87, -0.42, 0.2

    What's that? How long do you have to think to recognize a sine in numbers? You might start reasoning about the simmetries, 0.2, -0.2, 0.74, -0.74, then the slope and so on, if you're very bright. But how long do you think it would take to recognize the sine plotting that data on a graph?

    It's a difference of orders of magnitude. Like in a B-movie scifi, you've been using only 10% of your brain (not really), imagine if we could access 100%, interesting things begin to happen.

    I think most of us do know that visualization is powerful, because we can appreciate it when we work with it, for example in a live profiler.
    Yet I've rarely seen people dumping data from programs into graphing software and I've rarely seen programmers that actually know the science of data visualization.

    Visualizing program behaviour is even more important in the context of rendering engineers or any code that doesn't just either fail hard or work right.
    We can easily implement algorithms that are wrong but doesn't produce a completely broken output. It might be just slower (i.e. to converge) than it needs to be, or more noisy, or just not quite "right" and cause our artists to try to adjust for our mistakes by authoring fixes in the art (this happens -all- the time) and so on.
    And there are even situations where the output is completely broken, but it's just not obvious from looking at a tabular output, a great example for this would be in the structure of LCG random numbers.

    This random number generator doesn't look good, but you won't tell from a table of its numbers...


    - Good visualizations

    The main objective of visualization is to be meaningful. That means choosing the right data to study a problem, and displaying it in the right projection (graph, scale, axes...).

    The right data is the one that is interesting, it shows the features of our problem. What questions are we answering (purpose)? What data we need to display?

    The right projection is the one that shows such features in an unbiased, perceptually linear way, and that makes different dimensions comparable and possibly orthogonal. How do we reveal the knowledge that data is hiding? Is it x or 1/x? Log(x)? Should we study the ratio between quantities or absolute difference and so on.

    Information about both data and scale comes at first from domain expertise. A light (or sound) intensity probably should go on a logarithmic scale, maybe a dot product should be displayed as the angle between its vectors, many quantities have a physical interpretation and a perceptual interpretation or a geometrical one and so on.

    But even more interestingly, information about data can come from the data itself, by exploration. In an interactive environment it's easy to just dump a lot of data to observe, notice certain patterns and refine the graphs and data acquisition to "zoom in" particular aspects. Interactivity is the key (as -always- in programming).


    - Tools of the trade

    When you delve a bit into visualization you'll find that there are two fairly distinct camps.

    One is visualization of categorical data, often discrete, with the main goal of finding clusters and relationships. 
    This is quite popular today because it can drive business analytics, operate on big data and in general make money (or pretty websites). Scatterplot matrices, parallel coordinate plots (very popular), Glyph plots (star plots) are some of the main tools.

    Scatterplot, nifty to understand what dimensions are interesting in a many-dimensional dataset

    The other camp is about visualization of continuos data, often in the context of scientific visualization, where we are interested in representing our quantities without distortion, in a way that the are perceptually linear.

    This usually employs mostly position as a visual cue, thus 2d or 3d line/surface or point plots.
    These become harder with the increase of dimensionality of our data as it's hard to go beyond three dimensions. Color intensity and "widgets" could be used to add a couple more dimensions to points in a 3d space but it's often easier to add dimensions by interactivity (i.e. slicing through the dataset by intersecting or projecting on a plane) instead.

    CAVE, soon to be replaced by oculus rift
    Both kinds of visualizations have applications to programming. For deterministic processes, like the output or evolution in time of algorithms and functions, we want to monitor some data and represent it in an objective, undistorted manner. We know what the data means and how it should work, and we want to check that everything goes according to what we think it should.  
    But there are also times were we don't care about exact values but we seek for insight into processes of which we don't have exact mental models. This applies to all non-deterministic issues, networking, threading and so on, but also to many things that are deterministic in nature but have a complex behaviour, like memory hierarchy accesses and cache misses.


    - Learn about perception caveats

    Whatever your visualization is though, the first thing to be aware of is visual perception: not all visual cues are useful for quantitative analysis. 

    Perceptual biases are a big problem, because as they are perceptual, we tend not to see them, just subconsciously we are drawn to some data points more than others when we should not.


    Metacritic homepage has horrid bar graphs.
    As numbers are bright and below a variable-size image,  games with longer images seem to have lower scores...

    Beware  of color, one of the most abused, misunderstood tool for quantitative data. Color (hue) is extremely hard to get right, it's very subjective and it doesn't express well quantities nor relationships (what color is less than another), yet it's used everywhere.
    Intensity and saturation are not great either, again very commonly used but often inferior to other hints like point size or stroke width.


    From complexdiagrams


    - Visualization of programs

    Programs are these incredibly complicated projects we manage to carry forward, but if that's not challenging enough we really love working with them in the most complicated ways possible. 

    So of course visualization is really limited. The only "mainstream" usage you will have probably encountered is in the form of bad graphs of data from static analysis. Dependences, modules, relationships and so on.

    A dependency matrix in NDepend

    Certainly if you have to see your program execution itself it -has- to be text. Watch windows, memory views with hex dumps and so on. Visual Studio, which is probably the best debugger IDE we have, is not visual at all nor allows for easy development of visualizations (it's even hard to grab data from memory in it).

    We're programmers so it's not a huge deal to dump data to a file or peek memory [... my article], then we can visualize the output of our code with tools that are made for data. 
    But an even more important tool is to use visualization directly of the behaviour of code, in runtime. This is really a form of tracing which most often is limited to what's known as "printf" debugging.

    Tracing is immensely powerful as it tells us at a high level what our code is doing, as opposed to the detailed inspection of how the code is running that we can get from stepping in a debugger.
    Unfortunately there is today basically no tool for graphical representation of program state in time, so you'll have to roll your own. Working on your own sourcecode it's easy enough to put some instrumentation to export data to a live graph and in my own experiments I don't use any library for this, just write the simplest possible ad-hoc code to suck the data out.

    Ideally though it would be lovely to be able to instrument compiled code, it's definitely possible but much more of an hassle without the support of a debugger. Another alternative that sometimes I adopt is to just have an external application peek at regular interval into my target's process memory
    It's simple enough but it captures data at a very low frequency so it's not always applicable, I use it most of the times not on programs running in realtime but as an live memory visualization while stepping through in a debugger.

    Apple's recent Swift language seems a step into the right direction, and looks like it pulled some ideas from Bret Victor and Light Table.
    Microsoft had a timid plugin for VisualStudio that did some very basic plotting that doesn't seem to be actively updated and another one for in-memory images, but what would be really needed is the ability to export data easily and in realtime as good visualizations are usually to be made ad-hoc for a specific problem.

    Cybertune/Tsunami

    If you want to delve deeper into program visualization there is a fair bit written about it by the academia, with also a few interesting conferences, but what's even more interesting to me is seeing it applied to one of the hardest coding problems: reverse engineering. 
    It should perhaps not be surprising as reversers and hackers are very smart people, so it should be natural for them to use the best tools in their job.
    It's quite amazing seeing how much one can understand with very little other information by just looking at visual fingerprints, data entropy and code execution patterns.
    And again visualization is a process of exploration, it can highlight some patterns and anomalies to then delve in further with more visualizations or by using other tools.

    Data entropy of an executable, graphed in hilbert order, shows signing keys locations.


    - Bonus links

    Visualization is a huge topic and it would be silly to try to teach everything it's needed in a post, but I wanted to give some pointers hoping to get some programmers interested. If you are, here some more links for further study. 
    Note that most of what you'll find on the topic nowadays is either infovis and data-driven journalism (explaining phenomenons via understandable, pretty graphics) or big-data analytics. 
    These are very interesting and I have included a few good examples below, but they are not usually what we seek, as domain experts we don't need to focus on aesthetics and communication, but on unbiased, clear quantitative data visualization. Be mindful of the difference.

    - Addendum: a random sampling of stuff I do for work
    All made either in Mathematica or Processing and they are all interactive, realtime.
    Shader code performance metrics and deltas across versions 
    Debugging an offline backer (raytracer) by exporting float data and visualizing as point clouds
    Approximation versus ground truth of BRDF normalization
    Approximation versus ground truth of area lights
    BRDF projection on planes (reasoning about environment lighting, card lighting)