C0DE517E: 2014

06 September, 2014

Scientific Python 101

As for the Mathematica 101, after the (long) introduction I'll be talking with code...

Introduction to "Scientific Python"

In this I'll assume a basic knowledge of Python, if you need to get up to speed, learnXinYminute is the best resource for a programmer.

With "Scientific Python" I refer to an ecosystem of python packages built around NumPy/SciPy/IPython. I recommend installing a scientific python distribution, I think Anaconda is by far the best (PythonXY is an alternative), you could grab the packages from pypi/pip from any Python distribution, but it's more of a hassle.

NumPy is the building block for most other packages. It provides a matlab-like n-dimensional array class that provides fast computation via Blas/Lapack. It can be compiled with a variety of Blas implementations (Intel's MKL, Atlas, Netlib's, OpenBlas...), a perk of using a good distribution is that it usually comes with the fastest option for your system (which usually is multithreaded MKL). SciPy adds more numerical analysis routines on top of the basic operations provided by NumPy.

IPython (Jupyter) is a notebook-like interface similar to Mathematica's (really, it's a client-server infrastructure with different clients, but the only one that really matters is the HTML-based notebook one).
An alternative environment is Spyder, which is more akin to Matlab's or Mathematica Workbench (a classic IDE) and also embeds IPython consoles for immediate code execution.

Especially when learning, it's probably best to start with IPython Notebooks.

Why I looked into SciPy

While I really like Mathematica for exploratory programming and scientific computation, there are a few reasons that compelled me to look for an alternative (other than Wolfram being an ass that I hate having to feed).

First of all, Mathematica is commercial -and- expensive (same as Matlab btw). Which really doesn't matter when I use it as a tool to explore ideas and make results that will be used somewhere else, but it's really bad as a programming language.

I wouldn't really want to redistribute the code I write in it, and even deploying "executables" is not free. Not to mention not many people know Mathematica to begin with.
Python, in comparison, is very well known, free, and integrated pretty much everywhere. I can drop my code directly in Maya (or any other package really, python is everywhere) for artists to use, for example.

Another big advantage is that Python is familiar, even for people that don't know it, it's a simple imperative scripting language.
Mathematica is in contrast a very odd Lisp, which will look strange at first even to people who know other Lisps. Also, it's mostly about symbolic computation, and the way it evaluate can be quite mysterious. CPython internals on the other hand, can be quite easily understood.

Lastly, a potential problem lies in the fact that python packages aren't guaranteed to have all the same licensing terms, and you might need many of them. Verifying that everything you end up installing can be used for commercial purposes is a bit of a hassle...

How does it fare?

It's free. It's integrated everywhere. It's familiar. It has lots of libraries. It works. It -can- be used as a Mathematica or Matlab replacement, while being free, so every time you need to redistribute your work (research!) it should be considered.

But it has still (many) weaknesses.

As a tool for exploratory programming, Mathematica is miles ahead. Its documentation is great, it comes with a larger wealth of great tools and its visualization options are probably the best bar none.
Experimentation is an order of magnitude better if you have good visualization and interactivity support, and Mathematica, right now, kills the competition on that front.
Manipulate[] is extremely simple, plotting is decently fast and the quality is quite high, there is lots of thought behind how the plots work, picking reasonable defaults, being numerically reliable and so on.

In Python on the other hand you get IPython and matplotlib. Ok, you got a ton of other libraries too, but matplotlib is popular and the basis of many others too.
IPython can't display output if assignments are made, and displays only the last evaluated expression. Matplotlib is really slow, really ugly, and uses a ton of memory. Also you can either get it integrated in IPython, but with zero interactivity, or in a separate window, with just very bare-bones support for plot rotation/translation/scale.

There are other tools you can use, but most are 2D only, some are very fast and 3D but more cumbersome to use and so on and so forth...
Update: nowadays there are a few more libraries using WebGL, which are both fast and allow interactivity in IPython!

As a CAS I also expect Mathematica to be the best, you can do CAS in Python via SymPy/Sage/Mathics but I don't rely too much on that, personally, so I'm not in a position to evaluate.

Overall, I'll still be using Mathematica for many tasks, it's a great tool.

As a tool for numerical computation it fares better. Its main rival would be Matlab, whose strength really lies in the great toolboxes Mathworks provides.
Even if the SciPy ecosystem is large with a good community, there are many areas where its packages are lacking, not well supported or immature.

Sadly though for the most Matlab is not that popular because of the unique functionality it provides, but because MathWorks markets well to the academia and it became the language of choice for many researchers and courses.
Also, researchers don't care about redistributing source nearly as much as they really should, this day and age it's all still about printed publications...

So, is Matlab dead? Not even close, and to be honest, there are many issues Python has to solve. Overall though, things are shifting already, and I really can't see a bright future for Matlab or its clones, as fundamentally Python is a much better language, and for research being open is probably the most important feature. We'll see.

A note on performance and exploration

For some reason, most of the languages for scientific exploratory programming are really slow. Python, Matlab, Mathematica, they are all fairly slow languages.

The usual argument is that it doesn't matter at all, because these are scripting languages used to glue very high-performance numerical routines. And I would totally agree. If it didn't matter.
A language for exploratory programming has to be expressive and high-level, but also fast enough for the abstractions not to fall on their knees. Sadly, Python isn't.

Even with simple code, if you're processing a modicum amount of data, you'll need to know its internals, and the variety of options available for optimization. It's similar in this regard to Mathematica, where using functions like Compile often requires planning the code up-front to fit in the restrictions of such optimizers.

Empirically though it seems that the amount of time I had to spend minding performance patterns in Python is even higher than what I do in Mathematica. I suspect it's because many packages are pure python.

It's true that you can do all the optimization staying inside the interactive environment, not needing to change languages. That's not bad. But if you start having to spend a significant amount of time thinking about performance, instead of transforming data, it's a problem.

Also, it's a mystery to me why most scientific languages are not built for multithreading, at all. All of them, Python, Matlab and Mathematica, execute only some underlying C code in parallel (e.g. blas routines). But not anything else (all the routines not written in native code, often things such as plots, optimizers, integrators).

Even Julia, which was built specifically for performance, doesn't really do multithreading so far, just "green" threads (one at a time, like python) and multiprocessing.

Multiprocessing in Python is great, IPython makes it a breeze to configure a cluster of machines or even many processes on a local machine. But it still requires order of magnitudes more effort than threading, killing interactivity (push global objects, imports, functions, all manually across instances).

Mathematica at least does the multiprocessing data distribution automatically, detecting dependencies and input data that need to be transferred.

Learn by example:

Check out my code here.

Other resources:

Tutorials

A Crash Course in Python for Scientists
Scientific Python Lectures
A gallery of interesting IPython notebooks
Performance Python for Numerical Algorithms
Anaconda's Conda package manager (also, remember to conda clean unused packages...)
Learn about division and future division
Numpy performance tricks
General Python performance tricks
Parallel Computing with IPython
Interactivity with IPython

Packages

Scipy: numpy, scipy, matplotlib, sympy, pandas
Optimization and learning

Scikit-Learn, one of SciKit packages (scipy extensions)
GPy, Gaussian processes.
Scikit-Monaco, Monte Carlo integrators
Cvxopt/cvxpy, convex optimization
NLopt, more nonlinear optimizers
PyOpt, even more optimizers
DEAP, evolutionary algorithms

Dill, a package that can serialize/snapshot a python kernel. Useful when one wants to stop working on an iPython session but want to be able to pick it up again from the same state next time.
Performance

A comparison of Cython, Numba, PyCuda, PyOpenCl, NumPy and other frameworks on a simple problem (Mandelbrot set)
SciPy Weave, inlines C code in Python code, compiles and links to python on demand. Deprecated. Try Cython instead.
Numba, a numpy "aware" compiler, targets LLVM, compiles in runtime (annotated functions)
Cython, compiles annotated python to C. Bottleneck uses it to accelerate some NumPy functions. (see also Shedskin, Pythran and ocl)
JobLib, makes multiprocessing easier (see IPython.Parallel too) but still not great as you can't have multithreading, multiprocessing means you'll have to send data around independent python interpreters :/
NumExpr, a fast interpreter of numerical expressions on arrays. Faster than numpy by aggregating operations (instead of doing one at at time)
WeldNumpy is another faster interpreter, the idea here is to lazy-evaluate expressions to be able to execute them more optimally.
Theano, targets cpu and gpu, numpy aware, automatic differentiation. Clumsy...
Nuikta, offline compiles python to C++, should support all extensions
PyPy, a JIT, with a tracing interpreter written in python. Doesn't support all extensions (the CPython C library interface)
Python/Cuda links

Non-homogeneous data

Blaze, like numpy but for non-homogeneous, incomplete data
PyTables, hierarchical data

Graphics/Plotting

For 3d animations, VisVis seems the only tool that is capable of achieving decent speed, quality, and has a good interface and support. It has a matlab-like interface, but actually creating objects (Line() instead of plot...) is much better/faster.

Update: Its successor is VisPy, at the time I first wrote this, it was still experimental. I have not tried it yet, but it seems better now.
Update: Ipyvolume seems viable too.

Bokeh, nice plotting library, 2d only, outputs HTML5/JS so it can be interacted with in IPython Notebook. Somewhat lower-level than Matplotlib, albeit it does provide a bunch of plotting functions

Chaco is another 2d plot/gui library, very OO, similar to Bokeh it might require more code to create a graph

Matplotlib toolkits (MPL is SLOW and rather ugly, but it's the most supported):

Mplot3d, quite crappy 3d plots
Seaborn, good looking 2d plots
mpld3, a matplotlib compatible library that emits HTML5/JS using d3.js

NodeBox OpenGL is nifty, and DrawBot is very similar too (but OSX only at the moment). They actually derive from the same base sourcecode.
Point Cloud Library and PyGTS, Gnu Triangulated Surface Library
Others:

Mayavi/Tvtk, 3d visualizations. Usable, but finicky and old (and tends to crash). Note:

# For anaconda windows distribution, to use mayavi you need to install

# mayavi and wxpython, use from command line binstar search -t conda mayavi

%gui wx

%pylab wx

%matplotlib inline

# In Ipython Notebook %matplotlib has to come after pylab, it seems.

# "inline" is cool but "qt" and "wx" allows interactivity

# qt is faster than wx, but mayavi requires wx

PyQwt 2d/3d library, faster than matplotlib but even uglier. PyQtGraph is another similar project. Good if interactivity is needed. Also provide GUI components to code interactive graphs.
DisLin, 2d/3d graphs. Does not seem to support animations

Other

~~ComputableApp for iOS.~~ Update: dead. Look at Juno
Deployment/packagers: PyInstaller. Alternatives: Freeze for unix, Py2App for OSX, Py2Exe for Windows

03 September, 2014

Notes on real-time renderers

This accounts only for well-established methods, there are many other methods and combinations of methods I'm not covering. It's a sort of "recap".

Forward
Single pass over geometry generates "final" image, lights are bound to draw calls (via uniforms), accurate culling of light influence on geometry requires CSG splits. Multiple lights require either loops/branches in the shaders or shader permutations.

Benefits

Fastest in its baseline case (single light per pixel, "simple" shaders or even baked lighting). Doesn't have a "constant" up-front investment, you pay as you go (more lights, more textures...).
Least memory necessary (least bandwidth, at least in theory). Makes MSAA possible.
Easy to integrate with shadowmaps (can render them one-at-a-time, or almost)
No extra pass over geometry
Any material, except ones that require screen-space passes like Jimenez's SS-SSS

Issues

Culling lights on geometry requires geometrical splits (not a huge deal, actually). Can support "static" variations of shaders to customize for a given rendering case (number/type of lights, number of textures and so on) but "pays" such optimization with combinatorial explosion of shader cases and many more draw calls.
Culling dynamic lights can't be efficiently done. Scripted lights along fixed paths can be somewhat culled via geometry cutting, but fully dynamic lights can't efficiently cut geometry in runtime, just be assigned to objects, thus wasting computation.
Decals need to be multipass, lit twice. Alternatively, for static decals mesh can be cut and texture layering used (more shader variations), or for dynamic decals color can be splatted before main pass (but that costs an access to the offscreen buffer regardless or not a decal is there).
Complex shaders might not run optimally. As you have to do texturing and lighting (and shadowing) in the same pass, shaders can require a lot of registers and yield limited occupancy. Accessing many textures in sequence might create more trashing than accessing them in separate passes.
Lighting/texturing variations have to be dealt with dynamic branches which are often problematic for the shader compiler (must allocate registers for the worst case...), conditional moves (wasted work and registers) or shader permutations (combinatorial explosion)
Many "modern" rending effects require a depth/normal pre-pass anyways (i.e. SSAO, screen-space shadows, reflections and so on). Even though some of these can be faded out after a preset range and thus they can work with a partial pre-pass.
All shading is done on geometry, which means we pay all the eventual inefficiencies (e.g. partial quads, overdraw) on all shaders.

Forward+ (light indexed)
Forward+ is basically identical to forward but doesn't do any geometry split on the scene as a pre-pass, it relies on tiles or 3d grids ("clustered") to cull lights in runtime.

Benefits

Same memory as forward, more bandwidth. Enables MSAA.
Any material (same as forward)
Compared to forward, no mesh splitting necessary, much less shader permutations, less draw calls.
Compared to forward it handles dynamic lights with good culling.

Issues

Light occlusion culling requires a full depth pre-pass for a total of two geometrical passes. Can be somehow sidestepped with a clustered light grid, if you don't have to end up splatting too many lights into it.
All shadowmaps need to be generated upfront (more memory) or splatted in screen-space in a pre-pass.
All lighting permutations need to be addressed as dynamic branches in the shader. Not good if we need to support many kinds of light/shadow types. In cases where simple lighting is needed, still has to pay the price of a monolithic ubershader that has to consider any lighting scenario.
Compared to forward, seems a steep price to pay to just get rid of geometry cutting. Note that even if it "solved" shader permutations, its solution is the same as doing forward with shaders that dynamic branch over light types/number of lights and setting these parameters per draw call.

Deferred shading
Geometry pass renders a buffer of material attributes (and other proprieties needed for lighting but bound to geometry, e.g. lightmaps, vertex-baked lighting...). Lighting is computed in screenspace either by rendering volumes (stencil) or by using tiling. Multiple shading equations need either to be handled via branches in the lighting shaders, or via multiple passes per light.

Benefits

Decouples texturing from lighting. Executes only texturing on geometry so it suffers less from partial quads, overdraw. Also, potentially can be faster on complex shaders (as discussed in the forward rendering issues).
Allows volumetric or multipass decals (and special effects) on the GBuffer (without computing the lighting twice).
Allows full-screen material passes like analytic geometric specular antialiasing (pre-filtering), which really works only done on the GBuffer, in forward it fails on all hard edges (split normals), and screen-space subsurface scattering.
Less draw calls, less shader permutations, one or few lighting shaders that can be hand-optimized well.

Issues

Uses more memory and bandwidth. Might be slower due to more memory communication needed, especially on areas with simple lighting.
Doesn't handle transparencies easily. If a tiled or clustered deferred is used, the light information can be passed to a forward+ pass for transparencies.
Limits materials that need many different material parameters to be passed from geometry to lighting (GBuffer), even if shader variation for material in a modern PBR renderer tends not to be a problem.
Can't do lighting computations per object/vertex (i.e. GI), needs to pass everything per pixel in the GBuffer. An alternative is to store baked data in a voxel structure.
Accessing lighting related textures (gobos, cubemaps) might be less cache-coherent.
In general it has lots of enticing benefits over forward, and it -might- be faster in complex lighting/material/decal scenarios, but the baseline simple lighting/shading case is much more expensive.

Notes on tiled/clustered versus "stenciled" techniques

On older hardware early-stencil was limited to a single bit, so it couldn't be used both to mark the light volume and distinguish surface types. Tiled could be needed as it allowed more material variety by categorizing tiles and issuing multiple tile draws if needed.

On newer hardware tiled benefits lie in the ability of reducing bandwidth by processing all lights in a tile in a single shader. It also has some benefits for very small lights as these might stall early in the pipeline of the rasterizer, if drawn as volumes (draws that generate too little PS work).

In fact most tile renderers demos like to show thousands of lights in view... But the reality is that it's still tricky to afford many shadowed lights per pixel in any case (even on nextgen where we have enough memory to cache shadowmaps), and unshadowed, cheap lights are worse than no lighting at all.

Often, these cheap unshadowed lights are used as "fill", a cheap replacement for GI. This is not an unreasonable use case, but there are better ways, and standard lights, even when diffuse only, are actually not a great representation of indirect radiance.
Voxel, vertex and lightmap bakes are often superior, or one could thing of special fill volumes that can take more space, embedding some radiance representation and falloff in them.

In fact one of the typical "deferred" looks that many games still have today is characterized by "many" cheap point lights without shadowing (nor GI, nor gobos...), creating ugly circular splotches in the scene.
Also tiled/clustered makes dynamic shadows somewhat harder, as you can't render one shadowmap at a time...

Tiled and clustered have their reasons, but demo scenes with thousands of cheap point lights are not one. Mostly they are interesting if you can compute other interesting data per tile/voxel.
You can still get a BW saving in "realistic" scenes with low overlap of lighting volumes, but it's a tradeoff between that and accurate per-quad light culling you get from a modern early-z volume renderer.

Deferred lighting
I'd say this is dwindling technique nowadays compared to deferred shading.

Benefits

It requires less memory per pass, but the total memory traffic summing all passes is roughly the same. The former used to be -very- important due to the limited EDRAM memory on xbox 360 (and its inability to render outside EDRAM).
In theory allows more material "hacks", but it's still very limited. In fact I'd say the material expressiveness is identical to deferred shading, but you can add "extra" lighting in the second geometry pass. On deferred shading that has to be passed along using extra GBuffer space.
Allows shadows to be generated and used per light, instead of all upfront like deferred shading/forward+

Issues

An extra geometric pass (could be avoided by using the same GBuffer as deferred shading, then doing lighting and compositing with the textures in separate fullscreen passes - but then it's almost more a variant of DS than DL imho)

Some Links:

30 August, 2014

I Support Anita

One of the first things you learn when you start making games is to ignore most of the internet. Gamers can undoubtedly be wonderful, but as in all things internet, a vocal minority of idiots can overtake most spaces of discourse. Normally, that doesn't matter, these people are irrelevant and worthless even if they might think they have any power in their jerking circles. But if after words other forms of harassment are used, things change.

I'm not a game designer, I play games, I like games, but my work is about realtime rendering, most often the fact it's used by a game is incidental. So I really didn't want to write this, I didn't think there is anything I can add to what has already been written. And Anita herself does an excellent job at defending her work. Also I think the audience of this blog is the right target for this discussion.

Still, we've passed a point and I feel everybody in this industry should be aware of what's happening and state their opinion. I needed to make a "public" stance.

Recap: Anita Sarkeesian is a media critic. She began a successful kickstarter campaign to produce reviews of gender tropes in videogames. She has been subject to intolerable, criminal harassment. People who spoke in her support have been harassed, websites have been hacked... Just google her name to see the evidence.

My personal stance:

I support the work of Anita Sarkeesian. As I would of anybody speaking intelligently about anything, even if I were in disagreement.
I agree with the message in the Tropes Vs Women series. I find it to be extremely interesting, agreeable and instrumental to raise awareness of an in many cases not well understood phenomenon.

If I have any opinion on her work, is that I suspect in most cases hurtful stereotypes don't come from malice or laziness (neither of which she mentions as possible causes by the way), but from the fact that games are mostly made by people like me, male, born in the eighties, accustomed to a given culture.
And even if we can logically see the issues we still have in gender depictions, we often lack the emotional connection and ability to notice their prevalence. We need all the critiques we can get.

I encourage everybody to take a stance, especially mainstream gaming websites and gaming companies (really, how can you resist being included here), but even smaller blog such as this one.

It's time to marginalize harassment and ban such idiots from the gaming community. To tell them that it's not socially acceptable, that most people don't share their views.
Right now for most of the video attacks (I've found no intelligent rebuttal yet) to Tropes Vs Women are "liked" on youtube. Reasonable people don't speak up, and that's even understandable, nobody should argue with idiots, they are usually better left ignored. But this got out of hands.

I'm not really up for a debate. I understand that there can be an debate on the merit of her ideas, there can be debate about her methods even, and I'd love to read anything intelligent about it.

We are way past a discussion on whether she is right or wrong. I personally think she is substantially right, but even if she was wrong I think we should all still fight for her to be able to do her job without such vile attacks. When these things happen, in such an extent, I think it's time for the industry to be vocal, for people to stop and just say no. If you think Anita's work (and her as a person) doesn't deserve at least that respect, I'd invite you to just stop following me, seriously.

Links:

23 August, 2014

Notes on #Minimalism in code

A few weeks ago I stumbled upon a nice talk at the public library near where I live, about minimalism, buy a couple of friends, Joshua Fields Millburn and Ryan Nicodemus, who call themselves "the minimalists".

Worth watching

What it's interesting is that I immediately found this notion of parsimony to be directly applicable to programming. I guess, it would apply to most arts, really. I "live tweeted" that "minimalism in life is probably showing the same level of maturity of minimalism in coding" and started sketching some notes for a later post.

On stage I hear "the two most dangerous words in the English language are: one day". This is going to be good, I wonder if these guys -were- programmers.

- Return of Investment

In real life, clutter comes with a price. Now, that might very well not be a limiting factor, I think that the amount of crap tends to depend on our self-control, while money just dictates how expensive the crap we buy gets, but still there is a price. In the virtual world of code on the other hand clutter has similar negative consequences on people's mental health, but on your finances it might even turn out to be profitable.

We all laugh at the tales of coders creating "job security" via complexity, but I don't really think it's something that happens often in a conscious way. No, what I believe is much more common is that worthless complexity is encouraged by the way certain companies and teams work.

Let's make an example, a logging system. Programmer A writes three functions, open, printf, close, it works great, it's fast, maybe he even shares the knowledge on what he learned writing it. Programmer B creates a logging framework library templated factory whatever pattern piece of crap, badly documented on top of that.

What happens? In a good company, B is fired... but it's not that easy. Because there are maybe two-three people that really know that B's solution doesn't solve anything, many others won't really question it, will just look at this overcomplicated mess and think that must be cool, complexity must exist for a reason. And it's a framework!

So now what could have been done in three lines takes several thousands, and who wants to rewrite several thousand lines? So B's framework begins to spread, it's shared by many projects, and B ends up having a big "sphere of influence" and actually getting a promotion. Stuff for creepypasta, I know. Don't do Boost:: at night (nor ever).

In real life things are even more tricky. Take for example sharing code. Sharing code is expensive (sharing knowledge though is not, always do that) but it can get beneficial of course. Where is the tipping point? "One day" is quite certainly the wrong answer, premature generalization is a worse sin than premature optimization, these days.

There is always a price to pay to generality, even when it's "reasonable". So at every step there is an operation research problem to solve, over quantities that are fuzzy to say the least.

- Needed complexity

In their presentation, Joshua and Ryan take some time to point out that they are not doing an exercise in living with the least amount of stuff as possible, they aren't trying to impose some new-age set of rules.
The point is to bring attention to making "deliberate and meaningful" choices, which again I saw as being quite relevant to programming.

Simplicity is not about removing all high-level constructs nor it is about being as high-level, and terse, as possible. I'd say it's very hard to encapsulate it in a single metric to optimize, but I like the idea of being conscious of what we do.

If we grab a concept, or a construct, do we know really why, or are we just following the latest fad from a blog for cool programmers?

Abstractions are useful to "compress" code (at least, they should always be used for that, in practice many times we abstract early and end up with more code, like writing a compressor where the decompressor takes more space than the original file...), but compression is not an end goal (otherwise we would be editing .zip files!), it can reduce complexity "locally" but it "lifts" it into a new, global, concept, that has to be learned, mastered, maintained...

Not that we shouldn't be open minded and experimental, try new tools... This is not about writing everything in C, for the rest of our lives. The opposite! We should strive for an understanding of what to use when, be aware, know more.

It's cool and tempting to have this wide arsenal of tools, and as in all arts, we are creative and with experience we can find meaningful ways to use quite literally any tool.
But there is a huge difference between being able to expertly choose between a hundred brushes for a painting, having mastered each, and thinking that if we do have and use a hundred brushes we will get a nice painting. We can't be masters of too many things in our lives, we have to chose carefully.

Complexity is not always useless, but surely should be feared more. And it hides everywhere.

Do I use a class? What are the downsides? More stuff in a header, more coupling? What it's getting me, that I can't achieve without. Do I need that template? What was a singleton for again, why would it be better than a static? Should I make that operation implicit in the type? Will the less typing needed to use a thing offset the more code needed in the implementation? Will it obscure details that should not be obscured? And so on...

- Closing remarks

Even if I said there isn't a single metric to optimize, there are a few things that I like to keep in mind.

One is the amount of code compared to the useful computation the code does. The purpose of all code is algorithmic, we transform data. How much of a given system is dedicated to actually transforming the data? How much is infrastructural? On a similar note, how much heat am I wasting doing things that are not moving the bits I need for the final result I seek?

Again, this isn't about terseness. I can have a hundred lines of assembly that sort an array or one line of python, they lie on opposite ends of terseness versus low-level control line, but in both cases they are doing useful computation.

The second is the amount of code and complexity that I save versus the amount of code and complexity I have to write to get the saving. Is it really good to have something that it's trivial to use "on the surface", but impossible to understand when it comes to the inner workings?

Sometimes the answer is totally a yes, e.g. I don't really care too much about the minute details of how Mathematica interprets my code when I do exploratory programming there. Even if it ends up with me kicking around the code a few times when I don't understand why things don't work. But I might not want that magic to happen in my game code.

Moreover, most of the times we are not even on the Pareto front, we're not making choices that maximize the utility in one or the other direction. And most of the times, such choices are highly non-linear, where we can just accept a bit more verbosity at the caller-side for a ton less headache on the implementation-side.

Lastly, the minimalists also talk about the "packing party": pack everything you have, then unpack only the things that you need as you need them, over a few weeks. Throw away the stuff that stays packed. The code equivalent coverage testing: push your game through an extensive profile step, then mark all the lines that were never reached.

Throwing away stuff is more than fine, it's beneficial. Keep the experience and knowledge. And if needed, we always have p4 anyways.

Somewhat related, and very recommended: http://www.infoq.com/presentations/Simple-Made-Easy (note though that I do think that "easy" is also a good metric, especially for big projects where you might have programmer turnover, many junior programmers and so on)

28 June, 2014

Stuff that every programmer should know: Data Visualization

If you're a programmer and you don't have visualization as one of your main tools in your belt, then good news, you just found how to easily improve your skill set. Really it should be taught in any programming course.

Note: This post won't get you from zero to visualization expert, but hopefully it can pique your curiosity and will provide plenty of references for further study.

Visualizing data has two main advantages compared to looking at the same data in a tabular form.

The first is that we can pack more data in a graph that we can get by looking at numbers on screen, even more if we make our visualizations interactive, allowing explorations inside a data set. Our visual bandwidth is massive!

This is useful also because it means we can avoid (or rely less on) summarization techniques (statistics) that are always by their nature "lossy" and can easily hide important details (the Anscombe's quartet is the usual example).

Anscombe's quartet, from wikipedia. Data has the same statistics, but clearly different when visualized

The second advantage, which is even more important, is that we can reason about the data much better in a visual form.

0.2, 0.74, 0.99, 0.87, 0.42, -0.2, -0.74, -0.99, -0.87, -0.42, 0.2

What's that? How long do you have to think to recognize a sine in numbers? You might start reasoning about the simmetries, 0.2, -0.2, 0.74, -0.74, then the slope and so on, if you're very bright. But how long do you think it would take to recognize the sine plotting that data on a graph?

It's a difference of orders of magnitude. Like in a B-movie scifi, you've been using only 10% of your brain (not really), imagine if we could access 100%, interesting things begin to happen.

I think most of us do know that visualization is powerful, because we can appreciate it when we work with it, for example in a live profiler.

Yet I've rarely seen people dumping data from programs into graphing software and I've rarely seen programmers that actually know the science of data visualization.

Visualizing program behaviour is even more important in the context of rendering engineers or any code that doesn't just either fail hard or work right.

We can easily implement algorithms that are wrong but doesn't produce a completely broken output. It might be just slower (i.e. to converge) than it needs to be, or more noisy, or just not quite "right" and cause our artists to try to adjust for our mistakes by authoring fixes in the art (this happens -all- the time) and so on.

And there are even situations where the output is completely broken, but it's just not obvious from looking at a tabular output, a great example for this would be in the structure of LCG random numbers.

This random number generator doesn't look good, but you won't tell from a table of its numbers...

- Good visualizations

The main objective of visualization is to be meaningful. That means choosing the right data to study a problem, and displaying it in the right projection (graph, scale, axes...).

The right data is the one that is interesting, it shows the features of our problem. What questions are we answering (purpose)? What data we need to display?

The right projection is the one that shows such features in an unbiased, perceptually linear way, and that makes different dimensions comparable and possibly orthogonal. How do we reveal the knowledge that data is hiding? Is it x or 1/x? Log(x)? Should we study the ratio between quantities or absolute difference and so on.

Information about both data and scale comes at first from domain expertise. A light (or sound) intensity probably should go on a logarithmic scale, maybe a dot product should be displayed as the angle between its vectors, many quantities have a physical interpretation and a perceptual interpretation or a geometrical one and so on.

But even more interestingly, information about data can come from the data itself, by exploration. In an interactive environment it's easy to just dump a lot of data to observe, notice certain patterns and refine the graphs and data acquisition to "zoom in" particular aspects. Interactivity is the key (as -always- in programming).

- Tools of the trade

When you delve a bit into visualization you'll find that there are two fairly distinct camps.

One is visualization of categorical data, often discrete, with the main goal of finding clusters and relationships.

This is quite popular today because it can drive business analytics, operate on big data and in general make money (or pretty websites). Scatterplot matrices, parallel coordinate plots (very popular), Glyph plots (star plots) are some of the main tools.

Scatterplot, nifty to understand what dimensions are interesting in a many-dimensional dataset

The other camp is about visualization of continuos data, often in the context of scientific visualization, where we are interested in representing our quantities without distortion, in a way that the are perceptually linear.

This usually employs mostly position as a visual cue, thus 2d or 3d line/surface or point plots.

These become harder with the increase of dimensionality of our data as it's hard to go beyond three dimensions. Color intensity and "widgets" could be used to add a couple more dimensions to points in a 3d space but it's often easier to add dimensions by interactivity (i.e. slicing through the dataset by intersecting or projecting on a plane) instead.

CAVE, soon to be replaced by oculus rift

Both kinds of visualizations have applications to programming. For deterministic processes, like the output or evolution in time of algorithms and functions, we want to monitor some data and represent it in an objective, undistorted manner. We know what the data means and how it should work, and we want to check that everything goes according to what we think it should.
But there are also times were we don't care about exact values but we seek for insight into processes of which we don't have exact mental models. This applies to all non-deterministic issues, networking, threading and so on, but also to many things that are deterministic in nature but have a complex behaviour, like memory hierarchy accesses and cache misses.

- Learn about perception caveats

Whatever your visualization is though, the first thing to be aware of is visual perception: not all visual cues are useful for quantitative analysis.

Perceptual biases are a big problem, because as they are perceptual, we tend not to see them, just subconsciously we are drawn to some data points more than others when we should not.

Metacritic homepage has horrid bar graphs.
As numbers are bright and below a variable-size image, games with longer images seem to have lower scores...

Beware of color, one of the most abused, misunderstood tool for quantitative data. Color (hue) is extremely hard to get right, it's very subjective and it doesn't express well quantities nor relationships (what color is less than another), yet it's used everywhere.

Intensity and saturation are not great either, again very commonly used but often inferior to other hints like point size or stroke width.

From complexdiagrams

- Visualization of programs

Programs are these incredibly complicated projects we manage to carry forward, but if that's not challenging enough we really love working with them in the most complicated ways possible.

So of course visualization is really limited. The only "mainstream" usage you will have probably encountered is in the form of bad graphs of data from static analysis. Dependences, modules, relationships and so on.

A dependency matrix in NDepend

Certainly if you have to see your program execution itself it -has- to be text. Watch windows, memory views with hex dumps and so on. Visual Studio, which is probably the best debugger IDE we have, is not visual at all nor allows for easy development of visualizations (it's even hard to grab data from memory in it).

We're programmers so it's not a huge deal to dump data to a file or peek memory [... my article], then we can visualize the output of our code with tools that are made for data.

But an even more important tool is to use visualization directly of the behaviour of code, in runtime. This is really a form of tracing which most often is limited to what's known as "printf" debugging.

Tracing is immensely powerful as it tells us at a high level what our code is doing, as opposed to the detailed inspection of how the code is running that we can get from stepping in a debugger.

Unfortunately there is today basically no tool for graphical representation of program state in time, so you'll have to roll your own. Working on your own sourcecode it's easy enough to put some instrumentation to export data to a live graph and in my own experiments I don't use any library for this, just write the simplest possible ad-hoc code to suck the data out.

Ideally though it would be lovely to be able to instrument compiled code, it's definitely possible but much more of an hassle without the support of a debugger. Another alternative that sometimes I adopt is to just have an external application peek at regular interval into my target's process memory.
It's simple enough but it captures data at a very low frequency so it's not always applicable, I use it most of the times not on programs running in realtime but as an live memory visualization while stepping through in a debugger.

Apple's recent Swift language seems a step into the right direction, and looks like it pulled some ideas from Bret Victor and Light Table.
Microsoft had a timid plugin for VisualStudio that did some very basic plotting that doesn't seem to be actively updated and another one for in-memory images, but what would be really needed is the ability to export data easily and in realtime as good visualizations are usually to be made ad-hoc for a specific problem.

Cybertune/Tsunami

If you want to delve deeper into program visualization there is a fair bit written about it by the academia, with also a few interesting conferences, but what's even more interesting to me is seeing it applied to one of the hardest coding problems: reverse engineering.
It should perhaps not be surprising as reversers and hackers are very smart people, so it should be natural for them to use the best tools in their job.

It's quite amazing seeing how much one can understand with very little other information by just looking at visual fingerprints, data entropy and code execution patterns.
And again visualization is a process of exploration, it can highlight some patterns and anomalies to then delve in further with more visualizations or by using other tools.

Data entropy of an executable, graphed in hilbert order, shows signing keys locations.

- Bonus links

Visualization is a huge topic and it would be silly to try to teach everything it's needed in a post, but I wanted to give some pointers hoping to get some programmers interested. If you are, here some more links for further study.
Note that most of what you'll find on the topic nowadays is either infovis and data-driven journalism (explaining phenomenons via understandable, pretty graphics) or big-data analytics.
These are very interesting and I have included a few good examples below, but they are not usually what we seek, as domain experts we don't need to focus on aesthetics and communication, but on unbiased, clear quantitative data visualization. Be mindful of the difference.

Some of my other posts

Mathematica 101
Peeking memory (this is great coupled with reflection or a language to define dependencies between data structures)
Trace debugging from shaders
Point clouds and processing

Essential software

Processing
Python/IPython (anaconda), matplotlib (seaborn)
Mathematica (mathics is interesting too)
d3.js could be nifty too, especially if you already have a web frontend for your application as it's increasingly popular to do
Don't use Excel, you'll waste time as it's very cumbersome to manipulate data there and its graphs are actually quite bad