*As for the Mathematica 101, after the (long) introduction I'll be talking with code...*

**Introduction to "Scientific Python"**

In this I'll assume a basic knowledge of Python, if you need to get up to speed, learnXinYminute is the best resource for a programmer.

With "Scientific Python" I refer to an ecosystem of python packages built around NumPy/SciPy/IPython. I recommend installing a scientific python distribution, I think Anaconda is by far the best (PythonXY is an alternative), you could grab the packages from pypi/pip from any Python distribution, but it's more of a hassle.

NumPy is the building block for most other packages. It provides a matlab-like n-dimensional array class that provides fast computation via Blas/Lapack. It can be compiled with a variety of Blas implementations (Intel's MKL, Atlas, Netlib's, OpenBlas...), a perk of using a good distribution is that it usually comes with the fastest option for your system (which usually is multithreaded MKL). SciPy adds more numerical analysis routines on top of the basic operations provided by NumPy.

IPython is a notebook-like interface similar to Mathematica's (really, it's a client-server infrastructure with different clients, but the only one that really matters is the HTML-based notebook one).

An alternative environment is Spyder, which is more akin to Matlab's or Mathematica Workbench (a classic IDE) and also embeds IPython consoles for immediate code execution.

An alternative environment is Spyder, which is more akin to Matlab's or Mathematica Workbench (a classic IDE) and also embeds IPython consoles for immediate code execution.

Especially when learning, it's probably best to start with IPython.

**Why I looked into SciPy**

While I really like Mathematica for exploratory programming and scientific computation, there are a few reasons that compelled me to look for an alternative (other than Wolfram being an ass that I hate having to feed).

First of all, Mathematica is commercial -and- expensive (same as Matlab btw). Which really doesn't matter when I use it as a tool to explore ideas and make results that will be used somewhere else, but it's

**really bad**as a programming language.
I wouldn't really want to redistribute the code I write in it, and even deploying "executables" is not free. Not to mention not many people know Mathematica to begin with.

Python, in comparison, is very well known, free, and integrated pretty much everywhere. I can drop my code directly

Python, in comparison, is very well known, free, and integrated pretty much everywhere. I can drop my code directly

**in Maya**(or any other package really, python is everywhere) for artists to use, for example.
Another big advantage is that Python is familiar, even for people that don't know it, it's a simple imperative scripting language.

Mathematica is in contrast a very odd Lisp, which will look strange at first even to people who know other Lisps. Also, it's mostly about symbolic computation, and the way it evaluate can be quite mysterious. CPython internals on the other hand, can be quite easily understood.

Lastly, a potential problem lies in the fact that python packages aren't guaranteed to have all the same licensing terms, and you might need many of them. Verifying that everything you end up installing can be used for commercial purposes is a bit of a hassle...

Mathematica is in contrast a very odd Lisp, which will look strange at first even to people who know other Lisps. Also, it's mostly about symbolic computation, and the way it evaluate can be quite mysterious. CPython internals on the other hand, can be quite easily understood.

Lastly, a potential problem lies in the fact that python packages aren't guaranteed to have all the same licensing terms, and you might need many of them. Verifying that everything you end up installing can be used for commercial purposes is a bit of a hassle...

**How does it fare?**

**It's free. It's integrated everywhere. It's familiar. It has lots of libraries.**

**It works**. It -can- be used as a Mathematica or Matlab replacement, while being free, so every time you need to redistribute your work (research!) it should be considered.

But it has still (many) weaknesses.

As a tool for exploratory programming, Mathematica is miles ahead. Its documentation is great, it comes with a larger wealth of great tools and its visualization options are probably the best bar none.

Experimentation is an order of magnitude better if you have good visualization and interactivity support, and Mathematica, right now, kills the competition on that front.

Python has IPython and Matplotlib. IPython can't display output if assignments are made, and displays only the last evaluated expression. Matplotlib is really slow, really ugly, and uses a ton of memory. Also you can either get it integrated in IPython, but with zero interactivity, or in a separate window, with just very bare-bones support for plot rotation/translation/scale.

Experimentation is an order of magnitude better if you have good visualization and interactivity support, and Mathematica, right now, kills the competition on that front.

Python has IPython and Matplotlib. IPython can't display output if assignments are made, and displays only the last evaluated expression. Matplotlib is really slow, really ugly, and uses a ton of memory. Also you can either get it integrated in IPython, but with zero interactivity, or in a separate window, with just very bare-bones support for plot rotation/translation/scale.

As a CAS I also expect Mathematica to be the best, you can do CAS in Python via SymPy/Sage/Mathics but I don't rely too much on that, personally, so I'm not in a position to evaluate.

Overall, I'll still be using Mathematica for many tasks, it's a great tool.

As a tool for numerical computation it fares better. Its main rival would be Matlab, whose strength really lies in the great toolboxes Mathworks provides.

Even if the SciPy ecosystem is large with a good community, there are many areas where its packages are lacking, not well supported or immature.

Overall, I'll still be using Mathematica for many tasks, it's a great tool.

As a tool for numerical computation it fares better. Its main rival would be Matlab, whose strength really lies in the great toolboxes Mathworks provides.

Even if the SciPy ecosystem is large with a good community, there are many areas where its packages are lacking, not well supported or immature.

Sadly though for the most Matlab is not that popular because of the unique functionality it provides, but because MathWorks markets well to the academia and it became the language of choice for many researchers and courses.

Also, researchers don't care about redistributing source nearly as much as they really should, this day and age it's all still about printed publications...

So, is Matlab dead? Not even close, and to be honest, there are many issues Python has to solve. Overall though, things are shifting already, and I really can't see a bright future for Matlab or its clones, as fundamentally Python is a much better language, and for research being open is probably the most important feature. We'll see.

The usual argument is that it doesn't matter at all, because these are scripting languages used to glue very high-performance numerical routines. And I would totally agree. If it didn't matter.

A language for exploratory programming has to be expressive and high-level, but also fast enough for the abstractions not to fall on their knees. Sadly, Python isn't.

Even with simple code, if you're processing a modicum amount of data, you'll need to know its internals, and the variety of options available for optimization. It's similar in this regard to Mathematica, where using functions like Compile often requires planning the code up-front to fit in the restrictions of such optimizers.

Empirically though it seems that the amount of time I had to spend minding performance patterns in Python is even higher than what I do in Mathematica. I suspect it's because many packages are pure python.

It's true that you can do all the optimization staying inside the interactive environment, not needing to change languages. That's not bad. But if you start having to spend a significant amount of time thinking about performance, instead of transforming data, it's a problem.

Also, it's a mystery to me why most scientific languages are not built for multithreading, at all. All of them, Python, Matlab and Mathematica, execute only some underlying C code in parallel (e.g. blas routines). But not anything else (all the routines not written in native code, often things such as plots, optimizers, integrators).

Even Julia, which was built specifically for performance, doesn't really do multithreading so far, just "green" threads (one at a time, like python) and multiprocessing.

Multiprocessing in Python is great, IPython makes it a breeze to configure a cluster of machines or even many processes on a local machine. But it still requires order of magnitudes more effort than threading, killing interactivity (push global objects, imports, functions, all manually across instances).

Mathematica at least does the multiprocessing data distribution automatically, detecting dependencies and input data that need to be transferred.

Also, researchers don't care about redistributing source nearly as much as they really should, this day and age it's all still about printed publications...

So, is Matlab dead? Not even close, and to be honest, there are many issues Python has to solve. Overall though, things are shifting already, and I really can't see a bright future for Matlab or its clones, as fundamentally Python is a much better language, and for research being open is probably the most important feature. We'll see.

**A note on performance and exploration****For some reason, most of the languages for scientific exploratory programming are really slow. Python, Matlab, Mathematica, they are all fairly slow languages.**

The usual argument is that it doesn't matter at all, because these are scripting languages used to glue very high-performance numerical routines. And I would totally agree. If it didn't matter.

A language for exploratory programming has to be expressive and high-level, but also fast enough for the abstractions not to fall on their knees. Sadly, Python isn't.

Even with simple code, if you're processing a modicum amount of data, you'll need to know its internals, and the variety of options available for optimization. It's similar in this regard to Mathematica, where using functions like Compile often requires planning the code up-front to fit in the restrictions of such optimizers.

Empirically though it seems that the amount of time I had to spend minding performance patterns in Python is even higher than what I do in Mathematica. I suspect it's because many packages are pure python.

It's true that you can do all the optimization staying inside the interactive environment, not needing to change languages. That's not bad. But if you start having to spend a significant amount of time thinking about performance, instead of transforming data, it's a problem.

Also, it's a mystery to me why most scientific languages are not built for multithreading, at all. All of them, Python, Matlab and Mathematica, execute only some underlying C code in parallel (e.g. blas routines). But not anything else (all the routines not written in native code, often things such as plots, optimizers, integrators).

Even Julia, which was built specifically for performance, doesn't really do multithreading so far, just "green" threads (one at a time, like python) and multiprocessing.

Multiprocessing in Python is great, IPython makes it a breeze to configure a cluster of machines or even many processes on a local machine. But it still requires order of magnitudes more effort than threading, killing interactivity (push global objects, imports, functions, all manually across instances).

Mathematica at least does the multiprocessing data distribution automatically, detecting dependencies and input data that need to be transferred.

**Finally, code: NumPy by example.**

*Execute each section in an IPython cell (copy and paste, then shift-enter)*

# First, we'll have to import NumPy and SciPy packages.

# IPython has "magics" (macros) to help common tasks like this:

%pylab

# This will avoid scientific notation when printing, see also %precision

np.set_printoptions(suppress = True)

#from __future__ import division # 1/2 = 0.5 instead of integer division... problem is that then 2/2 is float as well...

# In IPython notebook, cell execution is shift-enter

# In IPython-QT enter evaluates and control-enter is used for multiline input

# Tab is used for completion, shift-tab after a function name shows its help

# Note that IPython will display the output of the LAST expression in a cell

# but variable assignments in Python are NOT expressions, so they will

# suppress output, unlike Matlab or Mathematica. You will need to add print

# statements after the assignments in the following examples to see the output

data = [[1,2],[3,4],[5,6]]

# if you want to suppress output in the rare cases it's generated, use ;

data;

# There is a magic for help as well

%quickref

# Important magics are: %reset, %reset_selective, %whos

# %prun (profile), %pdb (debug), %run, %edit

# And you can use ? and ?? for details on a symbol

get_ipython().magic(u'pinfo %pylab')

# Other than the Python built in help() and dir(symbol)

1*3 # evaluate this in a cell

_*3 # _ refers to the output of last evaluated cell

# again, not that useful because you can't refer to the previous expression

# evaluated inside a cell, just to the previous evaluation of an entire cell

# A numpy array can be created from a homogeneous list-like object

arr = np.array(data)

# Or an uninitialized one can be created by specifying its shape

arr = np.ndarray((3,3))

# There are also many other "constructors"

arr = np.identity(5)

arr = np.zeros((4,5))

arr = np.ones((4,))

arr = np.linspace(2,3) # see also arange and logspace

arr = np.random.random((4,4))

# Arrays can also be created from functions

arr = np.fromfunction((lambda x: x*x), shape = (10,))

# ...or by parsing a string

arr = np.fromstring('1, 2', dtype=int, sep=',')

# Arrays are assigned by reference

arr2 = arr

# To create a copy, use copy

arr2 = arr.copy()

# An array shape is a descriptor, it can be changed with no copy

arr = np.zeros((4,4))

arr = arr.reshape((2,8)) # returns the same data in a new view

# numpy also supports matrices, which are arrays contrained to be 2d

mat = np.asmatrix(arr)

# Not all operations avoid copies, flatten creates a copy

arr.flatten()

# While ravel doesn't

arr = np.zeros((4,4))

arr.ravel()

# By default numpy arrays are created in C order (row-major), but

# you can arrange them in fortran order as well on creation,

# or make a fortran-order copy of an existing array

arr = np.asfortranarray(arr)

# Data is always contiguously packed in memory

# Arrays can be indexed as with python lists/tuples

arr = np.zeros((4,4))

arr[0][0]

# Or with a multidimensional index

arr[0,0]

# Negative indices start from the end

arr[-1,-1] # same as arr[3,3] for this array

# Or, like matlab, via splicing of a range: start:end

arr = arange(10)

arr[1:3] # elements 1,2

arr[:3] # 0,1,2

arr[5:] # 5,6,7,8,9

arr[5:-1] # 5,6,7,8

arr[0:4:2] # step 2: 0,3

arr = arr.reshape(2,5)

arr[0,:] # first row

arr[:,0] # first column

# Also, splicing works on a list of indices (see also choose)

arr = arr.reshape(10)

arr[[1,3,5]]

# and with a numpy array of bools (see also where)

arr=np.array([1,2,3])

arr2=np.array([0,3,2])

arr[arr > arr2]

# flat returns an 1D-iterator

arr = arr.reshape(2,5)

arr.flat[3] # same as arr.reshape(arr.size)[3]

# Core operations on arrays are "ufunc"tions, element-wise

# vectorized operations

arr = arange(0,5)

arr2 = arange(5,10)

arr + arr2

# Operations on arrays of different shapes follow "broadcasting rules"

arr = np.array([[0,1],[2,3]]) # shape (2,2)

arr2 = np.array([1,1]) # shape (1,)

# If we do arr+arr2 arr2.ndim

# an input can be used if shape in all dimensions sizes, or if

# the dimensions that don't match have size 1 in its shape

arr + arr2 # arr2 added to each row of arr!

# Broadcasting also works for assignment:

arr[...] = arr2 # [[1,1],[1,1]] note the [...] to access all contents

arr2 = arr # without [...] we just say arr2 refers to the same object as arr

arr[1,:] = 0 # This now is [[1,1],[0,0]]

# flat can be used with broadcasting too

arr.flat = 3 # [[3,3],[3,3]]

arr.flat[[1,3]] = 2 # [[3,2],[2,3]]

# broadcast "previews" broadcasting results

np.broadcast(np.array([1,2]), np.array([[1,2],[3,4]])).shape

# It's possible to manually add ones in the shape using newaxis

# See also: expand_dims

print(arr[np.newaxis(),:,:].shape) # (1,2,2)

print(arr[:,np.newaxis(),:].shape) # (2,1,2)

# There are many ways to generate list of indices as well

arr = arange(5) # 0,1,2,3,4

arr[np.nonzero(arr)] += 2 # 0,3,4,5,6

arr = np.identity(3)

arr[np.diag_indices(3)] = 0 # diagonal elements, same as np.diag(arg.shape[0])

arr[np.tril_indices(3)] = 1 # lower triangle elements

arr[np.unravel_index(5,(3,3))] = 2 # returns an index given the flattened index and a shape

# Iteration over arrays can be done with for loops and indices

# Oterating over single elements is of course slower than native

# numpy operators. Prefer vector operations with splicing and masking

# Cython, Numba, Weave or Numexpr can be used when performance matters.

arr = np.arange(10)

for idx in range(arr.size):

print idx

# For multidimensiona arrays there are indices and iterators

arr = np.identity(3)

for idx in np.ndindex(arr.shape):

print arr[idx]

for idx, val in np.ndenumerate(arr):

print idx, val # assigning to val won't change arr

for val in arr.flat:

print val # assigning to val won't change arr

for val in np.nditer(arr):

print val # same as before

for val in np.nditer(arr, op_flags=['readwrite']):

val[...] += 1 # this changes arr

# Vector and Matrix multiplication are done with dot

arr = np.array([1,2,3])

arr2 = np.identity(3)

np.dot(arr2, arr)

# Or by using the matrix object, note that 1d vectors

# are interpreted as rows when "seen" by a matrix object

np.asmatrix(arr2) * np.asmatrix(arr).transpose()

# Comparisons are also element-wise and generate masks

arr2 = np.array([2,0,0])

print (arr2 > arr)

# Branching can be done checking predicates for any true or all true

if (arr2 > arr).any():

print "at least one greater"

# Mapping a function over an array should NOT be done w/comprehensions

arr = np.arange(5)

[x*2 for x in arr] # this will return a list, not an array

# Instead use apply_along_axis, axis 0 is rows

np.apply_along_axis((lambda x: x*2), 0, arr)

# apply_along_axis is equivalent to a python loop, for simple expressions like the above, it's much slower than broadcasting

# apply_along_axis is equivalent to a python loop, for simple expressions like the above, it's much slower than broadcasting

# It's also possible to vectorize python functions, but they won't execute faster

def test(x):

return x*2

testV = np.vectorize(test)

testV(arr)

# Scipy adds a wealth of numerical analysis functions, it's simple

# so I won't write about it.

# Matplotlib (replicates Matlab plotting) is worth having a quick look.

# IPython supports matplotlib integration and can display plots inline

%matplotlib inline

# Unfortunately if you chose the inline backend you will lose the ability

# of interacting with the plots (zooming, panning...)

# Use a separate backend like %matplotlib qt if interaction is needed

# Simple plots are simple

test = np.linspace(0, 2*pi)

plt.plot(test, np.sin(test)) # %pylab imports matplotib too

plt.show() # IPython will show automatically a plot in a cell, show() starts a new plot

plt.plot(test, np.cos(test)) # a separate plot

plt.plot(test, test) # this will be part of the cos plot

#plt.show()

# Multiple plots can also be done directly with a single plot statement

test = np.arange(0., 5., 0.1)

# Notes that ** is exponentiation.

# Styles in strings: red squares and blue triangles

plt.plot(test, test**2, 'rs', test, test**3, 'b^')

plt.show()

# It's also possible to do multiple plots in a grid with subplot

plt.subplot(211)

plt.plot(test, np.sin(test))

plt.subplot(212)

plt.plot(test, np.cos(test), 'r--')

#plt.show()

# Matplotlib plots use a hierarchy of objects you can edit to

# craft the final image. There are also global objects that are

# used if you don't specify any. This is the same as before

# using separate objects instead of the global ones

fig = plt.figure()

axes1 = fig.add_subplot(211)

plt.plot(test, np.sin(test), axes = axes1)

axes2 = fig.add_subplot(212)

plt.plot(test, np.cos(test), 'r--', axes = axes2)

#fig.show()

# All components that do rendering are called artists

# Figure and Axes are container artists

# Line2D, Rectangle, Text, AxesImage and so on are primitive artists

# Top-level commands like plot generate primitives to create a graphic

# Matplotlib is extensible via toolkits. Toolkits are very important.

# For example mplot3d is a toolkit that enables 3d drawing:

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

points = np.random.random((100,3))

# note that scatter wants separate x,y,z arrays

ax.scatter(points[:,0], points[:,1], points[:,2])

#fig.show()

# Matplotlib is quite extensive (and ugly) and can do much more

# It can do animations, but it's very painful, by changing the

# data artists use after having generated them. Not advised.

# What I end up doing is to generate a sequence on PNG from

# matplotlib and play them in sequence. In general MPL is very slow.

# If you want to just plot functions, instead of discrete data,

# sympy plots are a usable alternative to manual sampling:

from sympy import symbols as sym

from sympy import plotting as splt

from sympy import sin

x = sym('x')

splt.plot(sin(x), (x, 0, 2*pi))

**Resources:**

Tutorials

**A Crash Course in Python for Scientists****Scientific Python Lectures****A gallery of interesting IPython notebooks****Performance Python for Numerical Algorithms**- Anaconda's
**Conda package manager**(also, remember to conda clean unused packages...) - Learn about division and future division
- Basic NumPy tutorial
- Numpy performance tricks
- General Python performance tricks
- Parallel Computing with IPython
- Interactivity with IPython

Packages

- Scipy: numpy, scipy, matplotlib, sympy, pandas
- Optimization and learning
**Scikit-Learn**, one of SciKit packages (scipy extensions)- GPy, Gaussian processes.
- Scikit-Monaco, Monte Carlo integrators
- PyBrain, learning and neural networks
- Cvxopt/cvxpy, convex optimization
- NLopt, more nonlinear optimizers
- PyOpt, even more optimizers
- DEAP, evolutionary algorithms
- PyGTS, Gnu Triangulated Surface Library
- Performance
- A comparison of Cython, Numba, PyCuda, PyOpenCl, NumPy and other frameworks on a simple problem (Mandelbrot set)
**SciPy Weave**, inlines C code in Python code, compiles and links to python on demand.**Numba**, a numpy "aware" compiler, targets LLVM, compiles in runtime (annotated functions)**Cython**, compiles annotated python to C. Bottleneck uses it to accelerate some NumPy functions. (see also Shedskin, Pythran and ocl)- JobLib, makes multiprocessing easier (see IPython.Parallel too) but still not great as you can't have multithreading, multiprocessing means you'll have to send data around independent python interpreters :/
- Parakeet, compiles annotated numpy functions to CPU and GPU code, similar to Numba
- NumExpr, a fast interpreter of numerical expressions on arrays. Faster than numpy by aggregating operations (instead of doing one at at time)
- Theano, targets cpu and gpu, numpy aware, automatic differentiation. Clumsy...
- Nuikta, offline compiles python to C++, should support all extensions
- PyPy, a JIT, with a tracing interpreter written in python. Doesn't support all extensions (the CPython C library interface)
- Python/Cuda links
- Non-homogeneous data
- Graphics/Plotting
**For 3d animations, VisVis****seems the only tool that is capable of achieving decent speed, quality, and has a good interface and support.**It has a matlab-like interface, but actually creating objects (Line() instead of plot...) is muuuuch better/faster. Its successor is VisPy, still very experimental.**Bokeh**, nice plotting library, 2d only, outputs HTML5/JS so it can be interacted with in IPython Notebook- Matplotlib toolkits (MPL is SLOW and rather ugly, but it's the most supported):
**Mplot3d**, quite crappy 3d plots**Seaborn**, good looking 2d plots- Prettyplotlib, not crappy looking 2d plots
**PyProcessing**, not really a "scipy package" but useful, especially considering how bad the options for 3d plotting are right now. See also**NodeBox OpenGL****Point Cloud Library**- Chaco, another 2d plot/gui library, very OO, not useful for quickly graphing
- mpld3, a matplotlib compatible library that emits HTML5/JS using d3.js
- Others:
- Mayavi/Tvtk, 3d visualizations. Usable, but finicky and old (and tends to crash).
*Note:*

# mayavi and wxpython, use from command line binstar search -t conda mayavi

%gui wx

%pylab wx

%matplotlib inline

# In Ipython Notebook %matplotlib has to come after pylab, it seems.

# "inline" is cool but "qt" and "wx" allows interactivity

# qt is faster than wx, but mayavi requires wx

Other

- ComputableApp for iOS
**Deployment/packagers:**PyInstaller. Alternatives:- Binstar, a package sharing system

## 1 comment:

Awesome post and help getting started with scientific Python! I wrote something with similar intentions in January http://bartwronski.com/2014/01/19/mathematics-toolbox/ and now I'm writing a short "post-mortem"/follow-up - tips& tricks, what went wrong etc. Actually it's great that you posted it around +/- similar time, so I'll reference you and remove overlap - focus on some of personal experience instead. Cheers! :)

Post a Comment