C0DE517E: DataLog & TableLog

18 November, 2017

DataLog & TableLog

What:

A simple system to serialize lists of numbers.

Why:

Programmers should use visualization as an everyday tool when developing algorithms.

Most times if you just look at the final results via some aggregate statistics, for non trivial code, you end up missing important details that could lead to better solutions.
Visualize often and early. Visualize the dynamic behaviour of your code!

What I used to do for the most part is to printf() from C code times values in a simple csv format, or directly as Mathematica arrays.

Mathematica is great for visualization and often with a one-liner expression I can process and display the data I emitted. Often I even copy the Mathematica code to do so as a comment in the C source.
Sometimes I peek directly in the process memory...

This hack’n’slash approach is fine, but it starts to be very inconvenient when you need to dump a lot of data and/or if the data is generated by multiple threads or in different stages in the program.

Importing the data can be very slow as well!

Thus, I finally decided I needed a better serialization code...

Features:

Schema-less. Serializes arrays of numbers. Supports nested arrays, no need to know the array dimensions up-front. Can represent any structure.

Compact. Stores numbers, internally, in the smallest type that can contain them (from 8-bit integers to double-precision floating point). Decodes always as double, transparently.

Sample import code for Processing.

Can also serialize to CSV, Mathematica arrays and UBJSON (which Mathematica 11.x can import directly)

Multi-thread safe.

Automatically sorts and optionally collates together data streams coming from different threads.

Not too slow. Usable. I would probably rewrite it from scratch now that I understand what I can do better - but the current implementation is good enough that I don't care, and the interface is ok.

Absolutely NOT meant to be used as a "real" serialization format, everything is meant to be easy to drop in an existing codebase, zero dependencies, and get some data out quickly, to then be removed...

Bonus: "TableLog" (included in the same source)

A system for statistical aggregation, for when you really have lots of data...
...or the problem is simple enough that you know what statistics to extract from the C code!
Represents a data table (rows, columns).

Each row should be an independent "item" or experiment.
Each column is a quantity to be measured of the given item.
Multiple samples (data values) can be "pushed" to given rows/columns.
Columns automatically compute statistics over samples.
Each column can aggregate a different number of samples.
Each column can be configured to compute different statistics: average, minimum, maximum, histograms of different sizes.