Search this blog

18 November, 2017

DataLog & TableLog


What:
  • A simple system to serialize lists of numbers. 

Why: 

  • Programmers should use visualization as an everyday tool when developing algorithms. 
    • Most times if you just look at the final results via some aggregate statistics, for non trivial code, you end up missing important details that could lead to better solutions. 
    • Visualize often and early. Visualize the dynamic behaviour of your code!
  • What I used to do for the most part is to printf() from C code times values in a simple csv format, or directly as Mathematica arrays.
    • Mathematica is great for visualization and often with a one-liner expression I can process and display the data I emitted. Often I even copy the Mathematica code to do so as a comment in the C source.
    • Sometimes I peek directly in the process memory...
  • This hack’n’slash approach is fine, but it starts to be very inconvenient when you need to dump a lot of data and/or if the data is generated by multiple threads or in different stages in the program.
    • Importing the data can be very slow as well!
  • Thus, I finally decided I needed a better serialization code...

Features:

  • Schema-less. Serializes arrays of numbers. Supports nested arrays, no need to know the array dimensions up-front. Can represent any structure.
  • Compact. Stores numbers, internally, in the smallest type that can contain them (from 8-bit integers to double-precision floating point). Decodes always as double, transparently.
  • Sample import code for Processing.
  • Can also serialize to CSV, Mathematica arrays and UBJSON (which Mathematica 11.x can import directly)
  • Multi-thread safe.
    • Automatically sorts and optionally collates together data streams coming from different threads.
  • Not too slow. Usable. I would probably rewrite it from scratch now that I understand what I can do better - but the current implementation is good enough that I don't care, and the interface is ok.
  • Absolutely NOT meant to be used as a "real" serialization format, everything is meant to be easy to drop in an existing codebase, zero dependencies, and get some data out quickly, to then be removed...

Bonus: "TableLog" (included in the same source)
  • A system for statistical aggregation, for when you really have lots of data...
  • ...or the problem is simple enough that you know what statistics to extract from the C code!
  • Represents a data table (rows, columns).
    • Each row should be an independent "item" or experiment.
    • Each column is a quantity to be measured of the given item.
    • Multiple samples (data values) can be "pushed" to given rows/columns.
    • Columns automatically compute statistics over samples.
    • Each column can aggregate a different number of samples.
    • Each column can be configured to compute different statistics: average, minimum, maximum, histograms of different sizes.
  • Multithread-safe.
    • Multiple threads can write to different rows...
    • ...or the same row can be "opened" globally across threads.
    • Columns can be added incrementally (but will appear in all rows).
--- Grab them here! ---

DataLog: C code - computing & exporting data


DataLog: Processing code - importing & visualizing data


TableLog: C code
TableLog: Data imported in Excel


More visualization examples...