# First principles

If you wish to make an apple pie from scratch, you must first invent the universe.
— Carl Sagan

I’m definitely showing this quote to whoever tries to implement something without first searching for a library to do it.

# Hieroglyphics as types, whitespace as function names

I came across some curious Haskell tweets lately and decided to group them in a unique place.

These made me remember about a curious fact: did you know that there are other types of spaces in Unicode, like U+00A0, the no-break space? What about using it in Ruby? (please don’t)

Whenever I see someone talking about non-ASCII characters in programming languages, I always get back to APL, an old language that used extremely concise notation similar to mathematics itself. Due to most keyboards being horrible, it never caught on. :-(

(Mental note: having some kind of LaTeX math symbols embedded into a language for scientific computing would be… interesting.)

# Playing with Lua

I work for a mobile games company as a data scientist. I use Ruby for data wrangling and some sorts of analysis, Python for more specific things (essentially scikit-learn) and bash scripts for gluing everything together.

The developers use Corona for creating our games, which uses Lua. I decided to give that language a try.

Some facts:

• Lua is tiny. As someone accostumed to Python and Ruby, it is shocking to see such a small standard library. For example, this is the manual – there are only 158 Lua functions listed there.
• The syntax is incredibly simple. Take a look at these diagrams; if you understand the Extended Backus-Naur Form, you can read Lua’s syntax quite easily. For comparison, Ruby’s syntax is complex enough that there are lots (and lots and lots) of small corner cases that I probably never heard about, even after years using it. Ah! And Ruby’s parse.y has 11.3k lines.
• Lua was built with embedding in mind; it is used for interface customization in World of Warcraft, for example.
• It is a Brazilian programming language! :-) Lua was created in 1993 in Rio de Janeiro, according to Wikipedia.

# The Data Package Format

At my last job, I worked with data from the Brazilian educational system in several situations. The details aren’t the important part, but the format – a giant denormalized CSV with an accompanying PDF detailing its fields. It is very nice after you work with it for some time, but there are some things that could be better.

In that format, enumerations (fields with a fixed, finite set of values) are encoded as some arbitrary integer range, boolean values as 0 or 1, and other implementation details that are explained in the PDF. Thus far we have a cute CSV with documented fields. Nice, right?

Actually, yes.

# Learning new programming languages

Programming languages are possibly one of the simplest parts of software engineering. You can know your language from the inside-out and still have problems in a project — knowing the tool doesn’t imply knowing the craft. But learning a new language is really a lot of fun.

Inspired by Avdi Grimm’s roadmap for learning new languages, I decided to give it a try and put my current interests in writing.

• Julia – http://julialang.org/
I have experience writing code in MATLAB, Octave, Python (with Numpy, Scipy and Pandas) and a bit of R, and still I’m excited with Julia.There are at least 3 features of Julia that are powerful and make me wish to work with it: its Just-In-Time compiler, parallel for and the awesome metaprogramming inherited from LISP.

The drawback is… is… well, I didn’t have time to really use it and get comfortable writing Julia programs. Yet.

I already tried learning Haskell a couple of times. Maybe 3 or 4 or 5 times. I wrote programs based on mathematics and some simple scripts, most of the syntax isn’t strange anymore, even monads make sense now; however, I still feel a bit stiff when writing Haskell. I don’t know.

Two books I recently bought might help with that – Real World Haskell and Parallel and Concurrent Programming in Haskell. I probably need to motivate myself to write something useful with it.

• Rust – http://www.rust-lang.org/

There is a quote in Rust’s website that sums my expectations of it:

Rust is a systems programming language that runs blazingly fast, prevents nearly all segfaults, and guarantees thread safety.

I know how to read C/C++ and even write a bit of it, but it’s messy and takes more time than I usually have for side projects. Writing code that is safe & fast shouldn’t be so hard. ;)

All-in-all, this is a very brief list. However, I don’t think I should focus on more languages right now. To be honest, I think that my next learning targets are in applied mathematics. I need a stronger foundation in Partial Differential Equations and Probability Theory. There are several topics in optimization that I should take the time to study. Calculus of variations also seems quite cool.

(good thing that I have friends in pure math to help me find references!)

# SciRuby projects for Google Summer of Code 2015

Another year with SciRuby accepted as a mentoring organization in Google Summer of Code (GSoC)! The Community Bonding Period ended yesterday; the coding period officially begins today.

I’m really happy with the projects chosen this year; various different subjects and some would be really useful for me, i.e. Alexej’s LMM gem, Sameer’s Daru and Will’s changes to NMatrix.

That’s all. After the next GSoC meeting, I should write about how each of the projects are going.

Searching for your tools when you need to use them is bad organization.

Having a standard set of tools is a good thing. I have two toolboxes in my house, one for electronics and another for “hard” tools.

A voltimeter, a Raspberry Pi and an arduino.

# Deep copying objects in Ruby

Time and time again I forget that Object#clone in Ruby is a shallow clone and end up biting myself and spending 30 seconds looking at myself asking what the hell happened. The only difference today is that I decided to finally post about it in my blog – let’s hope this time is the last.

# Updates on NMatrix and SciRuby development

For the last couple of days, I’ve been thinking about what I wrote two weeks ago regarding SciRuby and the whole Ruby scientific computing scene. I still believe that the sciruby gem can be used as an integrated environment, but there are some problems that must be solved before:

1. We need a reasonably feature complete and easy to install version of NMatrix.
2. A good plotting tool. Right now, Naoki is working on this as part of GSoC 2014.
3. Statistics. Lots of things are already implemented in Statsample, but both Statsample::DataFrame and Statsample::Vector should use NMatrix behind the hood. Supporting JRuby can be problematic here…
4. Given 1 and 2, it’s possible to implement a lot of other interesting and useful things. For example: linear regression methods, k-means clustering, neural networks, use NMatrix as a matrix type for OpenCV images. There are lots of possibilities.
5. Minimization, integration and others.

With that in mind, my objective for the following weeks is to improve NMatrix. First, there are BLAS routines (mainly from level II, but some stuff from level I and III as well) that aren’t implemented in NMatrix and/or that aren’t available for the rational and ruby objects dtypes. There’s also LAPACK.

Another benefit of having complete C/C++ implementations is that we’ll eventually have to generalize these interfaces to allow other implementations (e.g. Mac OSX vecLib’s LAPACK, Intel’s MKL), thus making it much easier to install NMatrix. As Collin (and, I think, Pjotr) said in the sciruby-dev mailing list, it should be as easy as gem install nmatrix.

## BLAS and LAPACK general implementations

• HAVE_CBLAS_H being derived from mkmf‘s have_header
• Many more routines are implemented. Ideally, BLAS level 1 and 2 should be complete by the end of May.

An important next step is to be able to link against arbitrary BLAS and LAPACK implementations, given that they obey the standard. Issue #188 started some ideas; issue #22 is the original (and very old) one.

## After that…

When NMatrix support both BLAS and LAPACK without a problem — i.e. have its own implementation and can also link against arbitrary ones (OSX’s vecLib, GSL, ATLAS, Intel’s MKL, AMD’s Core Math Library) — we’ll be able to build on top of it. There are some routines in NMatrix that are already working with every dtype, but most of them aren’t. When we know exactly which routines can’t work with which dtypes, we’ll reach a very good standpoint to talk about what we support.

Alright, we have determinants for rational matrices, but not “other operation”, etc. What else? STYPES! We also need to have good support for Yale matrices. (obs: maybe add “old Yale” format?)

There isn’t much to do: we have to support the whole BLAS/LAPACK standard, almost everything linear algebra-wise is in these. After that, it’s mostly improvements to the interface, better method naming, better documentation and examples, better IO, etc.

Another point that would be good to adress is to remove the dependency of g++ > 4.6. We should strive to remove everything that depends on C++11 features, thus allowing normal Mac OSX users to install NMatrix without having to first install another compiler.

## Better documentation

We need to refactor our documentation. Oh, how we need to!

First, remove everything that shouldn’t be in the facing API — the classes and modules used in NMatrix::IO shouldn’t be available in the public API anyway, only the outside-facing stuff: how to save and load to/from each format. Probably more things as well.

Second, do a better job of being consistent with docs. There are some methods without a return type or stuff like that. Lots of methods in the C/C++ world aren’t documented as well. We can do better!

Finally, a really good documentation template. Fivefish is a good choice — it provides a very pretty, searchable and clean interface. (create NMatrix’s docs with it and host on my own server, see what happens).

# Solving linear systems in NMatrix

I’m writing some guides for NMatrix, so in the following weeks there should be some posts similar to this one, but more complex.

Linear systems are one of the most useful methods from “common algebra”. Various problems can be represented by them: systems of linear ODEs, operations research optimizations, linear electrical circuits and a lot of the “wording problems” from basic algebra. We can represent these systems as

$Ax = b$

Where $A$ is a matrix of coefficients and $b$ a vector representing the other side of the equation.

# Gems for scientific computing

UPDATE (20/04): User centrx from #ruby-lang at freenode warned me that I forgot about RSRuby/RinRuby, so I added them to projects.yml.

In Wicked Good Ruby 2013, Bryan Liles made a presentation about Machine Learning with Ruby. It’s a good introduction to the subject and he presents some useful tricks (I didn’t know about Grapher.app, for example). But the best advise I could get is that there’s a lot of room for improvement in the Ruby scientific computing scene.

Having contributed to some SciRuby projects in the last year, I’ve seen it first-hand. With NMatrix, it’s possible to do a lot of vector and matrix calculations easily, if you know how to install it — a task that’s much easier today. There are statsample for statistics, distribution for probability distributions, minimization, integration, the GSL bindings and others. But if you need plotting, it can be pretty hard to use (e.g. Rubyvis) or depend on external programs (Plotrb outputs SVG files). Do you want an integrated environment, like MATLAB or Pylab? There isn’t one.

Searching for more instances of people interested in the subject, I found a presentation about neural networks by Matthew Kirk from Ruby Conf 2013, an Eurucamp 2013 presentation by Juanjo Bazán and slides from a presentation by Shahrooz Afsharipour at a German university. If we needed any confirmation that there are folks looking for SciRuby, here’s the evidence.

## What can be done

In order to address these problems, I’m trying to come up with concrete steps towards creating a scientific community around Ruby. It’s obvious we need “more scientific libraries”, but what do we already have? What is easy to install and what isn’t? Should we create something new or improve what we have?

Also, I’m mapping the Ruby scientific computing landscape. I’ve compiled a YAML file with a list of the projects that I’ve found so far. In the future, this could be transformed in a nice visualization on sciruby.org to help scientists find the libraries they need.

If you know how to use the R programming language, both RSRuby and RinRuby can be used. They’re libraries that run R code inside of Ruby, so you can technically do anything you’d do with R in Ruby. This is suboptimal and R isn’t known for its speed.

For an integrated environment, we can revive the sciruby gem. For example:

I’m updating the SciRuby repository in this branch. Before creating the above DSL, it’s necessary to remove a lot of cruft (e.g. should use bundler/gem_tasks instead of hoe) and add some niceties (e.g. Travis CI support). Most importantly, adding dependency to the main SciRuby projects — NMatrix, statsample, minimization, integration, etc — in order to have a real integrated environment without require’ing everything manually. I’ll probably submit a pull request by next week.

We also need to improve our current selection: NMatrix installation shouldn’t depend on ATLAS, plotrb (or other solution) needs to be more usable, show how IRuby can be used to write scripts with nice graphics and LaTeX-support and create a list of all the applications that use our libraries for reference.

The Ruby Science Foundation was selected for Google Summer of Code 2014, thus some very bright students will help us fix some of these problems during the summer. However, there’s a lot to be done in every SciRuby project, if you’ve got the time. :)

## Conclusion

We still have a long way before having a full-fledged scientific community — but there’s hope! Some areas to look at:

• Good numerical libraries: NMatrix, mdarray.
• Algorithms for data mining, modeling and simulations: AI4R, ruby-fann, ruby-libsvm, statsample, distribution, etc.
• Plotting: Rubyvis is a port of Protovis, which was deprecated in favor of
D3js. Thus, we should create some plotting library around a C backend or
around D3, like Plotrb.
• Integrated environment: IRuby together with SciRuby.

Except for plotting, an area that really needs a lot of love and care, most of these are already working, but with usability problems (installation, mostly).

If you think that it’d be cool to have a scientific community centered around Ruby and you do have some time available, please please please:

1. Take a look at the SciRuby repositories.
2. If there’s a subject you’re interested in, see if you can refactor something, add more tests, well, anything.
3. Open issues about new features or pull requests improving the current ones.
4. If you don’t understand much about the subject, but see something that could be improved, do it: is there Travis CI support? Something wrong with the gemspec? Is it still using some gem to generate gemspecs?
5. There’s the sciruby-dev mailing list and the #sciruby channel on Freenode if there’s something you want to ask or discuss.

You can find me as agarie on freenode or @carlos_agarie on twitter.

## References

1. SciRuby. SiteGitHub
2. List of scientific computing projects in Ruby. projects.yml
3. Wicked Good Ruby 2013. Site
4. Bryan Liles: Machine Learning with Ruby. bryan
5. Matthew Kirk: Test-driven neural networks with Ruby. neural
6. Shahrooz Afsharipour: Ruby in the context of scientific computing. slides in PDF
7. Juanjo Bazán: presentation in Eurucamp 2013. juanjo-slides

# Cross validation in Ruby

These days I had some data mining problems in which I wanted to use Ruby instead of Python. One of the problems I faced is that I wanted to use k-fold cross validation, but couldn’t find a sufficiently simple gem (or gist or whatever) for it. I ended up creating my own version.

## A review of k-fold cross validation

A common way to study the performance of a model is to partition a dataset into training and validation sets. Cross validation is a method to assess if a model can generalize well independent of how this separation is decided.

The method of k-fold cross validation is to divide the dataset into k partitions, select one at a time for the validation set and use the other k – 1 partitions for training. So you end up with k different models, and respective performance measures against the validation sets.

The image below is an example of one fold: the k-th partition is left out for validation and partitions 1, …, k-1 are used for training.

E.g., if k = 5, you end up with 80% of the dataset for training and 20% for validation.

## The implementation

My solution is a function that receives the dataset, the number of partitions and a block, responsible for training and using the classifier in question. Most of it is straightforward, just keep in mind that the last partition (k-1-th) should encompass all the remaining elements when dataset.size isn’t divisible by k. Obviously, the training set is defined by the elements not in the validation set.

The last part is to yield both sets to the given block. Some information regarding the functionality of the yield keyword can be seen in another post and in Ruby core’s documentation.

Now suppose you have a CSV file called “dataset.csv” and you have a classifier you want to train. It’s as easy as:

And your classifier’s code is totally decoupled from the cross validation function. I like it.

## Conclusion

I found a gem on GitHub the other day unsurprisingly called cross validation. Its API is similar to scikit-learn‘s, which I find particularly strange. Too object oriented for me.

This code isn’t a full-blown gem and I don’t think there should be one just for cross validation. It fits in a whole machine learning library, though–and I hope to build one based on NMatrix… eventually.

# What are BLAS and LAPACK

At the beginning, the names BLAS, LAPACK and ATLAS confused me — imagine a young programmer, without formal training, trying to understand what’s a “de facto application programming interface standard” with lots of strangely-named functions and some references to the ancient FORTRAN language.

As of now, I think my understanding is sufficient to write about them.

BLAS (Basic Linear Algebra Subroutine) is a standard that provides 3 levels of functions for different kinds of linear algebra operations. Consider $\alpha$ and $\beta$ as scalars, x and y as vectors and A, B and T (triangular) as matrices. The levels are divided in the following way:

1. Scalar and vector operations of the form $y = \alpha * x + y$, dot product and vector norms.
2. Matrix-vector operations of the form $y = \alpha * A * x + \beta * y$ and solving $T * x = y$.
3. Matrix-matrix operations of the form $C = \alpha * A * B + \beta * C$ and solving $B = \alpha * T^{-1} * B$. GEMM (GEneral Matrix Multiply) is contained in this level.

There are several functions in LAPACK (Linear Algebra PACKage), from solving linear systems to eigenvalues and factorizations. It’s much better to take a look at its documentation when you’re looking for something specific.

## A bit of history

BLAS was first published in 1979, as can be seen in this paper. An interesting part of it is the section named Reasons for Developing the Package:

1. It can serve as a conceptual aid in both the design and coding stages of a programming effort to regard an operation such as the dot product as a basic building block.

2. It improves the self-documenting quality of code to identify an operation such as the dot product by a unique mnemonic name.

3. Since a significant amount of the execution time in complicated linear algebraic programs may be spent in a few low level operations, a reduction of the execution time spent in these operations may be reflected in cost savings in the running of programs. Assembly language coded subprograms for these operations provide such savings on some computers.

4. The programming of some of these low level operations involves algorithmic and implementation subtleties that are likely to be ignored in the typical applications programming environment. For example, the subprograms provided for the modified Givens transformation incorporate control of the scaling terms, which otherwise can drift monotonically toward underflow.

So it seems we still use BLAS for the reasons it was created. The paper’s a pretty good read if you have the time. (and if you don’t know what’s a Givens transformation, read this)

LAPACK was first published in 1992, as can be seen in the release history. By reading the LAWNs (LAPACK Working Notes), we can get a pretty good picture of its beginning, e.g. papers that presented techniques which were later added to it and installation notes (with sayings of the sort “[…] by sending the authors a hard copy of the output files or by returning the distribution tape with the output files stored on it”).

## Implementations

There are various implementations of the BLAS API, e.g. by Intel, AMD, Apple and the GNU Scientific Library. The one supported by NMatrix is ATLAS (Automatically Tuned Linear Algebra Software), a very cool project that uses a lot of heuristics to determine optimal compilation parameters to maximize its BLAS & LAPACK implementations’ performance.

As for LAPACK, its original goal was “to make the widely used EISPACK and LINPACK libraries run efficiently on shared-memory vector and parallel processors” (source). Simply put, it’s a library for speeding up various matrix-related routines by taking advantage of each architecture’s memory hierarchy. The trick is that it uses block algorithms for dealing with matrices instead of an element-by-element approach. This way, less time is spent moving data around. It’s written in Fortran 90.

Another important point regarding LAPACK is that it requires a good BLAS implementation — it assumes there’s one already made for the system at hand — by calling the level 3 operations as much as possible.

## Function naming conventions

One of the strangest things about BLAS and LAPACK is how their functions are named. In LAPACK, a subroutine name is of the form pmmaaa, where:

• p is the type of the numbers used, e.g. S for single-precision floating-point and Z for double-precision complex.
• mm is the kind of matrix used in the algorithm, e.g. GE for GEneral matrices, SY for SYmmetric and TB for Triangular Band.
• aaa is the algorithm implemented by the subroutine, e.g. QRF for QR factorization, TRS for solving linear equations with factorization.

BLAS functions are named as <character><name><mod>, which, although similar to LAPACK’s, have differences depending on the specific level. In level 1, <name> is the operation type, while is level 2 and 3 it’s the matrix argument type. For each level, there are some specific values that <mod> (if present) can take, each providing additional information of the operation. <character> is the data type regardless of the level.

These arcane names are derived from the fact that FORTRAN’s identifiers were limited to 6 characters in length. This was solved by FORTRAN90 by allowing up to 31 characters, but the names used in BLAS and LAPACK remain to this day.

## Use in NMatrix

NMatrix has bindings to both BLAS and LAPACK. Let me show you:

If you want to take a look at the low-level bindings, grab some coffee and read the ext/nmatrix/math/ directory. Since 8f129f, it has been greatly simplified and can actually be understood.

## References

Below you can find a list of the main resources used in this post.

# The Measurable Gem

I updated the Measurable gem yesterday with documentation and corrections to the methods.

It’s a module packed with lots of methods that calculate the distance between two vectors, u and v. They’re pretty useful for machine learning tasks and can be used in various apps whenever you need to estimate the similarity of two things — strings, sets, etc.

Just a reminder that some of the methods aren’t metrics in the mathematical sense, that is, given a function d(x, y), it is a metric if and only if the following properties hold:

• Symmetry: d(x, y) == d(y, x).
• Non-negative: d(x, y) >= 0, for every (x, y).
• Coincidence axiom: d(x, y) == 0 if, and only if, x == y.
• Triangular inequality: d(x, y) <= d(x, z) + d(z, y).

In any case, there are still many methods that I want to add to Measurable (which you can find in the README). As I’m learning about them while I write this gem, it’s hard to know in advance what’s useful and what isn’t. Any help with references and examples (and feature requests) are appreciated.

Another point is that I want to rewrite some methods in C (e.g. Euclidean distance) to get to know Ruby’s C API and to speed some things up. This would be a pretty good reason to use the gem also — speed — as most of the methods are very straightforward and succint to write.

I plan on releasing versions 0.0.6 up to 0.1 very rapidly, just by adding new method definitions, updating documentation and probably adding some examples.

Well, that’s it.