## RuPy Campinas 2015

Na semana passada participei do RuPy Campinas. Há anos que não ia em algum evento de programação e foi divertido rever algumas pessoas e conhecer outras. :)

Deixarei aqui links para as apresentações que assisti.

## Python e a Invasão dos Objetos Inteligentes

Autor: João S. O. Bueno
Slides

O JS trabalhou comigo por uns meses e me ensinou um bucado de Python, então assistir uma apresentação dele foi bem divertido.

De todos os exemplos que ele mostrou, o que mais gostei foi o de Programação Reativa em Python, que vocês podem ver aqui. Essencialmente, é a implementação da engine de uma Spreadsheet em cerca de 30 linhas de código.

## Tunando seu código Ruby

Foi uma palestra bem rápida. O mais interessante foi descobrir a gem benchmark-ips. Senti falta de alguns exemplos concretos, i.e. como refatorar código a partir de alguns benchmarks bem feitos.

## A evolução de uma arquitetura distribuída

Autor: Guilherme Garnier
Slides

Ouvi novamente sobre algumas coisas que não ouvia há um tempo, como a pattern Circuit Breaker. A história foi parecida com o que vi em outras empresas, o monolito inicial tornou-se difícil de manter e foi quebrado em micro-serviços. No final, percebi que preciso aprender a usar o Docker o quanto antes. :-)

## Novas linguagens: o que vem depois do Ruby

Autor: Fabio Akita
Slides

Uma das duas melhores palestras do dia. O ponto alto da palestra foi o grafo de linguagens de programação que o autor montou, mostrando quais linguagens influenciaram outras ao longo do tempo. Podem vê-lo no repositório:

github.com/akitaonrails/computer_languages_genealogy_graphs

Ele citou diversas linguagens curiosas, algumas das quais eu tive o prazer de experimentar, como Ada e Prolog.

A parte mais útil da apresentação foi quando ele comentou sobre a LLVM e como um monte de linguagens a está utilizando agora, e.g. Swift. Apesar dos vários pontos positivos que ele citou, o que mais me marcou foi que fiquei com uma vontade imensa de aprender mais linguagens de programação… enfim, acho que vou adicionar Swift à lista das que quero aprender.

## Girando Pratos: Concorrência com Futures em Python

Outra palestra realmente boa. Faz um tempo que não uso nada de Python que não seja o NumPy, SciPy ou o Scikit-learn, mas fiquei com vontade de brincar de concorrência (muito embora precise terminar o Parallel and Concurrent Programming in Haskell primeiro…).

As duas bibliotecas usadas na apresentação são a threading e a asyncio.

Filed under Events

## Hieroglyphics as types, whitespace as function names

I came across some curious Haskell tweets lately and decided to group them in a unique place.

These made me remember about a curious fact: did you know that there are other types of spaces in Unicode, like U+00A0, the no-break space? What about using it in Ruby? (please don’t)

Whenever I see someone talking about non-ASCII characters in programming languages, I always get back to APL, an old language that used extremely concise notation similar to mathematics itself. Due to most keyboards being horrible, it never caught on. :-(

(Mental note: having some kind of LaTeX math symbols embedded into a language for scientific computing would be… interesting.)

Filed under Programming

## Playing with Lua

I work for a mobile games company as a data scientist. I use Ruby for data wrangling and some sorts of analysis, Python for more specific things (essentially scikit-learn) and bash scripts for gluing everything together.

The developers use Corona for creating our games, which uses Lua. I decided to give that language a try.

Some facts:

• Lua is tiny. As someone accostumed to Python and Ruby, it is shocking to see such a small standard library. For example, this is the manual – there are only 158 Lua functions listed there.
• The syntax is incredibly simple. Take a look at these diagrams; if you understand the Extended Backus-Naur Form, you can read Lua’s syntax quite easily. For comparison, Ruby’s syntax is complex enough that there are lots (and lots and lots) of small corner cases that I probably never heard about, even after years using it. Ah! And Ruby’s parse.y has 11.3k lines.
• Lua was built with embedding in mind; it is used for interface customization in World of Warcraft, for example.
• It is a Brazilian programming language! :-) Lua was created in 1993 in Rio de Janeiro, according to Wikipedia.

## Random number generators

After finding so many interesting features about the language, I wrote some random number generators:

I decided to write RNGs after reading John D. Cook’s post about RNGs in Julia. :)

Filed under Programming

## The Data Package Format

At my last job, I worked with data from the Brazilian educational system in several situations. The details aren’t the important part, but the format – a giant denormalized CSV with an accompanying PDF detailing its fields. It is very nice after you work with it for some time, but there are some things that could be better.

In that format, enumerations (fields with a fixed, finite set of values) are encoded as some arbitrary integer range, boolean values as 0 or 1, and other implementation details that are explained in the PDF. Thus far we have a cute CSV with documented fields. Nice, right?

Actually, yes.

I was quite happy with it for months, doing analyses and maintaining the internal libraries used to work with it. However, as soon as we started working with data from earlier years, things went awry. Not so obviously wrong, but the code started getting lots of little conditionals, including things like:

And this is freaking ugly.

I thought I could improve the situation. For example, keeping a directory for each year with the dataset (the CSV file) and a JSON describing the schema of the fields. The gains aren’t pronounced in this case, basically it is a translation of the PDF documentation to a computer-readable format.

We could also create a default schema such that each year’s data is mapped to it. This would move the complexity of the application to data pre-processing, which I prefer – that is one of the ugliest and most troublesome steps of data analysis anyway.

## The Data Package Format

Today I was organizing the output files of some internal tools I developed at my current job:

So a bunch of directories, each with various CSVs representing data for a country. For reasons that I can’t write here, I started thinking about how it would be awesome if I could write some sort of metadata file for those IDs.

This opens up some possibilities. The format of those CSV files changed some times in the past 2 months, and some of my recent scripts can’t work with earlier versions of them. If I had a metadata file describing the precise schema of those files, I could abort any incompatible operation instead of receiving an error or, much worse, failing silently.

Thankfully, I found something that filled that niche: what is known as a Data Package. It is a bundle of data (which can be in any format, CSV, XLS, etc) and a “descriptor” file called  datapackage.json . Quite simple. The specification can be found here.

For the case I’m working with, i.e. lots of CSV files, there is an extension to the format called Tabular Data Package. Its specification can be found here.

## Another thing

These formats are defined using what is called JSON schema, which I hadn’t heard about before. The json-schema.org website shows some interesting examples.

Filed under Open Source

## Learning new programming languages

Programming languages are possibly one of the simplest parts of software engineering. You can know your language from the inside-out and still have problems in a project — knowing the tool doesn’t imply knowing the craft. But learning a new language is really a lot of fun.

Inspired by Avdi Grimm’s roadmap for learning new languages, I decided to give it a try and put my current interests in writing.

• Julia – http://julialang.org/
I have experience writing code in MATLAB, Octave, Python (with Numpy, Scipy and Pandas) and a bit of R, and still I’m excited with Julia.There are at least 3 features of Julia that are powerful and make me wish to work with it: its Just-In-Time compiler, parallel for and the awesome metaprogramming inherited from LISP.

The drawback is… is… well, I didn’t have time to really use it and get comfortable writing Julia programs. Yet.

I already tried learning Haskell a couple of times. Maybe 3 or 4 or 5 times. I wrote programs based on mathematics and some simple scripts, most of the syntax isn’t strange anymore, even monads make sense now; however, I still feel a bit stiff when writing Haskell. I don’t know.

Two books I recently bought might help with that – Real World Haskell and Parallel and Concurrent Programming in Haskell. I probably need to motivate myself to write something useful with it.

• Rust – http://www.rust-lang.org/

There is a quote in Rust’s website that sums my expectations of it:

Rust is a systems programming language that runs blazingly fast, prevents nearly all segfaults, and guarantees thread safety.

I know how to read C/C++ and even write a bit of it, but it’s messy and takes more time than I usually have for side projects. Writing code that is safe & fast shouldn’t be so hard. ;)

All-in-all, this is a very brief list. However, I don’t think I should focus on more languages right now. To be honest, I think that my next learning targets are in applied mathematics. I need a stronger foundation in Partial Differential Equations and Probability Theory. There are several topics in optimization that I should take the time to study. Calculus of variations also seems quite cool.

(good thing that I have friends in pure math to help me find references!)

Filed under Math, Open Source, Programming

## SciRuby projects for Google Summer of Code 2015

Another year with SciRuby accepted as a mentoring organization in Google Summer of Code (GSoC)! The Community Bonding Period ended yesterday; the coding period officially begins today.

I’m really happy with the projects chosen this year; various different subjects and some would be really useful for me, i.e. Alexej’s LMM gem, Sameer’s Daru and Will’s changes to NMatrix.

That’s all. After the next GSoC meeting, I should write about how each of the projects are going.

Filed under Open Source

Searching for your tools when you need to use them is bad organization.

Having a standard set of tools is a good thing. I have two toolboxes in my house, one for electronics and another for “hard” tools.

A voltimeter, a Raspberry Pi and an arduino.

With that in mind, I decided to list the technologies I’m currently using at work. Some of them will be listed with a * to indicate that I’m still testing & learning about it.

• Machine – 2013 MacBook Pro, OS X Yosemite.
• Text editor – vim. I’ve been using it for a year and a half with no intention of switching over to another editor. My vimrc file is on GitHub.
• Programming languages – Ruby for data cleaning and other pre-processing tasks & Python for building models and preparing results for presentations. I’m working towards using only Ruby, but IRuby, Nyaplot and Daru still need some work before that is possible.
• Pry is much, much better than the default IRB console. Being able to look in objects contexts everywhere is underrated, you only notice how powerful this is after spending a few minutes to find a bug that would otherwise have taken 1 or 2 hours. Besides, pry-byebug allows you to use a decent debugger, with breakpoints, nexts and continues.
• Libraries
• SmarterCSV is quite good for handling CSV files. It has features for reading batches of rows, so bigger files are fine. Its interface is really simple, so I tend to investigate new datasets via  irb -r smarter_csv. For simpler operations, like projections or joins, I prefer CSVkit (as a matter of fact, implementing csvkit in Ruby with SmarterCSV should be a piece of cake).
• Nyaplot [*] is a great plotting library when used with IRuby. It is very easy to generate interactive plots and there is even an extension for plotting on top of maps, called Mapnya.
• Pandas for joining and grouping data in notebooks. There is a similar library in Ruby called Daru [*] that I still haven’t had the chance to try.
• Scikit-learn for building classifiers and doing cross validation easily.
• Matplotlib for plotting when in Python land. There are some niceties like allowing LaTeX in titles and labels and using subplots and axes.
• Jupyter notebook are amazing for presenting analysis and results. One of the SciRuby projects is the IRuby notebook, by Daniel Mendler, which brings the same facilities available in IPython to Ruby.
• GNU Parallel – This is probably the single most useful tool in the list right now. I’m not dealing with large datasets; the largest are a few GBs in size. Instead of booting a Hadoop cluster on AWS, I write my “MapReduce” pipeline with a few scripts and calls to Parallel.
• Julia Language [*] – I wrote a few number crunching scripts so far, but there’s a lot of potential in Julia. I hope to have something cool to show in some weeks.

And that’s it.

Filed under Open Source, Programming

## Deep copying objects in Ruby

Time and time again I forget that Object#clone in Ruby is a shallow clone and end up biting myself and spending 30 seconds looking at myself asking what the hell happened. The only difference today is that I decided to finally post about it in my blog – let’s hope this time is the last.

## Well, what is a deep copy?

In C++ there is the concept of a copy constructor, which is used when an object is initialized as a copy of an existing object. In many situations this can be deduced by a compiler and you don’t have to worry. If your object contains pointers to things that can’t be shared, however, you have to provide what is called a user-defined copy constructor:

A user-defined copy constructor is generally needed when an object owns pointers or non-shareable references, such as to a file […]

— Wikipedia on Copy Constructor

In Ruby, variables have references to objects. If you want a clone of that variable (e.g. an Array), you can simply do:

This works because numbers are singletons, so you’re not passing references around (when working in 64 bits, the Ruby interpreter inlines Numeric objects for most operations as well). But if you have other arrays or hashes instead of numbers, things start to break. You don’t have the object, but a reference to it, thus when you do:

This kind of bug can be hard to understand when first encountered, so it’s definitely a good thing to have in mind.

I was in the middle of implementing what I just explained when I noticed I was reinventing the wheel. Turning to Stackoverflow, I found an answer similar to what I was doing and another, simpler, more interesting and applicable to my specific situation:

Duh. The Marshal library is a tool for storing objects as bytes outside of the program for later use. I’ve never had to use it before, as the only apparent use case I have (storing trained statistical classifiers) can be achieved more robustly by saving parameters in a JSON.

But I digress. By storing the object’s data as a byte stream and reconstructing the same object afterwards, you create new copies of each of the constituent objects.

However, there are two problems with this approach:

• Some objects can’t be marshalled. You’ll need to implement marshalling logic yourself, which kind of defeats the purpose of using this technique: why not implement deep copying instead?
• It is slow. In some cases this doesn’t matter. I was building a small simulation in which I copied an Array with less than 100 Hashes at each iteration and there were less than 2000 time steps in total, thus resulting in maybe some extra seconds. But for larger scripts this can be problematic.

The second point could be solved by thinking the problem through, but I had 30 minutes to come up with an argument for a point I was about to make in a meeting. I sure hope I never have to do this again (famous last words…)

Filed under Programming

## Books read in the first quarter of 2015

Covers of the books read this quarter

In 2014, I wrote a list of the books read at the time. This year, I’ll collect the books and papers each quarter.

A curious note: Nassim Nicholas Taleb (author of Black Swan) wrote on Facebook about the paper about eusociality (think about insect colonies like ants and bees). The authors showed that kin selection theory is unnecessary given that traditional natural selection theory can predict this relation, and Richard Dawkins attacked them making no mentions to their mathematical models. I’m no biologist, but the text is very accessible (the math too if you’re familiar with stochastic processes).

## Papers

• AI Planning: Systems and Techniques. James Hendler et al. PDF
• O-Plan: a Common Lisp Planning Web Service. A. Tate and J. Dalton. PDF
• EduRank: A Collaborative Filtering Approach to Personalization in E-learning. Avi Segal et al. PDF
• Discovering Gender-Specific Knowledge from Finnish Basic Education using PISA Scale Indices. M. Saarela and T. Kärkkäinen. PDF
• The Unified Logging Infrastructure for Data Analytics at Twitter. George Lee et al. PDF
• The Evolution of Eusociality. Martin A. Nowak et al. Read online. Supplementary material.

## Mangas

5 Centimeters per Second manga cover

It has been a long time since I read any mangas, but 5 Centimeters per Second is a masterpiece. Read more about it on MyAnimeList.

Filed under Books

## The most socially useful communication technology

Text is the most socially useful communication technology. It works well in 1:1, 1:N, and M:N modes. It can be indexed and searched efficiently, even by hand. It can be translated. It can be produced and consumed at variable speeds. It is asynchronous. It can be compared, diffed, clustered, corrected, summarized and filtered algorithmically. It permits multiparty editing. It permits branching conversations, lurking, annotation, quoting, reviewing, summarizing, structured responses, exegesis, even fan fic.

I read the post “Always bet on text” today and, I must say, it is a beautiful way to look at the process of communicating by writing. :)

April 1, 2015 · 17:00

## Excel trying to take over the world

Intuitively, it is not just the limited capability of ordinary software that makes it safe: it is also its lack of ambition. There is no subroutine in Excel that secretly wants to take over the world if only it were smart enough to find a way.
— Nick Bostrom, Superintelligence

I wouldn’t be so certain about it.

There are “scientists” (economists) who think it is OK to use Excel for making predictions that affect several people, as you can see from this article in The Guardian. Essentially, they didn’t add four years of data from New Zealand to a spreadsheet. Other methodological factors were in effect as well. And all of this contributed to lots of people losing their jobs in various countries when the recommended austerity measures were put in place. Imagine if Excel wanted to take over the world.

The paper that discusses in depth about Reinhart & Rogoff’s mistake is “Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogo ff“.

## Completeness and incomputability

It is notable that completeness and incomputability are complementary properties: It is easy to prove that any complete prediction method must be incomputable. Moreover, any computable prediction method cannot be complete — there will always be a large space of regularities for which the predictions are catastrophically poor.

— Ray Solomonoff, “Algorithmic Probability — Its Discovery — Its Properties and Application to Strong AI”

This quote is a paragraph from the book Randomness Through Computation, an amazing work I was reading this morning.

The idea that any computable prediction method can’t be complete is profound for those of us that work with machine learning; it implies we always have to deal with trade-offs. Explicitly considering this makes for a better thought process when designing applications.

## References

1. Ray Solomonoff — Wikipedia.
2. Solomonoff’s Lightsaber — Wikipedia, LessWrong

Filed under Uncategorized

## Books so far in 2014

I have a lot of books.

I’ve finally decided to organize my collection and keep track of what I read. In this post, I’ll list the books I read since January — or at least an approximation given by the email confirmations of the ebooks I bought, my memory and the ones in my bookshelf. I also divided them in sections. Papers are included as well.

1 Comment

Filed under Books

## Updates on NMatrix and SciRuby development

For the last couple of days, I’ve been thinking about what I wrote two weeks ago regarding SciRuby and the whole Ruby scientific computing scene. I still believe that the sciruby gem can be used as an integrated environment, but there are some problems that must be solved before:

1. We need a reasonably feature complete and easy to install version of NMatrix.
2. A good plotting tool. Right now, Naoki is working on this as part of GSoC 2014.
3. Statistics. Lots of things are already implemented in Statsample, but both Statsample::DataFrame and Statsample::Vector should use NMatrix behind the hood. Supporting JRuby can be problematic here…
4. Given 1 and 2, it’s possible to implement a lot of other interesting and useful things. For example: linear regression methods, k-means clustering, neural networks, use NMatrix as a matrix type for OpenCV images. There are lots of possibilities.
5. Minimization, integration and others.

With that in mind, my objective for the following weeks is to improve NMatrix. First, there are BLAS routines (mainly from level II, but some stuff from level I and III as well) that aren’t implemented in NMatrix and/or that aren’t available for the rational and ruby objects dtypes. There’s also LAPACK.

Another benefit of having complete C/C++ implementations is that we’ll eventually have to generalize these interfaces to allow other implementations (e.g. Mac OSX vecLib’s LAPACK, Intel’s MKL), thus making it much easier to install NMatrix. As Collin (and, I think, Pjotr) said in the sciruby-dev mailing list, it should be as easy as gem install nmatrix.

## BLAS and LAPACK general implementations

• HAVE_CBLAS_H being derived from mkmf‘s have_header
• Many more routines are implemented. Ideally, BLAS level 1 and 2 should be complete by the end of May.

An important next step is to be able to link against arbitrary BLAS and LAPACK implementations, given that they obey the standard. Issue #188 started some ideas; issue #22 is the original (and very old) one.

## After that…

When NMatrix support both BLAS and LAPACK without a problem — i.e. have its own implementation and can also link against arbitrary ones (OSX’s vecLib, GSL, ATLAS, Intel’s MKL, AMD’s Core Math Library) — we’ll be able to build on top of it. There are some routines in NMatrix that are already working with every dtype, but most of them aren’t. When we know exactly which routines can’t work with which dtypes, we’ll reach a very good standpoint to talk about what we support.

Alright, we have determinants for rational matrices, but not “other operation”, etc. What else? STYPES! We also need to have good support for Yale matrices. (obs: maybe add “old Yale” format?)

There isn’t much to do: we have to support the whole BLAS/LAPACK standard, almost everything linear algebra-wise is in these. After that, it’s mostly improvements to the interface, better method naming, better documentation and examples, better IO, etc.

Another point that would be good to adress is to remove the dependency of g++ > 4.6. We should strive to remove everything that depends on C++11 features, thus allowing normal Mac OSX users to install NMatrix without having to first install another compiler.

## Better documentation

We need to refactor our documentation. Oh, how we need to!

First, remove everything that shouldn’t be in the facing API — the classes and modules used in NMatrix::IO shouldn’t be available in the public API anyway, only the outside-facing stuff: how to save and load to/from each format. Probably more things as well.

Second, do a better job of being consistent with docs. There are some methods without a return type or stuff like that. Lots of methods in the C/C++ world aren’t documented as well. We can do better!

Finally, a really good documentation template. Fivefish is a good choice — it provides a very pretty, searchable and clean interface. (create NMatrix’s docs with it and host on my own server, see what happens).

Filed under Programming

## Solving linear systems in NMatrix

I’m writing some guides for NMatrix, so in the following weeks there should be some posts similar to this one, but more complex.

Linear systems are one of the most useful methods from “common algebra”. Various problems can be represented by them: systems of linear ODEs, operations research optimizations, linear electrical circuits and a lot of the “wording problems” from basic algebra. We can represent these systems as

$Ax = b$

Where $A$ is a matrix of coefficients and $b$ a vector representing the other side of the equation.