Searching for your tools when you need to use them is bad organization.
Having a standard set of tools is a good thing. I have two toolboxes in my house, one for electronics and another for “hard” tools.
With that in mind, I decided to list the technologies I’m currently using at work. Some of them will be listed with a * to indicate that I’m still testing & learning about it.
- Machine – 2013 MacBook Pro, OS X Yosemite.
- Text editor – vim. I’ve been using it for a year and a half with no intention of switching over to another editor. My vimrc file is on GitHub.
- Programming languages – Ruby for data cleaning and other pre-processing tasks & Python for building models and preparing results for presentations. I’m working towards using only Ruby, but IRuby, Nyaplot and Daru still need some work before that is possible.
- Pry is much, much better than the default IRB console. Being able to look in objects contexts everywhere is underrated, you only notice how powerful this is after spending a few minutes to find a bug that would otherwise have taken 1 or 2 hours. Besides, pry-byebug allows you to use a decent debugger, with breakpoints, nexts and continues.
- SmarterCSV is quite good for handling CSV files. It has features for reading batches of rows, so bigger files are fine. Its interface is really simple, so I tend to investigate new datasets via irb -r smarter_csv. For simpler operations, like projections or joins, I prefer CSVkit (as a matter of fact, implementing csvkit in Ruby with SmarterCSV should be a piece of cake).
- Nyaplot [*] is a great plotting library when used with IRuby. It is very easy to generate interactive plots and there is even an extension for plotting on top of maps, called Mapnya.
- Pandas for joining and grouping data in notebooks. There is a similar library in Ruby called Daru [*] that I still haven’t had the chance to try.
- Scikit-learn for building classifiers and doing cross validation easily.
- Matplotlib for plotting when in Python land. There are some niceties like allowing LaTeX in titles and labels and using subplots and axes.
- Jupyter notebook are amazing for presenting analysis and results. One of the SciRuby projects is the IRuby notebook, by Daniel Mendler, which brings the same facilities available in IPython to Ruby.
- GNU Parallel – This is probably the single most useful tool in the list right now. I’m not dealing with large datasets; the largest are a few GBs in size. Instead of booting a Hadoop cluster on AWS, I write my “MapReduce” pipeline with a few scripts and calls to Parallel.
- Julia Language [*] – I wrote a few number crunching scripts so far, but there’s a lot of potential in Julia. I hope to have something cool to show in some weeks.
And that’s it.