“The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.” - Donald Knuth (1974)
“At Google, for example, I have found that random samples on the order of 0.1 percent work fine for analysis of business data.” - Hal Varian (2014)
You have data. You want to analyze that data. But, this is the largest data you’ve worked with and you’re having problems because it’s so big. This post has some actionable advice to help you analyze your data. First, let me describe how I classify data by size:
I have a few suggestions for working with medium data.
SQL databases (PostgreSQL, MySQL, MariaDB, SQLite) are extremely
helpful for working with data on disk rather than in memory. The
different flavors of SQL databases have different advantages. The main
advantage to SQLite is it does not rely on a server-client setup. On
machines where you don’t have root access to set up a database server
you can still use SQLite to work with data on disk. If you are working
in R
the library dplyr is amazing for working with tables
stored in SQL databases.
By taking a small enough random sample of your data you can work with the data in memory and get results quickly. This approach speeds up each iteration of the (edit code)-(run code)-(evaluate results) loop. Data where observations are independent and identically distributed (IID) are perfectly suited for random sampling. When working with large panel data sets, individuals should be sampled keeping all observations for a selected individual in the resulting data set.
When observations are not IID then random sampling may not work for you. For instance, sampling data from an online data site where each observation is a potential date between two users is more complicated. Hitsch, Hortaçsu and Ariely (2010) address this issue by focusing on one connected subgraph at a time (they looked at one subgraph of users based in Boston and another subgraph based in San Diego).
If you start out of the gates designing your code around MapReduce when your research question doesn’t require it, you’ll end up wasting a lot of time programming. If you can answer your research question using a random sample, do it, it will save you a lot of time. If a random sample won’t work with your data is there another method you could use to shrink the data and answer the question of interest? Finally, if you’re certain you need to use a large data set, can you use SQL on Hadoop?