Wednesday, November 11, 2015

Where do I get data?

When getting started with data analysis of any kind, it can sometimes seem daunting. If you have no frame of reference it can be difficult to know where to start. If you are trying to learn time-series analysis techniques it might be good to have large sets of time series data such as a list of stock quotes over time to perform the analysis. If you are trying clustering techniques it would be good to have categorical data such as a list of all car models and their characteristics including: range, mpg, number of cylinders, various emissions, country of distribution, etc.

So assuming you have the tools you need there are a number of places that you can get information. Many sites such as Twitter, Facebook, and YouTube offer apis which allow you to pull all sorts of information. However, if you are looking for information outside of the Social media area it seems to get a bit more difficult, but there are a few great options out there.

First there's the U.S. Government, there has been a large push in data openness, and the result of that is data.gov which claims to have over 198,000. Not all the sources are available in an easily ingested format such as CSV, but there are over 8200 api accessible sources. Additionally from the U.S government there are NASA Open Datasets. Beyond the government though there are more options.

Quandl is a great resource for data both for free and for a premium. All data is available in XML, CSV, and JSON via APIs. There are also Libraries directly supported by Quandl for Python, R and Excel. Additionally, there are community supported plugins for many other languages and programs including: Go, Java and Ruby.

Beyond these there's always communities such as those on reddit /r/datasets and /r/data.

Happy data collecting!

No comments: