Data Science Websites

This page lists my favorite online classes and websites for learning about Data Science and keeping up with the latest news and cutting-edge techniques.

Classes/Learning

Machine Learning with Andrew Ng (Coursera)

This class is a great start for introducing you to important machine learning basics. The course uses a little mathematics, but nothing beyond basic calculus or linear algebra. Dr. Andrew Ng teaches machine learning better than anyone I know of, so I would highly recommend this class.

He discusses most unsupervised and supervised machine learning topics, but he decided not to include any discussion of tree-based methods or Naive Bayes in his Coursera class. If you want to see him teach Naive Bayes, you can go here.

Introduction to Data Science with Bill Howe (Coursera)

If you are totally new to data science and don’t understand what it encompasses, this class is a great overview of the field. He discusses the basics of various database types (relational vs. non-relational), Hadoop/MapReduce, visualization, statistics, and machine learning. After this class is finished (or after watching all of the lectures), you should understand much better what a data scientist does.

Computing for Data Analysis with Roger Peng (Coursera)

This class will help you understand the basics of R and using it for data analysis. If you have never used R before, this class can help you get started. I would go ahead and skip it if you already know R fairly well. If you just want to see the course videos, Revolution Analytics has a link to all of the videos on YouTube.

Data Analysis with Jeff Leek (Coursera)

This class is the more advanced version of the previous class. The class is designed for people that are somewhat familiar with R already. What is nice about the class is that it teaches a lot of the statistics you should know, along with how to do the statistics with the R language. If you are looking for some machine learning instruction, however, this class has very little of it, as most of the data analysis is done with classic statistical techniques.

Again, Revolution Analytics has a link to all of the course videos here.

Statistics One with Andrew Conway (Coursera)

This class is meant to be a nice review of all of the core statistics you may have forgotten over the years from college. If you have never taken a statistics class before, this one is a good start. The instructor comes from a social science background, so a lot of the statistical techniques discussed are more meant to work with very small datasets. That being said, there are a lot of great points in here about experimental design that data scientists need to be aware of (in cases such as A/B testing especially!). As with the other data analysis classes, R is the language primarily used.

All of the course videos can be found on YouTube here.

Hadoop Courses (Big Data University)

If you want to learn more about Hadoop via coursework, IBM runs several free courses. Almost the entire Hadoop stack is covered here (now including Spark). There are several courses available on other important topics such as relational databases and data analytics as well, but I haven’t taken any of those classes. You can choose to concentrate on a particular segment of classes by going to the “Curriculum Map” at the top of the page (under Courses). I personally did all of the Hadoop Core classes. The classes are very helpful and even provide a small sample Hadoop cluster you can practice with (but be aware: you will need a computer with a lot of memory to use it! At least 8 GB is recommended). You can take a multiple choice quiz at the end, and if you get at least 60% of the questions correct, you have passed the class and receive a certificate. IBM has also now included badges that you can display on your LinkedIn profile to show what you have learned.

Learning SQL (SQL Zoo)

Once I found this, I recommended it to all of the other interns during my work in the summer of 2014. By far the best way to learn SQL I have found. The reason it works so well in teaching you SQL is because your queries run in the background via the website as you learn. The questions are interesting puzzles that introduce you to SQL in a more interactive manner. There are also quizzes at the end of each section to help you test what you have learned. After finishing the lessons, you should understand the vast majority of basic queries you will need to run. There are even some more advanced queries covered in the lessons, and a few of them are very tough. You can find answers online to the questions if you look hard enough, but I would advise against this unless you are REALLY stuck.

Udacity Data Science courses (Udacity)

Several interesting courses on here to choose from. The course material itself is free, but you can pay extra if you would like personal help and a certificate. I will admit that I have not yet taken any of these, so I can’t vouch for their quality. The newer ‘Data Visualization and D3.js’ and ‘Real-Time Analytics with Apache Storm’ classes look very tempting!

Basic Command Line Use

I randomly found this website and it gives a nice overview of how to clean data using the terminal, which will probably be faster than loading it into Python/R first. This is a good skill to develop and several nice tutorials are presented. For more detail on this, I think Jeroen Janssens’ book “Data Science at the Command Line” would be very appropriate. I haven’t read his book yet, but I have used his Data Science Toolbox and it has been very helpful as a light virtual machine with Ubuntu running on it. It’s definitely on my to-do list when I have time!

News Sources

DataTau

Even though there isn’t very much conversation on the site yet, DataTau is my favorite! Simple and clean with no ads (similar to Hacker News), you can always find something interesting to read. From recent publications on machine learning to insightful blog postings, this is a good way to keep up with the latest and greatest in data science. I check it daily for the newest information. Highly recommended for keeping your edge on the newest technology and library releases.

Forbes: Data-Driven

For the data scientist more focused on business (such as managers or quantitative MBAs) this site is probably closer to what you are looking for. Not very technical, but it does offer some basic news on companies primarily using data.

FiveThirtyEight

The reincarnated version of Nate Silver’s website (author of one of my favorite books, “The Signal and The Noise”). All of the articles here involve news with a data-driven focus on them. If you have any interest in advanced data analysis done in sports, politics, economics, science, or other news of the day, this is a great place to go.