January 30, 2013

Coursera Data Analysis MOOC: First Impressions

On the spur of the moment I decided to enrol in Coursera's Data Analysis course. I've been curious about MOOCs (massive open on-line courses) for some time, so when I came across this one, I decided it was time to find out more. Plus, the course topic is well-suited to the kind of work I do.

The course is given by Jeff Leek, an Assistant Professor in Biostatistics from the Johns Hopkins Bloomberg School of Public Health. Jeff's introductory video is shown below.


The course is run over eight weeks and is delivered as a set of video lectures. Topics covered include:
  • The structure of a data analysis (steps in the process, knowing when to quit, etc.)
  • Types of data (census, designed studies, randomized trials)
  • Types of data analysis questions (exploratory, inferential, predictive, etc.)
  • How to write up a data analysis (compositional style, reproducibility, etc.)
  • Obtaining data from the Web (through downloads mostly)
  • Loading data into R from different file types
  • Plotting data for exploratory purposes (boxplots, scatterplots, etc.)
  • Exploratory statistical models (clustering)
  • Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)
  • Basic model checking (primarily visually)
  • The prediction process
  • Study design for prediction
  • Cross-validation
  • A couple of simple prediction models
  • Basics of simulation for evaluating models
  • Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)
Each lecture can be viewed in your Web browser or downloaded (MP4) for off-line viewing. The lectures are slide presentations with audio of the lecturer explaining the content. You can download the slides (PDF) and transcripts if you prefer.

A 10-question quiz must be completed by the end of each week. It has hard and soft deadlines. If you miss the soft deadline you can still submit answers before the hard deadline but a penalty is applied to your score. You can attempt each quiz four times.

Two peer assignments must be completed; one in week 3 (due at the end of week 4) the other in week 6 (due at the end of week 7).  The assignments are graded by your student peers, and you must grade at least four peer assignments to avoid a 20% penalty. Your grade is based on the median of the grades you receive from your peers.

An interesting aspect of the course is the forum, to which students can post questions. Prof. Leek obviously can't answer all the questions, as the course has 100,000 students. So, you can vote on questions and the lecturer responds to the top few. Students can help each other out by responding to questions too.

The course requires a working knowledge of R. I've been using R increasingly as part of my day-to-day work so am comfortable with this. Some (optional) background lectures on R are provided in the course material along with links to other resources.

Successful completion of the course conveys no official qualification or accreditation. I've enrolled purely for my own edification; to learn about MOOCs like coursera, and sharpen my data analysis skills.

I'll post follow ups as the course progresses.