February 22, 2013

Coursera Data Analysis MOOC: Half-Time Entertainment

It's week five of the Coursera Data Analysis MOOC, and it's been a busy time since the course commenced (see my first impressions). I've just completed the weekly quiz so have time to come up for air and post a progress report.

The course is similar in many respects to my time at uni: I've been late to lectures, and had to scramble to complete tests and assignments. As I mentioned previously, one of my main motivations was to learn about MOOCs. So, I wasn't too worried when by the middle of week one I hadn't been spoon-fed course material. I'd expected a flurry of emails with course information but my inbox was quiet. In fact, I needed to actually visit the Coursera Data Analysis Web site to attend class. By the time I did I found the course well under way, and I had a lot of catching up to do.

I also realised that, just like uni, turning up wasn't going to be enough; I needed to invest serious time in understanding what was being taught and applying it to tests and assignments. So, I pulled my finger out and put aside some other projects to clear time each week to devote to the course.

Having a weekly quiz with a hard deadline has been a useful motivator. It would have been easy to chuck it in - after all, enrollment is free - or let things slip until I had more time. With the quiz deadline I have a weekly goal that keeps me working on the course each day.

I've just completed the first assignment. It was an interesting project focussed on a data set from the Lending Club; a peer-to-peer loans service. We were given two weeks to submit our work. Following this we had a week to mark at least four of our peers' assignments (failure to do so applies a 20% penalty to your own assignment). We were provided with a simple assessment template to guide us through marking.

This is the first time Coursera has presented the Data Analysis course, and there have been a few hiccups along the way. Lecture notes included a few typos, scheduling of deadlines needed to be fine-tuned, and the requirements of the assignments were changed due to security issues (running a stranger's R code is inherently risky).

Many of the changes have come about from feedback via the course forum. I've not had much time to participate in the forum other than occasionally scanning the top-voted posts.

I've found the course material challenging and rewarding. It's clear that data analysis requires a strong grounding in statistics. Prof. Leek has provided us with a tool kit for data analysis: techniques and how to apply them using R. However, an explanation of the underlying mathematics is not covered (the course is only eight weeks). Prof. Leek has provided links to further resources that provide this background information but I haven't had time to delve into this material.

That being said, I am becoming more proficient with R, which is useful in my day-to-day work. And I have gained a better understanding of the techniques available to me for data analysis work.

I'll post another update at the end of the course in March.

February 1, 2013

Lines vs. Bars for Categorical Data

I recently commented on a thread started by Joey Cherdarchuk in the LinkedIn Data Visualization group. The thread discusses Joey's reworking of an infographic about social media demographics. Joey used diverging stacked bar charts to significantly improve upon the original, which used pie charts. You can read Joey's blog post in full here.

I suggested an alternative would be to use a simple line chart. This is a technique often advocated by Kaiser Fung on his excellent Junk Charts blog. Here's an example of his approach. Many people react negatively to this technique as you can see in the comments section of Kaiser's post. Here's his response:
You won't be the only reader to feel this way. Over the years, I have had complaints from readers about lines connecting categorical data every time I put up such a chart. Here's my reasoning: follow your eyes as you read a dot plot, you are visually tracing the lines that I have drawn, why not just draw the lines?
I happen to agree with Kaiser; using lines helps tie together the separate data points so you can more easily see trends and make comparisons.

I applied this treatment to the social demographics data from the original infographic. You can see the results below (interactive version here):

This approach certainly has its merits. You can clearly see that for most social media platforms, participation rates increase with age. Google+ is the obvious exception and the trend for Reddit is flat. As you'd expect, the trend is most stark for LinkedIn; the professional network.

The interactive version also has examples of the data plotted using point charts and bar charts (stacked and clustered). None of which I feel work as well as the simple line chart. For example, here's a clustered bar chart.

I think it's important not to reflexively rule out line charts when dealing with categorical data as the technique can yield useful insights.

Update (2013-02-22)

During further discussion on the LinkedIn Data Visualization group, Bill Droogendyk referenced an excellent article on the subject of visualizing quantitative data by one of my favourite viusualization thought leaders, Steven Few. The article entitiled "Quantitative vs. Categorical Data: A Difference Worth Knowing" discusses the different types of categorical data:
  • nominal
  • ordinal
  • interval
Using Few's nomenclature, the Age axis used in the charts above is an interval scale, for which Few recommends line (and bar) charts. Kaiser Fung's example uses an ordinal scale. At first glance some interpret it as nominal but fail to notice the following treatment:
I sorted the schools by the ratio of three-pointers to midrange jump shots.

By ranking the schools, the scale Fung uses is ordinal. Now here is where Fung and Few differ. Few advises against using line charts with ordinal scales, whereas Fung does so quite often.

I sit on the fence: I reckon it's worth considering a line chart for categorical data (interval & ordinal) and seeing for yourself.