September 11, 2013

Over the Rainbow

About a year ago Visual.ly posted on their blog an open letter to NASA asking them to avoid using the rainbow (spectral) colour scale for representing continuous data. The letter listed five problems with the Rainbow colour scale:
  • Colour-blind people can't perceive the scale properly
  • Divisions between hues produce false visual artefacts
  • The order of the hues has no inherent meaning
  • Yellow appears much brighter than the other hues
  • It is more difficult to see detail than with scales that vary in brightness
I'm ashamed to admit that I've used the rainbow colour scale in my own work, often in response to pressure from users who are used to seeing the scale used elsewhere and want to apply it to their own data.

NASA has responded to Visual.ly's open letter in the form of Robert Simmon from the Earth Observatory. Robert has been on a similar crusade within NASA to eradicate use of the rainbow colour scale, apparently with some success.

Robert's response has been a series of six blog posts on how to use colour for data visualization. It is the best tutorial I've come across on this subject. The posts are:
The use of colour is a vital but under-appreciated aspect of data visualization. It's all too easy to use the default colour scales provided by the tools we use to create visualizations. Unfortunately, these defaults are often inadequate. Rather than using the defaults, spend some time thinking about how you are using colour to represent data. If you refer to Robert Simmon's "Subtleties of Color" tutorial when doing so then you can't go wrong.

[ 2013-09-24 ] Robert posted this addendum

July 23, 2013

Mobile Visualization

One of my favourite podcasts is Data Stories. Who'd have thought a purely audio presentation of a visual subject would work? But it does, mostly due to its two charming hosts: Moritz Stefaner and Enrico Bertini, and the expert guests they interview.

The guest on episode #25 was Dominikus Baur, whose speciality is delivering visualizations on mobile, touch-based devices. Dominikus is part of the team that created Touchwave, an iOS toolkit for multi-touch interaction with stacked charts.

The podcast is well worth listening to if you're interested in developing visualizations for mobile devices. The guys discuss the challenges and opportunities presented by mobile platforms.



Small screens and limited processing power are obvious challenges. The latter motivated the choice of a native iOS implemention of Touchwave rather than a platform-neutral implementation based on HTML5/Javascript.

Touch-based user interfaces, especially, multi-touch represent an opportunity for new and interesting ways of interacting with visualizations, compared with the traditional keyboard and pointer interfaces used with desktop and notebook PCs.

Dominikus mentioned the support for mobile devices provided by Tableau. I know that other visualization products, including Spotfire, Panopticon, QlikView and Dundas, support deployment of visualization on mobile devices. How well these implementations work I can't say as I've not used them (please leave a comment if you have some experience).

My own work with D3.js performs poorly on mobile devices. I developed these visualizations with desktop PC users in mind (large screens, pointer interfaces) They won't even load on my Android phone. On my Android tablet they'll load but performance is sluggish and interaction is awkward. In time, I expect the former problem will be resolved as the performance of mobile processors improves. However, the interaction problems will remain.

There is a distinct lag between the adoption of mobile devices and the development of data visualization interfaces that work effectively on them. There is a clear need for need new techniques, such as those developed by Dominikus and his colleagues, if we're to provide interactive visualizations that work effectively on mobile devices.

March 29, 2013

Coursera Data Analysis MOOC: Wrap-Up

The Coursera Data Analysis MOOC has concluded. You can read my earlier posts on the course (first impressions, half-time, graduation).

If you're interested in the course content then it's been made public. The video lectures have been published to Prof. Jeff Leek's YouTube channel, and the slide decks can be downloaded from GitHub.

Jeff was interviewed by Roger Peng on the Simply Statistics podcast, in which he reflects on his experience of the MOOC.


Jeff shared some interesting data regarding the course:
  • 102,000 students enrolled
  • 51,000 watched lectures
  • 20,000 answered quizzes
  • 5,500 completed & graded assignments

March 15, 2013

Coursera Data Analysis MOOC: Graduation

I've completed Coursera's 2013 Data Analysis course. You can read my earlier posts on the course here and here.

I was initially motivated to enrol so I could learn about Massive Open Online Courses (MOOCs). Once the course started I realised it would take significant effort on my part to see it through. I could easily have given up but decided to invest the time needed to complete the course.

I'm glad I did because I gained the following:
  • knowledge of how Coursera works
  • a broad overview of the statistical techniques that can be used for data analysis
  • improved ability to use R - a tool I often use for work
How Coursera Works
I was out with friends one evening mid-way through the course, and mentioned I'd enrolled with Coursera. I said that the course was free, and was asked "What is Coursera's business model?" I didn't know at the time but I've since read that various revenue streams are being considered:
  • certification fees
  • introducing students to employers and recruiters
  • tutoring
  • sponsorship
  • tuition fees
According to Wikipedia, Coursera was not generating revenue as of March 2012.

The mechanics of a Coursera course are similar to those of a college or university course with the difference being that it takes place online and thousands of students are enrolled.
  • lectures: content is presented as video lectures.
  • quizzes: regular online, multiple-choice tests must be completed.
  • assignments: assignments are submitted online. The lecturer can't assess them all, so students mark their peers' work.
  • getting help: the lecturer can't answer all questions so students post queries to an online forum. Students help each other out with answers, and each course has a handful of knowledgeable TAs who monitor the forum and post replies. You can vote up a post - those with the most votes are handled with the highest priority.
  • wiki: each course has a wiki to which useful course-related information can be added.
  • meet ups: if you want to take things off-line, MeetUps can be organised to discuss the course face-to-face with fellow Courserians
Data Analysis Course Content
Ultimately, the quality of a course, whether traditional or MOOC, hinges on its content. A friend of mine, who is a university maths lecturer, enrolled in a Coursera programming language course but found the content so poor he gave up.

Overall, the Data Analysis course was good quality. It was the first time Prof. Leek had given the course so there were a few mistakes in the course material. These were picked up by students, who posted corrections to the online forum.

There were also logistical difficulties for students in some time zones. To accommodate them deadlines for quizzes and assignments were tweaked.

I expect the course will be given again, so future enrolees will enjoy the benefits of the road-testing performed by my cohort of students.

Data analysis is a very broad subject, so it was difficult for Prof. Leek to provide a detailed presentation of the techniques covered in the eight-week course. Instead, a basic introduction was presented for each technique, with examples of how to perform the analysis using R. Links to further resources were provided for those students with the time and inclination to delve deeper into the underlying mathematics. This was something I didn't have time for but at least I now know where to start.

Conclusion
Coursera offers a broad range of courses, and then there are courses offered by others. Having completed my first Coursera MOOC I'm tempted to enrol in another but they do require a significant investment of time and effort. For now I'm content to consolidate what I've learned and wait for something to come along that piques my interest sufficiently for me to put in the effort required.

February 22, 2013

Coursera Data Analysis MOOC: Half-Time Entertainment

It's week five of the Coursera Data Analysis MOOC, and it's been a busy time since the course commenced (see my first impressions). I've just completed the weekly quiz so have time to come up for air and post a progress report.

The course is similar in many respects to my time at uni: I've been late to lectures, and had to scramble to complete tests and assignments. As I mentioned previously, one of my main motivations was to learn about MOOCs. So, I wasn't too worried when by the middle of week one I hadn't been spoon-fed course material. I'd expected a flurry of emails with course information but my inbox was quiet. In fact, I needed to actually visit the Coursera Data Analysis Web site to attend class. By the time I did I found the course well under way, and I had a lot of catching up to do.

I also realised that, just like uni, turning up wasn't going to be enough; I needed to invest serious time in understanding what was being taught and applying it to tests and assignments. So, I pulled my finger out and put aside some other projects to clear time each week to devote to the course.

Having a weekly quiz with a hard deadline has been a useful motivator. It would have been easy to chuck it in - after all, enrollment is free - or let things slip until I had more time. With the quiz deadline I have a weekly goal that keeps me working on the course each day.

I've just completed the first assignment. It was an interesting project focussed on a data set from the Lending Club; a peer-to-peer loans service. We were given two weeks to submit our work. Following this we had a week to mark at least four of our peers' assignments (failure to do so applies a 20% penalty to your own assignment). We were provided with a simple assessment template to guide us through marking.

This is the first time Coursera has presented the Data Analysis course, and there have been a few hiccups along the way. Lecture notes included a few typos, scheduling of deadlines needed to be fine-tuned, and the requirements of the assignments were changed due to security issues (running a stranger's R code is inherently risky).

Many of the changes have come about from feedback via the course forum. I've not had much time to participate in the forum other than occasionally scanning the top-voted posts.

I've found the course material challenging and rewarding. It's clear that data analysis requires a strong grounding in statistics. Prof. Leek has provided us with a tool kit for data analysis: techniques and how to apply them using R. However, an explanation of the underlying mathematics is not covered (the course is only eight weeks). Prof. Leek has provided links to further resources that provide this background information but I haven't had time to delve into this material.

That being said, I am becoming more proficient with R, which is useful in my day-to-day work. And I have gained a better understanding of the techniques available to me for data analysis work.

I'll post another update at the end of the course in March.

February 1, 2013

Lines vs. Bars for Categorical Data

I recently commented on a thread started by Joey Cherdarchuk in the LinkedIn Data Visualization group. The thread discusses Joey's reworking of an infographic about social media demographics. Joey used diverging stacked bar charts to significantly improve upon the original, which used pie charts. You can read Joey's blog post in full here.

I suggested an alternative would be to use a simple line chart. This is a technique often advocated by Kaiser Fung on his excellent Junk Charts blog. Here's an example of his approach. Many people react negatively to this technique as you can see in the comments section of Kaiser's post. Here's his response:
You won't be the only reader to feel this way. Over the years, I have had complaints from readers about lines connecting categorical data every time I put up such a chart. Here's my reasoning: follow your eyes as you read a dot plot, you are visually tracing the lines that I have drawn, why not just draw the lines?
I happen to agree with Kaiser; using lines helps tie together the separate data points so you can more easily see trends and make comparisons.

I applied this treatment to the social demographics data from the original infographic. You can see the results below (interactive version here):



This approach certainly has its merits. You can clearly see that for most social media platforms, participation rates increase with age. Google+ is the obvious exception and the trend for Reddit is flat. As you'd expect, the trend is most stark for LinkedIn; the professional network.

The interactive version also has examples of the data plotted using point charts and bar charts (stacked and clustered). None of which I feel work as well as the simple line chart. For example, here's a clustered bar chart.



I think it's important not to reflexively rule out line charts when dealing with categorical data as the technique can yield useful insights.

Update (2013-02-22)

During further discussion on the LinkedIn Data Visualization group, Bill Droogendyk referenced an excellent article on the subject of visualizing quantitative data by one of my favourite viusualization thought leaders, Steven Few. The article entitiled "Quantitative vs. Categorical Data: A Difference Worth Knowing" discusses the different types of categorical data:
  • nominal
  • ordinal
  • interval
Using Few's nomenclature, the Age axis used in the charts above is an interval scale, for which Few recommends line (and bar) charts. Kaiser Fung's example uses an ordinal scale. At first glance some interpret it as nominal but fail to notice the following treatment:
I sorted the schools by the ratio of three-pointers to midrange jump shots.

By ranking the schools, the scale Fung uses is ordinal. Now here is where Fung and Few differ. Few advises against using line charts with ordinal scales, whereas Fung does so quite often.

I sit on the fence: I reckon it's worth considering a line chart for categorical data (interval & ordinal) and seeing for yourself.

January 30, 2013

Coursera Data Analysis MOOC: First Impressions

On the spur of the moment I decided to enrol in Coursera's Data Analysis course. I've been curious about MOOCs (massive open on-line courses) for some time, so when I came across this one, I decided it was time to find out more. Plus, the course topic is well-suited to the kind of work I do.

The course is given by Jeff Leek, an Assistant Professor in Biostatistics from the Johns Hopkins Bloomberg School of Public Health. Jeff's introductory video is shown below.


The course is run over eight weeks and is delivered as a set of video lectures. Topics covered include:
  • The structure of a data analysis (steps in the process, knowing when to quit, etc.)
  • Types of data (census, designed studies, randomized trials)
  • Types of data analysis questions (exploratory, inferential, predictive, etc.)
  • How to write up a data analysis (compositional style, reproducibility, etc.)
  • Obtaining data from the Web (through downloads mostly)
  • Loading data into R from different file types
  • Plotting data for exploratory purposes (boxplots, scatterplots, etc.)
  • Exploratory statistical models (clustering)
  • Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)
  • Basic model checking (primarily visually)
  • The prediction process
  • Study design for prediction
  • Cross-validation
  • A couple of simple prediction models
  • Basics of simulation for evaluating models
  • Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)
Each lecture can be viewed in your Web browser or downloaded (MP4) for off-line viewing. The lectures are slide presentations with audio of the lecturer explaining the content. You can download the slides (PDF) and transcripts if you prefer.

A 10-question quiz must be completed by the end of each week. It has hard and soft deadlines. If you miss the soft deadline you can still submit answers before the hard deadline but a penalty is applied to your score. You can attempt each quiz four times.

Two peer assignments must be completed; one in week 3 (due at the end of week 4) the other in week 6 (due at the end of week 7).  The assignments are graded by your student peers, and you must grade at least four peer assignments to avoid a 20% penalty. Your grade is based on the median of the grades you receive from your peers.

An interesting aspect of the course is the forum, to which students can post questions. Prof. Leek obviously can't answer all the questions, as the course has 100,000 students. So, you can vote on questions and the lecturer responds to the top few. Students can help each other out by responding to questions too.

The course requires a working knowledge of R. I've been using R increasingly as part of my day-to-day work so am comfortable with this. Some (optional) background lectures on R are provided in the course material along with links to other resources.

Successful completion of the course conveys no official qualification or accreditation. I've enrolled purely for my own edification; to learn about MOOCs like coursera, and sharpen my data analysis skills.

I'll post follow ups as the course progresses.

January 1, 2013

Comparison of Australian Car Values

I was recently in the market for new wheels, and so spending a bit of time researching the Australian car market at carsales.com.au and RedBook.com.au. It got me thinking about the rates of depreciation in value of different makes of car. So, I set about creating a chart that would help me visualize this kind of information.

The result is the interactive line chart shown below. You can use the interactive version of the visualization if you have a modern, standards-compliant browser (Firefox, Chrome, Opera, Safari, etc.) or you can try Chrome Frame (Internet Explorer).

Resale value (%) for several popular models of Australian car.


Interaction
The chart plots a line for each of several popular models of car. The lines can be made to represent several different values:
  • Sticker price: price when new
  • Resale value: price when selling the car on the private market
  • Resale value (%): the resale value as a percentage of the sticker price
You can also highlight individual models using the checkboxes or by moving the mouse cursor over a line.

Data
Obtaining the data was laborious. I first determined popular makes by looking at the numbers of cars for sale at carsales.com.au. I focussed on sedans, ignoring SUVs, vans, utes etc. For each popular make I selected a couple of popular models - small and large.

Then I visited RedBook.com.au to research price history but encountered a couple of hurdles. Firstly, it isn't possible to get 10 years of price data for an individual model because RedBook only publishes the sticker price and current resale values (not the resale value last year, the year before and so on). To overcome this I used the sticker price and current resale value for comparable entry-level models from 2001 - 2011.

The second problem was that for some makes (BMW, Kia, Mercedes and Nissan) it wasn't possible to find two models that had been sold in Australia every year for the last 10 years. And no single model of Hyundai (a very popular make) has been sold continuously for the last decade.

Insights
Once I had the data I was able to visualize it, and there were a couple of surprises. Firstly, sticker price has remained fairly stable across all makes and models. I expected this would have decreased more recently, especially for imported cars, with the appreciation in the value of the Australian dollar.

New car prices have remained fairly stable for the past decade in spite of the appreciation in value of the Australian dollar.



Resale values drop precipitously in the first few years of a car's life - no surprises there.

Resale values drop significantly in the first few years.





What did surprise me was that when resale value is expressed as a percentage of sticker price, small Japanese cars faired best. I had expected the European marques - Audi, BMW, Mercedes and Volkswagen - to top this ranking, or even popular family sedans but it's the compact Japanese Mazda 3, Toyota Corolla, Subaru Impreza, Honda Civic and the Holden Barina (Japanese import) that top the list. The smallish VW Golf also holds its value well as does the Merc.

Compact Japanese cars hold their value best over the 10 years considered.



Commodore vs. Falcon
In Australia there is a long-running rivalry between the Holden Commodore and Ford Falcon. Below are the charts comparing the two cars. Falcon and Commodore track each other closely for sticker price. However, the resale value of Falcon falls away from that of the Commodore in the first few years