March 28, 2014

A Distorted and Incomplete Picture

When I saw Ben Goldacre's latest book Bad Pharma: How Drug Companies Mislead Doctors and Harm Patients on the new releases bookshelf of my local library, I borrowed it immediately. Not because I thought it would inform my data visualization work but because I really enjoyed Ben's previous book Bad Science.

So, I was pleased when I read Bad Pharma to find that it focuses on data, the raw material we work with when creating visualizations. I was also deeply disturbed by the book given that it details how modern evidence-based medicine is broken. Ben provides a useful summary of Bad Pharma in the book's introduction:
Drugs are tested by the people who manufacture them, in poorly designed trials, on hopelessly small numbers of weird, unrepresentative patients, and analysed using techniques which are flawed by design, in such a way that they exaggerate the benefits of treatments. Unsurprisingly, these trials tend to produce results that favour the manufacturer. When trials throw up results that companies don't like, they are perfectly entitled to hide them from doctors and patients, so we only ever see a distorted picture of any drug's true effects. Regulators see most of the trial data, but only from early on in a drug's life, and even then they don't give this data to doctors or patients, or even to other parts of government. This distorted evidence is then communicated and applied in a distorted fashion. In their forty years of practice after leaving medical school, doctors hear about what works through ad hoc oral traditions, from sales reps, colleagues or journals. But those colleagues can be in the pay of drug companies – often undisclosed – and the journals are too. And so are the patient groups. And finally, academic papers, which everyone thinks of as objective, are often covertly planned and written by people who work directly for the companies, without disclosure. Sometimes whole academic journals are even owned outright by one drug company. Aside from all this, for several of the most important and enduring problems in medicine, we have no idea what the best treatment is, because it's not in anyone's financial interest to conduct any trials at all. These are ongoing problems, and although people have claimed to fix many of them, for the most part they have failed; so all these problems persist, but worse than ever, because now people can pretend that everything is fine after all.
But enough about medicine. What makes Bad Pharma interesting to a data visualization practitioner are not charts and graphs (there are only a few in the book) it's the discussion of data. The book's first chapter Missing Data describes how drug trials performed by pharmaceutical companies overwhelmingly produce results that are favourable to the companies. Goldacre argues that this arises for several reasons
  • flawed experimental design: trials are designed in ways likely to produce a favourable outcome
  • flawed data analysis: see my post on Alex Reinhart's Statistics Done Wrong
  • publication bias: trials that produce unfavourable outcomes are simply not published, skewing published data towards favourable results
This reminds us to be circumspect about the data we visualize. We should ask:
  • How was the data collected?
  • How has the data been transformed or processed?
  • Is the data complete?
The answers to these questions are metadata that we need to communicate as part of any visualization we create. Without it, we risk painting a distorted and incomplete picture of the data we are visualizing.

March 21, 2014

Lyra: the Interactive Visualization Design Environment

I recently spent some time using Lyra an "interactive visualization design environment" that allows you to create visualizations without writing a single line of code. It's being developed by Arvind Satyanarayan, Kanit “Ham” Wongsuphasawat and Jeffrey Heer (think Prefuse, Protovis, D3, Vega) at the University of Washington's Interactive Data Lab.

Lyra is a bit like other interactive visualization design tools such as Tableau and Spotfire. However, under the hood it's powered by D3 (like Now, I enjoy coding visualizations directly using D3 but I realise not everyone shares my enthusiasm or has the time to learn D3. Lyra gives you access to the expressiveness of D3 without requiring you to learn its API (or Javascript).

The Lyra application is shown below and consists of three panels:
  • the left-hand panel manages Data Pipelines, where you define and transform (sort, group, filter, window, apply formula) data sources
  • the centre panel displays your visualization, where you interactively select and modify visualization elements: marks (rectangles, symbols, arcs, areas, lines and text), axes and layers
  • the right-hand panel provides access to attributes of the elements in your visualization
Once you've created a visualization you can export it as an image (PNG or SVG) or a Vega specification.

If you're familiar with D3 then you'll recognise some of its idioms in Lyra. For example, D3's data-binding mechanism is implemented by dragging-and-dropping data variables onto the attributes of visual elements.

If you want to try Lyra then you have several options:
Bear in mind that Lyra is alphaware. I did encounter a few issues, e.g. saving and recovering work didn't appear to work properly. The authors are interested in constructive feedback.

February 20, 2014

Statistics Done Wrong - Alex Reinhart

I've just finished reading Alex Reinhart's excellent Statistics Done Wrong; a guided tour of common statistical fallacies and misconceptions. It covers p-values, statistical power, statistical significance, pseudo-replication and stopping rules.

Statistics Done Wrong is written with all scientists in mind, assuming no knowledge of statistical methods. It's essential reading for all data scientists, including data visualization practitioners. Even though we're not data analysts we need at least a basic level of statistical literacy.

The problems Alex describes are rife in the scientific peer-reviewed literature. By coincidence I'm reading Ben Goldacre's Bad Pharma, which focusses on how clinical trials data is distorted by the pharmaceutical industry. Many of the issues Alex raises are seen in practice in Bad Pharma.

Statistics Done Wrong concludes with What Can Be Done? Here I quote the Your Job section:
Your task can be expressed in four simple steps:
  1. Read a statistics textbook or take a good statistics course. Practice.
  2. Plan your data analyses carefully and deliberately, avoiding the misconceptions and errors you have learned.
  3. When you find common errors in the scientific literature – such as a simple misinterpretation of p values – hit the perpetrator over the head with your statistics textbook. It’s therapeutic.
  4. Press for change in scientific education and publishing. It’s our research. Let’s not screw it up.

Statistics Done Wrong is on-line, free and should take you no more than an hour to read. Once you've read it share it with your data scientist colleagues. And if you want to learn more about data analysis then I recommend Coursera's Data Analysis MOOC - read my account of it here.

January 26, 2014

Review: Infoactive (beta)

Infoactive is an on-line tool for creating infographics, and is similar to Infogram, and Venngage - see my earlier review of these offerings.

Infoactive garnered considerable attention from the visualization community as a result of its highly successful Kickstarter campaign, which raised $55,109 (more that quadrupling its $12,000 target) from 1,448 backers. The promotional video clip is shown below.

I was one of those backers, which granted me early access to the Infoactive beta program.What follows are my impressions of the tool after a few hours experimenting with it.

At the outset, it's important to stress that at the time of writing Infoactive is in beta. I did encounter several problems that made working with the tool difficult. So, if you're expecting to start using Infoactive and be immediately productive then you're going to be somewhat disappointed.

With that out of the way let's focus on what you can do with Infoactive. The tool is very easy to use. A panel on the left-hand side of the page holds a palette of graphical elements that you can drag-and-drop onto your infographic canvas.

Two features that distinguish Infoactive from its rivals are
  • Connect to live data sources: you can provide the URL of a public Google Drive spreadsheet to serve as your data source. If the data changes then so too does the Infographic connected to it.
  • Infographics created with Infoactive are interactive: this includes filtering and details on mouse-over events.
Below is a sample infographic created using Infoactive. I would have preferred to include my own example but due to some of the bugs I encountered I'm including the example created by the Infoactive team:


Two types of data source can be used: public Google Drive spreadsheets or CSV files uploaded to Infoactive. You can specify multiple data sources for each infographic, with each chart connected to a specific source. An editor is provided that allows you to modify cell values in each data source.


Several chart types are provided:
  • Line and area charts
  • Horizontal and vertical column charts
  • Pie and donut charts
  • Gauges
  • Maps
You can drag-and-drop charts into your infographic. Once in place, you can configure various attributes of the chart such as its title, data set and the data columns assigned to each axis.


Filters are a useful interactive element. When placed in an infographic they allow the user to focus on a subset of the data defined by a categorical data column. Charts associated with the filter are updated in response to the user's selections. You can configure the data source, data column and layout of each filter.


A variety of text blocks (header, sub-header, text, logo) is provided. Two default colour themes (classic, earth) are available - you can also create a custom colour palette.


Once you've created an infographic you can publish it. This provides you with a URL which displays the infographic on its own page (for sharing on social media), or an iframe for embedding in a web page.


It's early days for Infoactive. Many people have pledged support, so expectations are high. Similar tools are available but the live and interactive aspect of Infoactive infographics differentiate them from the others. Infoactive is a promising tool that is easy to use but work is needed to iron out the bugs.

September 11, 2013

Over the Rainbow

About a year ago posted on their blog an open letter to NASA asking them to avoid using the rainbow (spectral) colour scale for representing continuous data. The letter listed five problems with the Rainbow colour scale:
  • Colour-blind people can't perceive the scale properly
  • Divisions between hues produce false visual artefacts
  • The order of the hues has no inherent meaning
  • Yellow appears much brighter than the other hues
  • It is more difficult to see detail than with scales that vary in brightness
I'm ashamed to admit that I've used the rainbow colour scale in my own work, often in response to pressure from users who are used to seeing the scale used elsewhere and want to apply it to their own data.

NASA has responded to's open letter in the form of Robert Simmon from the Earth Observatory. Robert has been on a similar crusade within NASA to eradicate use of the rainbow colour scale, apparently with some success.

Robert's response has been a series of six blog posts on how to use colour for data visualization. It is the best tutorial I've come across on this subject. The posts are:
The use of colour is a vital but under-appreciated aspect of data visualization. It's all too easy to use the default colour scales provided by the tools we use to create visualizations. Unfortunately, these defaults are often inadequate. Rather than using the defaults, spend some time thinking about how you are using colour to represent data. If you refer to Robert Simmon's "Subtleties of Color" tutorial when doing so then you can't go wrong.

[ 2013-09-24 ] Robert posted this addendum

July 23, 2013

Mobile Visualization

One of my favourite podcasts is Data Stories. Who'd have thought a purely audio presentation of a visual subject would work? But it does, mostly due to its two charming hosts: Moritz Stefaner and Enrico Bertini, and the expert guests they interview.

The guest on episode #25 was Dominikus Baur, whose speciality is delivering visualizations on mobile, touch-based devices. Dominikus is part of the team that created Touchwave, an iOS toolkit for multi-touch interaction with stacked charts.

The podcast is well worth listening to if you're interested in developing visualizations for mobile devices. The guys discuss the challenges and opportunities presented by mobile platforms.

Small screens and limited processing power are obvious challenges. The latter motivated the choice of a native iOS implemention of Touchwave rather than a platform-neutral implementation based on HTML5/Javascript.

Touch-based user interfaces, especially, multi-touch represent an opportunity for new and interesting ways of interacting with visualizations, compared with the traditional keyboard and pointer interfaces used with desktop and notebook PCs.

Dominikus mentioned the support for mobile devices provided by Tableau. I know that other visualization products, including Spotfire, Panopticon, QlikView and Dundas, support deployment of visualization on mobile devices. How well these implementations work I can't say as I've not used them (please leave a comment if you have some experience).

My own work with D3.js performs poorly on mobile devices. I developed these visualizations with desktop PC users in mind (large screens, pointer interfaces) They won't even load on my Android phone. On my Android tablet they'll load but performance is sluggish and interaction is awkward. In time, I expect the former problem will be resolved as the performance of mobile processors improves. However, the interaction problems will remain.

There is a distinct lag between the adoption of mobile devices and the development of data visualization interfaces that work effectively on them. There is a clear need for need new techniques, such as those developed by Dominikus and his colleagues, if we're to provide interactive visualizations that work effectively on mobile devices.

March 29, 2013

Coursera Data Analysis MOOC: Wrap-Up

The Coursera Data Analysis MOOC has concluded. You can read my earlier posts on the course (first impressions, half-time, graduation).

If you're interested in the course content then it's been made public. The video lectures have been published to Prof. Jeff Leek's YouTube channel, and the slide decks can be downloaded from GitHub.

Jeff was interviewed by Roger Peng on the Simply Statistics podcast, in which he reflects on his experience of the MOOC.

Jeff shared some interesting data regarding the course:
  • 102,000 students enrolled
  • 51,000 watched lectures
  • 20,000 answered quizzes
  • 5,500 completed & graded assignments