November 24, 2014

Stephen King Screen Adaptations (Plotly)

Stephen King is a prolific author, whose books I've enjoyed reading since I was a teenager. His prodigious written output has spawned many screen adaptations for film and television, but in many cases I've been disappointed by the screen versions; see, for example, the dreadful "Under the Dome" TV mini-series.

I decided to look at how well-received King's films have been compared with his books. I found a list of screen adaptations, and for each looked up the book's rating on Goodreads and the movie's rating on IMDb. I necessarily omitted screenplays, movie sequels (not a adapted from a King book) and short stories that contributed to only a portion of a movie. I then imported this data into Plotly and produced the chart shown below
Mouse over a glyph to display details.

The chart reveals a positive correlation between the ratings of King's books and their screen adaptations. Highly rated novels such as "The Green Mile", "Rita Hayworth & The Shawshank Redemption" and "The Body" produced well-regarded movies, whereas poorly rated stories such as "Trucks", "The Mangler" and "Tommyknockers" resulted in absolute stinkers on screen.

We can also see that TV adaptations (wide glyphs) were generally less well-received than were film adaptations (tall glyphs). So too short stories (orange glyphs) and their screen adaptations tend not to rate as highly as novels (blue glyphs) and novellas (green glyphs), and their screen adaptations.

Incidentally, this was my first time using Plotly. I was able to import my data and generate a scatter plot with relative ease. Customising it for my needs took a little longer as I was new to the tool. I'll definitely use Plotly again.

November 11, 2014

Visualizing how my personal tax was spent

I received my tax assessment yesterday. On the last page was the bar chart shown below, which visualizes where my "personal tax was spent, based on 2014-15 Budget estimates" (according to the caption). To the right of each bar is a dollar amount (obscured) that represents the portion of my taxes spent in each category.

Where my "personal tax was spent, based on 2014-15 Budget estimates".
































I've not seen this chart on previous years' tax assessments. It provides a useful indication of where the Federal government (expects) to spend our personal taxes.

The chart is simple but effective. Sorting from largest to smallest is a good choice, as is the breakdown of the Welfare budget into sub-categories. I don't believe the colours encode any information. I'm glad they didn't use a (3D) pie chart which so often blights public reports of budget expenditure.

I'll be interested to see what charts accompany my tax assessment next year. I'd be interested to see some historical information such as budgeted versus actual expenditure, or the change in amount of tax paid.

September 4, 2014

The Simpsons Social Network (Season 1)

I've been a fan of The Simpsons ever since Season #1 was first broadcast. So, I was recently thinking about visualizing the social network (no, not this one) of Simpsons characters.

Constructing the network of social relationships between various Simpsons characters would be a difficult and time-consuming process (does Lisa even have any friends?) So, I opted for a different network that can be constructed programmatically; the network of character co-appearances. In this network, two characters are connected if they appear in the same episode of The Simpsons. This network is similar to the one constructed for film actors that allows us to determine six degrees of Kevin Bacon.

The Simpsons co-appearances network can be constructed by parsing the episodes pages of Wikisimpsons. Mathematically speaking, the network is a graph. Each node of the graph represents a Simpsons character. An (undirected) edge connects each pair of nodes whose characters appear in the same episode. To each edge I add a weight; the number of episodes in which the pair of characters co-appear. I also label each node with the number of episodes in which its character appears.

Having constructed the graph we can set about visualizing it. Visualizing graphs helps you understand the structure of a network. So the choice of graph-layout algorithm is critical. If you impose a hierarchical layout, you'll see hierarchies. If you impose a circular layout you'll see circles.

For this reason I've used a force-directed layout, which attempts to position the nodes such that the distance between any pair of connected nodes is inversely proportional to the weight on the edge between them. This results in characters who co-appear often having their nodes positioned close together, while those that don't will have their nodes separated.

To do this I used Gephi the "open source graph visualization platform". Gephi allows you to experiment with various layout algorithms and customize the appearance of your graph. You can easily apply different colour maps, labelling and rendering attributes to your graph's nodes and edges. Gephi has tools for filtering nodes and edges, and an arsenal of graph theoretic indices can be calculated.

I constructed a co-appearances graph for Season 1 of the Simpsons and loaded it into Gephi. I applied the following settings:
  • Layout: ForceAtlas 2
  • Node size and colour: number of episodes in which a character makes an appearance
  • Edge colour: number of episodes in which characters connected by the edge co-appearance
The resulting graph is shown below. High-resolution renderings are also available (PNG, PDF, SVG).
Graph of Simpsons characters co-appearances in Season 1.

The graph shows us several things. The "central" characters - Homer, Marge, Bart, Lisa and Maggie Simpson - form a cluster at the centre of the graph. They have the largest, darkest nodes because they appear in every episode of Season 1.

Around this central cluster are positioned smaller, lighter nodes for characters who appear frequently but not in every episode; characters like Milhouse Van Houten, Moe Szyslac, Barney Gumble, Monty Burns and Waylon Smithers. Notice that Burns and Smithers, and Moe and Barney are positioned close together as they often appear in the same episodes.
The central cluster of the Simpsons co-appearance graph.

On the outer edges of the graph are clusters of characters who appear together in a single episode. Below we see the cluster (of minor characters) for episode 7 "The Call of the Simpsons". Between these episode clusters are positioned characters who appear in two or three episodes.
Cluster of minor characters appearing in episode 7 "The Call of the Simpsons".
If you'd like to experiment with this graph you can download it from Github.

June 19, 2014

Australian Federal Budget 2014/15: Changes to Public Service Staffing Levels Visualized Using a D3.js Zoomable Treemap


A friend recently drew my attention to Ausviz, a site focussed on visualizations of Australian data, particularly data sets from data.gov.au. One of the first Ausviz visualizations I looked at uses a force-directed graph to visualize changes in public service staffing levels arising from the 2014/15 Federal Budget. The graph represents the hierarchy of ministries and departments, with the size and colour of leaf nodes encoding the change in departmental headcounts.

An alternative way of visualizing hierarchies is to use a treemap. The hierarchy is represented by a nested layout of rectangles. The size and colour of the rectangles is used to encode dimensions of the data.

So, taking inspiration from the Ausviz visualization I implemented a treemap to visualize the same data. The layout of rectangles represents the hierarchy of Federal Government ministries and departments. Rectangle sizes encode the numbers of staff in each department (2013/14 or 2014/15). Rectangle colours encode the changes in staffing levels (absolute or relative). The colour scale ranges from red (staff decrease) through white (no change) to green (staff increase).

The treemap is shown below. An interactive version can be found here (fullscreen). You will need a "modern" browser to use the interactive version, which supports the following operations:
  • change the size encoding (2013/14 or 2014/15)
  • change the colour encoding (absolute or relative)
  • drill down into a ministry (click on a rectangle)
  • mouse over a rectangle to display a departmental tool-tip




The treemap allows us to quickly see where the biggest changes, both absolute and relative, are to occur:
  • Size: 2013/14; Change: absolute (we see the big winners and losers)
    • Gain: Dept. Foreign Affairs & Trade - 1659, 42%
    • Gain: Dept. Prime Minister & Cabinet - 1543, 200%
    • Gain: Dept. Defence - 604, 1%
    • Loss: Dept. Employment, Education & Workplace Relations - 3740, 100%
    • Loss: Australian Taxation Office - 2954, 13%
  • Size: 2014/15; Change: absolute (we see the new, large departments and agencies)
    • New: Dept. Employment - 1716
    • New: Dept. Education - 1823
    • New: National Disability Insurance Agency - 798
  • Size: 2013/14; Change: relative (we see shut down departments and agencies)
    • Gone: AusAID - 1982
    • Gone: Dept. Resources, Energy & Tourism - 655
    • Gone: Dept. Regional Australia, Local Govt. Arts & Sports - 482
    • Gone: Health Workforce Australia - 140
    • Gone: Clean Energy Finance Corp. - 50
    • Gone: Wine Australia Corp. - 49
    • Gone: Australian National Preventative Health Agency - 40
    • Gone: Climate Change Authority - 35
    • Gone: Telecommunications Universal Service Management Agency - 17
    • Gone: Grape & Wine R&D Corp. - 11
    • Gone: Sugar Development R&D Corp. - 8
  • Size: 2014/15; Change: relative (we see the new, small departments and agencies)
    • New: Australian Grape & Wine Authority - 55
There are better ways of visualizing changes of this kind, e.g. a bump chart, sortable table, but the advantage of using a treemap is that it shows the structure of the public service.

The treemap was implemented using D3.js, and borrowed heavily from a couple of excellent examples:
Source data comes from Budget Paper 4 Table 2.2 Average Staffing Table. Note the many footnotes associated with this data.

The source-code is available on Github and licensed under a Creative Commons Attribution 4.0 International License.

March 28, 2014

A Distorted and Incomplete Picture

When I saw Ben Goldacre's latest book Bad Pharma: How Drug Companies Mislead Doctors and Harm Patients on the new releases bookshelf of my local library, I borrowed it immediately. Not because I thought it would inform my data visualization work but because I really enjoyed Ben's previous book Bad Science.

So, I was pleased when I read Bad Pharma to find that it focuses on data, the raw material we work with when creating visualizations. I was also deeply disturbed by the book given that it details how modern evidence-based medicine is broken. Ben provides a useful summary of Bad Pharma in the book's introduction:
Drugs are tested by the people who manufacture them, in poorly designed trials, on hopelessly small numbers of weird, unrepresentative patients, and analysed using techniques which are flawed by design, in such a way that they exaggerate the benefits of treatments. Unsurprisingly, these trials tend to produce results that favour the manufacturer. When trials throw up results that companies don't like, they are perfectly entitled to hide them from doctors and patients, so we only ever see a distorted picture of any drug's true effects. Regulators see most of the trial data, but only from early on in a drug's life, and even then they don't give this data to doctors or patients, or even to other parts of government. This distorted evidence is then communicated and applied in a distorted fashion. In their forty years of practice after leaving medical school, doctors hear about what works through ad hoc oral traditions, from sales reps, colleagues or journals. But those colleagues can be in the pay of drug companies – often undisclosed – and the journals are too. And so are the patient groups. And finally, academic papers, which everyone thinks of as objective, are often covertly planned and written by people who work directly for the companies, without disclosure. Sometimes whole academic journals are even owned outright by one drug company. Aside from all this, for several of the most important and enduring problems in medicine, we have no idea what the best treatment is, because it's not in anyone's financial interest to conduct any trials at all. These are ongoing problems, and although people have claimed to fix many of them, for the most part they have failed; so all these problems persist, but worse than ever, because now people can pretend that everything is fine after all.
But enough about medicine. What makes Bad Pharma interesting to a data visualization practitioner are not charts and graphs (there are only a few in the book) it's the discussion of data. The book's first chapter Missing Data describes how drug trials performed by pharmaceutical companies overwhelmingly produce results that are favourable to the companies. Goldacre argues that this arises for several reasons
  • flawed experimental design: trials are designed in ways likely to produce a favourable outcome
  • flawed data analysis: see my post on Alex Reinhart's Statistics Done Wrong
  • publication bias: trials that produce unfavourable outcomes are simply not published, skewing published data towards favourable results
This reminds us to be circumspect about the data we visualize. We should ask:
  • How was the data collected?
  • How has the data been transformed or processed?
  • Is the data complete?
The answers to these questions are metadata that we need to communicate as part of any visualization we create. Without it, we risk painting a distorted and incomplete picture of the data we are visualizing.

March 21, 2014

Lyra: the Interactive Visualization Design Environment

I recently spent some time using Lyra an "interactive visualization design environment" that allows you to create visualizations without writing a single line of code. It's being developed by Arvind Satyanarayan, Kanit “Ham” Wongsuphasawat and Jeffrey Heer (think Prefuse, Protovis, D3, Vega) at the University of Washington's Interactive Data Lab.

Lyra is a bit like other interactive visualization design tools such as Tableau and Spotfire. However, under the hood it's powered by D3 (like Plot.ly). Now, I enjoy coding visualizations directly using D3 but I realise not everyone shares my enthusiasm or has the time to learn D3. Lyra gives you access to the expressiveness of D3 without requiring you to learn its API (or Javascript).

The Lyra application is shown below and consists of three panels:
  • the left-hand panel manages Data Pipelines, where you define and transform (sort, group, filter, window, apply formula) data sources
  • the centre panel displays your visualization, where you interactively select and modify visualization elements: marks (rectangles, symbols, arcs, areas, lines and text), axes and layers
  • the right-hand panel provides access to attributes of the elements in your visualization
Once you've created a visualization you can export it as an image (PNG or SVG) or a Vega specification.





If you're familiar with D3 then you'll recognise some of its idioms in Lyra. For example, D3's data-binding mechanism is implemented by dragging-and-dropping data variables onto the attributes of visual elements.

If you want to try Lyra then you have several options:
Bear in mind that Lyra is alphaware. I did encounter a few issues, e.g. saving and recovering work didn't appear to work properly. The authors are interested in constructive feedback.

February 20, 2014

Statistics Done Wrong - Alex Reinhart

I've just finished reading Alex Reinhart's excellent Statistics Done Wrong; a guided tour of common statistical fallacies and misconceptions. It covers p-values, statistical power, statistical significance, pseudo-replication and stopping rules.

Statistics Done Wrong is written with all scientists in mind, assuming no knowledge of statistical methods. It's essential reading for all data scientists, including data visualization practitioners. Even though we're not data analysts we need at least a basic level of statistical literacy.

The problems Alex describes are rife in the scientific peer-reviewed literature. By coincidence I'm reading Ben Goldacre's Bad Pharma, which focusses on how clinical trials data is distorted by the pharmaceutical industry. Many of the issues Alex raises are seen in practice in Bad Pharma.

Statistics Done Wrong concludes with What Can Be Done? Here I quote the Your Job section:
Your task can be expressed in four simple steps:
  1. Read a statistics textbook or take a good statistics course. Practice.
  2. Plan your data analyses carefully and deliberately, avoiding the misconceptions and errors you have learned.
  3. When you find common errors in the scientific literature – such as a simple misinterpretation of p values – hit the perpetrator over the head with your statistics textbook. It’s therapeutic.
  4. Press for change in scientific education and publishing. It’s our research. Let’s not screw it up.

Statistics Done Wrong is on-line, free and should take you no more than an hour to read. Once you've read it share it with your data scientist colleagues. And if you want to learn more about data analysis then I recommend Coursera's Data Analysis MOOC - read my account of it here.