Don’t beat up Statscan for one data error – Cross

More on StatsCan from Phillip Cross, former chief economic analyst. Worth reading for some of the history and how the agency reacted to previous cuts:

People should get agitated about Statscan over substantive issues. Wring your hands that the CPI over-states price changes over long periods. Write your MP complaining that the Labour Force Survey doesn’t follow the U.S. practice and exclude 15 year olds. Take to the barricades that for an energy superpower like Canada, measuring energy exports has become a monthly adventure, routinely revised by $1-billion a month. But don’t use the July employment incident to evaluate how the statistical system is functioning overall. They messed up one data point in one series. Big deal. Anyone who lets one data point affect their view of the economy should not be doing analysis. Move along folks, nothing to see here.

Don’t beat up Statscan for one data error – The Globe and Mail.

The case of the disappearing Statistics Canada data

Good piece on Statistics Canada and the impact of some of the changes made to reduce long-standing data series:

Last year, Stephen Gordon railed against StatsCan’s attention deficit disorder, and its habit of arbitrarily terminating long-standing series and replacing them with new data that are not easily comparable.

For what appears to be no reason whatsover, StatsCan has taken a data table that went back to 1991 and split it up into two tables that span 1991-2001 and 2001-present. Even worse, the older data have been tossed into the vast and rapidly expanding swamp of terminated data tables that threatens to swallow the entire CANSIM site. A few months ago, someone looking for SEPH wage data would get the whole series. Now, you’ll get data going back to 2001 and have to already know StatsCan won’t tell you that there are older data hidden behind the “Beware of the Leopard” sign.…

Statistics Canada must be the only statistical agency in the world where the average length of a data series gets shorter with the passage of time. Its habit of killing off time series, replacing them with new, “improved” definitions and not revising the old numbers is a continual source of frustration to Canadian macroeconomists.

Others are keeping tabs on the vanishing data. The Canadian Social Research Newsletter for March 2 referred to the cuts as the CANSIM Crash Diet and tallied some of the terminations:

  • For the category “Aboriginal peoples” : 4 tables terminated out of a total of 7
  • For the category “Children and youth” : 89 tables terminated out of a total of 130
  • For the category “Families, households and housing” : 67 tables terminated out of a total of 112
  • For the category “Government” : 62 tables terminated out of a total of 141
  • For the category “Income, pensions, spending and wealth” : 41 tables terminated out of a total of 167
  • For the category “Seniors” : 13 tables terminated out of a total of 30

As far as Statistics Canada’s troubles go, this will never get the same level of attention as the mystery of the 200 jobs. But, as it relates to the long-term reliability of Canadian data, it’s just as serious.

Given my work using NHS data, particularly ethnic origin, visible minority and religions, linked to social and economic outcomes, still in the exploration stage of what data and linkages are available – or not.

The case of the disappearing Statistics Canada data

Neat Data Visualization: Net Neutrality

gr-neutrality-comments-624A good visualization that helps one understand relationships and relative weight of comments.

A Fascinating Look Inside Those 1.1 Million Open-Internet Comments 

Israel, Gaza, War & Data — i ❤ data — Medium

Twitter Mid-East solitudesFor data visualization geeks, as well as those more broadly interested in social networks and how they reinforce our existing views, this article by Gilad Lotan is a must read (Haaretz, the left-wing Israeli newspaper, draws the most from both sides):

Facebook’s trending pages aggregate content that are heavily shared “trending” across the platform. If you’re already logged into Facebook, you’ll see a personalized view of the trend, highlighting your friends and their views on the trend. Give it a try.

Now open a separate browser window in incognito mode Chrome: File->New Incognito Window and navigate to the same page. Since the browser has no idea who you are on Facebook, you’ll get the raw, unpersonalized feed.

How are the two different?

Personalizing Propaganda

If you’re rooting for Israel, you might have seen videos of rocket launches by Hamas adjacent to Shifa Hospital. Alternatively, if you’re pro-Palestinian, you might have seen the following report on an alleged IDF sniper who admitted on Instagram to murdering 13 Gazan children. Israelis and their proponents are likely to see IDF videos such as this one detailing arms and tunnels found within mosques passed around in their social media feeds, while Palestinian groups are likely to pass around images displaying the sheer destruction caused by IDF forces to Gazan mosques. One side sees videos of rockets intercepted in the Tel-Aviv skies, and other sees the lethal aftermath of a missile attack on a Gazan neighborhood.

The better we get at modeling user preferences, the more accurately we construct recommendation engines that fully capture user attention. In a way, we are building personalized propaganda engines that feed users content which makes them feel good and throws away the uncomfortable bits.

Worth reflecting upon. I try to have a range of news and twitter feeds to reduce the risk.

Israel, Gaza, War & Data — i ❤ data — Medium.

Professor goes to big data to figure out if Apple slows down old iPhones when new ones come out

Apple Slow iphones

A good illustration of the limits of big data and the risks of confusing correlation with causation. But bid data and correlation can help us ask more informed questions:

The important distinction is of intent. In the benign explanation, a slowdown of old phones is not a specific goal, but merely a side effect of optimizing the operating system for newer hardware. Data on search frequency would not allow us to infer intent. No matter how suggestive, this data alone doesn’t allow you to determine conclusively whether my phone is actually slower and, if so, why.

In this way, the whole exercise perfectly encapsulates the advantages and limitations of “big data.” First, 20 years ago, determining whether many people experienced a slowdown would have required an expensive survey to sample just a few hundred consumers. Now, data from Google Trends, if used correctly, allows us to see what hundreds of millions of users are searching for, and, in theory, what they are feeling or thinking. Twitter, Instagram and Facebook all create what is evocatively called the “digital exhaust,” allowing us to uncover macro patterns like this one.

Second, these new kinds of data create an intimacy between the individual and the collective. Even for our most idiosyncratic feelings, such data can help us see that we aren’t alone. In minutes, I could see that many shared my frustration. Even if you’ve never gathered the data yourself, you’ve probably sensed something similar when Google’s autocomplete feature automatically suggests the next few words you are going to type: “Oh, lots of people want to know that, too?”

Finally, we see a big limitation: This data reveals only correlations, not conclusions. We are left with at least two different interpretations of the sudden spike in “iPhone slow” queries, one conspiratorial and one benign. It is tempting to say, “See, this is why big data is useless.” But that is too trite. Correlations are what motivate us to look further. If all that big data does – and it surely does more – is to point out interesting correlations whose fundamental reasons we unpack in other ways, that already has immense value.

And if those correlations allow conspiracy theorists to become that much more smug, that’s a small price to pay.

Professor goes to big data to figure out if Apple slows down old iPhones when new ones come out

Pie Charts Are Terrible | Graph Graph

PieCharts_WorldBarColor

 

I am doing more and more charting to illustrate citizenship and multiculturalism issues and in consulting with those who have a better graphic sense than I, came across this convincing article and illustration against the use of pie charts.

Means I have to redo a number of what I have been working on but always good to learn something new that helps tell the story (I haven’t read his recommended book yet):

Let’s start this off with some honesty.  I used to love pie charts.  I thought they were great, just like the way I used to think Comic Sans was the best font ever.

But then I had some #RealTalk, and I’ve been enlightened in the error of my ways, and I want to pass on what I’ve learned to show people why pie charts aren’t the best choice for visualization.  For my day job, part of my work involves creating visualizations out of business data for our customers.  I picked up a copy of “Information Dashboard Design” a book by Stephen Few of Perceptual Edge.  If you’re at all interested in data visualization, I highly recommend his books, and on this site we attempt to use a lot of the principles in creating the visualizations we present to you.

Pie Charts Are Terrible | Graph Graph.

‘Hawking index’ charts which bestsellers are the ones people never read

Fun example of innovative analysis (and for all those of you who claim to read Piketty or other similar tomes):

Jordan Ellenberg, a mathematician at the University of Wisconsin, Madison, has just about proved this suspicion correct.

In a cheeky analysis of data from Kindle e-readers, Mr. Piketty’s daunting 700-page doorstopper emerged as the least read book of the summer, according to Prof. Ellenberg, who calls his ranking the Hawking Index in honour of Mr. Hawking’s tome, famous as the most unread book of all time.

As a result, he is tempted to rename it the “Piketty Index,” because Mr. Piketty scored even worse than Mr. Hawking.

As such, both stand as extreme case studies in aspirational reading. Like the Economist magazine’s Big Mac index of hamburger prices around the world, which is both silly and serious, Prof. Ellenberg’s Hawking Index is funny, in that it reveals the vanity of many book choices. But it also offers an interesting psychological perspective on reading that is born of good intentions, and dies of boredom on the dock or beach.

The calculation is simple, and as Prof. Ellenberg says, “quick and dirty.” It exploits a feature of Kindle that allows readers to highlight favourite quotes. It averages the page number of the five most highlighted passages in Kindle versions, and ranks that as a percentage of the total page count. Although it does not measure how far people read into a book, it makes a decent proxy for it.

“Why do you buy a book? One reason is because you know you’re going to like it,” Prof. Ellenberg said. “Another reason might be, ‘Oh, I think this book will be good for me to read.’”

….. He said his formula illustrates what mathematicians call the problem of inference, meaning he cannot say for sure these books are going unread, just that he has strong evidence for it.

“You can make some observation about the world, but there’s some underlying fact about the world that you’d like to know, and you want to kind of reverse engineer. You want to go backwards from what you observed to what you think is producing the data you see,” he said.

Other books reveal different insights into why people buy books they start but do not finish. Michael Ignatieff’s political memoir Fire And Ashes, for example, scores comparatively well for non-fiction at 44%, far better than Hillary Clinton’s Hard Choices, which barely cracked 2%. Lean In, the self-help book by Facebook executive Sheryl Sandberg, scored 12.3%.

In fiction, The Luminaries, by Canadian-born New Zealand author Eleanor Catton, which won last year’s Man Booker Prize, scores a mere 19%, and would score a lot lower if not for one highlighted quote near the end.

Prime Minister Stephen Harper’s book on hockey, A Great Game, curiously has no highlighted passages, so cannot be ranked on the Hawking Index (or, equivalently, ranks as low as is theoretically possible).

Fiction tended to score higher, likely reflecting the tendency for non-fiction authors to put quotable thesis statements in the introduction. The only novel that was down in the range of the non-fiction books was Infinite Jest by David Foster Wallace.

Prof. Ellenberg does not mean to disparage the low ranking books, he said, noting that the reason people buy them in the first place is that they are rich in content.

“I think it’s good to do back of the envelope computations as long as you do them with the appropriate degree of humility, and understand what it is that they’re saying,” he said. “I think any statistical measure you make up, you take it as seriously as it deserves to be taken.”

‘Hawking index’ charts which bestsellers are the ones people never read

Better data alone won’t fix Canada’s economy – The Globe and Mail

Good piece on the need for a broader and more thoughtful approach to the use of data in a “big data” environment:

The bottom line is that being data-dependent doesn’t mean responding to every wiggle in data. Nor does it mean basing our decisions solely on data or models and nothing else.Yes, we need better data.

But that’s only a start. We also need to ask precise and well-posed questions – of ourselves in our analysis and of our policy makers in their choices – particularly as “big data” increases the availability of non-conventional data sources. In addition, we need to bring new approaches to bare on data, and clearly explain the results to non-specialists.

After concerns about jobs estimates during the Ontario election, let’s hope that the lesson learned by our politicians is not to withhold economic analysis in future campaigns. Instead, let’s hope it causes them to raise their game by presenting more credible analyses.

At the same time, let’s be realistic about what better data can accomplish. This means acknowledging that data give us imprecise measurements of reality, but when used responsibly and creatively, they help us make better choices and hold governments to account for their policy decisions.

Better data alone won’t fix Canada’s economy – The Globe and Mail.

Why Canada has a serious data deficit

More on the importance of good data by Barrie McKenna in the Globe’s Report on Business:

Prof. Gross [the C.D. Howe Institute researcher responsible for their study on Temporary Foreign Workers and their effect on increasing unemployment in AB and BC] acknowledged that perfect data is “very costly.”

So is bad data.

Employment Minister Jason Kenney recently imposed a moratorium on the use of temporary foreign workers in the restaurant industry, following embarrassing allegations of misuse by some McDonald’s franchise and other employers. And he has promised more reforms to come.

But who is to say that restaurants need imported foreign labour any less than hotels or coal mines, which are unaffected by the moratorium? And without better information, Mr. Kenney may compound his earlier decision to expand the program with an equally ill-considered move to shrink it.

The government’s troubles with the temporary foreign workers program is a classic case of bad data leading to dubious decision-making. Until recently, the government has relied on inflated Finance department job vacancy data, compiled in part by tracking job postings on Kijiji, a free classified-ad website. Statscan, meanwhile, was reporting that the national job vacancy rate was much smaller, and falling.

The problem goes way beyond temporary foreign workers. And it’s a data problem of the government’s own making. Ottawa has cut funds from important labour market research, slashed Statscan’s budget more savagely than many other departments, and scrapped a mandatory national census in favour of a less-accurate voluntary survey.

The Canadian government has demonstrated “a lack of commitment” to evidence-based decision-making and producing high-quality data, according to a global report on governance released last week by the Bertelsmann Foundation, a leading German think tank. The report ranked Canada in the middle of the pack and sliding on key measures of good governance compared with 40 other developed countries

One of the disadvantages of being in government for almost 10 years is that decisions which may have appeared to be cost-free can come back and haunt you.

Why Canada has a serious data deficit – The Globe and Mail.

Meet Joe/Jose/Youssef Canada

A good overview of Canada from the National Voluntary Survey. While not as accurate as the Census cancelled by the current government (higher cost for poorer quality data, less comparability with previous data), at the national and provincial levels captures the major trends.

Meet Joe/Jose/Youssef Canada.