To truly modernize social programs, it will take big data and analytics

I may be biased given some of the obstacles I faced when working on “citizen-centred service” in the early days of Service Canada, where even conceptual work regarding integrating disparate programs and services from a pathfinder perspective faced resistance.

The complexity and coordination required, the organizational and even program stovepipes, and the sheer difficulty in developing and implementing such a change agenda make me a sceptic. After all, the government wasn’t even able to integrate pay services for its own employees with Phoenix, and has had less visible problems and challenges with Shared Services Canada.

But of course, better use of big data, and better capacity to analyse the data, offer considerable potential to assess the effectiveness of current programs, identify gaps and improve outcomes:

Finding better ways of wiring for e-government is important and necessary. Nobody would disagree with the need for better computing, electronic communications and information management.

However, digital improvements will bring about only modest gains if they are applied to programs that, at core, are based on pre-computer technologies, as is the case with most of today’s social and health programs. Transformative changes in program objectives and designs based on big data and micro-analytic tools must be brought into the picture. We need to create a trove of “what works” data that will lead to individually tailored social programming.

Most of today’s social programs were designed decades ago and, reflecting the limits of the technology of the day, provide eligible individuals with standard services, products or income supports that are designed to address specific problems.

For example, an unemployed person might be assigned to a training course with a fixed duration and curriculum. A low-income senior will be provided with a top-up pension in an amount that is predetermined based on the individual’s annual income in the previous year. Someone diagnosed with a particular disease will receive a prescription for specific drugs. These benefits are provided by a variety of independent programs funded by different orders of government — and are often delivered by staff in the social work, education or health disciplines. It is difficult to coordinate or even communicate across these programs, which is why they are often referred to as program silos. Individuals who receive benefits are seen as recipients, clients, patients or students, not as citizens or partners.

It’s a reasonably efficient system that works reasonably well for most people most of the time. On balance, the results are positive, near the average of other OECD countries.

However, the system is seriously showing its age.

The underlying weakness shows up most starkly in the way the system deals with the most vulnerable. People who are most in need of health and social services or income support often face multiple obstacles in life. They might lack several types of skills, have inadequate housing and poor jobs, have differing degrees of family support and financial assets, and face a variety of health, disability and addiction issues. People with multiple needs can face an almost impenetrable array of separate programs, each with different terms and conditions and offering solutions that are partial at best. Even with the help of experts and case managers, it is often impossible to create sensible combined packages of benefits to meet individual needs.

For decades, service providers and groups representing the vulnerable have pointed out the problem of trying to shoehorn people into this complex system of fractured supports and benefits.

And for many years, policy documents relating to education, health and social policy have called for a more holistic approach, with benefits directly tailored to the diverse needs of individuals. These have been referred to as student-centred, citizen-centric and, more recently, individually driven approaches. In health, related aspirations are often referred to as precision medicine or personalized medicine, where medical treatments, practices or products are tailored to the individual patient.

However, the called-for changes have not occurred. Experiments, demonstrations and other initiatives that have attempted to cut across the boundaries of the program silos have proven difficult to sustain and have typically had little impact on the design of mainstream programming.

There are good reasons for the lack of success in moving outside traditional silos:

  • Traditional programming makes it relatively easy to provide ongoing funding, to ensure ministerial accountability and to provide the high professional standards that can ensure, for example, the quality of health and educational interventions.
  • No organization has a mandate to develop interventions that cross these traditional program boundaries.
  • The empirical data needed to assess the effectiveness of tailor-made, holistic interventions are underdeveloped and are certainly not yet strong enough to create the needed accountability arrangements. Strong accountability regimes — the monitoring and evaluation activities that ensure that money is spent effectively, transparently and in line with intended objectives — are essential if reform is to be sustained.

But the solution is on the near horizon: big data and predictive analytics. They offer the opportunity for all citizens to become real partners in the design and implementation of the social and health programs that affect their lives. They can provide transformative gains, particularly for people who are most vulnerable.

This technology is in use in other applications and can be applied to social policy. I discussed it in an IRPP essay I wrote in 2015, The Enabling Society. At its core, individuals would have access to information at the very time they need to make big social and health decisions, to make well-informed choices about which combination of training, social services, housing, income supports and health interventions is likely to work best for them. This information would be calculated from large data sets that record the experience of people who have been in similar circumstances and had similar aspirations in the past. This technology produces information that allows all dimensions of the system to work in harmony:

  • The “what works” information would also be available to case workers, teachers, health professionals and other front-line staff so they can become partners in helping individuals put together flexible packages of interventions that are most likely to meet an individual’s particular needs and aspirations, including benefits provided by programs originating in different disciplines and orders of government.
  • The same information would provide the designers and administrators of the many independent traditional programs with the tools to make improvements steadily and automatically over time based on feedback loops that routinely describe which features of the program are working best and for whom — and at what cost.
  • The same information would also support rigorous accountability regimes both for existing program silos and for the flexible arrangements that provide individually tailored packages of interventions.

Such a system would result in huge gains on multiple fronts: in individual and social well-being, in effectiveness, in reduced cost, in the openness and accountability of public programs and in the ability of different orders of government to work together more harmoniously and in a way that treats citizens as main partners in shaping and delivering social programs.

A radically different approach along these lines, one that so dramatically changes the relationship between government and citizen, obviously cannot be attained overnight.

We should start small, in areas where mechanisms already exist to allow cooperation across jurisdictional and program borders and where the needed “what works” information is already well developed. There are a number of possible starting points.

For example, Employment and Social Development Canada could work with one or more provinces in introducing “what works” information into the daily operation of training and other employment programs on an experimental or demonstration basis under the authority of existing labour market agreements, which provide federal funding to support provincial and territorial employability initiatives. These agreements already allow considerable flexibility in the funding and development of innovative employment programs. As well, the needed “what works” data have already been developed and are already routinely used in the evaluation of these projects.

Once their practicality and effectiveness have been clearly demonstrated, these small initiatives could extend naturally and gradually to other areas and would, eventually, become the normal way we do business.

At the same time, the government of Canada should undertake a large-scale exercise to develop big data from administrative sources, such as anonymized information from tax files, employment insurance records and provincial training and social assistance files. It should also develop the associated analytic tools that will allow us to use these rich data to better understand individual behaviour and the kinds of social interventions that are likely to work best at the level of particular individuals.

Such fundamental but gradual changes in the purpose and design of programs need to go hand in hand with the deep reforms in digital processes that have been discussed in this Policy Options series, including a stronger capacity throughout all parts of government in the use of computers, electronic communications and information management. Process reforms could, in some cases, increase efficiency and improve service delivery and customer satisfaction. They could also provide somewhat better information about what programs and benefits are available and allow greater access to administrative data that have been collected. However, if such reforms are made in isolation, divorced from deep “what works” changes in program goals and designs, they risk creating expectations for change that cannot be met.

Those in the e-government community are not directly responsible for changing basic social policy directions or reforming the structure of social programs, but they can nevertheless play a pivotal role in the development and use of big data and predictive analytics. This might at minimum involve active support for reforms along the lines laid out in a paper by the Experts Panel on Income Security of the Council on Aging of Ottawa that describes the kind of micro-level data and microanalytic tools that are needed. Such support, along with process reform, could go a long way in finally enabling the real transition to the digital world.

Source: To truly modernize social programs, it will take big data and analytics

Identifying radical content online: Ryan Scrivens

I only wish we could use some of these analytical tools to better understand overall integration and the role that social networks play in either increasing integration or allowing individuals and groups to remain within their own community or group?

Violent extremists and those who subscribe to radical beliefs have left their digital footprints online since the inception of the World Wide Web. Notable examples include Anders Breivik, the Norwegian far-right terrorist convicted of killing 77 people in 2011, who was a registered member of a white supremacy web forum and had ties to a far-right wing social media site; Dylann Roof, the 21-year-old who murdered nine Black parishioners in Charleston, South Carolina, in 2015, and who allegedly posted messages on a white power website; and Aaron Driver, the Canadian suspected of planning a terrorist attack in 2016, who showed outright support for the so-called Islamic State on several social media platforms.

It should come as little surprise that, in an increasingly digital world, identifying signs of extremism online sits at the top of the priority list for counter-extremist agencies. Within this context, researchers have argued that successfully identifying radical content online, on a large scale, is the first step in reacting to it. Yet in the last 10 years alone, it is estimated that the number of individuals with access to the Internet has increased threefold, from over 1 billion users in 2005 to more than 3.8 billion as of 2018. With all of these new users, more information has been generated, leading to a flood of data.

It is becoming increasingly difficult, nearly impossible really, to manually search for violent extremists, potentially violent extremists or even users who post radical content online because the Internet contains an overwhelming amount of information. These new conditions have necessitated the creation of guided data filtering methods, which may replace the laborious manual methods that traditionally have been used to identify relevant information.

Governments in Canada and around the globe have engaged researchers to develop advanced information technologies, machine-learning algorithms and risk-assessment tools to identify and counter extremism through the collection and analysis of big data available online. Whether this work involves finding radical users of interest, measuring digital pathways of radicalization or detecting virtual indicators that may prevent future terrorist attacks, the urgent need to pinpoint radical content online is one of the most significant policy challenges faced by law enforcement agencies and security officials worldwide.

We have been part of this growing field of research at the International CyberCrime Research Centre, hosted at Simon Fraser University’s School of Criminology. Our work has ranged from identifying radical authors in online discussion forums to understanding terrorist organizations’ online recruitment efforts on various online platforms. These experiences have provided us with insights we can offer regarding the policy implications of conducting large-scale data analyses of extremist content online.

First, there is much that practitioners and policy-makers can learn about extremist movements by studying their online activities. Online discussion forums of the radical right or social media accounts of radical Islamists, for example, are rich with information about how members of a particular movement communicate, how they construct their radical identities, and who they are targeting — discussions, behaviours and actions that can spill over into the offline realm. Exploring the dark corners of the Internet can be helpful in understanding or perhaps even predicting trends in activity or behaviour before they happen in the offline world. If, for example, analysts can track an author’s online activity or identify an online trend that is becoming more radical over time, analysts may be in a better position to assist law enforcement officials and the intelligence community. At the same time, it is important to note that online behaviour often does not translate into offline behaviour; authorities must proceed with caution to ascertain the specific nature of an instance of online activity and the potential threat it poses.

Second, practitioners and policy-makers can gain valuable information about extremist movements by utilizing computational tools to study radical online activities. Our research suggests that it is possible to identify radical topics, authors or even behaviours in online spaces that contain an overwhelming amount of information. Signs of extremism can be found by drawing upon keyword-retrieval software that identifies and counts a specific set of words, or sentiment analysis programs that classify and categorize opinions in a piece of text. Large-scale, semi-automated analyses can provide practitioners and policy-makers with a macro-level understanding of extremist movements online, ranging from their radical ideology to their actual activities. This understanding, in turn, can assist in the development of counter-narratives or deradicalization and disengagement programs to counter violent extremism.

We must caution practitioners and policy-makers that our work suggests there is no simple typology or behaviour that best describes radical online activity or what constitutes radical content online. Instead, extremism comes in many shapes and sizes and varies with the online platform: some radical platforms, for example, promote blatant forms of extremism while other platforms encourage their subscribers to tone down the rhetoric and present their extremist views in a subtler manner. Nonetheless, a useful starting point in identifying signs of extremism online is to go directly to the source: identifying topics of discussion that are indeed radical at the core — with language that describes the “enemies” of the extreme right, for example, such as derogatory terms about Jews, Blacks, Muslims or LGBTQ communities.

Lastly, in order to gain a broader understanding of online extremism or to improve the means by which researchers and practitioners “search for a needle in a haystack,” social scientists and computer scientists should collaborate with one another. Historically, large-scale data analyses have been conducted by computer scientists and technical experts, which can be problematic in the field of terrorism and extremism research. These experts tend to take a high-level methodological perspective, measuring levels of — or propensity toward — radicalization or ways of identifying violent extremists or predicting the next terrorist attack. But searching for radical material online without a fundamental understanding of the radicalization process or how extremists and terrorists use the Internet can be counterproductive. Social scientists, on the other hand, may be well-versed in terrorism and extremism research, but most tend to be ill-equipped to manage big data — from collecting to formatting to archiving large volumes of information. Bridging the computer science and social science approaches to build on the strengths of each discipline offers perhaps the best chance to construct a useful framework for assisting authorities in addressing the threat of violent extremism as it evolves in the online milieu.

via Identifying radical content online

We all thought having more data was better. We were wrong. – Recode

Interesting set of arguments against the use of big data in all circumstances and the value of small, focussed data sets:

For years, the mantra in the world of business software and enterprise IT has been “data is the new gold.” The idea was that companies of nearly every shape and size, across every industry imaginable, were essentially sitting on top of buried treasure that was just waiting to be tapped into. All they needed to do was to dig into the correct vein of their business data trove and they would be able to unleash valuable insights that could unlock hidden business opportunities, new sources of revenue, better efficiencies and much more.

Big software companies like IBM, Oracle, SAP and many more all touted these visions of data grandeur, and turned the concept of big data analytics, or just Big Data, into everyday business nomenclature.

Even now, analytics is also playing an important role in the Internet of Things, on both the commercial and industrial side, as well as on the consumer side. On the industrial side, companies are working to mine various datastreams for insights into how to improve their processes, while consumer-focused analytics show up in things like health and fitness data linked to wearables, and will soon be a part of assisted and autonomous driving systems in our cars.

Of course, the everyday reality of these grand ideas hasn’t always lived up to the hype. While there certainly have been many great success stories of companies reducing their costs or figuring out new business models, there are probably an equal (though unreported) number of companies that tried to find the gold in their data — and spent a lot of money doing so — but came up relatively empty.

The truth is, analytics is hard, and there’s no guarantee that analyzing huge chunks of data is going to translate into meaningful insights. Challenges may arise from applying the wrong tools to a given job, not analyzing the right data, or not even really knowing exactly what to look for in the first place. Regardless, it’s becoming clear to many organizations that a decade or more into the “big data” revolution, not everyone is hitting it rich.

Part of the problem is that some of the efforts are simply too big — at several different levels. Sometimes the goals are too grandiose, sometimes the datasets are too large, and sometimes the valuable insights are buried beneath a mound of numbers or other data that just really isn’t that useful. Implicit in the phrase “big data,” as well as the concept of data as gold, is that more is better. But in the case of analytics, a legitimate question worth considering: Is more data really better?

In the world of IoT, for example, many organizations are realizing that doing what I call “little data analytics” is actually much more useful. Instead of trying to mine through large datasets, these organizations are focusing their efforts on a simple stream of sensor-based data or other straightforward data collection work. For the untold number of situations across a range of industries where these kinds of efforts haven’t been done before, the results can be surprisingly useful. In some instances, these projects create nothing more than a single insight into a given process for which companies can quickly adjust — a “one and done” type of effort — but ongoing monitoring of these processes can ensure that the adjustments continue to run efficiently.

Of course, it’s easy to understand why nobody really wants to talk about little data. It’s not exactly a sexy, attention-grabbing topic, and working with it requires much less sophisticated tools — think Excel spreadsheet (or the equivalent) on a PC, for example. The analytical insights from these “little data” efforts are also likely to be relatively simple. However, that doesn’t mean they are less practical and valuable to an organization. In fact, building up a collection of these little data analytics could prove to be exactly what many organizations need. Plus, they’re the kind of results that can help justify the expenses necessary for companies to start investing in IoT efforts.

To be fair, not all applications are really suited for little data analytics. Monitoring the real-time performance of a jet engine or even a moving car involves a staggering amount of data that’s going to continue to require the most advanced computing and big data analytics tools available.

But to get more real-world traction for IoT-based efforts, companies may want to change their approach to data analytics efforts and start thinking small.

Source: We all thought having more data was better. We were wrong. – Recode

How the parties collect your personal info — and why Trudeau doesn’t seem to mind: Delacourt

Great piece by Delacourt:

Numbers are definitely in fashion in the new Liberal government at the moment — and not just because the budget is landing next week.

A first-ever session on “behavioural economics” for public servants was filled to capacity last week, according to a Hill Times report. “Combining economics with behavioural psychology,” said PCO spokesperson Raymond Rivet, “this new tool can help governments make services more client-focused, increase uptake of programs, and improve regulatory compliance.

Better government through behavioural economics — the idea was popularized by the 2009 book Nudge and almost immediately adopted through the establishment of a “nudge unit” by the British government in 2010. Justin Trudeau’s government is already borrowing the concept of “deliverology” from the Brits, so the ‘nudge’ was never going to be far behind. President Barack Obama, Trudeau’s new best friend, also has taken steps to introduce nudge theory to the U.S. government in recent years.

But the real motivation for data-based governance in the Trudeau government may have come from a source much closer to home — the recent election, specifically the Liberals’ extensive use of big data to win 184 seats last fall. Make no mistake: Trudeau’s Liberals may have won the election by promising intangibles like ‘hope’ and ‘change’, but they sealed the deal with a sophisticated data campaign and ground war.

So now that the Liberals have seen how mastery of the numbers can help win elections, we probably shouldn’t be too surprised that they see those same skills as useful for governing as well. Big-data politics is here to stay.

What’s missing from that equation, however — at least on the political side — is privacy protection. Late last week, while everyone’s attention was fixated on Washington, federal Privacy Commissioner Daniel Therrien reminded a Commons committee that all the political parties are amassing data on voters without any laws to guard citizens’ privacy.

“While the Privacy Act is probably not the best instrument to do this, Parliament should also consider regulating the collection, use and disclosure of personal information by political parties,” Therrien told the Commons committee on access to information and privacy.
A little more than a year ago, it seemed that a new Liberal government could be expected to agree with the privacy commissioner.
Recall last year’s conference on “digital governance” in Ottawa; on stage for one panel discussion were key strategists for the three main parties — Tim Powers for the Conservatives, Brad Lavigne for the New Democrats and Gerald Butts for the Liberals. Mr. Butts is, of course, now Trudeau’s principal secretary.

Fielding questions from the audience, the three were asked whether political databases should be subject to Canadian privacy laws. Powers and Lavigne demurred; only Butts seemed to be saying ‘yes’.

Here’s his lengthy quote, which appeared a few weeks later in an iPolitics column by Chris Waddell:

“Let’s not kid ourselves, political parties are public institutions of a sort. They are granted within national or sub-national legislation special status on a whole variety of fronts, whether they be the charitable deduction, the exemption from access to information — all those sorts of things,” Butts told the conference.

“We have created a whole body of law … or maybe we haven’t. Maybe we have just created a hole in our two bodies of law that allow political parties to exist out there in the ether. I think that is increasingly a problem and it is difficult for me to envision a future where it exists for much longer.”

That was a year ago. And unless I missed it, there’s nothing in any of Trudeau’s mandate letters to ministers about new privacy laws for political parties. And without giving away too much about the new chapters of my soon-to-be-re-released book on political marketing, I didn’t get the impression during our recent interview that Prime Minister Trudeau was greatly troubled by the collision between privacy protection and political databases.

It seems odd to me that citizens can get (often appropriately) worked up about “intrusive” government measures, whether it’s the census or the C-51 anti-terrorism law, and yet be mostly indifferent to what the chief electoral officer has called the “Wild West” of political data collection.

Even Conservatives who resented the gun registry didn’t seem to mind that their own party was keeping track of gun owners in its database, so that it could send them specially targeted fundraising messages from time to time. That’s just behavioural economics, applied to the political arena.

So far, British Columbia is the only province to take steps to put political databases in line with privacy protection. The provincial chief of elections in B.C., Keith Archer, notified political parties that they would not get access to the voters’ list — the raw material of any political database — if they failed to comply with privacy laws.

That step could — and should — be implemented in Ottawa, too. We’re in the era of big-data politics and behavioural-insight governance, and Canadians are entitled to some accountability about the data the governing party is collecting and using on them.

Not so long ago, one of Trudeau’s most senior advisers agreed with that idea. Maybe all it takes is a little nudge.

Source: How the parties collect your personal info — and why Trudeau doesn’t seem to mind – iPolitics

Social Assistance Receipt Among Refugee Claimants in Canada: Evidence from Linked Administrative Data Files

A good illustration of the benefits to evidence-based policy making by linking administrative and economic data. Bit dry analysis but essentially shows that number accessing declines with time but remains about Canadian average:

Focusing on the middle estimate [which excluded non-linked files], the receipt of SA in year t+1 among the 2005-to-2010 claimant cohorts generally ranged between 80% and 90% across family types, with rates highest among lone mothers and couples with more than two children. Similarly, the incidence of SA receipt generally ranged from about 80% to 90% across families in which the oldest member was between 19 to 24 and 55 to 64 years of age. Across provinces, the incidence of SA receipt in year t+1 was generally highest in Quebec, at over 85%, and lowest in Alberta, at under 60%.

SA receipt varied considerably across country of citizenship. Refugee claimants from countries such as Afghanistan, Colombia, the Democratic Republic of Congo, Eritrea, and Somalia all had relatively high SArates (close to or above 90%) throughout most of the study period, while  rates were lower among refugee claimants from Bangladesh, Haiti, India, and Jamaica (generally below 80%).

The rates of SA receipt tended to decline sharply in the years following the start of the refugee claim. Between years t+1 and t+2, rates fell by about 20 percentage points among most claimant cohorts, declining a further 15 percentage points between t+2 and t+3, and 10 percentage points between t+3 and t+4. By t+4, between 25% and 40% of refugee claimants received SA. However, it is important to recall that these figures pertain to the diminishing group of refugee claimants whose claims remained open up to that year. These figures are also well above the Canadian average of about 8%.

Among refugee claimant families that received SA in year t+1, the average total family income typically ranged from about $19,000 to $22,000, with SA benefits accounting for $8,000 to $11,000—or about 40% to 48%—of that total.

In aggregate terms, SA income paid to all recipients in Canada totaled $10 billion to $13 billion in most years. Given their relatively small size as a group, the dollar amount of SA paid to refugee claimant families amounted to between 1.9% and 4.4% of that total, depending on the year and on the treatment of unlinked cases.

Source: Social Assistance Receipt Among Refugee Claimants in Canada: Evidence from Linked Administrative Data Files

Turning to Big, Big Data to See What Ails the World

Good examples of how big data can help identity the more important issues and the consequent shift in focus from death to disability:

The disconnect between what we think causes the most suffering and what actually does persists today. It is partly a function of success. Diarrhea, pneumonia and childbirth deaths have greatly declined, and deaths from malaria and AIDS have fallen, although far less dramatically. (The charts here show the stunning improvement in health around the world. And here are similar charts tracking progress in hunger, poverty and violence — a big picture that’s an important counterpoint to the constant barrage of negative world news.) This success is partly due to changes made because of the first Global Burden reports.

The downside is that longer lives mean people are living long enough to develop diabetes and Alzheimer’s.   “What decline we’re seeing from communicable diseases, we’re seeing a compensatory increase from diabetes,” Murray said.   And neurological diseases such as Alzheimer’s now account for twice as many years lived with disability as cardiovascular and circulatory diseases together, Smith writes.

This is not simply because people are living longer. It’s also a function of worsening diet everywhere, as poor societies adopt the processed foods found in rich ones.

The most surprising information, though, came not in measuring deaths, but disability. “Major depression caused more total health loss in 2010 than tuberculosis,” Smith writes. Neck pain caused more health loss than any kind of cancer, and osteoarthritis caused more than natural disasters. For other findings that may surprise you, see the quiz.

The report is a giant compilation of “who knew?”

Based on this information, countries and international organizations have been able to change how they spend their health resources, and some ambitious countries have done their own national Burden of Disease studies.

Iran, writes Smith, found that traffic injury was its leading preventable cause of health loss in 2003, and put money into building new roads and retraining police. It also targeted two other big problems its study found: suicide and heart disease.

Australia, responding to the high impact of depression, began offering cost-free short-term depression therapy .

Mexico was one of the countries making the most use of Global Burden of Disease data, after Julio Frenk became health minister in 2000.   Frenk had been Murray’s boss at the W.H.O., and a participant in Murray’s work. He found that Mexico’s health system was targeting the communicable diseases that predominated in 1950, not what currently ailed Mexicans. In response, Frenk established universal health insurance (before that, 50 million were uninsured) and set coverage according to the burden of disease.

The program covered emergency care for car accidents, treatment of mental illness, cataracts, and breast and cervical cancer — all of which had been uncovered, even for people with insurance. “You want to cover those interactions that give you the highest gain,” ]he said.

Murray and company have now branched out beyond diagnosis to measuring treatment: How many people really have access to programs like anti-malaria bed nets or contraception? How much is being spent and what does it buy? Where are the most useful points of intervention?   Meanwhile, data from the Global Burden reports  is seeping further into health policy decisions around the world — data that saves suffering and money and lives.

via Turning to Big, Big Data to See What Ails the World – NYTimes.com.

Research based on social media data can contain hidden biases that ‘misrepresent real world,’ critics say

Good article on some of the limits in using social media for research, as compared to IRL (In Real Life):

One is ensuring a representative sample, a problem that is sometimes, but not always, solved by ever greater numbers. Another is that few studies try to “disentangle the human from the platform,” to distinguish the user’s motives from what the media are enabling and encouraging him to do.

Another is that data can be distorted by processes not designed primarily for research. Google, for example, stores only the search terms used after auto-completion, not the text the user actually typed. Another is simply that many social media are largely populated by non-human robots, which mimic the behaviour of real people.

Even the cultural preference in academia for “positive results” can conceal the prevalence of null findings, the authors write.

“The biases and issues highlighted above will not affect all research in the same way,” the authors write. “[But] they share in common the need for increased awareness of what is actually being analyzed when working with social media data.”

Research based on social media data can contain hidden biases that ‘misrepresent real world,’ critics say

9 Ugly Lessons About Sex From Big Data | TIME

Interesting example of big data and some reminders that we are not yet living in a post-racial society:

5. According to Rudder’s research, Asian men are the least desirable racial group to women…On OkCupid, users can rate each other on a 1 to 5 scale. While Asian women are more likely to give Asian men higher ratings, women of other races—black, Latina, white—give Asian men a rating between 1 and 2 stars less than what they usually rate men. Black and Latin men face similar discrimination from women of different respective races, while white men’s ratings remain mostly high among women of all races.

6. …And black women are the least desirable racial group to men.Pretty much the same story. Asian, Latin and white men tend to give black women 1 to 1.5 stars less, while black men’s ratings of black women are more consistent with their ratings of all races of women. But women who are Asian and Latina receive higher ratings from all men—in some cases, even more so than white women.

8. Your Facebook Likes reveal can reveal your gender, race, sexuality and political views.A group of UK researchers found that based on someone’s Facebook Likes alone, they can tell if a user is gay or straight with 88% accuracy; lesbian or straight, 75%; white or black, 95%; man or woman, 93%; Democrat or Republican, 85%.

9 Ugly Lessons About Sex From Big Data | TIME.

Professor goes to big data to figure out if Apple slows down old iPhones when new ones come out

Apple Slow iphones

A good illustration of the limits of big data and the risks of confusing correlation with causation. But bid data and correlation can help us ask more informed questions:

The important distinction is of intent. In the benign explanation, a slowdown of old phones is not a specific goal, but merely a side effect of optimizing the operating system for newer hardware. Data on search frequency would not allow us to infer intent. No matter how suggestive, this data alone doesn’t allow you to determine conclusively whether my phone is actually slower and, if so, why.

In this way, the whole exercise perfectly encapsulates the advantages and limitations of “big data.” First, 20 years ago, determining whether many people experienced a slowdown would have required an expensive survey to sample just a few hundred consumers. Now, data from Google Trends, if used correctly, allows us to see what hundreds of millions of users are searching for, and, in theory, what they are feeling or thinking. Twitter, Instagram and Facebook all create what is evocatively called the “digital exhaust,” allowing us to uncover macro patterns like this one.

Second, these new kinds of data create an intimacy between the individual and the collective. Even for our most idiosyncratic feelings, such data can help us see that we aren’t alone. In minutes, I could see that many shared my frustration. Even if you’ve never gathered the data yourself, you’ve probably sensed something similar when Google’s autocomplete feature automatically suggests the next few words you are going to type: “Oh, lots of people want to know that, too?”

Finally, we see a big limitation: This data reveals only correlations, not conclusions. We are left with at least two different interpretations of the sudden spike in “iPhone slow” queries, one conspiratorial and one benign. It is tempting to say, “See, this is why big data is useless.” But that is too trite. Correlations are what motivate us to look further. If all that big data does – and it surely does more – is to point out interesting correlations whose fundamental reasons we unpack in other ways, that already has immense value.

And if those correlations allow conspiracy theorists to become that much more smug, that’s a small price to pay.

Professor goes to big data to figure out if Apple slows down old iPhones when new ones come out