Illustration: Tom Williams

We have heard of the Facebook data scandal, Cambridge Analytica, and Yahoo 2016 privacy breach. We said we’ve had enough. However, while we stand on the moral high ground as victims and talk extensively about the limits of our right to privacy, few have paid attention to just what big data and machine learning are about. Perhaps more dangerously, nor do we care. This story hopes to first offer a crash course on the familiar (and unfamiliar) field of big data and machine learning, then give some of our own views surrounding recent scandals.

Now, take a deep breath. We are about to dive into the world of big data and machine learning.

To ease your mind, let’s begin with an anecdote. A couple of months ago when Rebecca was covering a big data and credit rating company for Forbes Asia, she was given a wonderful Big Data 101 course by the Chief Technology Office of the company. Being a prospective literature student, she established an analogy between a big data robot and a Romanticism literature student.

Imagine the machine as a college literature student, specialising in Jane Austen’s works. She owns (note, not reads, but owns) all of Jane Austen’s books, from first edition to the most recent Norton edition. Additionally, she does not just own Austen’s novels, but also her letters to her lover, Tom Lefroy, and even her personal diaries.

One day the student is tasked with finding out which female character will marry the richest male character in Pride and Prejudice. The student has read some, but not all, of Jane Austen’s novels, and she has never read Pride and Prejudice. Of course, she could have read the book thoroughly and developed a well-founded argument. However, she was too lazy to read the entire book.

Based on her previous Austen readings, she knows that if a girl elopes with a young man, the girl usually does not marry the richest man in the end. The student quickly flips through Pride and Prejudice and sees a chapter titled “Lydia’s Elopement.” She then assumes that Lydia will not marry the richest man, without needing to know who Lydia is or what marriage means to the Victorian society.


Same with machine learning. Let’s say the machine is trying to decide whether a certain student is likely to take out a student loan. The machine learns from data points of past student loan applicants. These data points include applicants’ basic personal information, their financial background, their past occupations, the degree they are getting, their GPA, etc. If the owner of the machine (in real world cases, big data algorithm companies) wants and has these data points of students, the owner can also ‘feed’ the machine some other seemingly irrelevant data points, such as the frequency of a student buying Innocent super smoothies from Tesco, Evian bottled water from Sainsbury’s, or Soya Lattes from Prêt.

Based on the data points of past applicants, the machine then produces a data algorithm, assigning each factor (smoothies, latte, GPA, etc.) a coefficient. Suppose that a machine has learned from previous loan records that a person who buys a lot of Evian bottled water is less likely to take out student loans than those who has never bought Evian bottled water. Then the machine will treat ‘number of Evian bottled water bought’ as a relevant ‘parameter’ that has a negative impact on the chance of taking out student loans. It will use that assumption to study a new applicant and calculate his or her likeliness of taking out student loans.


Yes, you may laugh at the result how can buying Evian bottled water possibly be related to the chance of taking out loans? Well, that’s just it. In the world of big data and machine learning, human beings do not get to say how weak the connection is, machines do.


Now, this is a brutal simplification of how machine learning works, but the key takeaway message is that there’s a crucial difference between a literature student and the machine the machine does not have a built-in value system of what’s good and what’s bad. Humans do. The machine does not “judge.” It simply builds connections between parameters and results.

The machine’s predictions may be accurate, but they can also be unforgiving of one’s past mistakes. The machine predicts a person’s future behaviors based on what she and people like her have done in the past. Although these predictions do have evidence, they may be interpreted as discriminative by human beings, for whom the past does not always have an impact on the future, and miracles do happen.

The machine learns, just like a literature student does, from these data points. Thus, the more relevant data points the machine learns from, the smarter (which means, the more accurate data algorithms) it gets.

Facebook has data points of its 2.1 billion users (up to Q4 2017). With these bits of information, Facebook’s data system (note, not Mark Zuckerberg, but the data system itself) is able to build a quite detailed digital profile of each user. In other words, the machine knows you in some ways.

And now, unfortunately, it appears that knowledge has been abused. But by whom, and for what purpose?

Cambridge Analytica is a political consulting firm that specialises in the collection and analysis of data in order to aid their clients. These clients tend to be right-wing political campaigns, as the company is partly owned by Robert Mercer, an American hedge-fund manager who supports conservative political causes.

Aleksandr Kogan is a data scientist who, in 2014, developed an app called thisisyourdigitallife. The app collected user data by giving users a personality test; the data was intended for academic purposes. However, the app was partially developed and maintained by Cambridge Analytica, who realised that Facebook had a backdoor of sorts that let Cambridge Analytica see the data of not only thisisyourdigitallife users but also the data of the friends of those users and the data of anyone taking an online test hosted via Facebook. Cambridge Analytica proceeded to harvest the data of millions of people. Precisely how many is not known; estimates have been redrawn upward from 50 million to at least 87 million people compromised by the data breach.

Cambridge Analytica appears to have used this user data to get an idea of what political and cultural views Facebook users held and target political advertisements to them. This method of voter targeting appears to have been used in the 2016 Brexit referendum, as well as the 2016 Presidential and Congressional campaigns in the United States to support the pro-leave and Republican campaigns respectively. The degree to which this illicit data affected the outcomes of those votes is as of yet unclear, but the sheer scale of the data breach certainly opens up the possibility that Cambridge Analytica may have tipped the scales in either or both.

In the aftermath of Christopher Wylie’s revelations about Analytica’s actions while he was director of research, the United States Senate and House of Representatives called upon Mark Zuckerberg, the founder and CEO of Facebook, to testify about the data breach and Facebook’s response to it. The hearings also served as a way for lawmakers to air other grievances against Facebook that have piled up over the years. One such objection is that Facebook doesn’t delete user data when a profile is deleted; the profile is merely hidden from public view.

Zuckerberg didn’t allow for a change of that specific policy, but he did assure lawmakers that Facebook was doing all it could to ensure the privacy of its users, and that the loophole used by Analytica has since been fixed. This didn’t satisfy anyone however, as there are now talks to introduce an “Honest Ads Act” that would limit the scope and scale of political advertising online.

But despite the fury from lawmakers and the general public over this breach of privacy, the fundamental problem remains that data collection is what justifies Facebook’s 74 billion dollar net worth. Facebook generates revenue via the same method that Analytica used: the collection of user data in order to tailor ads. The only difference is that Analytica collected user data without the consent of the users. However, most Facebook users are unaware of just how much information they agreed to give away; the same is true of most people who use search engines or social networks, all of which generate profits by selling user data. This is why Facebook cannot protect user data without destroying its own business model. As a general rule, when an online service is free, you are the product.

Additionally, the admonishments given to Mark Zuckerberg by American politicians come across as hollow when one realises that the United States government (and its allies, including the UK, in the Five Eyes program) is the biggest violator of private information in the world. Edward Snowden’s revelations in 2013 that the amount of data points being recorded and stored from emails, phone calls, and internet activities of Americans by intelligence agencies are equally if not more horrendous. No company can compete with the surveillance and data storage capabilities of the United States and its allies, and while Facebook can make a legitimate claim that its users agreed to share their data and can stop sharing it at any time by quitting Facebook, the situation is rather different with national governments. Poll after poll has showed that British and American citizens are uncomfortable with mass surveillance, and would prefer such programs to end, or at least curtail their activities. Such curtailment has not occurred, nor can it be avoided, as quitting Britain is a great deal more difficult than quitting Facebook.

In a more general sense, the Cambridge Analytica scandal has only made clearer the fact that privacy, for all intents and purposes, is dead. It is not possible to function in modern society without being processed by algorithms checking to see if you want bell bottom pants, need an oil change, or want a communist revolution (to sell you Che t-Shirts for one algorithm, to put you in a security database for another). The most we can hope for is for our information to be used in an accountable fashion.  However, what is ‘accountable’ has ceased to be a debate over morality and rights but rather a business negotiation. While the data collectors are too busy making money, we, the users, are having too much fun devouring and consuming our friends’ daily lives in minute details.