To its credit, Facebook has made a number of key changes to its privacy and API stance since the dataset at the center of the Cambridge Analytica scandal was harvested. Ironically, these new restrictions on mass harvesting have provoked impassioned outcries from the very academic community that previously blasted Facebook for using data for research. At the same time, many academics I’ve spoken with have said the changes will have little impact on them, since they use web scraping tools designed to circumvent Facebook’s security tools or have already amassed such immense collections of Facebook data that they have no need for more. Yet, a closer look at Facebook’s relationship with the academic community and its tolerance of mass harvesting of user data over the years as well as recent changes in that relationship offer an intriguing perspective for the future of social media research.
As I noted repeatedly throughout the Cambridge Analytica story, the story of academics harvesting vast quantities of data from Facebook to use in profiling those users and redistributing that data to others is nothing new. In fact, the story described pretty much an average academic use case of Facebook data, abetted by university research ethics boards (IRBs) that largely view such research and data sharing as outside the purview of institutional ethical concern.
Last week the New York Times surveyed a small handful of the myriad Facebook datasets floating about the academic realm and the cavalier attitude of the academic world towards access and redistribution of users’ personal information. It also cited one researcher as acknowledging they had been approached by commercial entities interesting in purchasing the Facebook data from them – an indication that the Cambridge Analytica story has done little to scare off commercial interest in such harvested datasets.
Last September the controversial Stanford University “gaydar” study made use of a massive Facebook dataset that had been previously harvested through a similar personality quiz process as that used by the Kogan dataset at the center of the Cambridge Analytica scandal. Known as the myPersonality quiz, the Facebook dataset was freely available to other academics for use in IRB-approved research and the long list of researchers who had downloaded the data and papers resulting from it stand testament to its widespread access.
The myPersonality dataset included a large collection of user profile images, which were available for download as a digital signature of 500 markers representing each image (the actual photographs themselves were not available to download) and were used as part of the “gaydar” study. When asked whether the myPersonality dataset had received explicit permission from Facebook to collect the facial images and whether Stanford had adhered to its research ethics rules that typically required signed legal permission from a social media website prior to harvesting data, Dr. Kosinski offered a statement through a university spokesperson at the time saying only that “the users gave the app permission to use their data.”
Indeed, this sentiment was at the heart of the Cambridge Analytica story as well: users clicked a button authorizing the use of their data for research so they should have realized that this permission would extend to hundreds of researchers more than a decade later, including using their face to predict highly sensitive and intimate attributes like sexual orientation.
Indeed, even the journal the study was published in disagreed with the assessment that users granting permission for research use of their data a decade ago conferred informed consent for their imagery to be used for the Stanford study, offering that the “owners of the images posted them for different purposes. It may therefore be deemed unlikely that all of them would have granted consent for the use of their images in your research work.” Yet, in the end, it was an APA journal that accepted the paper for publication, deeming such issues to have no impact on publication status until an public outcry prompted reconsideration and fellow academics to condemn any attempt by the public to impose their ethical values and beliefs on the academic environs.
When asked whether the individuals who took the survey were asked to sign a model release authorizing their likeness to be redistributed (even as just a set of 500 markers instead of actual photographs) and how Stanford addressed the ethical questions of biometrics and privacy, Kosinski offered at the time only that “what people seem to be missing (sadly), however, is that anyone can easily record the images from the web.”
The university spokesperson declined to comment further other than emphasizing that the study had been peer reviewed and that the journal publishing it was operated by the American Psychological Association.
This exchange from last year takes on new meaning in light of the Cambridge Analytica story. Here we have the researchers’ belief that once a user grants permission for “research” use of their data, that is a blank check authorizing hundreds of researchers over decades to do whatever they feel like with the data and make it available to the entire global academic community. As additional justification, the argument goes that the images could have been harvested via web scraping anyway even without the users’ knowledge, so it doesn’t really matter.
Stanford, while previously detailing to me one of the most stringent research ethics processes in the nation, at the end of the day did not apply the extremely restrictive rules it had previously touted, instead opting to simply exempt the entire study from detailed ethical review.
At the time, I asked Facebook on September 7, 2017 for comment on its perspective on academic use of personality quizzes and similar apps to mass harvest user data. Unsurprisingly, the company never responded. On March 19, 2018 in light of Facebook’s actions against Cambridge Analytica, I asked the company again whether it planned to take action against other large datasets harvested from its platform via personality quizzes, mentioning myPersonality by name, such as requesting that they be deleted or requiring that researchers no longer make them widely available for download. Again, no response.
If Facebook was seriously concerned about academics mass harvesting private user data from its platform and sharing it far and wide, it certainly was not unaware of the myPersonality quiz, especially given that one of its own researchers was cited as having been approved to use it for his research.
It does seem that if Facebook was concerned that such applications were violations, it would have taken action in the aftermath of the much-discussed “gaydar” study last year amidst the discussion of the ethics of the myPersonality quiz dataset or at the very least would have issued a statement at the time raising concerns about the dataset. Instead, the company remained silent and did not comment on the widespread academic mass harvesting of its data.
Why did Facebook take no action for more than seven years, despite being fully aware of what the myPersonality quiz was and how widely its data was being redistributed? Why did it not at least issue a statement of concern when asked twice about the dataset in the past, including more than seven months ago? That’s an awfully long time for Facebook to accept the status quo before suddenly bursting into action claiming the app was in stark violation of its policies.
When I asked the company this week whether it planned to contact other academics that had mass harvested data from Facebook, either through then-approved mechanisms like quizzes or apps or through unapproved methods like web scraping, the company declined to comment other than by referring to a previous statement that “We are taking a hard look at the information apps can use when you connect them to Facebook, as well as other data practices. These other data practices include academic research.”
However, given that the company has historically declined to take action against violations of its policies, essentially sitting by and passively permitting academics to mass harvest its user data even when those activities were starkly in violation of its policies, it does raise the question of why it is acting now. Perhaps more importantly, given that academics have been accustomed to the company simply looking the other way when it comes to academics violating its terms of service, why would it have any expectation that an academic it permitted to violate its data access policies in acquiring their dataset would still adhere to the rest of those same policies that ban commercial resale or sharing of the data they received through violating its policies? In short, if the company does not enforce its rules, it creates an environment where academics become accustomed to the rules not applying to them and raises the question of why the company would feel academics would still adhere to other aspects of those rules. Again, the company declined to comment.
It is important to remember that while media reports may describe the Cambridge Analytica story as “improper harvesting of millions of [Facebook’s] users’ information by political consulting firm Cambridge Analytica,” the actual reality was that of a lonesome academic who ran a personality quiz with the full knowledge and approval of Facebook and that the violation here was not the harvesting of data, but rather the sharing of that data from the academic to Cambridge Analytica.
As Facebook has emphasized, had Cambridge Analytica harvested the data itself, there would never have been any violation at all.
Putting this all together, in the end it remains to be seen where Facebook and the academic community go from here. Conversations with a number of academics suggest Facebook faces a losing battle here, given that technologically there is little it can do to prevent academics from mass harvesting its data through web scrapers. The only real avenue would be for the company to adopt the kind of data loss prevention (DLP) and suspicious data access monitoring that many academic data vendors use to weed out mass automated harvesting attempts. Most large vendors to the academic world anticipate that bad actors will attempt to mass harvest their data without permission and automatically flag and suspend users accessing higher-than-normal volumes of material. Yet, even here academics routinely get around these limitations by spreading the download task across large armies of undergraduate students, sometimes distributed across multiple institutions, each of which is assigned to download only a small amount of content per day, in a giant distributed cat and mouse game of staying under the vendors’ radar.
Yet, here too the precedent appears to go precisely in the opposite direction. In cases where Twitter has confirmed that an academic study directly violated its terms of service, the company has declined to take action, reinforcing that social media companies are loathe to do anything that might upset their engagement with the academic community that provides them a steady stream of new employees and collaborators. In addition, the “sovereign immunity” enjoyed by many universities in the US provides additional complications, as does the negative publicity of suing universities, while the courts have not always ruled such terms of service to be binding.
In short, even when it tries to do better at privacy, by restricting the ways in which user data can be bulk harvested and misused, the company comes under fire – this time by academics rejecting any restrictions on their ability to turn Facebook’s two billion users into nonconsenting digital lab rats for sale or barter. Moreover, as all cybersecurity professionals know, once sensitive data is out in the wild, there is simply no way to stop its spread.
Putting this all together, it is ironic that the very academic community that has so energetically attacked Facebook’s privacy stance and approach to user research in the past is now so strongly attacking the company’s attempts to improve those issues on the grounds that granting users additional privacy protections will encroach on the academic community’s right to treat users as digital lab rats. Will privacy win out or will we accept our privacy-less fate? If Facebook’s stock price recovery is any indicator, this story was never about privacy.