The Center for Research in Computation and Society was founded to develop a new generation of ideas and technologies designed to address some of society’s most vexing problems. The Center brings computer scientists together with economists, psychologists, legal scholars, ethicists, neuroscientists, and other academic colleagues across the University and throughout the world, to address fundamental computational problems that cross disciplines, and to create new technologies informed by societal constraints to address those problems.
Research initiatives that have been launched throughout industry and academia study the intersection of technology and society in two distinct manners: they investigate the effects of information technology on society or study ways to use existing technologies to solve societal problems. The approach of the Harvard Center for Research in Computation and Society is unique in its forward-looking scope and integrative approach, supporting research on innovative computer science and technology informed by societal effects, not merely examining the effects of existing technology on society.
Harvard University provides the ideal venue for this cross-disciplinary research initiative because of the strength and depth of the faculty across schools and academic disciplines. The Computer Science Program at Harvard comprises world-class teams of researchers in artificial intelligence, cryptography and algorithms, learning theory, systems (including networks, databases, programming languages, and operating systems), and computational economics. The Center’s intellectual scope is broadened and strengthened through participation by faculty and associates across Harvard, including the economics, government, and psychology departments in the Faculty of Arts and Sciences, as well as the Law School, Kennedy School of Government, and Business School.
CRCS has several ongoing research projects, which may be of interest to prospective fellows.
Information technology, advances in statistical computing, and the deluge of data available through the Internet are transforming social science. With the ability to collect and analyze massive amounts of data on human behavior and interactions, social scientists can hope to uncover many more phenomena, with greater detail and confidence, than allowed by traditional means such as surveys and interviews. In addition to advancing the state of knowledge, the rich analysis of behavioral data can enable companies to better serve their customers, and governments their citizenry.
However, a major challenge for computational social science is maintaining the privacy of human subjects. At present, an individual social science researcher is left to devise her own privacy shields, such as stripping the dataset of “personally identifiable information” (PII). However, such privacy shields are often ineffective and provide limited (or no) real-world privacy protection. Indeed, there have been a number of cases where the individuals in a supposedly anonymized dataset have been re-identified. At the same time, social scientists are increasingly analyzing complex forms of data, such as large social networks, spatial trajectories, and semistructured text, that are even less amenable to naive attempts at anonymization.
This project is a broad, multidisciplinary effort to enable the collection, analysis, and sharing of social science data while providing sufficient privacy for individual subjects. Bringing together computer science, social science, statistics, and law, the investigators seek to refine and develop definitions and measures of privacy and data utility, and design an array of technological, legal, and policy tools for social scientists to use when dealing with sensitive data.
These tools will be tested and deployed at the Harvard Institute for Quantitative Social Science’s Dataverse Network, an open-source digital repository that offers the largest catalogue of social science datasets in the world. Our aim is to provide social scientists with a technological and legal framework that embodies the modern computational understanding of privacy, and a reliable open infrastructure that aids in the management of confidential research data from collection through dissemination.
A Health Data Marketplace
A combination of money and outmoded privacy practices is driving the personal data that enables science back into the vaults of private industries. The first factor is money. Price Waterhouse Coopers predicts that sharing personal health information beyond the direct care of the patient will soon be a two billion dollar market. The second factor is outmoded privacy practices. For years, researchers relied on de- identification, the removal of explicit identifiers (e.g. name and address), as a way to provide privacy in data. This approach is too naive today because other data sources often contain some or all the same values, allowing redacted identity information to be restored by linking datasets. As evidence are several highly publicized cases of re-identifications.
We propose to create a new paradigm for data sharing in which a data subject assembles, controls, and receives compensation for sharing copies of his data. We term it a privacy marketplace, which is an electronic market where personal data and services are exchanged and the market design provides privacy guarantees to the subjects of the data. Buyers are likely to be companies and researchers. Sellers are individuals who receive compensation or services for sharing their own data. Privacy guarantees include risk-based compensation. While individuals selling personal data in a marketplace is not a totally new idea, doing so in a way that integrates privacy guarantees into market design is new and its implementation is now both technically and financially feasible.
The premise of our proposed research is that we are at a quintessential moment in health data history, in which unprecedented amounts of personal health data will be shared, exasperating inadequacies in past legal approaches to privacy. Yet, recent scientific advances in data privacy and expressive markets exist and recent health policy allows individuals to own a copy of their own data. By fusing these new directions in a cross-disciplinary pursuit, we have an opportunity to gain theoretical insights across computer science, law, and economics not possible before, to launch a new paradigm for sharing personal information, and to introduce a new form of privacy governance. Our goal is to produce a general theory of data marketplaces for sharing personal data (via contracts, etc.) guided by achieving greater transparency, accountability, and social utility than possible with historical privacy approaches.
By “general theory” we mean a theory embodied in a computer model that directly computes privacy guarantees or privacy risk compensation to data subjects in a data marketplace. In addition, our research effort also includes designing and constructing an actual privacy marketplace, inviting public participation, and having a living lab to further design and test mechanisms for real-world data sharing.
If successful, our effort will provide new theoretical insights into privacy, markets and privacy governance. In addition, we will cross-fertilize work across areas of computer science: (i) formal notions of differential privacy from theoretical computer science, (ii) user-expressed policies, bidding languages and methods for optimization and market-clearing from artificial intelligence, and (iii) formal language and type theory for safe computations from programming languages. In addition to the scientific impact of better understanding privacy risk and preferences: (i) the living lab will provide a unique way for researchers and public health to access personal data and recruit subjects; and (ii) our marketplace design will encourage new forms of privacy governance.
Crowdsourcing describes the process of solving a problem by sharing it with a large group of people, each with diverse capabilities and interests, with a solution constructed by selection or combination from inputs received from the crowd, perhaps involving payments or other incentives to encourage participation. Recent examples that showcase the power of crowdsourcing include Galaxy Zoo, which is crowdsourcing for classifying galaxies, Amazon Mechanical Turk (MTurk) for “human intelligence tasks” (HITS), TopCoder for code development, the Mozilla Foundation’s “bug bounty”, and the DARPA “red balloon challenge.” An emerging application is to use crowdsourcing to provide enhanced computer security. For example, the Conficker working group provided a platform where security researchers could share information and data on a particularly popular botnet. PhishTank is a volunteer effort to collect and verify reports of phishing websites and rapidly disseminate a feed to banks and browser blacklists. With crowdsourcing there is a natural tension: How can one use many people in the ‘crowd’ to make accurate decisions and provide useful, aggregate information without also allowing people (and computers) to mount attacks on such a system and render it untrustworthy? In enabling trustworthy crowdsourcing and datasharing, we will study the following three foundational problems:
• Input trust: Can reputation algorithms be designed for crowdsourcing, to identify which inputs are useful and filter the input into a system? How should resources be allocated between confirming earlier reports and collecting new information?
• Output trust: Can crowdsourcing be usefully formalized as a dynamic incentive system, with state transitions that depend on the combined actions of multiple participants, each with possibly divergent interests? Does this lead to a control framework, where incentives can be designed to promote the effective allocation of work?
• Time-critical mobilization: Some crowdsourcing tasks involve the activation of multiple workers within a short timeframe, e.g. to trace a security attack or for search-and-rescue. We seek to formalize this problem, and design and analyze a mechanism that provably encourages (truthful) information sharing, including robustness against false-name
Crowdsourcing systems are systems that are computational, economic and social in nature. Although demonstrating the potential for wide application, they are inherently vulnerable to attack. We propose foundational algorithmic and theoretical work, seeking to understand how to achieve effective and robust solutions. Beyond applications of crowdsourcing and datasharing to security problems, crowdsourcing is being used for a wide variety of other applications, including collaborative science, software engineering, personalized healthdata services, and language translation. Improved methods for achieving trustworthy “production” from crowd-sourced systems will bring broad societal benefits.
Privacy & Security in Targeted Advertising
Internet advertising generated more than $22 billion in revenue in the United States in 2009, with around 47% coming from search advertising (ads adjacent to search engine results) and 35% from display (“banner”) advertising. In addition to providing great business value, the opportunity to target ads to users makes ads more valuable for consumers.
On the other hand, the explosion of targeted advertising is threatening user privacy online and also operating in a way that is largely invisible to end users, and neither controlled nor understood by them. The need to protect business value while enabling meaningful user control over data about their online behavior presents the main focus of this project.
We believe that it is possible to sustain the commercial value from targeted advertising while providing appropriate privacy and meaningful and manageable control to users. The broad goal of this project is to develop a technological and legal framework that can allow data to be collected and retained for targeted advertising, while providing users with control over the use of data and the ability to articulate preferences about data use vs. access to content and services.
We are interested in achieving impact by informing the public, policy makers, and the media about the economic value and privacy implications of targeted advertising. Making progress on our research agenda will require the combined efforts of many disciplines:
- Computer Science and Economics. The emerging discipline that studies problems at the intersection between computer science and economics, for example in regards to market design, computational mechanism design, and efficient preference elicitation, is relevant to these questions.
- Business. We will need to understand the forms of targeting and personalization that are most valuable for advertisers, the kinds of properties of data that are important to preserve in sustaining value, and seek methods to protect innovation in online services and personalization.
- Law and Policy. Legal scholarship in regards to property rights vs. contract law approaches to protect privacy, considering also the role of norms and other voluntary enforcement and the development of new legal instruments as necessary, will be essential in developing a comprehensive, workable solution to targeted advertising.
Technologies for Personalized Adaptive Accessibility
In a world where efficient computer access is required in education, at work, in accessing basic services, and in maintaining social relationships, both the possibility and the efficiency of access are essential for equitable participation in society. Current one-size-fits-many accessibility solutions miss the opportunities that arise from careful consideration of an individual’s abilities and fail to address the sometimes dynamic aspect of those abilities, such as when a user’s activity or context causes a “situational impairment.” More fundamentally, existing assistive technologies tend to be designed on the assumption that software user interfaces, which are designed for the “average user,” are immutable and that users have to adapt themselves to the technology. This necessarily puts a ceiling on the efficiency of access enabled by these approaches.
Our long-term vision of Personalized Adaptive Accessibility aims to reverse this situation by providing users with interfaces that reflect each person’s unique abilities, devices, and environment. Complementing existing accessibility approaches, Personalized Adaptive Accessibility approach has the promise to not only enable access, but also to make it substantially more efficient by providing each user with interfaces designed to leverage what the user can do best rather than compensating for what the user cannot do or cannot do well.
This vision extends to different kinds of abilities (motor, perceptual, cognitive, emotional) and to a range of factors that might affect them (permanent medical condition, temporary injury, age, choice of input devices, situational impairments caused by activity or context, education, access to technology). Because of the great variety of individual capabilities among computer users, manually designing interfaces for each one of them is impractical and not scalable. To address this challenge, we are developing novel techniques for automatically generating and adapting user interfaces, for enabling powerful user-driven customization of interfaces, for supporting collaborative community-based design, as well as for enabling application designers to create intentionally flexible and adaptable interfaces.
We are looking for Fellows with a strong interest in inventing novel technologies to enable more equitable access to computing and with expertise in a combination of any of the relevant fields including accessibility, human-computer interaction, crowdsourcing, artificial intelligence and machine learning. We are also hoping that the Fellow(s) who join us will take advantage of the unique interdisciplinary environment of CRCS and forge connections with other parts of Harvard campus.
Language-Based Security and Privacy
The focus of the Language-Based Security group is on the development of technology that helps detect or prevent implementation flaws in security and privacy-critical software. Specifically, we are investigating the role of advanced type systems for tracking information flow in programs; new compiler techniques and hardware designs for enforcing security policies in an efficient fashion; new languages for specification of policies; and new techniques for minimizing the needs to trust software and hardware.