Fascinating research on de-anonymizing code -- from either source code or compiled code:
Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. Their work could be useful in a plagiarism dispute, for instance, but it also has privacy implications, especially for the thousands of developers who contribute open source code to the world.
Abstract: Can online trackers and network adversaries de-anonymize web browsing data readily available to them? We show -- theoretically, via simulation, and through experiments on real user data -- that de-identified web browsing histories can\ be linked to social media profiles using only publicly available data. Our approach is based on a simple observation: each person has a distinctive social network, and thus the set of links appearing in one's feed is unique. Assuming users visit links in their feed with higher probability than a random user, browsing histories contain tell-tale marks of identity. We formalize this intuition by specifying a model of web browsing behavior and then deriving the maximum likelihood estimate of a user's social profile. We evaluate this strategy on simulated browsing histories, and show that given a history with 30 links originating from Twitter, we can deduce the corresponding Twitter profile more than 50% of the time. To gauge the real-world effectiveness of this approach, we recruited nearly 400 people to donate their web browsing histories, and we were able to correctly identify more than 70% of them. We further show that several online trackers are embedded on sufficiently many websites to carry out this attack with high accuracy. Our theoretical contribution applies to any type of transactional data and is robust to noisy observations, generalizing a wide range of previous de-anonymization attacks. Finally, since our attack attempts to find the correct Twitter profile out of over 300 million candidates, it is -- to our knowledge -- the largest scale demonstrated de-anonymization to date.
In this article, detailing the Australian and then worldwide investigation of a particularly heinous child-abuse ring, there are a lot of details of the pedophile security practices and the police investigative techniques. The abusers had a detailed manual on how to scrub metadata and avoid detection, but not everyone was perfect. The police used information from a single camera to narrow down the suspects. They also tracked a particular phrase one person used to find him.
This story shows an increasing sophistication of the police using small technical clues combined with standard detective work to investigate crimes on the Internet. A highly painful read, but interesting nonetheless.
Here's the story of how it was done. First, a fake ad on torrent listings linked the site to a Latvian bank account, an e-mail address, and a Facebook page.
Using basic website-tracking services, Der-Yeghiayan was able to uncover (via a reverse DNS search) the hosts of seven apparent KAT website domains: kickasstorrents.com, kat.cr, kickass.to, kat.ph, kastatic.com, thekat.tv and kickass.cr. This dug up two Chicago IP addresses, which were used as KAT name servers for more than four years. Agents were then able to legally gain a copy of the server's access logs (explaining why it was federal authorities in Chicago that eventually charged Vaulin with his alleged crimes).
Using similar tools, Homeland Security investigators also performed something called a WHOIS lookup on a domain that redirected people to the main KAT site. A WHOIS search can provide the name, address, email and phone number of a website registrant. In the case of kickasstorrents.biz, that was Artem Vaulin from Kharkiv, Ukraine.
Der-Yeghiayan was able to link the email address found in the WHOIS lookup to an Apple email address that Vaulin purportedly used to operate KAT. It's this Apple account that appears to tie all of pieces of Vaulin's alleged involvement together.
On July 31st 2015, records provided by Apple show that the me.com account was used to purchase something on iTunes. The logs show that the same IP address was used on the same day to access the KAT Facebook page. After KAT began accepting Bitcoin donations in 2012, $72,767 was moved into a Coinbase account in Vaulin's name. That Bitcoin wallet was registered with the same me.com email address.
Abstract: Perfect anonymization of data sets has failed. But the process of protecting data subjects in shared information remains integral to privacy practice and policy. While the deidentification debate has been vigorous and productive, there is no clear direction for policy. As a result, the law has been slow to adapt a holistic approach to protecting data subjects when data sets are released to others. Currently, the law is focused on whether an individual can be identified within a given set. We argue that the better locus of data release policy is on the process of minimizing the risk of reidentification and sensitive attribute disclosure. Process-based data release policy, which resembles the law of data security, will help us move past the limitations of focusing on whether data sets have been "anonymized." It draws upon different tactics to protect the privacy of data subjects, including accurate deidentification rhetoric, contracts prohibiting reidentification and sensitive attribute disclosure, data enclaves, and query-based strategies to match required protections with the level of risk. By focusing on process, data release policy can better balance privacy and utility where nearly all data exchanges carry some risk.
This research shows how to track e-commerce users better across multiple sessions, even when they do not provide unique identifiers such as user IDs or cookies.
Abstract: Targeting individual consumers has become a hallmark of direct and digital marketing, particularly as it has become easier to identify customers as they interact repeatedly with a company. However, across a wide variety of contexts and tracking technologies, companies find that customers can not be consistently identified which leads to a substantial fraction of anonymous visits in any CRM database. We develop a Bayesian imputation approach that allows us to probabilistically assign anonymous sessions to users, while ac- counting for a customer's demographic information, frequency of interaction with the firm, and activities the customer engages in. Our approach simultaneously estimates a hierarchical model of customer behavior while probabilistically imputing which customers made the anonymous visits. We present both synthetic and real data studies that demonstrate our approach makes more accurate inference about individual customers' preferences and responsiveness to marketing, relative to common approaches to anonymous visits: nearest- neighbor matching or ignoring the anonymous visits. We show how companies who use the proposed method will be better able to target individual customers, as well as infer how many of the anonymous visits are made by new customers.
We are able to de-anonymize executable binaries of 20 programmers with 96% correct classification accuracy. In the de-anonymization process, the machine learning classifier trains on 8 executable binaries for each programmer to generate numeric representations of their coding styles. Such a high accuracy with this small amount of training data has not been reached in previous attempts. After scaling up the approach by increasing the dataset size, we de-anonymize 600 programmers with 52% accuracy. There has been no previous attempt to de-anonymize such a large binary dataset. The abovementioned executable binaries are compiled without any compiler optimizations, which are options to make binaries smaller and faster while transforming the source code more than plain compilation. As a result, compiler optimizations further normalize authorial style. For the first time in programmer de-anonymization, we show that we can still identify programmers of optimized executable binaries. While we can de-anonymize 100 programmers from unoptimized executable binaries with 78% accuracy, we can de-anonymize them from optimized executable binaries with 64% accuracy. We also show that stripping and removing symbol information from the executable binaries reduces the accuracy to 66%, which is a surprisingly small drop. This suggests that coding style survives complicated transformations.
The behavior of the researchers is reprehensible, but the real issue is that CERT Coordination Center (CERT/CC) has lost its credibility as an honest broker. The researchers discovered this vulnerability and submitted it to CERT. Neither the researchers nor CERT disclosed this vulnerability to the Tor Project. Instead, the researchers apparently used this vulnerability to deanonymize a large number of hidden service visitors and provide the information to the FBI.
Does anyone still trust CERT to behave in the Internet's best interests?
EDITED TO ADD (12/14): I was wrong. CERT did disclose to Tor.
Those of you unfamiliar with hacker culture might need an explanation of "doxing."
The word refers to the practice of publishing personal information about people without their consent. Usually it's things like an address and phone number, but it can also be credit card details, medical information, private e-mails -- pretty much anything an assailant can get his hands on.
Doxing is not new; the term dates back to 2001 and the hacker group Anonymous. But it can be incredibly offensive. In 2014, severalwomenwere doxed by male gamers trying to intimidate them into keeping silent about sexism in computer games.
Companies can be doxed, too. In 2011, Anonymous doxed the technology firm HBGary Federal. In the past few weeks we've witnessed the ongoing doxing of Sony.
Everyone from political activists to hackers to government leaders has now learned how effective this attack is. Everyone from common individuals to corporate executives to government leaders now fears this will happen to them. And I believe this will change how we think about computing and the Internet.