This website does readability filtering of other pages. All styles, scripts, forms and ads are stripped. If you want your website excluded or have other feedback, use this form.

Home | Linguistic Data Consortium

Skip to main content


What's New:

Web pages feature DMPs LDC enhances its user services Staff Podcasts Accessible on the LDC Blog How LDC Data Inspires Research

The New York Times Annotated Corpus illustrates how data published in LDC’s Catalog can become an important resource for the community. The New York Times is one of LDC’s earliest data providers; the billions of words of news text it has provided for language resources since the 1990s continue to be used today for research and technology development. Its contribution of the New York Times Annotated Corpus in 2008 opened a new dimension for research with summaries, tags and parsing tools for close to two million news articles spanning a twenty year period. Researchers immediately recognized the significance of this resource. In its brief history in the Catalog, the corpus has become one of the top ten most distributed data sets and inspired over 200 research papers.

The top ten list reflects how LDC data contributes to the work of our global community. It includes data sets published  over two decades ago  that are regarded as benchmark resources essential for new entrants to the field, as well as more recent releases that  support users’ ever-growing and changing needs. The papers written about LDC data – over 13,000 unique publications that we’ve found to date – confirm the impact of the Consortium’s archive in supporting continued work and scientific progress.

Quick Links

&copy 1992-2018 Linguistic Data Consortium, The Trustees of the University of Pennsylvania. All Rights Reserved.