This website does readability filtering of other pages. All styles, scripts, forms and ads are stripped. If you want your website excluded or have other feedback, use this form.

CiteSeerX Data | CiteSeerX

CiteSeerX Data

CiteSeerx data and metadata are available for others to use. Data available includes CiteSeerx metadata, databases, data sets of pdf files and text of pdf files.

For more information, please contact us directly. Currently, data is only available through sharing folders on Google Drive. Please contact us for more information. Data released by CiteSeerx is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

CiteSeerx is compliant with the Open Archives Initiative Protocol for Metadata Harvesting, which is a standard proposed by The Open Archive Initiative in order to facilitate content dissemination. For data not mentioned here, please contact us through feedback.

To browse or download records programmatically from CiteSeerx OAI collection please use the harvest url:

The archive may also be browsed from an interface via an OAI Repository Explorer, either by using the CiteSeerx archive identifier or by directly entering the harvest url.

Here is a list of toolkits that can be used for OAI metadata harvesting.

Data Sets

Citation and Header Datasets

  • UMass Citation Field Extraction Dataset
    License: N/A
    Provides labels and segments for extracted citations from articles found on Citations are from 5000 papers from four fields. Described in "A New Dataset for Fine-Grained Citation Field Extraction."
  • Cora Information Extraction
    License: N/A
    Research paper headers and citations, with labeled segments for authors, title, institutions, venue, date, page numbers and several other fields.
  • CiteSeerX Citation Data
    License: N/A
    Tagged citation data from CiteSeerX