Data Management Standards and Practices

SEAD is an end-to-end infrastructure for managing, sharing, curating, preserving, and publishing data. SEAD provides secure access-controlled Project Spaces in which teams and individual researchers can incrementally develop datasets, and then submit their data for publication in long-term repositories working with SEAD. SEAD's technology guides project teams through a publication process where data collections from the team's secure Project Space are curated, packaged, matched with a trusted long-term repository, given a Digital Object Identifier (DOI), and registered with DataONE.

The SEAD standards and practices pertaining to the five main areas required in an NSF data management plan are covered below:

  1. I. Types of data
  2. II. Data and metadata standards
  3. III. Policies for Access and Sharing and Provisions for Appropriate Protection/Privacy
  4. IV. Plans for archiving and preservation of access
  5. V. Policies and Provisions for Re-Use and Re-Distribution

I. Types of data

Data can be uploaded to secure Project Spaces in SEAD and annotated via SEAD's web interface or, for projects that have large numbers of data files and/or complex processes, via SEAD's RESTful web service interface. One of the advantages to using SEAD is that its hosted, access-controlled Project Spaces allow your team to upload and share any and all materials as you create them and to later decide, at milestones within your project, which subsets of this information to publish and preserve for the longer term.

In SEAD, groups have the flexibility to decide which data and additional materials, such as experimental procedures, software, calibration information, test suites, and other forms of formal and informal documentation, to manage and preserve as records of your project. SEAD can support any decision you make regarding what to keep during the project, how and when to upload and annotate your data, and what and when to publish and preserve for the longer term. SEAD encourages and enables project teams to publish and preserve more of their data than just the final results and to consider including raw data, control experiments, calibration information, negative results, reports, notes, and other information. Many repositories working with SEAD support publication of new versions of data and derived datasets. Researchers can take advantage of these capabilities to incrementally enrich and expand their project's published data products over time.

II. Data and metadata standards

SEAD supports the publication of data in any format and for project teams to use any metadata vocabulary(ies). By default, SEAD uses terms from the Dublin Core (DC) vocabulary and utilizes the W3C Provenance specification. Together these provide a default option for recording the title, abstract, creator(s), creation date, and contact. Project Space administrators can add new project-specific terms and link to external community vocabularies, as desired. For example, researchers can add terms to describe the spatial, temporal, thematic coverage of the data, and data derivation relationships. SEAD software also allows teams to describe data and document relationships between data files as necessary to make it useful to others.

When data files in common formats (including images, movies, ESRI Shape files, CSV files, documents, and slides) are uploaded into SEAD Project Spaces, previews are generated and metadata stored within the file are automatically extracted and made visible to enhance your ability to understand the file contents. If your project uses other formats, we encourage you to contact SEAD about the possibility of adding additional extractors or previewers to your Project Space.

To support your data publication process, SEAD provides a Staging Area where you can review and dynamically compare your data and metadata against the requirements of specific repositories to identify any issues you may have with unsupported formats and/or missing required metadata. This process enables you to quickly update your submission to assure that it meets the policies of the repository you choose.

III. Policies for Access and Sharing and Provisions for Appropriate Protection/Privacy

SEAD provides mechanisms for you to control access to your data. The SEAD team also works to assure that your data are secure within SEAD's services. If your data raises any specific access or privacy issues, we encourage you to contact the SEAD team to discuss your needs. SEAD's Project Spaces are access-controlled and use encrypted (https) communications. Project Spaces are created by project teams within SEAD's hosted services. Project Space administrators control access to a specific project space, and the privileges granted to users within that space (e.g. to upload and annotate data, or to have view-only access). You have full control over who in your team have administrative rights within SEAD (who can therefore control who sees the data and who is able to edit/add/publish data). Teams also have the option to mark some or all of their data publicly visible, or to allow an extended team to have view-only access to the data, providing a “preprint” or “preview” mechanism for data before they are officially published.

Machines housing SEAD services are monitored using the standard software/procedures implemented by SEAD's parent organizations – the National Center for Supercomputing Applications (NCSA) at the University of Illinois, the University of Indiana, and the University of Michigan.

IV. Plans for Archiving and Preservation of Access

SEAD's design supports incremental upload and annotation of data making it possible for project teams to manage more of their data, with more metadata, and earlier in their process. By allowing this active use of data during the project, SEAD technology enables researchers to add an additional layer of quality assurance that does not exist when data is submitted for publication and annotated only at the end of the project's life. SEAD's RESTful API can be used to avoid manual transcription errors, and many features of SEAD user interface (including support for controlled vocabularies, use of type-ahead capabilities to encourage re-use of previously entered terms, previews and metadata extraction) are designed to reduce input errors. Data within SEAD's Project Spaces are stored on redundant disk arrays and are periodically backed-up.

Publication in SEAD involves automated data and metadata review, assignment of a persistent global identifier (by default a Digital Object Identifier (DOI)) and submission to one of several long-term repositories working with SEAD that have documented policies and practices for preserving data. By default, publication through SEAD results in the creation and preservation of Open Archives Initiative Object Reuse and Exchange (OAI-ORE) documentation of the contents, structure, and metadata as published along with the submitted data. Cryptographic hashes are generated as data are uploaded into SEAD Project Spaces and are transmitted with data throughout the publication process. They can be used to verify fixity – that there have been no byte-level changes to the data since it was submitted to SEAD – at any point in the future. SEAD also registers all published data with the DataOne federated data catalog, which provides faceted search over data from a broad range of projects.

All repositories that accept data published through SEAD work to assure the preservation of the published datasets and provide continuing access to them. The technical means used to preserve the datasets, i.e. the ‘raw bytes', varies across repositories, but include use of error-correction codes and/or multiple data copies, as well as ongoing migration of services and data to new platforms and storage media over time as required by best practice in digital preservation. SEAD's default repository at the Indiana University packages data publications as storage-efficient zip archives compliant with the BagIT specification (used by the Library of Congress and a wide range of research data repositories), copies of which are stored in two geographically separate locations. Individual repositories working with SEAD may also be able to provide further review of data as it is published, longer retention periods, migration of data to new formats over time, and additional policies that may be important for your data publications. SEAD encourages researchers to discuss such specific needs with SEAD, its existing partners, and/or institutional or domain repositories that accept SEAD data packages. SEAD will work to preserve data in perpetuity and seek a minimum planned data retention period of 5 years from partner repositories. Should SEAD and its partner institutions become unable to continue maintaining the data, efforts will be made to contact the project team that published the data and/or their institutions to transfer the publications to another provider. To date, SEAD has been used to publish collections including more than 2.2 M files representing more than 1.4 TB of information, with the oldest collections having been managed within SEAD without data loss for more than 4 years. SEAD is directly involved in national and international efforts, including the Research Data Alliance and National Data Service Consortium, to provide universal enduring storage for research data and will work to make data published through SEAD compatible with such services as they emerge.

V. Policies and Provisions for Re-Use and Re-Distribution

SEAD encourages the use of open data licenses such as the Creative Commons (CC) license for data publications. Teams can request a different license if it is supported by the repository chosen for publishing and archiving of data. Unless arranged beforehand, SEAD will make published data available with its open data license in an unrestricted manner and will not enforce access control or other license provisions. SEAD also works with repositories that support data embargoes, and/or enforce access controls for sensitive, restricted-use data (e.g. data involving privacy issues). SEAD encourages researchers to discuss such specific capabilities with SEAD in advance.