2025/019 | Research Data Management 101: Ethical Sharing, Archiving, and Reuse

Academic Publishing Navigator, 2025, Art. 19

Research Data Management 101: Ethical Sharing, Archiving, and Reuse

Research Data Management 101: Ethical Sharing, Archiving, and Reuse

I. Foundational Concepts: RDM within the Research Lifecycle

Research Data Management (RDM) is a systematic framework essential for ensuring the longevity, utility, and accountability of scholarly outputs. This process of deciding and documenting how data is collected, organized, stored, and shared is critical because research data often possesses a lifespan significantly longer than the projects that generate it [1]. Effective RDM transcends simple file storage; it is an organizational discipline designed to anticipate long-term requirements and mitigate risks from the outset.

A. The RDM Imperative: Strategic Value and Policy Compliance

Proactive engagement with RDM practices yields measurable strategic advantages for researchers and institutions. Planning RDM needs in advance enables clear organization, saving considerable time and reducing the burden of last-minute preparation before publication or archiving [1]. Furthermore, rigorous management and documentation throughout the entire project maintain data integrity, allowing researchers and external collaborators to accurately understand and use the data in the future [1].

The strategic value of RDM extends into broader scholarly benefits. Sharing data can lead to new, unanticipated discoveries, promotes innovation, and accelerates new collaborations [1]. Moreover, it provides valuable research material for scholars with limited funding, encourages the validation and improvement of research methods, and reduces the substantial cost associated with duplicating data collection efforts [1].

The institutional drive for RDM is increasingly underpinned by policy compliance. Many funding agencies and academic journals now mandate that researchers produce and adhere to a detailed Data Management Plan (DMP) and/or a Data Sharing Plan [1, 2]. For example, the U.S. National Institutes of Health (NIH) requires all applicants generating scientific data to prepare a Data Management and Sharing (DMS) Plan [3]. This institutional requirement demonstrates that RDM is no longer merely a recommendation for good practice but a core institutional risk management function. The requirement for institutions to monitor and manage compliance with the DMS Plan reinforces that adherence to RDM standards is directly tied to securing and maintaining grant funding, necessitating sustained investment in infrastructure, advisory services, and staff training [3, 4].

B. The Research Data Lifecycle and the Centrality of Planning

The research data lifecycle is a conceptual model illustrating the stages of data management, guiding how data flows from conception to reuse [1]. Although often depicted sequentially, the individual phases of RDM frequently overlap, emphasizing that anticipatory decisions made early in the process determine the success of later stages [5].

The cycle begins with Planning, conducted before any data collection starts. This foundational step requires designing RDM measures in advance, planning resources for data processing and storage, and, critically, ensuring the correct implementation of legal and ethical provisions governing data collection and subsequent publication [2]. The subsequent phases include Data Collection (generating data through various methods), Preparation/Analysis (focusing on temporary storage, version control, and internal sharing among the project group), and the final stages of Archiving, Publication, and Reuse [2]. The ultimate goal of these later phases is to make the data Findable, Accessible, Interoperable, and Reusable (FAIR) to maximize external impact [2].

During the active management phase, rigorous documentation is crucial. Researchers must document the context of data collection, the methods used, the structure and organization of data files, and any validation or quality assurance steps performed [6]. They must also record the software used for analysis and any transformations applied to the raw data [6, 7]. This systematic documentation process ensures that the tacit knowledge held by the primary research team is externalized into accessible, structured protocols. This externalization is essential for validation, reproducibility, and, practically, maintaining project continuity, especially when personnel changes occur [1].

C. The Data Management Plan (DMP): A Living Compliance Roadmap

The Data Management Plan (DMP) is the primary formal document outlining how data will be organized, stored, preserved, and disseminated throughout the entire research lifecycle [8]. It details how data security requirements will be met and specifies the long-term archiving strategy [8]. Due to the evolving nature of research, the DMP is considered a living document and must be adjustable as the project course changes [8].

DMPs serve a dual purpose: they guide researchers in improving project efficiency and reproducibility while also satisfying the mandatory compliance requirements imposed by funding bodies [8]. Tools like the DMPTool are often used to facilitate the creation and review of these plans using institutionally and funder-specific templates [8].

Essential DMP Elements (Checklist Approach)

Regardless of the specific funder template, a comprehensive DMP generally requires addressing the following key sections:

  1. Data Description and Inventory: This section identifies the scientific data to be generated, detailing the type of data (numerical, image, text), how it will be collected, the format (including whether the format is open standard or proprietary), the estimated quantity and file size, and the source (newly generated or existing data) [9, 10].

  2. Standards and Documentation: This mandates the definition of robust data standards, including the use of metadata schemas standard to the field. It requires specifying file naming conventions, version control strategies, and the creation of essential human-readable documentation, such as data dictionaries, codebooks, or ReadMe files [6, 9].

  3. Storage, Security, and Backup: Researchers must specify local storage methods and locations, backup schedules, and how privacy, ethics, and legal concerns will be addressed through security measures [6].

  4. Preservation and Sharing: The plan must identify the chosen archive or repository, the preferred long-term preservation formats, and the strategy for dissemination and public access [6, 10].

  5. Roles and Oversight: The DMP must clearly designate who is responsible for managing the data currently and in the future, and who at the institution will monitor and manage compliance with the plan [3, 9].

The following table summarizes the key stages of the RDM lifecycle and the corresponding deliverables that must be generated:

Table I: The Research Data Management Lifecycle Stages and Key Deliverables

RDM StageDescription and RDM MeasuresKey Deliverables/ToolingSource(s)
PlanningAnticipatory design, resource allocation, and adherence to legal/ethical frameworks before collection starts. Focus on anticipating storage and legal needs.Data Management Plan (DMP), Ethics Review/IRB Protocol1
Active Data ManagementCollection, preparation, analysis, temporary storage, and secure sharing within the defined project group. Focus on version control and quality assurance.File Naming Conventions, Version Control System, Documentation of Analysis/Software2
Ethical VettingImplementing consent, minimizing harm, data minimization, and employing necessary security controls for sensitive data.Informed Consent Forms, Access Control Lists (Principle of Least Privilege), Encryption Protocols11
Archiving and PreservationPreparing data for long-term storage, curation, selecting preservation formats, and implementing measures to make data FAIR.Repository Selection, Preferred Archival File Formats, Comprehensive Metadata Schema (e.g., DataCite)2
Publication and ReuseMaking data visible, accessible, and citable to maximize impact. Defining legal permissions for secondary use.Persistent Identifier (DOI/Accession Number), Open/Controlled Access License (CC0, CC BY, DUC), Data Citation2

II. Ethical Sharing and Sensitive Data Stewardship

Ethical considerations form a critical foundation for RDM, particularly when managing human-derived or other sensitive data, such as Personally Identifiable Information (PII) or Personal Health Information (PHI) [21]. These principles, which include voluntary participation, informed consent, and minimizing the potential for harm, must guide research design and practices from the proposal stage forward [6, 11].

A. Establishing Ethical Groundwork

The ethical responsibility to participants begins with ensuring Voluntary Participation—that subjects are free to opt in or out of the study at any point without coercion—and Informed Consent [11]. Informed consent is a pivotal technical mandate: participants must receive and understand all information regarding the study’s purpose, benefits, risks, funding, and crucially, how their data will be stored, protected, used, and potentially shared in a de-identified format [11, 12]. The agreements established during this consent procedure directly dictate the necessary extent of anonymity implemented in RDM practices [21].

A core distinction exists between anonymity and confidentiality. Anonymity means that personally identifiable data is never collected, making it impossible to link any individual participant to their data [11]. Confidentiality means that researchers collect identifiable data but commit to keeping that information hidden from everyone else, often by anonymizing the data before it is linked to other research results or shared externally [11].

Furthermore, RDM practices for sensitive data must adhere to Data Minimization, a key principle under regulations such as the European Union’s General Data Protection Regulation (GDPR) [13]. This involves collecting only the data strictly necessary for the research purpose, avoiding unnecessary requests for identifiable information (such as full names) if they are not integral to the study [12, 21].

B. Managing Sensitive Data Protocols

Sensitive data—which encompasses personal characteristics, location data, trade secrets, financial information, and certain biodiversity data [13]—requires dedicated security protocols. Protection against unwanted disclosure is necessary for legal, ethical, and proprietary reasons [13].

The management of sensitive data relies on several interconnected security principles:

  1. Encryption: All sensitive data must be encrypted 12.

  2. Secure Storage and Disposal: Data must be stored using only institutional and IRB-approved, sanctioned storage systems. Processes must also be established to ensure secure disposal once retention periods are met 12.

  3. Principle of Least Privilege: This mandates strictly controlling access, ensuring that only research team members listed on the IRB protocol have access to the data, and only to the minimum extent necessary to perform their specific roles 12.

  4. Institutional Approval: Research involving highly sensitive government-licensed datasets (e.g., those containing PII or PHI) requires formal institutional approval to guarantee compliance with security requirements [21].

C. Technical Mechanisms for Privacy Protection

De-identification is the primary method employed to protect participant privacy before data sharing occurs 12. However, applying any anonymization strategy involves an inherent trade-off between maximizing privacy protection (reducing re-identification risk) and preserving data utility for future analysis [22, 23].

K-Anonymity and Differential Privacy

K-Anonymity is a technique designed to prevent re-identification attacks by ensuring that each record in a released dataset is indistinguishable from at least $K-1$ other records that share the same quasi-identifier values [24, 25]. This grouping makes it significantly more difficult for an attacker to link anonymous records with public information and identify specific individuals [24].

Implementation of K-anonymity commonly utilizes two methods [25]:

  • Suppression: Replacing certain sensitive values with an asterisk, potentially suppressing all or some values in a column [25].

  • Generalization: Replacing individual specific values (e.g., an age of "19") with a broader category (e.g., "$\le 20$") or using global recoding to group continuous or discrete numerical variables into predefined classes [24, 25].

A measured approach to applying K-anonymity is necessary because excessive generalization and suppression can skew analytical results, especially in high-dimensional datasets [25]. Research has shown that while re-identification risk can be reduced by up to $16.82\%$, the corresponding data quality difference may reach $4.80\%$ in some configurations, illustrating the crucial balance researchers must strike between privacy guarantees and data utility [23].

For contexts requiring the highest level of privacy, Differential Privacy is utilized. Unlike K-anonymity, which prevents identity linkage, differential privacy provides a mathematically proven guarantee by restricting the amount of information that any external party can learn about an individual, regardless of the attacker's background knowledge [24].

The Controlled Access Architecture

For the most sensitive human data, such as genomic information subject to policies like the NIH Genomic Data Sharing (GDS) Policy, the preferred method of sharing is through Controlled-Access Repositories (e.g., NIH's dbGaP) 12. These repositories do not make data public but instead govern access through a sophisticated institutional framework:

  1. Deposit: Researchers deposit a de-identified dataset 12.

  2. Review: External researchers must apply for access. A Data Access Committee (DAC) reviews the request to ensure it aligns with legitimate scientific purposes, participant informed consent, and the established Data Use Limitations 12.

  3. Legal Mandate: Approved users must sign a legally binding Data Use Certification (DUC) or Data Use Agreement (DUA), which stipulates the terms of use and commits the researcher and their institution to protect the data 12.

This structure confirms that effective sensitive data stewardship is not purely a technical exercise in encryption but an active legal and governance process driven by the DAC and supported by legal agreements [26]. The following table summarizes the primary access control mechanisms implemented in RDM:

Table II: Mechanisms for Controlled Data Access

Access MechanismAccess RequirementsData Type Typically HandledKey Safeguards/Policy
Open AccessNone; public sharing.Non-sensitive, highly reusable data.Permissive Licenses (CC0, CC BY)
Registered/Authenticated AccessUser registration or authentication (e.g., institutional login).Data requiring basic usage tracking/attribution or adherence to simple Terms of Use.

Registration/Authentication Procedure 27

Controlled AccessReview and approval by a Data Access Committee (DAC).

Sensitive human data (PII, PHI, genomic data).12

Data Use Certification (DUC)/Agreement (DUA) 12, Principle of Least Privilege 12

Metadata Only AccessOnly metadata is shared publicly.Highly restricted/proprietary datasets where the data itself cannot be released.

Open metadata sharing as good practice, even if data is restricted 21

III. Archiving and Digital Preservation

Archiving is the essential long-term strategy for data stewardship, focusing on guaranteeing the integrity, authenticity, and sustained availability of research data far beyond the project's funding timeline [14].

A. Selecting Trustworthy Data Repositories

The choice of repository is crucial for long-term preservation and future reuse. Repositories provide the necessary infrastructure to manage and maintain data [28].

Repository Hierarchy and Criteria

Primary consideration should be given to repositories that are discipline- or data-type-specific, as they are best equipped to support field-specific discovery, standards, and effective reuse by the relevant research community [14]. If no appropriate disciplinary repository exists, researchers should consider institutional repositories (which provide local stewardship and preservation) or generalist data repositories (such as Dryad or Figshare), which accept a wide variety of file types regardless of discipline [14, 28].

Trustworthy repositories must adhere to rigorous criteria established by funding agencies and scholarly bodies:

  • Unique Persistent Identifiers (PIDs): Datasets must be assigned a citable, unique persistent identifier, typically a Digital Object Identifier (DOI) or an accession number. PIDs ensure persistence and facilitate data discovery, reporting, and research assessment [14, 17, 19].

  • Long-Term Sustainability: The repository must possess a clear plan for long-term data management, including stable technical infrastructure, secure funding plans, and contingency protocols to maintain data integrity, authenticity, and availability during and after unforeseen events [14].

  • Open Access Principles: The repository must be open to all researchers within its scope, provide the option for data to be released under highly permissive licenses (specifically CC0 or CC BY), and must not charge readers access fees or subscription fees [19].

CoreTrustSeal Certification and Curation

Institutional and disciplinary repositories seeking to demonstrate accountability and reliability can pursue CoreTrustSeal certification [29]. This certification reflects adherence to requirements related to preservation, designated community definition, and curatorial standards [30].

A fundamental requirement of trustworthy archiving is the clear definition of the Designated Community—the specific group of users the repository intends to serve [30]. If the designated community is broad, the repository must offer extensive contextual documentation to ensure the data is comprehensible to all intended users, addressing potential tacit assumptions regarding language skills, software requirements, or operating systems [30]. This linkage between the administrative act of defining the user base and the technical function of providing rich metadata confirms that archiving is an active curatorial process, essential for achieving true interoperability and reuse.

B. Strategies for Long-Term Preservation

The technical longevity of archived data relies heavily on the selection of appropriate file formats. Preservation strategy mandates the use of non-proprietary, well-documented, and widely implemented formats, maximizing the likelihood that the data can be rendered accurately years or decades in the future, regardless of software evolution [31].

The following file formats are highly recommended for long-term digital preservation:

Table IV: Preferred Archival File Formats for Long-Term Preservation

Content TypePrimary Preservation Format (Preferred)Key JustificationSource(s)
Tabular/Statistical DataComma-Separated Values (.csv), Tab-Separated Values (.tsv), or Delimited Text (.txt)Open standard, platform independent, vendor-neutral, minimizing data loss compared to proprietary spreadsheets.15
Documents/TextPDF/A (Archival Standard)ISO standard specifically designed for long-term electronic document preservation.15
Raster ImagesTIFF (uncompressed) or PNGHigh fidelity, lossless formats that maintain image integrity over time.15
Geospatial DataGeoTIFF (.tiff) or Geographic Markup Language (.gml)Supports integrated location metadata and specialized geographical data structures.32
AudioBWF-Broadcast WAV (.wav is the extension) or AIFFNon-proprietary, uncompressed digital audio standards preferred for archival quality.15
VideoFFV1 Matroska Multimedia Container (.mkv) or Motion JPEG 2000Utilizes open codecs and standard containers, favored over proprietary video standards for stability.15

By selecting these formats, institutions prioritize technical stability and vendor independence, critical factors in mitigating the risks associated with rapid technological obsolescence and ensuring the data remains accessible and authentic [14].

IV. Maximizing Reuse: Standards, Principles, and Governance

Maximizing the reuse of research data is a central goal of RDM and requires adhering to both technical standards for machine-actionability and ethical principles for human-centric governance.

A. The FAIR Principles: Enabling Technical Reuse

The FAIR Principles—Findable, Accessible, Interoperable, and Reusable—provide the technical framework for optimizing data publishing practices, repository standards, and analytical services globally [33, 34]. The fundamental objective of FAIR is to optimize data reuse by ensuring robust technical standards are met:

  • Findable (F): Data must be uniquely identifiable and locatable through the use of Persistent Identifiers (PIDs) and rich, descriptive metadata 33.

  • Accessible (A): Data must be retrievable via transparent access policies and appropriate protocols, even if access is restricted to controlled environments 33.

  • Interoperable (I): Data must be structured to work across platforms, tools, and domains, relying on standardized vocabularies, terminologies, ontologies, and machine-readable metadata [33, 35].

  • Reusable (R): Data must be accompanied by clear context, provenance (the history of its origin and processing), and explicit legal licensing to enable repeated use and validation 33.

Despite widespread adoption, implementation often falls short of achieving true FAIRness in practice, primarily due to uneven metadata quality, thin documentation, and unclear provenance 33. To address this gap, automated assessment tools are employed: for instance, FAIR-Aware helps researchers self-assess their knowledge of FAIR requirements before deposit [36], while F-UJI provides an objective, automated evaluation and scoring of a dataset’s FAIRness based on established metrics [37].

B. Indigenous Data Governance: The CARE Principles

While FAIR principles are data-centric, the CARE Principles for Indigenous Data Governance are people- and purpose-oriented, addressing the critical role of data in advancing Indigenous innovation and self-determination 38. The CARE framework—Collective Benefit, Authority to Control, Responsibility, and Ethics—was developed by the Global Indigenous Data Alliance (GIDA) as an essential complement to FAIR 39.

The core contribution of CARE is shifting data governance from simple consultation to value-based relationships that promote equitable Indigenous participation 39.

The principle of Authority to Control (ATC) is particularly critical. It mandates that Indigenous Peoples’ rights and interests in their data must be recognized and their authority to control how those data—including data pertaining to lands, territories, and knowledge—are represented and identified must be empowered [40]. This requires acknowledging Indigenous Peoples’ collective and individual rights to free, prior, and informed consent in the collection and use of such data, including the development of data policies and protocols [40].

For researchers and institutions, this governance requirement acts as a precondition for ethical reuse. Achieving the technical standard of Reusability (R) under FAIR is insufficient if the necessary ethical and legal framework stipulated by CARE, particularly ATC, is not met. RDM service providers must therefore integrate mechanisms for Indigenous Data Governance (IDG) into their policy structures, ensuring that data stewardship practices address historical power imbalances and create value that supports Indigenous governance and citizen engagement 39.

The relationship between these two vital governance frameworks is summarized below:

Table III: Comparative Analysis of Data Governance Principles

Principle SetCore OrientationPrimary Focus/ObjectiveKey ConceptsImplication for Governance
FAIR

Data-Centric 39

Optimize technical reuse and machine actionability.33

Findable, Accessible, Interoperable, Reusable; Persistent Identifiers.Focus on technical standards, metadata quality, and repository capability.
CARE

People and Purpose-Oriented 38

Advance Indigenous innovation and self-determination.38

Collective Benefit, Authority to Control, Responsibility, Ethics.

Focus on legal frameworks (Indigenous Data Sovereignty), institutional engagement, and ethical power-sharing.39

C. Enabling Data Discovery and Attribution

The final steps in maximizing data reuse involve establishing persistent identity, clear attribution, and defining legal permissions.

Metadata and Documentation Standards

Metadata—the information that describes the who, what, when, where, why, and how of the research—is the essential component that facilitates search, retrieval, and appropriate use [41]. Metadata must be sufficiently detailed to allow a user to reconstruct the context of the data collection, evaluate its fitness for their purpose, and analyze it appropriately [41].

To achieve Interoperability (I), metadata must be structured using agreed-upon standards. The use of common terminologies, ontologies, and standardized formats is critical for machine-readability, enabling users to learn about and utilize data quickly using code [35]. A notable example is the DataCite Metadata Schema, which focuses on a list of core properties chosen for the accurate and consistent identification of a resource, typically a dataset, for citation and retrieval purposes [16]. The DataCite schema maintains compatibility by mapping its properties to the Dublin Core Metadata Initiative Schema, facilitating cross-disciplinary data discovery [42].

Persistent Identifiers and Citation Requirements

Data must be recognized as legitimate, citable products of research [18]. This requires robust archiving and direct citation, just as literature is cited [43]. The Joint Declaration of Data Citation Principles (JDDCP) sets the standards for this practice, asserting that data citations must be both human-understandable and machine-actionable [18].

Persistent Identifiers (PIDs), such as the Digital Object Identifier (DOI) or accession numbers, are assigned by repositories at publication to provide a persistent link to the dataset, ensuring its enduring findability [17, 44].

Crucially, the JDDCP emphasizes the need for Specificity and Verifiability [18]. Citation metadata must include information on the data’s provenance and fixity sufficient to verify that the exact timeslice, version, or granular portion of data retrieved later is identical to what was originally cited [18]. Tools like the Universal Numerical Fingerprint (UNF) employ cryptographic technology to ensure that the unique alphanumeric identifier changes if any portion of the dataset is altered, fulfilling this fixity requirement [45].

Data Licensing for Legal Reuse

Data licensing provides the standardized legal mechanisms to grant the public permission to use copyrighted research materials, thereby removing legal barriers to reuse [46]. The Creative Commons (CC) framework is the standard for research data [46].

  • CC BY (Attribution): Requires users to attribute the original creator, a common requirement that ensures scholarly credit [18].

  • CC0 Public Domain Dedication: This dedication is preferred for maximizing open data reuse and accelerating scientific impact [19]. By dedicating the text of the legal tool to the public domain and waiving all copyright restrictions worldwide, CC0 eliminates legal friction related to derivative works and jurisdictional complexities [47]. Institutional policies that prioritize the dedication of data to the public domain (where legally and ethically permissible) strategically favor the maximization of scientific impact, collaboration, and unanticipated discoveries over maximizing attribution rights [1].

For sensitive data, licensing is coupled with access controls. Open access data may use CC0 or similar permissive licenses [48], while safeguarded data relies on bespoke licenses (e.g., Special Licence or End User Licence) combined with the legally binding Data Use Agreements (DUA) enforced by controlled-access repositories [20, 48].

V. Implementation Roadmap: Best Practices and Institutional Oversight

Effective RDM implementation requires a continuous, goal-oriented workflow for researchers and robust, sustainable support services from the host institution.

A. Institutional RDM Service Development

Institutions must develop and maintain comprehensive RDM services based on core capabilities, ensuring long-term sustainability and compliance readiness [4]. These capabilities include:

  • Policy and Sustainability: Developing and maintaining institutional RDM policies, coupled with solid business plans that detail staff investment, technological investment, and cost modeling to secure the sustainability of RDM services [4].

  • Support Services: Providing extensive advisory services and training to researchers and support staff in both online and in-person formats [4].

  • Data Management Planning: Offering specific support and infrastructure (e.g., access to the DMPTool) to help researchers effectively plan the data component of their projects and produce compliant Data Management Plan documentation [4, 8].

  • Active Data Management: Offering active data management services, including secure storage, collaboration support, scalability, and publishing mechanisms that adhere to open access principles [4].

  • Compliance and Assessment: Implementing processes for appraisal and risk assessment to identify valuable data and mitigate associated risks. This includes supporting assessments of datasets and RDM services to ensure compliance with FAIR principles [4]. Furthermore, participation in national and international initiatives, such as the European Open Science Cloud (EOSC), enhances the further development and linkage of research infrastructures necessary for data reuse [2, 4].

B. Actionable RDM Workflow for Researchers

Researchers can establish an effective RDM workflow through actionable, phased steps designed to integrate RDM into the research project structure [49].

  1. Define Clear Goals: Establish a clear objective for data management, ideally aiming for alignment with institutional and funder best practices, such as maximizing FAIR compliance [49].

  2. Understand Best Practices and Review Current Management: Review existing data management procedures to identify current gaps in data handling, technical capabilities, and adherence to protocols. Researchers must identify the type of data, the software used, and the necessary technical requirements for storage and analysis [6, 49].

  3. Standardize Organization and Documentation: Develop consistent organizational protocols, including defining file structures and creating a clear, descriptive naming system for files. File names should incorporate relevant details like dates, version numbers, and project numbers to facilitate easy sorting and filtering [49]. Create templates for documenting protocols (codebooks, data dictionaries) to ensure consistency [49].

  4. Develop and Maintain the DMP: Use the DMP as the central roadmap, detailing storage, security, preservation, and sharing choices [50]. The plan should be continuously reviewed and adjusted throughout the life of the project to reflect changes in methodology or scope [49].

VI. Conclusions and Recommendations

Research Data Management (RDM) is fundamentally shifting from an academic recommendation to a governance and risk management requirement. The data presented demonstrates a convergence of technical necessity (long-term preservation of digital assets) with strict compliance demands (funder mandates, ethical regulations).

Primary Conclusions:

  1. RDM as Institutional Risk Management: The requirement by major funding bodies, such as the NIH, for mandatory Data Management and Sharing Plans and institutional oversight of their execution means that RDM is directly tied to financial compliance and institutional accountability [3, 8]. Failure to establish robust RDM services, including adequate business planning and investment in advisory capabilities, constitutes a significant operational and financial risk [4].

  2. The Priority of Ethical Infrastructure: Ethical stewardship, particularly for human data, demands a complex legal and technical infrastructure that extends beyond simple encryption. Access to sensitive data is governed by the specialized architecture of Controlled Access Repositories, DAC review, and legally binding Data Use Agreements 12. This sophisticated system is necessary to balance the inherent trade-off between maximizing data utility and minimizing re-identification risk through techniques like K-anonymity and differential privacy [22, 24].

  3. Governing Reuse through Dual Principles: Maximizing data impact requires adherence to two complementary governance frameworks. The FAIR Principles provide the technical road map for data publishing, focusing on discoverability and machine-actionability through persistent identifiers and structured metadata (e.g., DataCite) [16, 33]. Simultaneously, the CARE Principles address the ethical and political dimensions of reuse, asserting the right of Indigenous Peoples to control their data (Authority to Control) as a prerequisite for external engagement and sustainable benefit 39.

  4. Digital Preservation Mandates Open Standards: Long-term archival integrity is achieved through strategic technical decisions regarding repository trust (CoreTrustSeal certification, long-term funding plans) and file format selection [14, 29]. Preservation mandates the use of open, non-proprietary formats (e.g., CSV, PDF/A, TIFF) to ensure accessibility independent of proprietary software evolution [31, 32].

Recommendations for RDM Service Directors:

  1. Integrate Compliance and Ethics: RDM services should collaborate directly with institutional legal and ethics boards (IRB) to formally integrate Data Use Agreements (DUAs) and Data Access Committee (DAC) procedures into the sharing workflow for sensitive data.

  2. Prioritize Curation over Storage: Resources should be allocated not merely to acquiring storage, but to active data curation—specifically, training personnel to define Designated Communities, curate rich documentation (metadata and provenance), and ensure compliance with preservation format standards [15, 30].

  3. Encourage CC0 Strategy: Where ethically and legally permitted, encourage researchers to use the CC0 public domain dedication for non-sensitive data. This strategic decision aligns RDM policy with the goal of maximizing scientific impact and accountability by reducing legal friction for derivative use and collaboration [1, 47].


[1] Princeton University. (n.d.). Research data management handbook. URL: https://researchdata.princeton.edu/data-management-handbook
[2] University of Vienna. (n.d.). The research data lifecycle. URL: https://rdm.univie.ac.at/what-is-research-data-management/the-research-data-lifecycle/
[3] National Institutes of Health. (n.d.). Writing a data management and sharing (DMS) plan. URL: https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/dms/writing-dms-plan
[4] OpenAIRE. (n.d.). RDM service development checklist. URL: https://www.openaire.eu/rdm-service-development-checklist
[5] Karlsruhe Institute of Technology. (n.d.). The research data cycle. URL: https://www.rdm.kit.edu/english/researchdata_rdm_cycle.php
[6] University of Virginia Library. (2019). Research data management best practices. URL: https://library.virginia.edu/sites/default/files/rds-docs/RDM_Best_Practices_2019Final.pdf
[7] Washington University in St. Louis. (n.d.). Research data management (RDM) checklist. URL: https://beckerdms.wustl.edu/resources/research-data-management-rdm/
[8] University of Texas at Dallas. (n.d.). Data management plan. URL: https://data.utdallas.edu/data-management-plans/dmp/
[9] MIT Libraries. (n.d.). Write a data management plan. URL: https://libraries.mit.edu/data-management/plan/write/
[10] Intone. (n.d.). Data management plan checklist: Essential components. URL: https://intone.com/data-management-plan-checklist-essential-components/
[11] Scribbr. (n.d.). Ethical considerations in research: Types & examples. URL: https://www.scribbr.com/methodology/research-ethics/
[12] University at Buffalo. (n.d.). Key principles for sensitive data. URL: https://library.buffalo.edu/research/rds/education/sensitive-data.html
[13] OpenAIRE. (n.d.). Sensitive data guide. URL: https://www.openaire.eu/sensitive-data-guide
[14] National Institutes of Health. (n.d.). Selecting a data repository. URL: https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/dms/selecting-a-data-repository
[15] Smithsonian Institution Archives. (n.d.). Recommended preservation formats for electronic records. URL: https://siarchives.si.edu/what-we-do/digital-curation/recommended-preservation-formats-electronic-records
[16] DataCite. (n.d.). DataCite metadata schema kernel v3.0. URL: https://schema.datacite.org/meta/kernel-3.0/doc/DataCite-MetadataKernel_v3.0.pdf
[17] Princeton University. (n.d.). Persistent identifiers: DOIs, accession numbers, and ORCID. URL: https://researchdata.princeton.edu/research-lifecycle-guide/publishing-and-preservation/dois
[18] Data Citation Implementation Pilot (DCIP) Project. (2018). Implementing the Joint Declaration of Data Citation Principles: A practical roadmap for scholarly publishers. PLOS One, 13(11), e0206124. DOI: https://pmc.ncbi.nlm.nih.gov/articles/PMC6244190/
[19] PLOS ONE. (n.d.). Recommended repositories for data deposit. URL: https://journals.plos.org/plosone/s/recommended-repositories
[20] National Institutes of Health. (n.d.). Data use certification (DUC) agreement. URL: https://grants.nih.gov/policy-and-compliance/policy-topics/sharing-policies/accessing-data/certification-agreement
[21] University of California, Santa Barbara. (n.d.). Anonymizing and protecting sensitive data. URL: https://rcd.ucsb.edu/resources/data-resources/anonymizing-protecting
[22] National Center for Biotechnology Information. (2024). Trade-offs in anonymizing speech data for individual-level clinical research (PMC12534620). URL: https://pmc.ncbi.nlm.nih.gov/articles/PMC12534620/
[23] Dollinger, Z., Patzelt, J., Gatz, E., Bach, G., & Acs, A. (2022). Utility vs. risk: Anonymization points in data warehouse scenarios. 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), 1205–1214. DOI: https://ieeexplore.ieee.org/document/9842582
[24] Wikipedia. (n.d.). K-anonymity. URL: https://en.wikipedia.org/wiki/K-anonymity
[25] Wikipedia. (n.d.). K-anonymity. URL: https://en.wikipedia.org/wiki/K-anonymity
[26] National Institutes of Health. (n.d.). NIH controlled-access data repositories (CADR) implementation guidebook. URL: https://grants.nih.gov/sites/default/files/flmngr/NIH-CADR-Implementation-Guidebook.pdf
[27] ELIXIR. (n.d.). Data sharing and access. URL: https://rdmkit.elixir-europe.org/sharing
[28] F1000 Research. (n.d.). How to choose a repository. URL: https://www.f1000.com/researcher_blog/how-to-choose-a-repository/
[29] CoreTrustSeal. (n.d.). CoreTrustSeal trustworthy data repositories requirements. URL: https://www.coretrustseal.org/why-certification/requirements/
[30] CoreTrustSeal. (2019). CoreTrustSeal trustworthy data repositories requirements: Extended guidance v2.0. URL: https://www.coretrustseal.org/wp-content/uploads/2019/11/2019-10-CoreTrustSeal-Extended-Guidance-v2_0.pdf
[31] Digital Preservation Coalition. (n.d.). File formats and standards. URL: https://www.dpconline.org/handbook/technical-solutions-and-tools/file-formats-and-standards
[32] MIT Libraries. (n.d.). Recommended file formats. URL: https://libraries.mit.edu/data-management/store/formats/
[33] Edet, W. (n.d.). Making FAIR real: How Codatta turns principles into practice. Medium. URL: https://medium.com/@winneredet2/making-fair-real-how-codatta-turns-principles-into-practice-f0e8cbcc7b6d
[34] Australian Research Data Commons. (n.d.). Making data FAIR. URL: https://ardc.edu.au/resource-hub/making-data-fair/
[35] National Institute of Allergy and Infectious Diseases. (2025). Understanding metadata: Key to data sharing and reuse. URL: https://www.niaid.nih.gov/research/understanding-metadata-key-data-sharing-and-reuse
[36] FAIR-IMPACT. (n.d.). FAIR assessment tools. URL: https://fair-impact.eu/fair-assessment-tools
[37] FAIR-IMPACT. (n.d.). FAIR assessment tools. URL: https://fair-impact.eu/fair-assessment-tools
[38] Global Indigenous Data Alliance. (2019). CARE principles for indigenous data governance. URL: https://en.wikipedia.org/wiki/CARE_Principles_for_Indigenous_Data_Governance
[39] Ada Lovelace Institute. (n.d.). Operationalising Indigenous Data Governance: The CARE Principles. URL: https://www.adalovelaceinstitute.org/blog/care-principles-operationalising-indigenous-data-governance/
[40] Global Indigenous Data Alliance. (2019). CARE principles for indigenous data governance. URL: https://www.rd-alliance.org/wp-content/uploads/2024/03/CARE20Principles20for20Indigenous20Data20Governance_OnePagers_FINAL20Sept2006202019.pdf
[41] Cornell University. (n.d.). Metadata: The who, what, when, where, why, how of your research. URL: https://data.research.cornell.edu/data-management/storing-and-managing/metadata/
[42] DataCite. (n.d.). DataCite metadata schema and Dublin Core mapping. URL: https://schema.datacite.org/meta/kernel-4.4/doc/DataCite_DublinCore_Mapping.pdf
[43] FORCE11. (n.d.). Joint Declaration of Data Citation Principles. URL: https://force11.org/info/joint-declaration-of-data-citation-principles-final/
[44] Ghent University. (n.d.). Alternative (Persistent) identifiers. URL: https://onderzoektips.ugent.be/en/tips/00001743/
[45] Dataverse. (n.d.). Data citation standard. URL: https://dataverse.org/best-practices/data-citation
[46] Creative Commons. (n.d.). About CC licenses. URL: https://creativecommons.org/share-your-work/cclicenses/
[47] Creative Commons. (2015). Why Creative Commons uses CC0. URL: https://creativecommons.org/2015/02/25/why-creative-commons-uses-cc0/
[48] Digital Curation Centre. (n.d.). How to license research data. URL: https://www.dcc.ac.uk/guidance/how-guides/license-research-data
[49] SciNote. (n.d.). 7 steps to get started with research data management. URL: https://www.scinote.net/blog/7-steps-to-get-started-with-research-data-management/
[50] Harvard Medical School. (n.d.). Research data management checklist. URL: https://postdoc.hms.harvard.edu/file_url/202