Skip to main content
Back to Home

Archival Methodology

Comprehensive whitepaper on our preservation standards and archival processes

Last Updated: January 2026

1. Mission and Purpose

This archive exists to preserve and provide public access to documents released by the U.S. Department of Justice on January 30, 2026, under the Epstein Files Transparency Act. Our mission is to maintain these records in perpetuity, ensuring they remain accessible to researchers, journalists, legal scholars, and the general public.

We believe that public records belong to the public. The documents in this archive represent a significant historical record that must be preserved with the highest standards of archival integrity. Our methodology is designed to ensure that every document is authentic, verifiable, and accessible for generations to come.

Core Principles:

  • Authenticity: Documents are preserved exactly as released by originating agencies, with no modifications to content
  • Integrity: Cryptographic verification ensures documents have not been tampered with since acquisition
  • Accessibility: Multiple formats and delivery methods ensure long-term public access
  • Transparency: Our processes, standards, and any corrections are fully documented and public
  • Privacy: We respect user privacy and comply with applicable data protection regulations
  • Sustainability: The archive is designed to persist beyond any single organization or technology platform

2. Document Collection and Sourcing

The foundation of any archive is the authenticity of its source materials. We employ rigorous standards for document acquisition to ensure every file in our collection can be traced directly to its official source.

2.1 Primary Sources

All documents in this archive originate from official government releases. The primary source is the U.S. Department of Justice release of January 30, 2026, which included approximately 3.5 million pages of records. Additional documents may be sourced from:

  • Federal court PACER (Public Access to Court Electronic Records) system
  • FOIA (Freedom of Information Act) responses
  • State court public records
  • Official press releases and public statements

2.2 Acquisition Process

Step-by-Step Collection:

  1. Source Identification: Official .gov URLs are documented before any download begins
  2. SSL/TLS Verification: We verify the authenticity of the source server's SSL certificate
  3. Download with Logging: All downloads are logged with timestamp, source URL, and file size
  4. Immediate Checksum Generation: SHA-256 hash is computed within seconds of download completion
  5. Backup Creation: Multiple copies are created on geographically distributed storage
  6. Chain of Custody Documentation: Complete provenance record is created for each document

2.3 Source Verification Standards

We do not accept documents from unofficial sources. Every document must be traceable to an official release. When third parties claim to have documents, we verify against official sources before inclusion. Documents that cannot be verified are not included in the archive.

3. Processing and Verification

Once documents are collected, they undergo a multi-stage verification process designed to ensure authenticity and prepare them for public access while maintaining their integrity.

3.1 Format Verification

Every document undergoes format verification to ensure it is a valid file that can be properly rendered. This includes:

  • PDF structure validation using industry-standard tools
  • File header verification to confirm file type matches extension
  • Corruption detection through multiple parsing attempts
  • Embedded object analysis for security screening
  • Page count verification against source metadata

3.2 Cryptographic Integrity

We use SHA-256 cryptographic hashing to create a unique "fingerprint" for every document. This 64-character hexadecimal string will change if even a single byte of the document is modified. Users can independently verify that the document they access matches the original by computing the hash themselves.

Why SHA-256?

  • Cryptographically secure: No known method to create a different file with the same hash
  • Widely supported: Available in all major operating systems and programming languages
  • NIST approved: Part of the Secure Hash Algorithm-2 family approved by the National Institute of Standards and Technology
  • Archival standard: Used by major archives including the Library of Congress and National Archives

3.3 Metadata Extraction

Metadata is extracted from each document using automated tools, then verified and supplemented by manual review:

Automated Extraction

File creation date, modification date, author fields, subject, page count, file size

Manual Verification

Sender, recipient, document type, date of content, subject classification

OCR Processing

Text extraction for searchability, with quality score assigned

Redaction Mapping

Location and extent of redactions documented for each page

3.4 Cross-Reference Verification

Documents are cross-referenced against related materials and external sources to verify consistency. Email threads are matched, dates are verified against known events, and references between documents are validated.

4. Organization and ID System

A consistent, logical organization system is essential for an archive of this scale. Our Document ID system provides a unique, permanent identifier for every document that encodes key information while remaining human-readable.

4.1 Document ID Structure

Format: PREFIX-YYYY-MM-DD-TYPE-NNN

  • PREFIX: Collection or person code (e.g., MME for Mette-Marit, GM for Ghislaine Maxwell)
  • YYYY-MM-DD: Document date in ISO 8601 format
  • TYPE: Document type code (E=Email, C=Correspondence, FM=Flight Manifest, DEP=Deposition)
  • NNN: Sequential number for documents with same prefix, date, and type

4.2 Collection Organization

Documents are organized into collections based on their primary subject or source. Each collection has a designated prefix and maintains its own sequential numbering. Collections may overlap - a document can appear in multiple collections if relevant to multiple subjects.

4.3 Metadata Schema

We follow the Dublin Core Metadata Element Set as our foundation, extended with archive-specific fields. This internationally recognized standard ensures our metadata is interoperable with other archives and research systems.

Core Metadata Fields:

Title / Description
Creator / Author
Date Created
Date Released
Document Type
Format / MIME Type
Language
Source Agency
Rights / Access Level
Identifier (Document ID)
Checksum (SHA-256)
Page Count
Redaction Status
Related Documents

5. Quality Assurance

Quality assurance is not a single step but an ongoing process integrated throughout the archival workflow. We employ multiple layers of verification to catch errors before documents become publicly accessible.

5.1 Automated Quality Checks

  • File Integrity: Hash verification before and after any file operation
  • Metadata Completeness: Required fields must be populated before publication
  • Link Validation: Internal and external links are regularly tested
  • Search Index Consistency: Search results verified against database
  • Format Rendering: Automated PDF rendering tests across browsers

5.2 Manual Review Process

High-priority documents undergo additional manual review. This includes verification of metadata accuracy, readability assessment, and sensitivity screening. Any document flagged during automated checks receives mandatory manual review.

5.3 Community Feedback

We maintain channels for the public to report errors, inconsistencies, or concerns about any document. All reports are investigated, and corrections are made when warranted. The correction process itself is documented and transparent.

6. Digital Preservation

Digital preservation is not simply about storage - it requires active management to ensure long-term accessibility as technologies evolve. Our preservation strategy addresses the challenges of format obsolescence, storage media degradation, and organizational continuity.

6.1 Storage Infrastructure

Redundancy Model:

  • Geographic Distribution: Copies stored in multiple physical locations
  • Media Diversity: Different storage technologies to mitigate single-point failures
  • Regular Verification: Automated integrity checks against stored checksums
  • Backup Rotation: Multiple backup generations maintained

6.2 Format Strategy

Documents are preserved in multiple formats to ensure future accessibility:

Original Format

Preserved exactly as released for authenticity verification

PDF/A

ISO-standardized archival PDF format for long-term preservation

Plain Text

OCR-extracted text for searchability and format independence

Metadata Package

JSON/XML metadata with complete provenance information

6.3 Technology Migration

We monitor format obsolescence and proactively migrate documents to current standards before legacy formats become unreadable. Migration decisions are documented, and original files are always preserved alongside migrated versions.

7. Privacy and Ethics

We recognize that public records can contain sensitive information about individuals who may be tangentially mentioned. Our ethical framework balances public interest transparency with respect for privacy.

7.1 Respecting Official Redactions

Redactions made by releasing agencies are preserved exactly as received. We do not attempt to remove, reverse, or circumvent redactions. Redactions typically protect:

  • Victims and witnesses entitled to privacy
  • Ongoing law enforcement investigations
  • National security information
  • Grand jury proceedings
  • Minor children mentioned in documents

7.2 User Privacy

We protect the privacy of archive users. We do not track which documents individuals access, do not require user accounts, and do not use tracking cookies or analytics that could identify users. See our Data Privacy page for complete details.

7.3 Ethical Use Guidelines

We encourage responsible use of archive materials. Researchers should verify information, provide context, avoid defaming individuals based on association alone, and respect the privacy of third parties who may be mentioned in documents.

8. Transparency and Corrections

Transparency is fundamental to trust. We document our processes, acknowledge limitations, and maintain public records of any corrections or amendments to the archive.

8.1 Correction Policy

When Corrections Occur:

  1. Originating agency issues official correction
  2. Metadata error is identified and verified
  3. Document was incorrectly attributed or dated
  4. Technical error in processing is discovered

8.2 Version History

When documents are corrected, we preserve all versions. The original is never deleted. Version history includes the date of change, nature of correction, and reason. Users can access both current and historical versions.

8.3 Public Audit Trail

Major changes to the archive - new document releases, corrections, methodology updates - are documented in a public changelog. This audit trail allows the public to monitor the archive's evolution and hold us accountable to our standards.

9. Commitment to Standards

This methodology represents our commitment to archival best practices. We continuously evaluate our processes against evolving standards and welcome feedback from the archival community, researchers, and the public.

The documents in this archive represent an important historical record. Our responsibility is to preserve them with integrity, provide access with transparency, and maintain them for future generations. We take this responsibility seriously and are committed to earning and maintaining public trust through rigorous adherence to these standards.

Standards We Follow:

  • Dublin Core Metadata Initiative (DCMI)
  • Open Archival Information System (OAIS) Reference Model
  • PDF/A (ISO 19005) for long-term archival
  • SHA-2 family (FIPS 180-4) for cryptographic verification
  • GDPR and CCPA for data privacy

Related Resources