Files
gallery-organizer-web/specs/001-archive-curator/spec.md
Danilo Reyes 070a3633d8 init
2026-02-07 06:01:29 -06:00

14 KiB
Raw Blame History

Feature Specification: Archive Curator

Feature Branch: 001-archive-curator
Created: 2026-02-07
Status: Draft
Input: User description: "Build a web-based curator for a large local media archive generated by automated scraping, where each subdirectory represents one scraped user. The goal is to give a human fast, visual, and safe control over what stays, what goes, and what should never be downloaded again. WHY THIS EXISTS The archive has grown beyond what can be managed with a file manager. Many users are junk, duplicates, or low-value, while others must be preserved permanently. Deleting blindly is dangerous, and deleting a user without stopping future downloads causes the archive to refill itself. This system exists to make irreversible decisions deliberate, visible, and traceable. CORE IDEAS • A “user directory” is the atomic unit of decision. • Each user directory must have an explicit state: Untagged: not yet reviewed. Whitelisted: the user is valuable and protected. Blacklisted: the user is safe to delete. Kept: explicitly preserved and removed from the decision pool. • The current state must always be visible and must be the single source of truth. NON-NEGOTIABLE SAFETY • Whitelisted user directories must never be deletable by directory-level actions. • All destructive actions must: 1) show a preview of what will change, 2) require explicit human confirmation, 3) leave a permanent audit record. • The system must refuse to act outside explicitly configured root paths. • A global read-only mode must exist where nothing can be modified. • Destructive operations must be serialized and never run concurrently. • Symlinks must never be followed during deletion. REQUIRED WORKFLOWS MODE 1 — WHITELIST MEDIA TRIAGE Purpose: reclaim disk space without risking loss of important users. Behavior: • Show one image or video at a time belonging only to whitelisted users. • Allow selection scope: all whitelisted users (random), a specific whitelisted user. • Provide two viewing orders: random, largest files first. • Actions per item: Keep (no change), Delete this file (confirmation required). • Always display: owning user, file size, media type, relative path. • Automatically advance after each decision. Success condition: I can rapidly delete low-value files while being confident I cannot delete an entire whitelisted user. MODE 2 — UNTAGGED DIRECTORY COLLAGE REVIEW Purpose: decide whether an entire user is worth keeping. Behavior: • Select one untagged user directory at a time. • Display: directory name, total size, file count, a collage of randomly sampled files from that directory. • Provide a “resample” action to view a different random subset. • Decisions: Keep: * mark or move the directory as preserved, * remove it from the untagged pool, * record the decision. Delete: * before deletion, attempt to find the user in a plain-text download list file, * preview whether the user exists and what lines would be removed, * allow choosing whether to remove the user from the list file, * require a high-friction confirmation before deleting the directory, * perform selected actions and record them. DOWNLOAD LIST FILE HANDLING • The download list file controls future scraping and must be treated as critical data. • User removal must be conservative and explicit. • Default behavior is exact-match removal only. • All edits must be previewed and performed atomically. • If no entry is found, this must be clearly shown and handled safely. AUDITING AND TRACEABILITY • Every mutation must create an append-only audit entry. • Audit entries must capture: what action occurred, what paths were affected, whether the download list file was edited, when it happened. • Audit history must be viewable for verification. CONFIGURATION BEHAVIOR The system must allow configuration of: • untagged pool root, • whitelisted root, • kept root, • trash or staging area for deletions, • download list file path, • read-only mode, • whether deletions are staged or permanent by default. OUT OF SCOPE • Automatic or unattended deletion. • Machine-generated keep/delete decisions. • Silent bulk operations. • Integration with the scraper beyond editing the download list file. DONE MEANS • I can safely delete individual files from whitelisted users. • I can quickly decide keep or delete for untagged users using visual samples. • Deleting a user can also stop future downloads in a controlled, previewed way. • Whitelist protection is enforced and cannot be bypassed. • Every destructive action is previewed, confirmed, and auditable."

Clarifications

Session 2026-02-07

  • Q: What is the default list-file matching rule? → A: Case-insensitive match after trimming surrounding whitespace.
  • Q: What is the access control model? → A: No authentication; access limited to trusted local network.
  • Q: What is the audit retention policy? → A: Retain audit history indefinitely.
  • Q: What does “Kept” mean operationally? → A: Move to the kept root.
  • Q: How should multiple list-file matches be handled? → A: List all matches and allow selective removal.

User Scenarios & Testing (mandatory)

User Story 1 - Untagged Directory Decisions (Priority: P1)

As a curator, I review one untagged user at a time using a visual collage so I can quickly decide to keep or delete the entire user directory.

Why this priority: This is the primary workflow to shrink the archive and make high-impact decisions safely.

Independent Test: Can be fully tested by selecting an untagged directory, making keep and delete decisions, and verifying that the directory leaves the untagged pool with an audit entry and correct list-file handling.

Acceptance Scenarios:

  1. Given an untagged directory with media, When I open it, Then I see the directory name, total size, file count, and a collage of random samples.
  2. Given an untagged directory, When I resample, Then the collage changes while staying within the same directory.
  3. Given an untagged directory, When I choose Keep and confirm, Then the directory is marked or moved as preserved, removed from the untagged pool, and the decision is recorded.
  4. Given an untagged directory, When I choose Delete, Then I see a preview of the directory deletion and any matching download list entries before I can confirm.
  5. Given a delete decision, When I decline list-file removal, Then the directory deletion proceeds without modifying the list file and the outcome is audited.

User Story 2 - Whitelisted Media Triage (Priority: P2)

As a curator, I triage individual files inside whitelisted user directories so I can reclaim disk space without risking loss of the entire user.

Why this priority: It enables safe space recovery while preserving valuable users.

Independent Test: Can be tested by selecting whitelisted users, viewing items in random and size-prioritized order, deleting individual files with confirmation, and verifying the parent directory remains intact.

Acceptance Scenarios:

  1. Given whitelisted users exist, When I start triage, Then I see one item at a time with owning user, file size, media type, and relative path.
  2. Given triage mode, When I switch ordering, Then items are presented in random order or largest-first order as chosen.
  3. Given a whitelisted file, When I choose Delete and confirm, Then only the file is removed and the parent directory remains protected.
  4. Given triage mode, When I choose Keep, Then no mutation occurs and the next item is shown automatically.

User Story 3 - Audit Visibility and Safe Configuration (Priority: P3)

As a curator, I can view recent audit history and configure required paths and safety modes so I can verify past actions and keep operations within safe boundaries.

Why this priority: It provides traceability and enforces safety constraints for all other workflows.

Independent Test: Can be tested by configuring roots and safety modes, attempting out-of-bounds actions, and verifying audit history visibility.

Acceptance Scenarios:

  1. Given a configured system, When I open the audit view, Then I can see recent mutations with action, paths, list-file edits, and timestamps.
  2. Given I enable read-only mode, When I attempt a destructive action, Then the action is blocked and I am informed no changes were made.
  3. Given configured root paths, When I attempt a destructive action outside those roots, Then the system refuses the action and records the refusal.

Edge Cases

  • What happens when an untagged directory is empty or contains only unsupported media?
  • How does the system handle missing or unreadable download list files?
  • What happens if a directory or file is a symlink during a delete action?
  • How does the system behave if read-only mode is enabled mid-session?
  • What happens when a directory is moved or deleted outside the tool while being viewed?

Scope & Non-Goals

In scope: human-driven review, safe deletion workflows, list-file preview/removal, and audit visibility for verification.
Out of scope: automatic or unattended deletion, machine-generated decisions, silent bulk operations, and integration with the scraper beyond list-file edits.

Requirements (mandatory)

Functional Requirements

  • FR-001: System MUST treat each user directory as the atomic unit of decision.
  • FR-002: System MUST maintain a single visible state per user directory: Untagged, Whitelisted, Blacklisted, or Kept.
  • FR-003: System MUST provide an untagged directory review view with a random-sample collage, directory name, total size, and file count.
  • FR-004: System MUST allow resampling the collage without changing the directory.
  • FR-005: System MUST allow Keep and Delete decisions for untagged directories and record the decision outcome.
  • FR-005a: Keep decisions MUST move the directory to the kept root and remove it from the untagged pool.
  • FR-006: System MUST attempt to locate matching entries in the download list file before deleting a directory and present a preview of potential removals.
  • FR-007: System MUST allow the user to choose whether to remove matching list-file entries, independently of directory deletion.
  • FR-007a: If multiple list-file matches are found, the system MUST list all matches and allow selective removal.
  • FR-008: System MUST provide whitelisted media triage that shows one item at a time with owning user, file size, media type, and relative path.
  • FR-009: System MUST allow triage scope to be all whitelisted users or a specific whitelisted user.
  • FR-010: System MUST support random and largest-first ordering in triage mode.
  • FR-011: System MUST auto-advance to the next item after Keep or Delete actions.
  • FR-012: System MUST provide a view of recent audit history.
  • FR-013: System MUST allow configuration of untagged root, whitelisted root, kept root, trash/staging area, download list file path, read-only mode, and deletion staging behavior.
  • FR-014: System MUST clearly state when no matching download list entry is found and proceed safely without list-file edits.
  • FR-015: System MUST match download list entries using case-insensitive comparison after trimming surrounding whitespace.
  • FR-016: System MUST operate without authentication and assume access is limited to a trusted local network.
  • FR-017: System MUST retain audit history indefinitely.

Safety & Data Preservation Requirements (mandatory for destructive actions)

  • SR-001: System MUST provide a dry-run preview for destructive actions.
  • SR-002: System MUST require explicit confirmation before destructive actions.
  • SR-003: System MUST append an audit record for every mutation.
  • SR-004: System MUST refuse to act outside configured root paths.
  • SR-005: System MUST NOT follow symlinks for destructive actions.
  • SR-006: System MUST provide a global read-only mode that disables mutations.
  • SR-007: System MUST default to two-stage deletion (trash/staging) unless explicitly configured.
  • SR-008: System MUST serialize destructive operations and disallow concurrent deletes.
  • SR-009: Whitelisted directories MUST never be deletable by directory-level actions.
  • SR-010: List-file edits MUST be previewed and performed atomically, with exact-match removal by default using case-insensitive match after trimming surrounding whitespace.

Key Entities (include if feature involves data)

  • User Directory: Folder containing media for one scraped user, with a single explicit state.
  • Directory State: One of Untagged, Whitelisted, Blacklisted, Kept, stored as the source of truth for decisioning.
  • Media Item: An image or video file within a user directory.
  • Download List Entry: A line in the download list file representing a user to be scraped.
  • Audit Entry: Append-only record of a mutation with action, paths, list-file changes, and timestamp.
  • Configuration: The set of roots, list-file path, read-only mode, and deletion staging preference that bounds operations.

Assumptions

  • The curator is a single human operator at a time.
  • The download list file is plain text with one user entry per line.
  • The curated archive resides on local storage accessible to the tool.

Dependencies

  • Access to the configured root paths and download list file on local storage.

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: 90% of untagged directory decisions (keep/delete) complete in under 60 seconds after opening the directory.
  • SC-002: Users can complete at least 50 whitelisted file triage decisions in 10 minutes without directory-level deletion risk.
  • SC-003: 100% of destructive actions show a preview, require confirmation, and create a visible audit entry.
  • SC-004: 95% of attempted list-file removals report a clear preview of the exact lines to be removed before confirmation.
  • SC-005: Read-only mode prevents 100% of mutation attempts while still allowing browsing and review.