Docs/Data Pipelines
data-pipelines.md

Data Pipelines

How YouTube, time-management, migration, and CMS data become runtime artifacts.

Data pipelines

The site keeps expensive parsing out of runtime pages. Raw exports are processed offline into compact JSON files under public/data/, and client experiences read those artifacts directly.

Pipeline overview

ExperienceRaw/source dataProcessorRuntime artifact
YouTube Scholarpublic/data/Youtube_Data/ Takeout filesscripts/process-youtube-data.jspublic/data/youtube-scholar.json
Curiosity VelocityYouTube Takeout history/channel datascripts/process-youtube-data.jspublic/data/youtube-curiosity-velocity.json
Time ManagementScreen-time spreadsheets under public/data/time-management/scripts/time-management/*.pypublic/data/time-management/analysis.json, activities_data.json
Bird Migrationpublic/data/bird_migration.csvparser utilities and committed processed datapublic/data/migration_lite.json
Blog/ranked reviewsWordPress CMSGraphQL fetchesStatic route output

YouTube data

Run:

bashnpm run youtube:data

This calls scripts/process-youtube-data.js. Keep raw Takeout files and generated artifacts clearly separated. If adding support for another Takeout export shape, normalize it in the script rather than branching throughout React components.

Time-management data

The time-management pipeline is Python-based:

bashpython3 scripts/time-management/explore_data.py
python3 scripts/time-management/merge_activities.py
python3 scripts/time-management/analyze_time_usage.py
python3 scripts/time-management/generate_detailed_analysis.py

Use the generated JSON files from public/data/time-management/ in the dashboard. Do not make the dashboard parse .xlsx files at runtime.

Static artifact rules

  • Commit small, public, demo-safe generated JSON when it is needed by static pages.
  • Do not commit private raw exports unless the repo is explicitly meant to publish them.
  • Prefer compact arrays or pre-aggregated summaries for visualization payloads.
  • Keep schema notes in this document when adding a new generated artifact.

Adding a new data story

  1. Create a route under src/app/projects/<story>/ or src/app/blog/post/interactive/<story>/.
  2. Add processing code under scripts/<story>/.
  3. Write generated runtime artifacts under public/data/<story>/.
  4. Keep raw inputs out of UI code.
  5. Document the input format, command, and output file here.

Privacy boundary

Personal exports can reveal browsing, watch history, routines, and preferences. Treat raw exports as private by default. A public artifact should be either anonymized, aggregated, or intentionally published as part of the narrative.