Data engineering: LLM extraction pipelines over a unified entity graph
PodcastPal
A podcast discovery app built on one unified graph: live events, the host-and-guest relationship web, and the merch those people sell, each enriching the others.
- Role
- Sole engineer, designer, and operator
- Year
- 2026
- Status
- In Development
- Live
- Visit site
- React 19 / Vite / TypeScript
- Python / FastAPI
- Gemini 2.5 Flash (extraction)
- PostgreSQL + PostGIS (Neon)
- Lambda / EventBridge / SQS ingestion
- D3.js / Leaflet
- AWS CDK / Caddy on EC2
5,500+
Podcast episodes ingested and graphed
2026-04 · RSS ingestion + LLM extraction pipeline backfill
Challenge
Podcast events, the web of who hosts and guests on what, and the merch those people endorse are usually three separate features. PodcastPal treats them as three views of one shared entity graph, so each enriches the others. The engineering challenge is building that graph cheaply and correctly from messy third-party RSS feeds, without a graph database and without letting per-request LLM cost balloon.
Process
Ingestion is a batch pipeline decoupled from the request path: a scheduled Lambda scrapes RSS into an S3 bronze tier, then SQS workers run Gemini 2.5 Flash extraction to build structured entities (podcasts, people, venues, events, merch), so user-facing pages are served from Postgres with no per-request model cost. The graph itself lives in PostgreSQL with PostGIS rather than a dedicated graph database, a deliberate cost decision: recursive queries and junction tables handle the relationship web at expected scale for roughly nothing, versus a graph database’s idle cost. A per-call cost tracker and a budget ceiling keep extraction spend bounded, and the lineage graph is visualized with D3 while events map through Leaflet.
Security was pressure-tested before any beta: the full stack went through a dual-model adversarial audit, two independent AI reviewers working the same brief in isolation, with the findings reconciled into a fix queue. Authorization, JWT validation, and SQL parameterization were confirmed clean, and the full review was closed out before beta.
Result
Deployed to production at podcastpal.io with over 5,500 episodes ingested and graphed. It is pre-launch, not opened to customers, and an entity-resolution pass to improve graph data quality is designed and partly underway. The standout is the discipline: one unified graph as the product spine, built cost-consciously, and audited by two independent AI models before a single customer sees it.