Data engineering: LLM extraction pipelines over a unified entity graph

PodcastPal

A podcast discovery app built on one unified graph: live events, the host-and-guest relationship web, and the merch those people sell, each enriching the others.

Role: Sole engineer, designer, and operator
Year: 2026
Status: In Development
Live: Visit site

React 19 / Vite / TypeScript
Python / FastAPI
Gemini 2.5 Flash (extraction)
PostgreSQL + PostGIS (Neon)
Lambda / EventBridge / SQS ingestion
D3.js / Leaflet
AWS CDK / Caddy on EC2

5,500+

Podcast episodes ingested and graphed

2026-04 · RSS ingestion + LLM extraction pipeline backfill

Challenge

Podcast events, the web of who hosts and guests on what, and the merch those people endorse are usually three separate features. PodcastPal treats them as three views of one shared entity graph, so each enriches the others. The engineering challenge is building that graph cheaply and correctly from messy third-party RSS feeds, without a graph database and without letting per-request LLM cost balloon.

Process

Ingestion is a batch pipeline decoupled from the request path: a scheduled Lambda scrapes RSS into an S3 bronze tier, then SQS workers run Gemini 2.5 Flash extraction to build structured entities (podcasts, people, venues, events, merch), so user-facing pages are served from Postgres with no per-request model cost. The graph itself lives in PostgreSQL with PostGIS rather than a dedicated graph database, a deliberate cost decision: recursive queries and junction tables handle the relationship web at expected scale for roughly nothing, versus a graph database’s idle cost. A per-call cost tracker and a budget ceiling keep extraction spend bounded, and the lineage graph is visualized with D3 while events map through Leaflet.

Security was pressure-tested before any beta: the full stack went through a dual-model adversarial audit, two independent AI reviewers working the same brief in isolation, with the findings reconciled into a fix queue. Authorization, JWT validation, and SQL parameterization were confirmed clean, and the full review was closed out before beta.

Result

Deployed to production at podcastpal.io with over 5,500 episodes ingested and graphed. It is pre-launch, not opened to customers, and an entity-resolution pass to improve graph data quality is designed and partly underway. The standout is the discipline: one unified graph as the product spine, built cost-consciously, and audited by two independent AI models before a single customer sees it.