Projects Debate Chatbot (RAG)

Debate Chatbot (RAG)

Grounded Q&A over the 2019-2020 U.S. Democratic primary debate transcripts using Pinecone retrieval and Anthropic (Claude) generation, returning citations for every answer.

What it does

This app answers questions by retrieving relevant debate transcript snippets at query time, then asking Claude to respond using only those sources. The response includes citations so the user can see exactly what the answer was grounded on.

Key features

  • Web chat UI at /
  • API endpoint at POST /chat
  • Source citations included in responses
  • One-command ingestion into Pinecone via scripts/ingest.py

Architecture

  1. Ingest transcripts into Pinecone as records with metadata (speaker/date/debate info).
  2. At query time: embed the question and retrieve top_k relevant records.
  3. Inject retrieved sources into a prompt and ask Claude to answer with citations.

API

GET /health

Returns:

{ "status": "ok" }

POST /chat

Request body:

{
  "question": "What did candidates say about Medicare for All?",
  "top_k": 8
}

Response:

  • answer: Markdown-formatted answer with citations like [1]
  • citations: the retrieved transcript snippets with metadata + excerpt

Quickstart (local)

python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# set:
#   PINECONE_API_KEY
#   ANTHROPIC_API_KEY

# add dataset CSV (not committed in repo):
#   debate_transcripts_v3_2020-02-26.csv

python3 scripts/ingest.py
uvicorn backend.main:app --reload --host 127.0.0.1 --port 8001

Deployment

The project includes a Dockerfile and is deployed on AWS App Runner (health check path: /health).

Configuration

Minimum environment variables:

  • PINECONE_API_KEY
  • ANTHROPIC_API_KEY

Common options:

  • PINECONE_INDEX_NAME, PINECONE_NAMESPACE
  • PINECONE_INDEX_HOST (recommended to avoid control plane lookup)
  • TOP_K, MAX_CONTEXT_CHARS

Next