Use free or zero-cost options when possible e.g., OpenRouter’s free tier (https://openrouter.ai/docs/api-reference/limits), Groq (https://console.groq.com/docs/rate-limits), or your own paid API keys if you have them. For embedding models, free-tier options are available from Cohere, Voyage, HuggingFace and others Complete the following steps to fully develop, deploy, and evaluate your application:
- Environment and Reproducibility ○ Create a virtual environment (e.g., venv, conda). ○ List dependencies in requirements.txt (or environment.yml). ○ Provide a README.md with setup + run instructions. ○ Set fixed seeds where/if applicable (for deterministic chunking or evaluation sampling).
- Ingestion and Indexing ○ Parse & clean documents (handle PDFs/HTML/md/txt). ○ Chunk documents (e.g., by headings or token windows with overlap). ○ Embed chunks with a free embedding model or a free-tier API. ○ Store the embedded document chunks in a local or lightweight vector database (e.g. Chroma or optionally a cloud-hosted vector store like Pinecone, etc.) Store vectors in a local/vector DB or cloud DB (e.g., Chroma, Pinecone, etc.)
- Retrieval and Generation (RAG) To build your RAG pipeline you may use frameworks such as LangChain to handle retrieval, prompt chaining, and API calls, or implement these manually. Implement Top-k retrieval with optional re-ranking. ○ ○ ○ © 2025 Quantic Holdings, Inc. All rights reserved. 6/23/21 2 AI Engineering Project ○ Build a prompting strategy that injects retrieved chunks (and citations/sources) into the LLM context. ○ Add basic guardrails: ■ Refuse to answer outside the corpus (“I can only answer about our policies”), ■ Limit output length, ■ Always cite source doc IDs/titles for answers.
- Web Application ○ Students can use Flask, Streamlit or alternative for the Web app. LangChain is recommended for orchestration, but is optional. ○ Endpoints/UI: ■ / - Web chat interface - text box for user input ■ /chat - API endpoint that receives user questions (POST) and returns model-generated answers with citations and snippets (link to source and show snippet). ■ /health - returns simple status via JSON.
- Deployment (Optional) ○ For production hosting use Render or Railway free tiers; students may alternatively use any other free-tier providers of their choice. ○ Configure environment variables (e.g. API keys, model endpoints, DB related etc.). ○ Ensure the app is publicly accessible at a shareable URL ○ If you do not deploy your application to production or staging, it must run and be demoed locally
- CI/CD ○ ○ Minimal automated testing is sufficient for this assignment (a build/run check, optional smoke test). Create a GitHub Actions workflow that on push/PR: ■ Installs dependencies, ■ Runs a build/start check (e.g., python -m pip install -r requirements.txt and python -c "import app" or pytest -q if you add tests), ■ Optional: On success in main, deploy to your host (Render/Railway action or via webhook/API).
- Evaluation of the LLM Application ○ Provide a small evaluation set of 15–30 questions covering various policy topics (PTO, security, expense, remote work, holidays, etc.). Report: ■ Answer Quality (required):
- Groundedness: % of answers whose content is factually consistent with and fully supported by the retrieved © 2025 Quantic Holdings, Inc. All rights reserved. 6/23/21 3 AI Engineering Project evidence—i.e., the answer contains no information that is absent or contradicted in the context.
- Citation Accuracy: % of answers whose listed citations correctly point to the specific passage(s) that support the information stated—i.e., the attribution is correct and not misleading.
- Exact/Partial Match (optional): % of answers that exactly or partially match a short gold answer you provide. ■ System Metrics (required):
- Latency (p50/p95) from request to answer for 10–20 queries. ■ Ablations (optional): compare retrieval k, chunk size, or prompt variants.
- Design Documentation ○ Briefly justify design choices (embedding model, chunking, k, prompt format, vector store).