🌌 Cardiology AI Assistant (ESC 2024)
⚡ Powered by Alibaba Qwen3-4B · ZeroGPU H200
Ask questions based on the 2024 ESC Medical Guidelines. Uses RAG with MedCPT embeddings, multi-query expansion, CrossEncoder reranking, Qwen3-4B generation, and live evaluation metrics.
Example Questions
Your answer will appear here after submission.
Metrics will appear here once the answer is generated.
How each metric is computed
| Metric | Method | Interpretation |
|---|---|---|
| BERTScore F1 | Sentence-level cosine-sim F1 between answer sentences and top-60 context sentences using all-MiniLM-L6-v2 (forced CPU) |
Measures how semantically similar the answer is to the source context |
| ROUGE-1 | Precision: fraction of answer unigrams that appear in the retrieved context | Are the words the model used actually in the retrieved passages? |
| ROUGE-2 | Precision: fraction of answer bigrams that appear in the retrieved context | Are the phrases the model used actually in the retrieved passages? |
| Semantic Similarity | Cosine similarity of full answer ↔ question embeddings | Does the answer embed in the same semantic space as the question? |
| Faithfulness | Fraction of answer sentences with cosine-sim ≥ 0.35 to any context sentence | Are answer claims grounded in retrieved text? |
| Answer Relevance | Cosine similarity of answer ↔ question embeddings | How directly does the answer respond to the question? |
| Context Recall | Fraction of top-60 context sentences with cosine-sim ≥ 0.35 to any answer sentence | How much of the retrieved evidence is used in the answer? |
Why precision for ROUGE? The retrieved context is ~8,000 tokens; a correct ~60-token answer has only ~4% unigram recall against that pool — even if every word came from the context. Precision asks the right question: "Did the model use words that actually appear in the retrieved passages?"
All metrics are reference-free — they use the retrieved context and original query as the reference signal, so no annotated ground-truth is needed.