In one line
Maternity statistics are the numerators-over-denominators by which South Africa audits whether mothers and babies live or die; critical appraisal is the discipline of deciding whether a paper's claimed effect is real, large enough to matter, and transferable to your district hospital — the single skill that turns a guideline reader into a guideline maker.
Assessment
This objective assumes the Intermediate groundwork — read the inference toolkit and study-design hierarchy first. At Final level the task is not to define sensitivity or a confidence interval but to deconstruct a trial and defend whether it should change practice. RR/OR/NNT, sensitivity/specificity, ITT, the evidence hierarchy and multiplicity are assumed; this chapter works one level above — on the estimand, the right appraisal lens for each study type, the test characteristics that mislead, and the SA-transfer judgement.
Get the denominators exactly right. These are the indices that sound alike but are not interchangeable:
- Maternal mortality ratio (MMR) = maternal deaths per 100,000 live births — the risk per pregnancy. South Africa reports the institutional MMR (iMMR), restricted to facility deaths over facility live births (DHIS denominator).
- Maternal mortality rate (MMRate) = maternal deaths per number of women of reproductive age — combines per-pregnancy risk and fertility, so it is not interchangeable with the ratio.
- A maternal death is the death of a woman while pregnant or within 42 days of termination of pregnancy, from any cause related to or aggravated by the pregnancy or its management — explicitly excluding incidental/accidental causes (the road-traffic death is not maternal). Split into direct (obstetric complications — haemorrhage, hypertension), indirect (pre-existing or new disease aggravated by pregnancy — cardiac disease, HIV/TB), and coincidental.
- Perinatal mortality rate = stillbirths plus early neonatal deaths (first 7 days) per 1,000 births; SA defines a stillbirth as ≥26 weeks (or the relevant weight cut-off) showing no signs of life.
The denominator subtleties that decide whether two numbers are comparable. Each of these is a place where two facilities, or two papers, can report different numbers from the same underlying deaths:
- The iMMR is denominator-biased upward wherever institutional delivery rates are high and community deaths invisible — it counts facility deaths over facility live births, so a region that delivers everyone in hospital (the SA norm) captures more deaths than one with high home birth. Comparing iMMR across districts therefore compares case-mix and referral patterns as much as care quality; a tertiary unit that receives the sickest transfers will always post a higher iMMR than the district that sent them, which is why avoidability classification, not the raw ratio, is the unit of audit.
- Late maternal deaths (>42 days to 1 year) are excluded from the classical MMR but are real and rising as women survive the acute event into prolonged critical illness — a true ascertainment gap, not a definitional nicety.
- The pregnancy-related death (any death during pregnancy/puerperium regardless of cause) is a broader, cause-agnostic denominator used where cause-of-death coding is weak; do not conflate it with the maternal death, which requires the causal link.
- A near-miss (severe acute maternal morbidity — a woman who survived a life-threatening complication) is the numerator of the mortality index = deaths / (deaths + near-misses). A falling iMMR with a rising near-miss count is good news (better rescue); a falling iMMR with a falling near-miss count may mean either genuine prevention or under-ascertainment — you cannot tell from mortality alone, which is the argument for auditing morbidity, not just death.
Read the question type before the result. Diagnostic accuracy, therapeutic RCT, prognostic cohort, and economic evaluation each demand a different appraisal lens, each with its own reporting and risk-of-bias instrument. Identify the estimand — what quantity the trial actually estimates (population, endpoint, intercurrent-event handling, summary measure) — because a beautiful p-value answering the wrong question is worthless. The intercurrent-event strategy is the easily-overlooked piece: a "treatment-policy" estimand (count the outcome regardless of what happened after randomisation — the ITT spirit) answers a different clinical question from a "hypothetical" estimand (the outcome had everyone adhered), and a trial's headline can be true under one and false under the other.
| Study type | Reporting guideline | Risk-of-bias / appraisal tool | The lens that matters |
|---|---|---|---|
| Therapeutic RCT | CONSORT 2025 | Cochrane RoB 2 (randomisation, deviations, missing data, measurement, selective reporting) | Allocation concealment, ITT-by-estimand, fragility of the effect |
| Diagnostic accuracy | STARD | QUADAS-2 (patient selection, index test, reference standard, flow & timing) | Spectrum bias, an imperfect reference standard, who was verified |
| Observational (cohort/case-control) | STROBE | confounding control, selection, information bias | Residual confounding — association ≠ causation |
| Systematic review / meta-analysis | PRISMA 2020 | AMSTAR-2, plus GRADE for certainty | Heterogeneity, publication bias, were the right trials pooled |
| Prognostic model | TRIPOD | PROBAST | Development vs validation, calibration not just discrimination, overfitting |
Management
"Management" here is a reproducible appraisal sequence plus the governance of SA maternity data.
A structured appraisal — immediate → ongoing → judgement
Immediate (validity — can I believe it at all?)
| Threat | What to check |
|---|---|
| Selection bias | Randomised? Allocation concealment (the safeguard against subverting randomisation — distinct from blinding)? |
| Performance/detection bias | Blinding of participants, clinicians, outcome assessors |
| Attrition bias | Loss to follow-up; was analysis intention-to-treat (ITT) by the estimand, not per-protocol? |
| Reporting bias | Pre-registered protocol; primary outcome unchanged; CONSORT 2025 open-science items (registration, protocol/SAP, data sharing) |
Ongoing (the result itself)
- Convert relative to absolute in the patient's baseline-risk terms; demand the NNT/NNH and the 95% CI, not the point estimate alone. Worked through: an OR of 0.38 (a 62% relative reduction) on a 4.3% baseline is only a ~2.7% absolute reduction — an NNT of ~37 to prevent one event — and that same relative effect on a 0.4% baseline would be near-worthless. The relative figure is constant; the clinical worth is not.
- For a composite primary endpoint, decompose it — a significant composite driven entirely by its softest component (e.g. "admission") while the hard component (death) is flat is a classic overstatement.
- For a surrogate endpoint (cervical length, biomarker), ask whether it is validated against the patient-important outcome; many are not.
- Probe fragility: the fragility index is the smallest number of patients whose event status would have to flip to render a significant result non-significant. A "positive" obstetric trial that turns null on 2–3 events is fragile — useful as a humility check, though criticised for tracking sample size and lacking an agreed threshold.
Judgement (does it apply to my patient?)
- External validity is the SA crux: a pre-eclampsia or PPH trial run in well-resourced settings may not transfer to a district hospital with different prevalence, comorbidity (notably HIV) and theatre access. The effect size can be real and irrelevant locally.
- Synthesise across studies with GRADE — rate certainty (high → very low) by risk of bias, inconsistency, indirectness, imprecision and publication bias — which is how the guidelines you cite were actually built.
Reading a diagnostic-accuracy paper at depth
Most O&G screening and point-of-care questions are diagnostic, not therapeutic, and the traps are different. Reading one well means doing four things:
- Refuse to live on sensitivity and specificity — they are properties of the test, fixed and prevalence-independent, but they do not tell you what you actually want at the bedside: the post-test probability given this result. That needs likelihood ratios and Bayes. LR+ = sensitivity / (1 − specificity); LR− = (1 − sensitivity) / specificity. A rough literacy: LR+ >10 or LR− <0.1 is strong (rules in / rules out), 5–10 / 0.1–0.2 moderate, and anything near 1 is a test that moves the needle so little it is not worth doing. The Fagan nomogram is the bedside way to chain pre-test probability × LR → post-test probability without arithmetic.
- Watch the prevalence trap. Predictive values (PPV/NPV) are not test properties — they swing with prevalence. A screen with 99% specificity throws mostly false positives when prevalence is 0.5% (the PPV collapses), which is exactly why a "highly specific" Down-syndrome or pre-eclampsia screen still needs a confirmatory step before anyone acts. A PPV quoted without the local prevalence is uninterpretable.
- Interrogate the reference standard. QUADAS-2 calls this out: if the "gold standard" is itself imperfect, or if only test-positives were verified (partial/differential verification bias), the accuracy is inflated. Spectrum bias is the other classic — a test validated on florid cases and healthy controls (case-control diagnostic design) looks brilliant and then disappoints in the real, ambiguous, district-clinic spectrum.
- Distinguish discrimination from calibration for any risk score. Discrimination (the c-statistic / AUC) asks whether the model ranks the sick above the well; calibration asks whether a predicted 10% risk really happens 10% of the time. A model can discriminate well (AUC 0.85) yet be badly mis-calibrated in a new population — and it is calibration that decides whether a threshold you act on is safe. This is why an imported model (fullPIERS, a fetal-growth chart, a VTE score) needs local validation, not just a good published AUC.
Reading a meta-analysis at depth
- Heterogeneity is the first question, not the forest plot. I² quantifies the proportion of variance due to between-study differences rather than chance; a high I² (loosely, >50–75%) means the studies are answering subtly different questions and a single pooled number may be a mirage. A random-effects model assumes a distribution of true effects and is appropriate then — but it up-weights small studies, the very ones most prone to bias and small-study effects.
- Publication bias distorts the pool: small negative trials vanish, so the meta-analysis over-estimates benefit. A funnel plot asymmetry (and tests like Egger's) flags it; the honest reviewer reports it as a GRADE downgrade for publication bias.
- A single mega-trial often beats a meta-analysis of small ones — pooling underpowered, heterogeneous, differently-biased studies can launder noise into a confident-looking diamond. Magpie is the worked example: one large, simple, pragmatic trial settled a question a shelf of small trials and their meta-analyses could not.
Governing the SA numbers
The NCCEMD Saving Mothers triennial report and the PPIP perinatal audit are South Africa's confidential-enquiry machinery; the NDoH National Integrated Maternal and Perinatal Care Guideline is the practice document built on them. Every morbidity-and-mortality meeting is applied descriptive epidemiology: compute your facility's iMMR and perinatal mortality rate, rank causes, and classify each death's avoidability (patient-, administrative- or provider-related) — the engine of quality improvement.
Beyond running it, the machinery has to be understood for where it is weak. The confidential enquiry model — anonymised, blame-free, expert assessor panels deriving system lessons rather than individual culpability — is itself the methodology, and its validity rests on complete, honest case ascertainment; under-reporting (a death never notified, a cause mis-coded) silently lowers the iMMR and is the single biggest threat to the numbers. Each death is classified on three axes that must be kept distinct: the final cause (what the woman died of), the avoidability judgement, and the level of substandard care with the point in the system where it failed — patient/community-related (delayed care-seeking, no antenatal attendance), administrative (no transport, no blood, no theatre, staff shortage), and provider/health-worker-related (failure to diagnose, wrong management, delay). That a majority of SA maternal deaths are judged potentially preventable, and that the preventable fraction sits heavily in the administrative and provider columns, is the argument that these are systems failures — the moral and managerial core of every M&M meeting. On the perinatal side, PPIP uses the modified Aberdeen / extended Wigglesworth classification to assign each death a primary obstetric cause and an avoidable factor; the persistently large "unexplained intrauterine death" category is partly a real biological gap and partly an investigation gap (no post-mortem, no placental histology), which is itself an auditable, fixable failing.
The evidence & the controversy
The current SA picture is sobering and specific. The eighth Saving Mothers report (2020–2022) records a corrected iMMR of 126 per 100,000 live births for the triennium (up from 113.8 previously), distorted by COVID-19: the iMMR was ~30% and ~47% above the 2019 baseline of 98.8 in 2020 and 2021, then fell to 109.7 in 2022. Over the triennium the leading causes were non-pregnancy-related infections (NPRI, 29.1% — COVID-19 and HIV/TB-driven), obstetric haemorrhage (16.4%), hypertensive disorders (14.7%), medical and surgical disorders (14%) and early-pregnancy complications (7.3%). Crucially, 57.4% of deaths were assessed as potentially preventable — the moral force behind the audit. The core SA picture is an iMMR trend to interpret, a cause hierarchy to name, and a 57% preventability figure whose implications fall on systems (the "5 Hs" priorities), not just individual care. Globally the contrast sharpens the argument — the WHO/UN inter-agency estimate puts the 2023 MMR at 197 per 100,000 (≈260,000 deaths, ~70% in sub-Saharan Africa), against an SDG 3.1 target of <70 by 2030 that SA, like most of the region, will miss on current trajectory.
The interpretation of these numbers goes beyond recall. The COVID spike is a natural experiment in confounding and indirect mortality — most of the 2020–2021 excess was NPRI (COVID-19 itself plus deferred/disrupted care for HIV, TB and chronic disease), so reading the rise as a collapse in obstetric care would be wrong; the obstetric causes were comparatively stable while the indirect burden ballooned. The cause hierarchy is the appraisal argument for resource allocation: that NPRI dominates is the statistical case for HIV/TB integration into antenatal care being a maternal-survival intervention, not a parallel programme; that haemorrhage and hypertension together approach a third of deaths and are overwhelmingly preventable is the case for the cheap, protocol-driven, EML-stocked interventions (MgSO₄, oxytocin/TXA, BP control, the maternity early-warning chart) being where lives are actually saved. On the perinatal side, the SA stillbirth burden is large and partly hidden in the "unexplained" category — StatsSA recorded 15,908 stillbirths and 8,212 early neonatal deaths (24,120 perinatal deaths) in 2020 — and the appraisal point is that a big unexplained fraction is as much an investigation/ascertainment failing (missing post-mortems, placental histology and growth assessment) as a biological mystery.
The recurring appraisal controversies are methodological. Non-inferiority trials are a recurring trap: a wide non-inferiority margin, a biased-toward-the-null ITT analysis, or high dropout can manufacture "non-inferiority" for a worse drug — here the per-protocol analysis is the more conservative one (the reverse of superiority trials), and both should agree. Multiplicity undermines the subgroup claims that pepper obstetric papers: test enough subgroups and one turns "significant" by chance, so demand pre-specification and a formal interaction test before believing a HIV-positive or twin-pregnancy subgroup effect. Reporting reform is live: CONSORT 2025 (the first major update since 2010) added seven checklist items, revised three and introduced a dedicated open-science section on registration, protocol/analysis-plan availability and data sharing — directly attacking outcome-switching and selective reporting, the failures that have inflated obstetric effect sizes. The defensible position holds these in tension: a single trial rarely settles practice, surrogate and composite endpoints flatter interventions, and SA practice must be anchored to local audit data and resource reality rather than imported point estimates.
Three further controversies are worth a sentence each because they recur:
- The OR-as-RR overstatement in common obstetric outcomes. Logistic regression and case-control designs report odds ratios; when the outcome is common (caesarean, GDM, pre-eclampsia at >10%), the OR exaggerates the RR, and a paper that quotes "OR 2.0" for a 30%-baseline outcome is describing a much smaller relative risk than the naïve reader assumes. Demand the absolute risks.
- "Statistically significant" vs "clinically meaningful" in huge cohorts. HAPO is the teaching case: with n=25,000 a trivially small, perfectly real glucose-outcome association reaches overwhelming significance, so the policy question (where to set a diagnostic threshold on a continuous risk) is a value judgement about acceptable trade-offs, not a fact the p-value can settle.
- Early stopping for benefit overstates effect size. Trials halted early at an interim "win" tend to over-estimate the treatment effect (random high); a result from a trial stopped early for benefit should be read with that upward bias in mind.
Landmark trials & key evidence
These studies are worth deconstructing as appraisal exercises, not merely recalling — each one shows a different way a headline number can mislead. Know them by name with their effect size and their flaw.
| Trial (year) | Question | Key finding | What it changed |
|---|---|---|---|
| Magpie (2002) | Does MgSO₄ prevent eclampsia in pre-eclampsia? | Eclampsia RR ≈0.42 (58% lower, 95% CI 40–71%) — ~11 fewer fits per 1000 women; maternal death RR 0.55; no fetal harm. A vast (n=10,141), pragmatic, ITT trial that settled a question decades of small studies could not. | Made MgSO₄ the global standard for pre-eclampsia with severe features/eclampsia prophylaxis; the SA/NDoH default. Appraisal lesson: a large simple trial with a hard endpoint beats a shelf of underpowered ones. |
| WOMAN (2017) | Does early tranexamic acid reduce death in PPH? | Death due to bleeding RR 0.81 (0.65–1.00, p=0.045); within 3 h RR 0.69 (0.52–0.91). But the composite primary endpoint (death-or-hysterectomy) was flat (RR 0.97) and hysterectomy unchanged — the win sat in one component only. | TXA 1 g IV within 3 h entered FIGO/WHO/NDoH PPH protocols. Appraisal lesson: decompose a composite — the original endpoint "failed" yet the trial is practice-changing on the component that mattered. |
| CRASH-2 (2010) | Does TXA reduce death in bleeding trauma? | All-cause mortality RR 0.91 (0.85–0.97); death due to bleeding RR 0.85 (0.76–0.96); benefit confined to early (<3 h) treatment. n=20,211. | The biological-plausibility and timing rationale that justified testing TXA in PPH (WOMAN). Appraisal lesson: a credible subgroup must be pre-specified (here, time-to-treatment) — and transferability across bleeding contexts needs its own trial. |
| CHIPS (2015) | Tight vs less-tight BP control in pregnancy hypertension? | No difference in the composite of pregnancy loss/high-level neonatal care (adjusted OR 1.02, 0.77–1.35), but less-tight control doubled severe hypertension (40.6% vs 27.5%, p<0.001). | Supports treating to a diastolic ~85 mmHg target. Appraisal lesson: "no significant difference" in a primary composite is not "no effect" — a clinically vital secondary (severe hypertension) drove the recommendation. |
| ASPRE (2017) | Does aspirin 150 mg cut preterm pre-eclampsia in screen-positive women? | Preterm pre-eclampsia 1.6% vs 4.3%, OR 0.38 (0.20–0.74) — a ~62% relative reduction in a first-trimester-screened high-risk cohort. | Underpins first-trimester screen-and-treat aspirin prophylaxis. Appraisal lesson: a large relative effect on a low-baseline outcome is a small absolute one (~2.7% ARR, NNT ≈37) — and external validity hinges on reproducing the screening algorithm. |
| HAPO (2008) | Does sub-diabetic maternal glucose harm pregnancy? | Continuous, graded association of fasting/1-h/2-h glucose with macrosomia (e.g. adjusted OR ~1.38 per 1 SD), C-peptide, caesarean — with no threshold. n=25,505 blinded cohort. | Drove the IADPSG/WHO one-step GDM diagnostic cut-offs. Appraisal lesson: where risk is continuous, any diagnostic threshold is a chosen trade-off, not a biological boundary — the core of diagnostic-accuracy appraisal. |
| Term Breech Trial (2000) | Planned caesarean vs planned vaginal birth for term breech? | Composite perinatal/neonatal death-or-serious-morbidity RR 0.33 (0.19–0.56) favouring planned CS; no difference in serious maternal morbidity. | Shifted global practice toward planned CS for term breech. Appraisal lesson: a landmark RCT can be over-extrapolated — later analyses questioned external validity (case selection, intrapartum care quality, attenuation of the long-term difference), the textbook cautionary tale on generalisability. |
| ARRIVE (2018) | Induce low-risk nulliparas at 39 wk vs expectant management? | No significant fall in the perinatal composite (RR 0.80, 0.64–1.00) but lower caesarean (18.6% vs 22.2%, RR 0.84, 0.76–0.93). | Reopened the elective-39-week-induction debate. Appraisal lesson: the headline (less CS) was a secondary outcome, the primary was null, and the trial's well-resourced US setting limits transfer to a district hospital with constrained theatre and monitoring capacity — the SA external-validity question made concrete. |
Screening — appraising a screening programme, not just a test
Screening is where statistical literacy meets policy, and the right frame is the Wilson–Jungner criteria (an important condition, a recognisable latent stage, an acceptable and accurate test, an effective and available treatment that works better when applied early, an agreed policy on whom to treat, and an economically balanced, continuous programme) rather than only "is the test accurate?".
- A screen is judged on the programme's effect on the patient-important outcome, not on the test's accuracy. A test can be sensitive and specific yet the programme fail because the downstream treatment does not help, or because the harms of false positives (anxiety, invasive confirmation, over-treatment) outweigh the benefit.
- The biases that flatter every screening study must be named: lead-time bias (survival from diagnosis lengthens simply because diagnosis is earlier, with no change in the date of death — a statistical artefact, not a benefit), length-time bias (screening preferentially catches slow, indolent disease that was going to do well anyway), and at the extreme overdiagnosis (detecting disease that would never have harmed the woman). Only a randomised screening trial with a mortality or hard-morbidity endpoint escapes these — survival-from-diagnosis comparisons cannot.
- SA-specific screening realities: cervical screening runs on an HPV-DNA-first or cytology programme constrained by NHLS throughput and follow-up loss (a screen that is never acted on saves no one — programme failure, not test failure); GDM screening policy turns on the HAPO-derived thresholds and SA resource limits; pre-eclampsia first-trimester screening (the ASPRE algorithm) is powerful but its external validity hinges on reproducing the exact screening combination (maternal factors + MAP + uterine-artery Doppler + PlGF) — import the threshold without the algorithm and the test characteristics do not hold.
Long-term, postnatal & follow-up — closing the audit loop
Statistics and appraisal do not end at the death certificate; the consultant's job is to feed the numbers back into care.
- Run the M&M meeting as a quality-improvement cycle, not a tribunal. Each maternal death and each perinatal death is classified (cause, avoidability, level of substandard care), the system lesson is extracted, a corrective action is assigned with an owner and a date, and the next meeting audits whether it happened — the audit loop is only closed when re-measurement shows change. A blame culture suppresses reporting and silently corrupts the very ascertainment the numbers depend on.
- Use the right denominator for your facility. A rising iMMR at a tertiary unit may reflect successful referral of the sickest women, not deteriorating care; benchmark against case-mix-comparable facilities and track the near-miss / mortality index alongside the death rate so a genuine improvement in rescue is not mistaken for a failure.
- Counsel families and feed back to community. Where a death or stillbirth is classified, honest disclosure and, for stillbirth, investigation (post-mortem, placental histology, growth review) both serve the family and shrink the "unexplained" category that weakens the next report.
- Recurrence and the next pregnancy are where individual statistics become counselling: a cause-specific recurrence risk (pre-eclampsia, abruption, a uterine scar) drives the next-pregnancy plan, and the appraisal habit — absolute risk in this woman's baseline terms, not an imported relative risk — is exactly what makes that counselling honest.
Worked viva — how to structure the answer
Examiners give a stem like "Here is a trial abstract: aspirin 150 mg vs placebo from 11–14 weeks in screen-positive women reduced preterm pre-eclampsia from 4.3% to 1.6%, OR 0.38 (0.20–0.74). Would you change your practice?" A high-scoring answer runs:
- Classify the study and name the lens — "This is a therapeutic RCT, so I appraise it with CONSORT/RoB-2: I want allocation concealment, blinding, ITT analysis and a pre-registered primary outcome."
- Translate the effect into absolute terms — "OR 0.38 is a ~62% relative reduction, but on a 4.3% baseline that is only a ~2.7% absolute reduction — about 37 women treated to prevent one preterm pre-eclampsia. The relative figure is impressive; the absolute one tells me the real yield."
- Probe validity and external validity — "The effect is real and large, but it was achieved in a population selected by a first-trimester screening algorithm (maternal factors, MAP, uterine-artery Doppler, PlGF). My district clinic may not reproduce that screen, so the applicable benefit could differ. I would also check the fragility and whether the primary outcome was pre-specified."
- Place it in the SA system — "Aspirin is cheap, EML-listed and safe, so even a modest absolute benefit is worth it for genuinely high-risk women; the binding constraint is identifying them, which in SA is usually clinical risk-factor screening rather than the full algorithm."
- State the decision — "Yes, I would offer aspirin to high-risk women, by clinical risk factors where the full screen is unavailable — but I would not over-claim the ASPRE absolute benefit for an unscreened population."
- Close with the appraisal principle — "One trial rarely settles practice; I would site this within the guideline synthesis (GRADE) that already informs the NDoH recommendation."
The same scaffold — classify → absolute effect → validity & transfer → SA system → decision → principle — works for any abstract.
Exam traps & red flags
- Quoting iMMR without its denominator or definition. State that it is facility deaths per 100,000 facility live births, that it is denominator-biased by case-mix and referral, and that COVID-19 inflated 2020–2021 — the context the figure is meaningless without.
- Confusing the ratio with the rate, or treating a coincidental death (assault, MVA) as maternal — it is excluded by definition.
- Reading "non-significant" as "no effect" in an underpowered SA study — almost always a power problem; check the CI width and sample size.
- Per-protocol vs ITT applied backwards. ITT is conservative for superiority; for non-inferiority it can falsely favour the new treatment, so per-protocol must corroborate.
- Swallowing a composite or surrogate endpoint whole — decompose it; an unvalidated surrogate is not a patient-important outcome.
- Believing an unplanned subgroup without pre-specification or interaction testing (multiplicity).
- Transplanting a high-resource trial result into a district hospital without weighing prevalence, HIV/TB comorbidity and theatre access — internal validity is not external validity.
- Treating the OR as the RR for a common outcome (it overstates), or relative risk reduction without the absolute figure — the commonest way trial results mislead.
- Quoting PPV/NPV without the local prevalence — predictive values are not test properties and collapse at low prevalence; a "highly specific" screen still floods you with false positives in a rare disease.
- Living on sensitivity/specificity and never reaching the post-test probability — likelihood ratios and Bayes are what answer the bedside question.
- Praising a model's AUC while ignoring calibration — good discrimination in the development set does not mean the predicted risks are right in your population; an imported score needs local validation.
- Believing survival-from-diagnosis in a screening study — lead-time and length-time bias inflate it; only a randomised mortality-endpoint trial escapes.
- Trusting a meta-analysis with high heterogeneity or funnel-plot asymmetry — a tidy diamond can launder noise and publication bias into false confidence.
- Misclassifying HIV deaths. Distinguish HIV-related indirect maternal deaths (pregnancy aggravating HIV) from incidental HIV deaths — a real source of MMR estimation error in high-prevalence SA.
This appraisal discipline recurs across the Final — the aspirin evidence in pre-eclampsia-prevention-aspirin, antihypertensive trials in hypertension-in-pregnancy-antihypertensives, the FIGO/WOMAN-trial basis of postpartum-haemorrhage, and the diagnostic-accuracy reasoning in cervical-premalignancy-colposcopy all stand on exactly these foundations.
Evidence anchors
- Saving Mothers 2020–2022, Eighth Comprehensive Triennial Report (NCCEMD/NDoH) — corrected iMMR 126/100,000 for the triennium (98.8 in 2019, 109.7 in 2022); cause hierarchy NPRI 29.1%, OH 16.4%, HDP 14.7%, M&S 14%, early pregnancy 7.3%; 57.4% potentially preventable.
- National Integrated Maternal and Perinatal Care Guideline (NDoH) — the SA practice document built on this audit machinery.
- WHO maternal mortality fact sheet and Trends in maternal mortality 2000–2023 (WHO/UNICEF/UNFPA/World Bank/UNDESA) — global MMR 197/100,000 in 2023 (~260,000 deaths, ~70% sub-Saharan Africa).
- SDG indicator 3.1.1 metadata (UN) — formal definitions of MMR vs MMRate, maternal death (within 42 days; direct/indirect; excludes incidental causes), HIV-related indirect maternal deaths, and the <70 by 2030 target.
- Perinatal deaths in South Africa, 2020 (Statistics South Africa) — 15,908 stillbirths, 8,212 early neonatal deaths, 24,120 perinatal deaths in 2020; the scale of the SA perinatal burden behind the PPIP audit.
- CONSORT 2025 statement (PMC) — updated RCT reporting guideline; 7 new items, 3 revised, 1 deleted, new open-science section (registration, protocol/SAP, data sharing).
- QUADAS-2 — quality assessment of diagnostic accuracy studies (Ann Intern Med) — the four-domain (patient selection, index test, reference standard, flow & timing) risk-of-bias and applicability tool for diagnostic-accuracy appraisal.
- AMSTAR 2 — appraisal of systematic reviews (BMJ) — 16-domain critical-appraisal tool (7 critical) for systematic reviews of randomised and non-randomised studies.
- Fragility index — rationale, calculation and limitations (PMC) — smallest number of event changes that would render a significant result non-significant; criticised for tracking sample size and lacking an agreed interpretive threshold.
Standard appraisal definitions (RR/OR/NNT, sensitivity/specificity, ITT, GRADE, the evidence hierarchy, multiplicity) are textbook canon and are anchored as such in the Intermediate biomedical statistics chapter.
