In one line

Maternity statistics are the numerators-over-denominators by which South Africa audits whether mothers and babies live or die; critical appraisal is the discipline of deciding whether a paper's claimed effect is real, large enough to matter, and transferable to your district hospital — the single skill that turns a guideline reader into a guideline maker.

Assessment

This objective assumes the Intermediate groundwork — read the inference toolkit and study-design hierarchy first. At Final level the task is not to define sensitivity or a confidence interval but to deconstruct a trial and defend whether it should change practice. RR/OR/NNT, sensitivity/specificity, ITT, the evidence hierarchy and multiplicity are assumed; this chapter works one level above — on the estimand, the right appraisal lens for each study type, the test characteristics that mislead, and the SA-transfer judgement.

Get the denominators exactly right. These are the indices that sound alike but are not interchangeable:

Maternal mortality ratio (MMR) = maternal deaths per 100,000 live births — the risk per pregnancy. South Africa reports the institutional MMR (iMMR), restricted to facility deaths over facility live births (DHIS denominator).
Maternal mortality rate (MMRate) = maternal deaths per number of women of reproductive age — combines per-pregnancy risk and fertility, so it is not interchangeable with the ratio.
A maternal death is the death of a woman while pregnant or within 42 days of termination of pregnancy, from any cause related to or aggravated by the pregnancy or its management — explicitly excluding incidental/accidental causes (the road-traffic death is not maternal). Split into direct (obstetric complications — haemorrhage, hypertension), indirect (pre-existing or new disease aggravated by pregnancy — cardiac disease, HIV/TB), and coincidental.
Perinatal mortality rate (PNMR) = stillbirths plus early neonatal deaths (first 7 completed days) per 1,000 total births. Keep the reporting system explicit: Stats SA civil registration defines a stillbirth as a fetus of at least 26 weeks with no signs of life after birth, while PPIP/NaPeMMCo audit tables report by birthweight categories, especially all deliveries/approximately ≥500 g and ≥1,000 g.

The denominator subtleties that decide whether two numbers are comparable. Each of these is a place where two facilities, or two papers, can report different numbers from the same underlying deaths:

The iMMR is denominator-biased upward wherever institutional delivery rates are high and community deaths invisible — it counts facility deaths over facility live births, so a region that delivers everyone in hospital (the SA norm) captures more deaths than one with high home birth. Comparing iMMR across districts therefore compares case-mix and referral patterns as much as care quality; a tertiary unit that receives the sickest transfers will always post a higher iMMR than the district that sent them, which is why avoidability classification, not the raw ratio, is the unit of audit.
Late maternal deaths (>42 days to 1 year) are excluded from the classical MMR but are real and rising as women survive the acute event into prolonged critical illness — a true ascertainment gap, not a definitional nicety.
The pregnancy-related death (any death during pregnancy/puerperium regardless of cause) is a broader, cause-agnostic denominator used where cause-of-death coding is weak; do not conflate it with the maternal death, which requires the causal link.
A near-miss (severe acute maternal morbidity — a woman who survived a life-threatening complication) is the numerator of the mortality index = deaths / (deaths + near-misses). A falling iMMR with a rising near-miss count is good news (better rescue); a falling iMMR with a falling near-miss count may mean either genuine prevention or under-ascertainment — you cannot tell from mortality alone, which is the argument for auditing morbidity, not just death.

Read the question type before the result. Diagnostic accuracy, therapeutic RCT, prognostic cohort, and economic evaluation each demand a different appraisal lens, each with its own reporting and risk-of-bias instrument. Identify the estimand — what quantity the trial actually estimates (population, endpoint, intercurrent-event handling, summary measure) — because a beautiful p-value answering the wrong question is worthless. The intercurrent-event strategy is the easily-overlooked piece: a "treatment-policy" estimand (count the outcome regardless of what happened after randomisation — the ITT spirit) answers a different clinical question from a "hypothetical" estimand (the outcome had everyone adhered), and a trial's headline can be true under one and false under the other.

Study type	Reporting guideline	Risk-of-bias / appraisal tool	The lens that matters
Therapeutic RCT	CONSORT 2025	Cochrane RoB 2 (randomisation, deviations, missing data, measurement, selective reporting)	Allocation concealment, ITT-by-estimand, fragility of the effect
Diagnostic accuracy	STARD	QUADAS-2 (patient selection, index test, reference standard, flow & timing)	Spectrum bias, an imperfect reference standard, who was verified
Observational (cohort/case-control)	STROBE	confounding control, selection, information bias	Residual confounding — association ≠ causation
Systematic review / meta-analysis	PRISMA 2020	AMSTAR-2, plus GRADE for certainty	Heterogeneity, publication bias, were the right trials pooled
Prognostic model	TRIPOD	PROBAST	Development vs validation, calibration not just discrimination, overfitting

Management

"Management" here is a reproducible appraisal sequence plus the governance of SA maternity data.

A structured appraisal — immediate → ongoing → judgement

Immediate (validity — can I believe it at all?)

Threat	What to check
Selection bias	Randomised? Allocation concealment (the safeguard against subverting randomisation — distinct from blinding)?
Performance/detection bias	Blinding of participants, clinicians, outcome assessors
Attrition bias	Loss to follow-up; was analysis intention-to-treat (ITT) by the estimand, not per-protocol?
Reporting bias	Pre-registered protocol; primary outcome unchanged; CONSORT 2025 open-science items (registration, protocol/SAP, data sharing)

Ongoing (the result itself)

Convert relative to absolute in the patient's baseline-risk terms; demand the NNT/NNH and the 95% CI, not the point estimate alone. Worked through: the event risks of 1.6% versus 4.3% give a relative risk of about 0.37, a relative risk reduction of about 63%. The odds ratio of 0.38 is a separate measure that approximates a risk reduction only because the outcome is uncommon, so report the odds ratio and the risk reduction separately rather than treating the OR as the relative reduction. On a 4.3% baseline the absolute reduction is only about 2.7%, an NNT of about 37 to prevent one event, and the same relative effect on a 0.4% baseline would be near-worthless. Treating the relative effect as constant across baselines is a working assumption; the clinical worth is not.
For a composite primary endpoint, decompose it — a significant composite driven entirely by its softest component (e.g. "admission") while the hard component (death) is flat is a classic overstatement.
For a surrogate endpoint (cervical length, biomarker), ask whether it is validated against the patient-important outcome; many are not.
Probe fragility: the fragility index is the smallest number of patients whose event status would have to flip to render a significant result non-significant. A "positive" obstetric trial that turns null on 2–3 events is fragile — useful as a humility check, though criticised for tracking sample size and lacking an agreed threshold.

Judgement (does it apply to my patient?)

External validity is the SA crux: a pre-eclampsia or PPH trial run in well-resourced settings may not transfer to a district hospital with different prevalence, comorbidity (notably HIV) and theatre access. The effect size can be real and irrelevant locally.
Synthesise across studies with GRADE — rate certainty (high → very low) by risk of bias, inconsistency, indirectness, imprecision and publication bias — which is how the guidelines you cite were actually built.

Reading a diagnostic-accuracy paper at depth

Most O&G screening and point-of-care questions are diagnostic, not therapeutic, and the traps are different. Reading one well means doing four things:

Refuse to live on sensitivity and specificity — they are properties of the test, fixed and prevalence-independent, but they do not tell you what you actually want at the bedside: the post-test probability given this result. That needs likelihood ratios and Bayes. LR+ = sensitivity / (1 − specificity); LR− = (1 − sensitivity) / specificity. A rough literacy: LR+ >10 or LR− <0.1 is strong (rules in / rules out), 5–10 / 0.1–0.2 moderate, and anything near 1 is a test that moves the needle so little it is not worth doing. The Fagan nomogram is the bedside way to chain pre-test probability × LR → post-test probability without arithmetic.
Watch the prevalence trap. Predictive values (PPV/NPV) are not test properties — they swing with prevalence. A screen with 99% specificity throws mostly false positives when prevalence is 0.5% (the PPV collapses), which is exactly why a "highly specific" Down-syndrome or pre-eclampsia screen still needs a confirmatory step before anyone acts. A PPV quoted without the local prevalence is uninterpretable.
Interrogate the reference standard. QUADAS-2 calls this out: if the "gold standard" is itself imperfect, or if only test-positives were verified (partial/differential verification bias), the accuracy is inflated. Spectrum bias is the other classic — a test validated on florid cases and healthy controls (case-control diagnostic design) looks brilliant and then disappoints in the real, ambiguous, district-clinic spectrum.
Distinguish discrimination from calibration for any risk score. Discrimination (the c-statistic / AUC) asks whether the model ranks the sick above the well; calibration asks whether a predicted 10% risk really happens 10% of the time. A model can discriminate well (AUC 0.85) yet be badly mis-calibrated in a new population — and it is calibration that decides whether a threshold you act on is safe. This is why an imported model (fullPIERS, a fetal-growth chart, a VTE score) needs local validation, not just a good published AUC.

In one line

Assessment

Get the denominators exactly right. These are the indices that sound alike but are not interchangeable:

Maternal mortality ratio (MMR) = maternal deaths per 100,000 live births — the risk per pregnancy. South Africa reports the institutional MMR (iMMR), restricted to facility deaths over facility live births (DHIS denominator).
Maternal mortality rate (MMRate) = maternal deaths per number of women of reproductive age — combines per-pregnancy risk and fertility, so it is not interchangeable with the ratio.
A maternal death is the death of a woman while pregnant or within 42 days of termination of pregnancy, from any cause related to or aggravated by the pregnancy or its management — explicitly excluding incidental/accidental causes (the road-traffic death is not maternal). Split into direct (obstetric complications — haemorrhage, hypertension), indirect (pre-existing or new disease aggravated by pregnancy — cardiac disease, HIV/TB), and coincidental.
Perinatal mortality rate (PNMR) = stillbirths plus early neonatal deaths (first 7 completed days) per 1,000 total births. Keep the reporting system explicit: Stats SA civil registration defines a stillbirth as a fetus of at least 26 weeks with no signs of life after birth, while PPIP/NaPeMMCo audit tables report by birthweight categories, especially all deliveries/approximately ≥500 g and ≥1,000 g.

The iMMR is denominator-biased upward wherever institutional delivery rates are high and community deaths invisible — it counts facility deaths over facility live births, so a region that delivers everyone in hospital (the SA norm) captures more deaths than one with high home birth. Comparing iMMR across districts therefore compares case-mix and referral patterns as much as care quality; a tertiary unit that receives the sickest transfers will always post a higher iMMR than the district that sent them, which is why avoidability classification, not the raw ratio, is the unit of audit.
Late maternal deaths (>42 days to 1 year) are excluded from the classical MMR but are real and rising as women survive the acute event into prolonged critical illness — a true ascertainment gap, not a definitional nicety.
The pregnancy-related death (any death during pregnancy/puerperium regardless of cause) is a broader, cause-agnostic denominator used where cause-of-death coding is weak; do not conflate it with the maternal death, which requires the causal link.
A near-miss (severe acute maternal morbidity — a woman who survived a life-threatening complication) is the numerator of the mortality index = deaths / (deaths + near-misses). A falling iMMR with a rising near-miss count is good news (better rescue); a falling iMMR with a falling near-miss count may mean either genuine prevention or under-ascertainment — you cannot tell from mortality alone, which is the argument for auditing morbidity, not just death.

Study type	Reporting guideline	Risk-of-bias / appraisal tool	The lens that matters
Therapeutic RCT	CONSORT 2025	Cochrane RoB 2 (randomisation, deviations, missing data, measurement, selective reporting)	Allocation concealment, ITT-by-estimand, fragility of the effect
Diagnostic accuracy	STARD	QUADAS-2 (patient selection, index test, reference standard, flow & timing)	Spectrum bias, an imperfect reference standard, who was verified
Observational (cohort/case-control)	STROBE	confounding control, selection, information bias	Residual confounding — association ≠ causation
Systematic review / meta-analysis	PRISMA 2020	AMSTAR-2, plus GRADE for certainty	Heterogeneity, publication bias, were the right trials pooled
Prognostic model	TRIPOD	PROBAST	Development vs validation, calibration not just discrimination, overfitting

Management

"Management" here is a reproducible appraisal sequence plus the governance of SA maternity data.

A structured appraisal — immediate → ongoing → judgement

Immediate (validity — can I believe it at all?)

Threat	What to check
Selection bias	Randomised? Allocation concealment (the safeguard against subverting randomisation — distinct from blinding)?
Performance/detection bias	Blinding of participants, clinicians, outcome assessors
Attrition bias	Loss to follow-up; was analysis intention-to-treat (ITT) by the estimand, not per-protocol?
Reporting bias	Pre-registered protocol; primary outcome unchanged; CONSORT 2025 open-science items (registration, protocol/SAP, data sharing)

Ongoing (the result itself)

Convert relative to absolute in the patient's baseline-risk terms; demand the NNT/NNH and the 95% CI, not the point estimate alone. Worked through: the event risks of 1.6% versus 4.3% give a relative risk of about 0.37, a relative risk reduction of about 63%. The odds ratio of 0.38 is a separate measure that approximates a risk reduction only because the outcome is uncommon, so report the odds ratio and the risk reduction separately rather than treating the OR as the relative reduction. On a 4.3% baseline the absolute reduction is only about 2.7%, an NNT of about 37 to prevent one event, and the same relative effect on a 0.4% baseline would be near-worthless. Treating the relative effect as constant across baselines is a working assumption; the clinical worth is not.
For a composite primary endpoint, decompose it — a significant composite driven entirely by its softest component (e.g. "admission") while the hard component (death) is flat is a classic overstatement.
For a surrogate endpoint (cervical length, biomarker), ask whether it is validated against the patient-important outcome; many are not.
Probe fragility: the fragility index is the smallest number of patients whose event status would have to flip to render a significant result non-significant. A "positive" obstetric trial that turns null on 2–3 events is fragile — useful as a humility check, though criticised for tracking sample size and lacking an agreed threshold.

Judgement (does it apply to my patient?)

External validity is the SA crux: a pre-eclampsia or PPH trial run in well-resourced settings may not transfer to a district hospital with different prevalence, comorbidity (notably HIV) and theatre access. The effect size can be real and irrelevant locally.
Synthesise across studies with GRADE — rate certainty (high → very low) by risk of bias, inconsistency, indirectness, imprecision and publication bias — which is how the guidelines you cite were actually built.

Reading a diagnostic-accuracy paper at depth

Most O&G screening and point-of-care questions are diagnostic, not therapeutic, and the traps are different. Reading one well means doing four things:

Refuse to live on sensitivity and specificity — they are properties of the test, fixed and prevalence-independent, but they do not tell you what you actually want at the bedside: the post-test probability given this result. That needs likelihood ratios and Bayes. LR+ = sensitivity / (1 − specificity); LR− = (1 − sensitivity) / specificity. A rough literacy: LR+ >10 or LR− <0.1 is strong (rules in / rules out), 5–10 / 0.1–0.2 moderate, and anything near 1 is a test that moves the needle so little it is not worth doing. The Fagan nomogram is the bedside way to chain pre-test probability × LR → post-test probability without arithmetic.
Watch the prevalence trap. Predictive values (PPV/NPV) are not test properties — they swing with prevalence. A screen with 99% specificity throws mostly false positives when prevalence is 0.5% (the PPV collapses), which is exactly why a "highly specific" Down-syndrome or pre-eclampsia screen still needs a confirmatory step before anyone acts. A PPV quoted without the local prevalence is uninterpretable.
Interrogate the reference standard. QUADAS-2 calls this out: if the "gold standard" is itself imperfect, or if only test-positives were verified (partial/differential verification bias), the accuracy is inflated. Spectrum bias is the other classic — a test validated on florid cases and healthy controls (case-control diagnostic design) looks brilliant and then disappoints in the real, ambiguous, district-clinic spectrum.
Distinguish discrimination from calibration for any risk score. Discrimination (the c-statistic / AUC) asks whether the model ranks the sick above the well; calibration asks whether a predicted 10% risk really happens 10% of the time. A model can discriminate well (AUC 0.85) yet be badly mis-calibrated in a new population — and it is calibration that decides whether a threshold you act on is safe. This is why an imported model (fullPIERS, a fetal-growth chart, a VTE score) needs local validation, not just a good published AUC.

Interpret and apply maternity statistics and critically appraise the primary literature in O&G

In one line

Assessment

Management

A structured appraisal — immediate → ongoing → judgement

Reading a diagnostic-accuracy paper at depth

The rest of this chapter is locked

Unlock the full package.

Interpret and apply maternity statistics and critically appraise the primary literature in O&G

In one line

Assessment

Management

A structured appraisal — immediate → ongoing → judgement

Reading a diagnostic-accuracy paper at depth

The rest of this chapter is locked