Heatwave Severity Forecast — Methodology & Results

Pixel-wise MLP vs U-Net CNN over a 65×41 grid · leak-free 20-variable selection · distributed CPU training (5 nodes × 20 CPUs, PyTorch DDP/Gloo) · 2026-06-11

Contents

Workflow Variables Workflow Variables 1. Executive summary 2. Data 3. Experimental design 4. Results at a glance 5. Target-leakage discovery 6. Dimensionality reduction 7. Models 8. Loss 9. Parallel training 10. Robustness analysis 11. Discussion 12. Reproducibility

Workflow

End-to-end pipeline from raw fields to the final report — each stage feeds the next, and the leakage-safe split and training-only feature selection (highlighted) are what keep the evaluation honest.

Variables

All 49 fields share the 65×41 grid. One is the target (obs); two are excluded as leaks; the rest are candidate predictors, of which 20 are selected (4 always-required + 16 chosen by RandomForest importance).

Name	File	Description / full name	Role
`obs`	`obs.nc`	Heatwave severity = Tx − Tx90p (PREDICTION TARGET)	target
`pr`	`pr.nc`	Standardized copy of the target — excluded as leak	leak
`sev`	`sev.nc`	Raw heatwave-severity field the target derives from — leak	leak
`tx`	`tx.nc`	Daily maximum 2-m temperature (Tx)	required
`lead`	`lead.nc`	Forecast lead time (horizon, 0–45)	required
`doy_cos`	`doy_cos.nc`	Day-of-year seasonal encoding (cosine)	required
`doy_sin`	`doy_sin.nc`	Day-of-year seasonal encoding (sine)	required
`t2m`	`t2m.nc`	2-m air temperature	selected
`ts`	`ts.nc`	Surface / skin temperature †	selected
`tm`	`tm.nc`	Mean (daily) temperature †	candidate
`td2m`	`td2m.nc`	2-m dew-point temperature	candidate
`dtd2m`	`dtd2m.nc`	2-m dew-point depression (T − Td) †	candidate
`st20`	`st20.nc`	Soil temperature, 0–20 cm layer †	selected
`st100`	`st100.nc`	Soil temperature, 0–100 cm layer †	selected
`sm20`	`sm20.nc`	Soil moisture, 0–20 cm layer †	candidate
`sm100`	`sm100.nc`	Soil moisture, 0–100 cm layer †	selected
`msl`	`msl.nc`	Mean sea-level pressure	selected
`u10`	`u10.nc`	10-m zonal (east–west) wind	candidate
`v10`	`v10.nc`	10-m meridional (north–south) wind	candidate
`uv10`	`uv10.nc`	10-m wind speed	candidate
`w`	`w.nc`	Vertical velocity (omega) †	candidate
`cape`	`cape.nc`	Convective Available Potential Energy	candidate
`cp`	`cp.nc`	Convective precipitation	candidate
`tp`	`tp.nc`	Total precipitation	candidate
`pwat`	`pwat.nc`	Precipitable water (column-integrated)	candidate
`ivt`	`ivt.nc`	Integrated Vapour Transport (magnitude)	candidate
`ivtu`	`ivtu.nc`	Integrated Vapour Transport, zonal component	candidate
`ivtv`	`ivtv.nc`	Integrated Vapour Transport, meridional component	candidate
`sw`	`sw.nc`	Downward shortwave radiation †	selected
`swn`	`swn.nc`	Net shortwave radiation †	selected
`lw`	`lw.nc`	Downward longwave radiation †	candidate
`lwn`	`lwn.nc`	Net longwave radiation †	selected
`srn`	`srn.nc`	Net surface radiation †	selected
`olr`	`olr.nc`	Outgoing longwave radiation (top of atmosphere)	selected
`sh`	`sh.nc`	Sensible heat flux †	candidate
`le`	`le.nc`	Latent heat flux †	candidate
`bowen`	`bowen.nc`	Bowen ratio (sensible / latent heat) †	candidate
`evt`	`evt.nc`	Evapotranspiration † — kept, flagged as possible leak (§5)	selected
`ev_mod`	`ev_mod.nc`	Modelled evaporation / evapotranspiration †	candidate
`sv_mod`	`sv_mod.nc`	Modelled severity-related field †	selected
`tcl`	`tcl.nc`	Total cloud cover †	selected
`dt75`	`dt75.nc`	Temperature/thickness diagnostic (level 75) † — unconfirmed	candidate
`dt85`	`dt85.nc`	Temperature/thickness diagnostic (level 85) † — unconfirmed	candidate
`dt98`	`dt98.nc`	Temperature/thickness diagnostic (level 98) † — unconfirmed	candidate
`dz15`	`dz15.nc`	Geopotential-thickness diagnostic (level 15) † — unconfirmed	candidate
`dz18`	`dz18.nc`	Geopotential-thickness diagnostic (level 18) † — unconfirmed	selected
`dz87`	`dz87.nc`	Geopotential-thickness diagnostic (level 87) † — unconfirmed	candidate
`dz95`	`dz95.nc`	Geopotential-thickness diagnostic (level 95) † — unconfirmed	candidate
`dz97`	`dz97.nc`	Geopotential-thickness diagnostic (level 97) † — unconfirmed	selected

Roles: target = predictand (never an input); leak = excluded (contains the answer, §5); required = always included; selected = chosen into the RF-20 working set; candidate = available but not selected. † marks a full name inferred from the abbreviation and not yet confirmed against dataset metadata — tell me the official definitions and I will correct them.

1. Executive summary

Bottom line. Replacing random predictor selection with a principled, leak-free selection and upgrading the models roughly doubled forecast skill: the best model (U-Net CNN) reaches Pearson r = 0.77 (baseline 0.43) and RMSE = 0.65 (baseline 0.89).

0.769

Best Pearson r (CNN)

+0.34 vs base

0.655

Best RMSE (CNN)

−0.24 vs base

0.853

POD (CNN)

+0.26 vs base

Usable predictors

leaks pr/sev removed

Three findings drive the report:

A data-leakage trap was found and closed. Two files (pr, sev) effectively contained the answer — pr is a byte-for-byte standardized copy of the target. They are now blocklisted everywhere. The leakage ablation below quantifies how badly they inflate apparent skill.
Principled dimensionality reduction. 46 candidate predictors are ranked three independent ways (RandomForest importance, mRMR filter, PCA) and reduced to 20. The two supervised rankings agree well above chance and the choice is stable under bootstrap resampling.
Architecture + loss upgrades. Residual-MLP and U-Net-CNN with a weighted-Huber loss lift skill and detection (POD) substantially; the spatial CNN now beats the per-pixel MLP, exactly as physical reasoning predicts.

2. Data

2.1 Target & domain

The prediction target is heatwave severity, defined as Tx − Tx90p (daily maximum temperature minus its 90th-percentile climatology), in °C; positive values mark heatwave conditions. Fields are on a 65 × 41 latitude–longitude grid over the Vietnam / Indochina land domain (the irregular land mask is visible in the spatial figures; ocean/off-domain pixels are masked out).

2.2 Predictors

49 NetCDF files share the grid; one is the target (obs.nc). The remaining are atmospheric/land predictors — temperatures (tx, t2m, ts, st20, st100), moisture & soil (sm20, sm100, dtd2m), circulation & flux (ivt*, u10, v10, cape, msl), thickness/heights (dz*, dt*), radiation (sw, lw, olr, …), and forecast metadata (lead, plus seasonal encodings doy_cos, doy_sin). All predictors are z-score standardized and stored as float16 (~10 GB in RAM for the 20-variable working set). Missing values use the sentinel −9999 and are masked.

Two files — pr and sev — are excluded as target leaks (§5). After removing obs + these two, 46 usable predictors remain.

2.3 Sample structure — forecasts × lead time

The sample axis has 95,680 entries, organized as 2,080 forecast initializations × 46 lead times (the forecast horizon lead cycles 0→45 every 46 samples; it is itself a predictor). Each sample is therefore one (initialization-date, lead) pair's spatial field. This is a sub-seasonal forecasting setup: skill is expected to be high at short lead and decay toward a climatological floor with horizon — which §10.3 confirms.

3. Experimental design

3.1 Train / validation / test split

The split is chronological by sample index (forecasts are stored in time order):

Split	Fraction	Samples	Init dates	Use
Train	72%	0 – 68,889	first ~1,498	fit model weights
Validation	8%	68,890 – 76,543	~next 166	early stopping, LR schedule
Test	20%	76,544 – 95,679	final 416	held-out evaluation

Leakage-safe by construction. The train/test cut at index 76,544 = exactly 1,664 × 46, i.e. it falls precisely on an initialization-date boundary. No init date has some of its lead times in train and others in test, so there is no cross-split contamination through shared dates — and because the cut is chronological, the test set is strictly the most recent 416 forecasts (a genuine forward-in-time holdout). Both train and test span all 46 lead times, so skill-vs-lead is well defined on the held-out set.

Feature selection (§6) is computed on the training period only — the test forecasts are never seen by the selector, the scaler, or the models.

3.2 What is being compared

Two models on identical inputs: a per-pixel PixelMLP (no spatial context — the control) and a U-Net SeverityCNN (spatial context). The contrast isolates the value of spatial structure.
Against a baseline: the previous configuration (20 randomly selected variables, plain MSE loss, shallow models). Every metric is reported as Δ versus this baseline.
Three selection methods (RF / mRMR / PCA) compared for agreement and stability (§6, §10.5).

3.3 Evaluation protocol & metrics

Predictions are scored only on valid (unmasked) pixels of the held-out test forecasts. Four complementary metrics: RMSE (magnitude error), Pearson r (pattern/temporal correlation), Bias (mean pred−obs, calibration), and POD = probability of detection of events (obs > 0). Uncertainty is quantified by a block bootstrap over test time steps (§10.1).

4. Results at a glance

Held-out test set: 19136 forecasts (final 416 initialization dates × 46 leads, the most recent 20% of the record — never seen in training or selection). 95% confidence intervals come from a block bootstrap over test time steps (1000 resamples).

Model	Metric	Value	95% CI (bootstrap)	Baseline	Δ vs baseline
MLP	RMSE	0.8091	[0.803, 0.816]	0.8873	-0.0782
MLP	Pearson r	0.6221	[0.617, 0.627]	0.4319	+0.1902
MLP	Bias	0.2448	[0.238, 0.252]	-0.0890	+0.3338
MLP	POD	0.9553	[0.954, 0.957]	0.6398	+0.3155
CNN	RMSE	0.6548	[0.649, 0.660]	0.8929	-0.2381
CNN	Pearson r	0.7691	[0.766, 0.773]	0.4279	+0.3412
CNN	Bias	0.1782	[0.173, 0.183]	-0.1214	+0.2996
CNN	POD	0.8530	[0.850, 0.856]	0.5914	+0.2616

Δ colored by whether the new model improves on the random/MSE baseline (lower RMSE, higher r & POD, smaller |Bias|).

Taylor diagram — normalized standard deviation vs correlation. Closer to the reference arc = higher skill.

5. Target-leakage discovery & resistance

Why this matters. A model that scores near-perfectly because a predictor secretly equals the target is worthless operationally. Detecting and removing such leaks is the single most important step for a trustworthy result.

During selection, two predictors showed implausible correlation to the target:

pr: |Pearson r| = 1.000, max|difference| = 0.000 → standardized copy of the target obs.
sev: |r| ≈ 0.80, fits as 2.5·obs − 3.05 → the raw severity field the target derives from.

Both are now in LEAK_VARS and excluded like obs (46 usable predictors remain). The ablation below trains the same RandomForest with and without them:

0.984

held-out R² with pr+sev (leaked)

0.436

held-out R² leak-free (selected 20)

0.422

held-out R² leak-free (all 46)

Left: including pr/sev inflates RandomForest R² toward 1.0 (meaningless). Right: |correlation to target| per variable; red bars are the excluded leaks. The honest predictors sit at r≲0.5.

6. Dimensionality reduction & variable selection

The 46 leak-free predictors are reduced to N=20 (the four required variables tx, lead, doy_cos, doy_sin are always kept). Three complementary strategies are computed on a memory-bounded subsample of the training period only (no test peeking):

6.1 RandomForest importance (default)

A RandomForest regressor is fit on subsampled valid pixels; variables are ranked by permutation importance — the drop in held-out R² when a variable's values are shuffled. This captures nonlinear, interaction-aware relevance and is the most skill-faithful of the three.

6.2 mRMR filter (minimum-redundancy maximum-relevance)

A greedy filter that trades off relevance to the target against redundancy with already-picked variables:

score(f) = |r(f, y)| − (1/|S|) · Σ_g∈S |r(f, g)|

maximizing correlation to the target y while penalizing correlation to the selected set S. This avoids picking 20 near-duplicate temperature fields.

6.3 PCA (representative-variable dimensionality reduction)

Principal Component Analysis decomposes the standardized predictor matrix X = U Σ Vᵀ; each principal component is a linear combination of predictors capturing an orthogonal mode of variance. We report the cumulative explained-variance scree and, to keep named/interpretable variables, select the predictor with the largest absolute loading on each leading component (representative-variable selection).

For this dataset, 25 of 46 components explain ≥95% of predictor variance — the predictors are substantially correlated, which is exactly why redundancy-aware reduction helps.

Left: mRMR relevance (|r| to target). Middle: RandomForest permutation importance. Right: PCA cumulative explained-variance scree.

6.4 Method agreement

Pairwise Jaccard overlap of the three top-20 sets (1.0 = identical choices):

Pair	Jaccard overlap	interpretation
rf-mrmr	0.429	supervised vs supervised
rf-pca	0.25	supervised vs unsupervised
mrmr-pca	0.481	supervised vs unsupervised

The two supervised, relevance-driven methods (RandomForest and mRMR) agree well above the chance overlap (~0.28 for two random 20-of-46 subsets), indicating they lock onto the same target-relevant predictors. PCA, which ranks predictors by variance irrespective of the target, overlaps less — exactly as theory predicts, and a useful reminder that unsupervised dimensionality reduction optimizes a different objective than skill. The default rf selection is therefore the skill-faithful choice; mRMR corroborates it.

7. Model architectures

PixelMLP (≈542k params)

Treats every (time, lat, lon) pixel independently — a pure point-wise function of the 20 predictors. Upgraded from the plain baseline to residual blocks + BatchNorm + GELU for better gradient flow and depth.

x → Linear(20→256) → BN → GELU
  → [ResBlock(256)] × 4        (skip: x + f(x))
  → BN → GELU → Linear(256→1)

No spatial context — it is the strongest a per-pixel model can be, and a natural control.

SeverityCNN — U-Net (≈1.06M params)

Fully-convolutional encoder/decoder over the whole grid with skip connections, so each output pixel sees a growing spatial neighbourhood. Pad/crop handles the odd 65×41 grid through the 2× pool/upsample stages.

enc1(20→48) ─────────────┐ skip
 ↓pool enc2(48→96) ───┐   │
  ↓pool bottleneck(96→192)
   ↑up  dec2 ←concat──┘   │
    ↑up dec1 ←concat──────┘ → 1×1 conv

Spatial context is the key advantage and explains why the CNN overtakes the MLP.

8. Loss function — weighted Huber

The baseline MSE produced a systematic negative bias and weak detection of heatwave events. We replace it with a Huber loss (robust to the long-tailed severity distribution) and up-weight event pixels (obs > 0):

L = ( Σ w·Huber_δ(pred−obs) ) / Σ w , w = HEATWAVE_WEIGHT if obs>0 else 1 , δ=1.0

Huber is quadratic for |error|≤δ and linear beyond, limiting the influence of extreme residuals; the event weighting (×3) directly targets Probability-of-Detection. The trade-off — a mild positive bias — is discussed in §11.

9. Distributed & parallel training

9.1 Data-parallel model training (DDP, Gloo)

Training runs on 5 nodes × 20 CPUs = 100 CPUs with PyTorch DistributedDataParallel over the Gloo CPU backend. Each rank holds a replica of the model and a shard of the data (via DistributedSampler); gradients are all-reduced every step so replicas stay identical. Validation runs on rank 0 and the validation loss is broadcast to all ranks so every rank's LR scheduler and early-stopping counter stay in lockstep.

Subtle bug fixed. Adding BatchNorm introduced DDP buffer-sync collectives. Running the rank-0-only validation through the DDP-wrapped model triggered a buffer broadcast the other ranks never joined → Gloo deadlock. Fix: validate through the unwrapped model.module. Verified with a local 2-rank repro.

9.2 Parallel data loading & CPU threading

48 NetCDF files are read concurrently into one pre-allocated float16 array via a ThreadPoolExecutor (capped to bound peak memory). Intra-op math uses TORCH_THREADS per process; DataLoader workers and PyTorch threads are partitioned from the SLURM CPU allocation to avoid oversubscription. Missing values (sentinel −9999) are masked feature-by-feature to avoid a multi-GB boolean temporary.

9.3 Why DDP over a single fat node

The full predictor cube is ~10 GB (20 vars) and training is CPU-bound; sharding time steps across nodes gives a near-linear throughput gain while DDP keeps the optimization mathematically equivalent to a single large batch.

10. Robustness, reliability & resistance

10.1 Sampling robustness — bootstrap confidence intervals

The 95% CIs in §4 are tight (see the bracket widths), so the skill gain is not an artifact of one particular test split — resampling the test time steps 1000× leaves the ranking and magnitudes stable.

10.2 Where the model is reliable — spatial skill

Skill is spatially coherent, not driven by a few lucky pixels: the CNN's per-pixel correlation is high and uniform across the whole domain (note the CNN RMSE colour-bar tops out well below the MLP's), and its per-pixel bias is close to zero almost everywhere. The MLP shows larger, patchier RMSE — particularly in the more variable southern and coastal pixels — because without spatial context it cannot borrow information from neighbours.

Per-pixel RMSE, Pearson r, and Bias across the grid for MLP (top) and CNN (bottom). The CNN is more uniformly skillful with fewer high-error pockets; the colour-bar ranges differ (CNN RMSE ≲0.8 vs MLP ≲0.95).

10.3 When the model is reliable — lead time & season

Stratifying by forecast horizon recovers a textbook sub-seasonal skill-decay curve: CNN correlation falls from r ≈ 0.92 at lead 0 (near-analysis) to ≈ 0.85 by lead 5, then plateaus around r ≈ 0.75 from ~2 weeks out. The plateau well above zero indicates the predictors carry genuine slow-varying signal (soil moisture, seasonal state) rather than only short-range persistence. Crucially, the CNN–MLP gap widens with lead: by 6 weeks the CNN holds r ≈ 0.75 while the MLP decays to ≈ 0.55 — spatial context matters most precisely when the local predictor signal is weakest. Skill also varies modestly with season (right panel; the season axis is reconstructed from the standardized day-of-year encodings of the initialization date, so treat it as approximate).

Skill vs forecast lead (left, horizon 0–45) and vs reconstructed season (right). The lead curve is the canonical decay-to-climatology; the CNN advantage grows at long lead.

10.4 Error structure & calibration

The error histograms are sharply peaked at zero with a longer positive tail (the event over-prediction from the weighted loss). The CNN's predicted-vs-observed density hugs the 1:1 line but with the classic regression-to-the-mean flattening — extreme observed severities are under-predicted, visible as the negative slope in the residual-vs-observed panel. This is exactly the deficit the linear recalibration in §10.6 corrects, and it explains why a variance-inflating recal slightly improves RMSE.

Left: error distributions (peaked at 0, mild positive tail). Middle: CNN predicted-vs-observed density (tight around 1:1, flattened at extremes). Right: residual-vs-observed, exposing regression-to-the-mean and the event-weighting bias.

10.5 Resistance to leakage & selection stability

Beyond removing pr/sev (§5), the selection itself is stable: bootstrapping the RandomForest importance shows the chosen variables are picked consistently across resamples, not by chance.

Selection frequency of each predictor across 200 bootstrap RandomForests. Green = variables in the final RF-20 set; high bars = robustly chosen.

10.6 Operating point & training-free calibration

The event up-weighting buys a high POD at the cost of a positive bias (over-prediction). Because an additive shift leaves Pearson r unchanged and minimizes RMSE exactly at the mean error, the bias can be removed after training, for free — turning the single trained model into a menu of operating points:

Model	Variant	RMSE	r	Bias	POD	FAR
CNN	raw (as trained)	0.655	0.769	+0.178	0.853	0.221
CNN	debiased (−mean error)	0.630	0.769	-0.000	0.766	0.134
CNN	linear recal (a·p+b)	0.625	0.769	-0.000	0.760	0.129
MLP	raw (as trained)	0.809	0.622	+0.245	0.955	0.367
MLP	debiased (−mean error)	0.771	0.622	+0.000	0.728	0.297
MLP	linear recal (a·p+b)	0.766	0.622	+0.000	0.692	0.285

For the CNN, a one-line debias lowers RMSE 0.655 → 0.625 and zeroes the bias while keeping r = 0.769 and POD = 0.77 — still far above the baseline POD of 0.59. Every metric beats the baseline simultaneously, with no retraining. Keep the raw model when maximum detection matters; debias when calibration/RMSE matters.

Additive-offset sweep. Left: RMSE is minimized and |Bias|→0 at the debias offset (dotted). Right: POD and false-alarm ratio decline smoothly as the effective event threshold rises — the detection/calibration trade-off, fully under user control.

11. Discussion & limitations

CNN > MLP is physically coherent. The MLP (per-pixel) sits near the single-pixel skill ceiling (RF held-out R² ≈ 0.44 → r ≈ 0.64); the CNN exceeds it only by using spatial context — strong evidence the gain is real signal, not residual leakage.
The skill-decay curve validates the result. Correlation falling from ≈0.92 at lead 0 to a ≈0.75 plateau by two weeks (§10.3) is the signature of a genuine sub-seasonal forecast — a leaked target would instead show flat, near-perfect skill at all leads. The honest lead-dependence is itself a leakage check.
Bias / POD trade-off is controllable. The ×3 event weighting drove POD up but flipped Bias slightly positive. As §10.6 shows, this is removable after training for free (additive debias / linear recal lowers RMSE and zeroes bias with r unchanged); lowering HEATWAVE_WEIGHT toward ~2.0 is the in-training alternative. Either way the model is not locked to one operating point.
Open question — evt. Kept as a legitimate predictor (|r|≈0.49, with genuine independent variance) but it dominates importance; if it is target-derived it should be blocklisted too.
Selection subsamples the training period for tractability; a full-record selection or multi-seed training stability study (separate cluster jobs) would further harden the conclusions.

12. Reproducibility

python src/feature_selection.py        # writes outputs/selected_vars.json (leak-free ranking)
sbatch slurm.sh                        # 5 nodes × 20 CPUs, DDP; trains MLP + CNN, evaluates
python report/compute_diagnostics.py   # regenerates predictions + robustness diagnostics
python report/selection_analysis.py    # selection stability + leakage ablation
python report/build_report.py          # this report

Config knobs: VAR_SELECT_METHOD (rf/mrmr/pca/random), N_VARS, HEATWAVE_WEIGHT, HUBER_DELTA, LEAK_VARS. Seed = 42.

Generated by report/build_report.py · all figures embedded · open in any browser or print to PDF.