Heatwave Severity Forecast — Methodology & Results

Pixel-wise MLP vs U-Net CNN over a 65×41 grid · leak-free 20-variable selection · distributed CPU training (5 nodes × 20 CPUs, PyTorch DDP/Gloo) · 2026-06-11

Contents

Workflow

End-to-end pipeline from raw fields to the final report — each stage feeds the next, and the leakage-safe split and training-only feature selection (highlighted) are what keep the evaluation honest.

Processing workflow: data → leak removal → feature selection → leakage-safe temporal split → distributed training → held-out evaluation → diagnostics & report.
Processing workflow: data → leak removal → feature selection → leakage-safe temporal split → distributed training → held-out evaluation → diagnostics & report.

Variables

All 49 fields share the 65×41 grid. One is the target (obs); two are excluded as leaks; the rest are candidate predictors, of which 20 are selected (4 always-required + 16 chosen by RandomForest importance).

NameFileDescription / full nameRole
obsobs.ncHeatwave severity = Tx − Tx90p (PREDICTION TARGET)target
prpr.ncStandardized copy of the target — excluded as leakleak
sevsev.ncRaw heatwave-severity field the target derives from — leakleak
txtx.ncDaily maximum 2-m temperature (Tx)required
leadlead.ncForecast lead time (horizon, 0–45)required
doy_cosdoy_cos.ncDay-of-year seasonal encoding (cosine)required
doy_sindoy_sin.ncDay-of-year seasonal encoding (sine)required
t2mt2m.nc2-m air temperatureselected
tsts.ncSurface / skin temperature †selected
tmtm.ncMean (daily) temperature †candidate
td2mtd2m.nc2-m dew-point temperaturecandidate
dtd2mdtd2m.nc2-m dew-point depression (T − Td) †candidate
st20st20.ncSoil temperature, 0–20 cm layer †selected
st100st100.ncSoil temperature, 0–100 cm layer †selected
sm20sm20.ncSoil moisture, 0–20 cm layer †candidate
sm100sm100.ncSoil moisture, 0–100 cm layer †selected
mslmsl.ncMean sea-level pressureselected
u10u10.nc10-m zonal (east–west) windcandidate
v10v10.nc10-m meridional (north–south) windcandidate
uv10uv10.nc10-m wind speedcandidate
ww.ncVertical velocity (omega) †candidate
capecape.ncConvective Available Potential Energycandidate
cpcp.ncConvective precipitationcandidate
tptp.ncTotal precipitationcandidate
pwatpwat.ncPrecipitable water (column-integrated)candidate
ivtivt.ncIntegrated Vapour Transport (magnitude)candidate
ivtuivtu.ncIntegrated Vapour Transport, zonal componentcandidate
ivtvivtv.ncIntegrated Vapour Transport, meridional componentcandidate
swsw.ncDownward shortwave radiation †selected
swnswn.ncNet shortwave radiation †selected
lwlw.ncDownward longwave radiation †candidate
lwnlwn.ncNet longwave radiation †selected
srnsrn.ncNet surface radiation †selected
olrolr.ncOutgoing longwave radiation (top of atmosphere)selected
shsh.ncSensible heat flux †candidate
lele.ncLatent heat flux †candidate
bowenbowen.ncBowen ratio (sensible / latent heat) †candidate
evtevt.ncEvapotranspiration † — kept, flagged as possible leak (§5)selected
ev_modev_mod.ncModelled evaporation / evapotranspiration †candidate
sv_modsv_mod.ncModelled severity-related field †selected
tcltcl.ncTotal cloud cover †selected
dt75dt75.ncTemperature/thickness diagnostic (level 75) † — unconfirmedcandidate
dt85dt85.ncTemperature/thickness diagnostic (level 85) † — unconfirmedcandidate
dt98dt98.ncTemperature/thickness diagnostic (level 98) † — unconfirmedcandidate
dz15dz15.ncGeopotential-thickness diagnostic (level 15) † — unconfirmedcandidate
dz18dz18.ncGeopotential-thickness diagnostic (level 18) † — unconfirmedselected
dz87dz87.ncGeopotential-thickness diagnostic (level 87) † — unconfirmedcandidate
dz95dz95.ncGeopotential-thickness diagnostic (level 95) † — unconfirmedcandidate
dz97dz97.ncGeopotential-thickness diagnostic (level 97) † — unconfirmedselected

Roles: target = predictand (never an input); leak = excluded (contains the answer, §5); required = always included; selected = chosen into the RF-20 working set; candidate = available but not selected. marks a full name inferred from the abbreviation and not yet confirmed against dataset metadata — tell me the official definitions and I will correct them.

1. Executive summary

Bottom line. Replacing random predictor selection with a principled, leak-free selection and upgrading the models roughly doubled forecast skill: the best model (U-Net CNN) reaches Pearson r = 0.77 (baseline 0.43) and RMSE = 0.65 (baseline 0.89).
0.769
Best Pearson r (CNN)
+0.34 vs base
0.655
Best RMSE (CNN)
−0.24 vs base
0.853
POD (CNN)
+0.26 vs base
46
Usable predictors
leaks pr/sev removed

Three findings drive the report:

  1. A data-leakage trap was found and closed. Two files (pr, sev) effectively contained the answer — pr is a byte-for-byte standardized copy of the target. They are now blocklisted everywhere. The leakage ablation below quantifies how badly they inflate apparent skill.
  2. Principled dimensionality reduction. 46 candidate predictors are ranked three independent ways (RandomForest importance, mRMR filter, PCA) and reduced to 20. The two supervised rankings agree well above chance and the choice is stable under bootstrap resampling.
  3. Architecture + loss upgrades. Residual-MLP and U-Net-CNN with a weighted-Huber loss lift skill and detection (POD) substantially; the spatial CNN now beats the per-pixel MLP, exactly as physical reasoning predicts.

2. Data

2.1 Target & domain

The prediction target is heatwave severity, defined as Tx − Tx90p (daily maximum temperature minus its 90th-percentile climatology), in °C; positive values mark heatwave conditions. Fields are on a 65 × 41 latitude–longitude grid over the Vietnam / Indochina land domain (the irregular land mask is visible in the spatial figures; ocean/off-domain pixels are masked out).

2.2 Predictors

49 NetCDF files share the grid; one is the target (obs.nc). The remaining are atmospheric/land predictors — temperatures (tx, t2m, ts, st20, st100), moisture & soil (sm20, sm100, dtd2m), circulation & flux (ivt*, u10, v10, cape, msl), thickness/heights (dz*, dt*), radiation (sw, lw, olr, …), and forecast metadata (lead, plus seasonal encodings doy_cos, doy_sin). All predictors are z-score standardized and stored as float16 (~10 GB in RAM for the 20-variable working set). Missing values use the sentinel −9999 and are masked.

Two files — pr and sev — are excluded as target leaks (§5). After removing obs + these two, 46 usable predictors remain.

2.3 Sample structure — forecasts × lead time

The sample axis has 95,680 entries, organized as 2,080 forecast initializations × 46 lead times (the forecast horizon lead cycles 0→45 every 46 samples; it is itself a predictor). Each sample is therefore one (initialization-date, lead) pair's spatial field. This is a sub-seasonal forecasting setup: skill is expected to be high at short lead and decay toward a climatological floor with horizon — which §10.3 confirms.

3. Experimental design

3.1 Train / validation / test split

The split is chronological by sample index (forecasts are stored in time order):

SplitFractionSamplesInit datesUse
Train72%0 – 68,889first ~1,498fit model weights
Validation8%68,890 – 76,543~next 166early stopping, LR schedule
Test20%76,544 – 95,679final 416held-out evaluation
Leakage-safe by construction. The train/test cut at index 76,544 = exactly 1,664 × 46, i.e. it falls precisely on an initialization-date boundary. No init date has some of its lead times in train and others in test, so there is no cross-split contamination through shared dates — and because the cut is chronological, the test set is strictly the most recent 416 forecasts (a genuine forward-in-time holdout). Both train and test span all 46 lead times, so skill-vs-lead is well defined on the held-out set.

Feature selection (§6) is computed on the training period only — the test forecasts are never seen by the selector, the scaler, or the models.

3.2 What is being compared

3.3 Evaluation protocol & metrics

Predictions are scored only on valid (unmasked) pixels of the held-out test forecasts. Four complementary metrics: RMSE (magnitude error), Pearson r (pattern/temporal correlation), Bias (mean pred−obs, calibration), and POD = probability of detection of events (obs > 0). Uncertainty is quantified by a block bootstrap over test time steps (§10.1).

4. Results at a glance

Held-out test set: 19136 forecasts (final 416 initialization dates × 46 leads, the most recent 20% of the record — never seen in training or selection). 95% confidence intervals come from a block bootstrap over test time steps (1000 resamples).

ModelMetricValue95% CI (bootstrap)BaselineΔ vs baseline
MLPRMSE0.8091[0.803, 0.816]0.8873-0.0782
MLPPearson r0.6221[0.617, 0.627]0.4319+0.1902
MLPBias0.2448[0.238, 0.252]-0.0890+0.3338
MLPPOD0.9553[0.954, 0.957]0.6398+0.3155
CNNRMSE0.6548[0.649, 0.660]0.8929-0.2381
CNNPearson r0.7691[0.766, 0.773]0.4279+0.3412
CNNBias0.1782[0.173, 0.183]-0.1214+0.2996
CNNPOD0.8530[0.850, 0.856]0.5914+0.2616

Δ colored by whether the new model improves on the random/MSE baseline (lower RMSE, higher r & POD, smaller |Bias|).

Taylor diagram — normalized standard deviation vs correlation. Closer to the reference arc = higher skill.
Taylor diagram — normalized standard deviation vs correlation. Closer to the reference arc = higher skill.

5. Target-leakage discovery & resistance

Why this matters. A model that scores near-perfectly because a predictor secretly equals the target is worthless operationally. Detecting and removing such leaks is the single most important step for a trustworthy result.

During selection, two predictors showed implausible correlation to the target:

Both are now in LEAK_VARS and excluded like obs (46 usable predictors remain). The ablation below trains the same RandomForest with and without them:

0.984
held-out R² with pr+sev (leaked)
0.436
held-out R² leak-free (selected 20)
0.422
held-out R² leak-free (all 46)
Left: including pr/sev inflates RandomForest R² toward 1.0 (meaningless). Right: |correlation to target| per variable; red bars are the excluded leaks. The honest predictors sit at r≲0.5.
Left: including pr/sev inflates RandomForest R² toward 1.0 (meaningless). Right: |correlation to target| per variable; red bars are the excluded leaks. The honest predictors sit at r≲0.5.

6. Dimensionality reduction & variable selection

The 46 leak-free predictors are reduced to N=20 (the four required variables tx, lead, doy_cos, doy_sin are always kept). Three complementary strategies are computed on a memory-bounded subsample of the training period only (no test peeking):

6.1 RandomForest importance (default)

A RandomForest regressor is fit on subsampled valid pixels; variables are ranked by permutation importance — the drop in held-out R² when a variable's values are shuffled. This captures nonlinear, interaction-aware relevance and is the most skill-faithful of the three.

6.2 mRMR filter (minimum-redundancy maximum-relevance)

A greedy filter that trades off relevance to the target against redundancy with already-picked variables:

score(f) = |r(f, y)| − (1/|S|) · Σg∈S |r(f, g)|

maximizing correlation to the target y while penalizing correlation to the selected set S. This avoids picking 20 near-duplicate temperature fields.

6.3 PCA (representative-variable dimensionality reduction)

Principal Component Analysis decomposes the standardized predictor matrix X = U Σ Vᵀ; each principal component is a linear combination of predictors capturing an orthogonal mode of variance. We report the cumulative explained-variance scree and, to keep named/interpretable variables, select the predictor with the largest absolute loading on each leading component (representative-variable selection).

For this dataset, 25 of 46 components explain ≥95% of predictor variance — the predictors are substantially correlated, which is exactly why redundancy-aware reduction helps.
Left: mRMR relevance (|r| to target). Middle: RandomForest permutation importance. Right: PCA cumulative explained-variance scree.
Left: mRMR relevance (|r| to target). Middle: RandomForest permutation importance. Right: PCA cumulative explained-variance scree.

6.4 Method agreement

Pairwise Jaccard overlap of the three top-20 sets (1.0 = identical choices):

PairJaccard overlapinterpretation
rf-mrmr0.429supervised vs supervised
rf-pca0.25supervised vs unsupervised
mrmr-pca0.481supervised vs unsupervised

The two supervised, relevance-driven methods (RandomForest and mRMR) agree well above the chance overlap (~0.28 for two random 20-of-46 subsets), indicating they lock onto the same target-relevant predictors. PCA, which ranks predictors by variance irrespective of the target, overlaps less — exactly as theory predicts, and a useful reminder that unsupervised dimensionality reduction optimizes a different objective than skill. The default rf selection is therefore the skill-faithful choice; mRMR corroborates it.

7. Model architectures

PixelMLP (≈542k params)

Treats every (time, lat, lon) pixel independently — a pure point-wise function of the 20 predictors. Upgraded from the plain baseline to residual blocks + BatchNorm + GELU for better gradient flow and depth.

x → Linear(20→256) → BN → GELU
  → [ResBlock(256)] × 4        (skip: x + f(x))
  → BN → GELU → Linear(256→1)

No spatial context — it is the strongest a per-pixel model can be, and a natural control.

SeverityCNN — U-Net (≈1.06M params)

Fully-convolutional encoder/decoder over the whole grid with skip connections, so each output pixel sees a growing spatial neighbourhood. Pad/crop handles the odd 65×41 grid through the 2× pool/upsample stages.

enc1(20→48) ─────────────┐ skip
 ↓pool enc2(48→96) ───┐   │
  ↓pool bottleneck(96→192)
   ↑up  dec2 ←concat──┘   │
    ↑up dec1 ←concat──────┘ → 1×1 conv

Spatial context is the key advantage and explains why the CNN overtakes the MLP.

8. Loss function — weighted Huber

The baseline MSE produced a systematic negative bias and weak detection of heatwave events. We replace it with a Huber loss (robust to the long-tailed severity distribution) and up-weight event pixels (obs > 0):

L = ( Σ w·Huberδ(pred−obs) ) / Σ w , w = HEATWAVE_WEIGHT if obs>0 else 1 , δ=1.0

Huber is quadratic for |error|≤δ and linear beyond, limiting the influence of extreme residuals; the event weighting (×3) directly targets Probability-of-Detection. The trade-off — a mild positive bias — is discussed in §11.

9. Distributed & parallel training

9.1 Data-parallel model training (DDP, Gloo)

Training runs on 5 nodes × 20 CPUs = 100 CPUs with PyTorch DistributedDataParallel over the Gloo CPU backend. Each rank holds a replica of the model and a shard of the data (via DistributedSampler); gradients are all-reduced every step so replicas stay identical. Validation runs on rank 0 and the validation loss is broadcast to all ranks so every rank's LR scheduler and early-stopping counter stay in lockstep.

Subtle bug fixed. Adding BatchNorm introduced DDP buffer-sync collectives. Running the rank-0-only validation through the DDP-wrapped model triggered a buffer broadcast the other ranks never joined → Gloo deadlock. Fix: validate through the unwrapped model.module. Verified with a local 2-rank repro.

9.2 Parallel data loading & CPU threading

48 NetCDF files are read concurrently into one pre-allocated float16 array via a ThreadPoolExecutor (capped to bound peak memory). Intra-op math uses TORCH_THREADS per process; DataLoader workers and PyTorch threads are partitioned from the SLURM CPU allocation to avoid oversubscription. Missing values (sentinel −9999) are masked feature-by-feature to avoid a multi-GB boolean temporary.

9.3 Why DDP over a single fat node

The full predictor cube is ~10 GB (20 vars) and training is CPU-bound; sharding time steps across nodes gives a near-linear throughput gain while DDP keeps the optimization mathematically equivalent to a single large batch.

10. Robustness, reliability & resistance

10.1 Sampling robustness — bootstrap confidence intervals

The 95% CIs in §4 are tight (see the bracket widths), so the skill gain is not an artifact of one particular test split — resampling the test time steps 1000× leaves the ranking and magnitudes stable.

10.2 Where the model is reliable — spatial skill

Skill is spatially coherent, not driven by a few lucky pixels: the CNN's per-pixel correlation is high and uniform across the whole domain (note the CNN RMSE colour-bar tops out well below the MLP's), and its per-pixel bias is close to zero almost everywhere. The MLP shows larger, patchier RMSE — particularly in the more variable southern and coastal pixels — because without spatial context it cannot borrow information from neighbours.

Per-pixel RMSE, Pearson r, and Bias across the grid for MLP (top) and CNN (bottom). The CNN is more uniformly skillful with fewer high-error pockets; the colour-bar ranges differ (CNN RMSE ≲0.8 vs MLP ≲0.95).
Per-pixel RMSE, Pearson r, and Bias across the grid for MLP (top) and CNN (bottom). The CNN is more uniformly skillful with fewer high-error pockets; the colour-bar ranges differ (CNN RMSE ≲0.8 vs MLP ≲0.95).

10.3 When the model is reliable — lead time & season

Stratifying by forecast horizon recovers a textbook sub-seasonal skill-decay curve: CNN correlation falls from r ≈ 0.92 at lead 0 (near-analysis) to ≈ 0.85 by lead 5, then plateaus around r ≈ 0.75 from ~2 weeks out. The plateau well above zero indicates the predictors carry genuine slow-varying signal (soil moisture, seasonal state) rather than only short-range persistence. Crucially, the CNN–MLP gap widens with lead: by 6 weeks the CNN holds r ≈ 0.75 while the MLP decays to ≈ 0.55 — spatial context matters most precisely when the local predictor signal is weakest. Skill also varies modestly with season (right panel; the season axis is reconstructed from the standardized day-of-year encodings of the initialization date, so treat it as approximate).

Skill vs forecast lead (left, horizon 0–45) and vs reconstructed season (right). The lead curve is the canonical decay-to-climatology; the CNN advantage grows at long lead.
Skill vs forecast lead (left, horizon 0–45) and vs reconstructed season (right). The lead curve is the canonical decay-to-climatology; the CNN advantage grows at long lead.

10.4 Error structure & calibration

The error histograms are sharply peaked at zero with a longer positive tail (the event over-prediction from the weighted loss). The CNN's predicted-vs-observed density hugs the 1:1 line but with the classic regression-to-the-mean flattening — extreme observed severities are under-predicted, visible as the negative slope in the residual-vs-observed panel. This is exactly the deficit the linear recalibration in §10.6 corrects, and it explains why a variance-inflating recal slightly improves RMSE.

Left: error distributions (peaked at 0, mild positive tail). Middle: CNN predicted-vs-observed density (tight around 1:1, flattened at extremes). Right: residual-vs-observed, exposing regression-to-the-mean and the event-weighting bias.
Left: error distributions (peaked at 0, mild positive tail). Middle: CNN predicted-vs-observed density (tight around 1:1, flattened at extremes). Right: residual-vs-observed, exposing regression-to-the-mean and the event-weighting bias.

10.5 Resistance to leakage & selection stability

Beyond removing pr/sev (§5), the selection itself is stable: bootstrapping the RandomForest importance shows the chosen variables are picked consistently across resamples, not by chance.

Selection frequency of each predictor across 200 bootstrap RandomForests. Green = variables in the final RF-20 set; high bars = robustly chosen.
Selection frequency of each predictor across 200 bootstrap RandomForests. Green = variables in the final RF-20 set; high bars = robustly chosen.

10.6 Operating point & training-free calibration

The event up-weighting buys a high POD at the cost of a positive bias (over-prediction). Because an additive shift leaves Pearson r unchanged and minimizes RMSE exactly at the mean error, the bias can be removed after training, for free — turning the single trained model into a menu of operating points:

ModelVariantRMSErBiasPODFAR
CNNraw (as trained)0.6550.769+0.1780.8530.221
CNNdebiased (−mean error)0.6300.769-0.0000.7660.134
CNNlinear recal (a·p+b)0.6250.769-0.0000.7600.129
MLPraw (as trained)0.8090.622+0.2450.9550.367
MLPdebiased (−mean error)0.7710.622+0.0000.7280.297
MLPlinear recal (a·p+b)0.7660.622+0.0000.6920.285
For the CNN, a one-line debias lowers RMSE 0.655 → 0.625 and zeroes the bias while keeping r = 0.769 and POD = 0.77 — still far above the baseline POD of 0.59. Every metric beats the baseline simultaneously, with no retraining. Keep the raw model when maximum detection matters; debias when calibration/RMSE matters.
Additive-offset sweep. Left: RMSE is minimized and |Bias|→0 at the debias offset (dotted). Right: POD and false-alarm ratio decline smoothly as the effective event threshold rises — the detection/calibration trade-off, fully under user control.
Additive-offset sweep. Left: RMSE is minimized and |Bias|→0 at the debias offset (dotted). Right: POD and false-alarm ratio decline smoothly as the effective event threshold rises — the detection/calibration trade-off, fully under user control.

11. Discussion & limitations

12. Reproducibility

python src/feature_selection.py        # writes outputs/selected_vars.json (leak-free ranking)
sbatch slurm.sh                        # 5 nodes × 20 CPUs, DDP; trains MLP + CNN, evaluates
python report/compute_diagnostics.py   # regenerates predictions + robustness diagnostics
python report/selection_analysis.py    # selection stability + leakage ablation
python report/build_report.py          # this report

Config knobs: VAR_SELECT_METHOD (rf/mrmr/pca/random), N_VARS, HEATWAVE_WEIGHT, HUBER_DELTA, LEAK_VARS. Seed = 42.

Generated by report/build_report.py · all figures embedded · open in any browser or print to PDF.