Pixel-wise MLP vs U-Net CNN over a 65×41 grid · leak-free 20-variable selection · distributed CPU training (5 nodes × 20 CPUs, PyTorch DDP/Gloo) · 2026-06-11
End-to-end pipeline from raw fields to the final report — each stage feeds the next, and the leakage-safe split and training-only feature selection (highlighted) are what keep the evaluation honest.
All 49 fields share the 65×41 grid. One is the target (obs); two are excluded as leaks;
the rest are candidate predictors, of which 20 are selected (4 always-required + 16 chosen by RandomForest importance).
| Name | File | Description / full name | Role |
|---|---|---|---|
obs | obs.nc | Heatwave severity = Tx − Tx90p (PREDICTION TARGET) | target |
pr | pr.nc | Standardized copy of the target — excluded as leak | leak |
sev | sev.nc | Raw heatwave-severity field the target derives from — leak | leak |
tx | tx.nc | Daily maximum 2-m temperature (Tx) | required |
lead | lead.nc | Forecast lead time (horizon, 0–45) | required |
doy_cos | doy_cos.nc | Day-of-year seasonal encoding (cosine) | required |
doy_sin | doy_sin.nc | Day-of-year seasonal encoding (sine) | required |
t2m | t2m.nc | 2-m air temperature | selected |
ts | ts.nc | Surface / skin temperature † | selected |
tm | tm.nc | Mean (daily) temperature † | candidate |
td2m | td2m.nc | 2-m dew-point temperature | candidate |
dtd2m | dtd2m.nc | 2-m dew-point depression (T − Td) † | candidate |
st20 | st20.nc | Soil temperature, 0–20 cm layer † | selected |
st100 | st100.nc | Soil temperature, 0–100 cm layer † | selected |
sm20 | sm20.nc | Soil moisture, 0–20 cm layer † | candidate |
sm100 | sm100.nc | Soil moisture, 0–100 cm layer † | selected |
msl | msl.nc | Mean sea-level pressure | selected |
u10 | u10.nc | 10-m zonal (east–west) wind | candidate |
v10 | v10.nc | 10-m meridional (north–south) wind | candidate |
uv10 | uv10.nc | 10-m wind speed | candidate |
w | w.nc | Vertical velocity (omega) † | candidate |
cape | cape.nc | Convective Available Potential Energy | candidate |
cp | cp.nc | Convective precipitation | candidate |
tp | tp.nc | Total precipitation | candidate |
pwat | pwat.nc | Precipitable water (column-integrated) | candidate |
ivt | ivt.nc | Integrated Vapour Transport (magnitude) | candidate |
ivtu | ivtu.nc | Integrated Vapour Transport, zonal component | candidate |
ivtv | ivtv.nc | Integrated Vapour Transport, meridional component | candidate |
sw | sw.nc | Downward shortwave radiation † | selected |
swn | swn.nc | Net shortwave radiation † | selected |
lw | lw.nc | Downward longwave radiation † | candidate |
lwn | lwn.nc | Net longwave radiation † | selected |
srn | srn.nc | Net surface radiation † | selected |
olr | olr.nc | Outgoing longwave radiation (top of atmosphere) | selected |
sh | sh.nc | Sensible heat flux † | candidate |
le | le.nc | Latent heat flux † | candidate |
bowen | bowen.nc | Bowen ratio (sensible / latent heat) † | candidate |
evt | evt.nc | Evapotranspiration † — kept, flagged as possible leak (§5) | selected |
ev_mod | ev_mod.nc | Modelled evaporation / evapotranspiration † | candidate |
sv_mod | sv_mod.nc | Modelled severity-related field † | selected |
tcl | tcl.nc | Total cloud cover † | selected |
dt75 | dt75.nc | Temperature/thickness diagnostic (level 75) † — unconfirmed | candidate |
dt85 | dt85.nc | Temperature/thickness diagnostic (level 85) † — unconfirmed | candidate |
dt98 | dt98.nc | Temperature/thickness diagnostic (level 98) † — unconfirmed | candidate |
dz15 | dz15.nc | Geopotential-thickness diagnostic (level 15) † — unconfirmed | candidate |
dz18 | dz18.nc | Geopotential-thickness diagnostic (level 18) † — unconfirmed | selected |
dz87 | dz87.nc | Geopotential-thickness diagnostic (level 87) † — unconfirmed | candidate |
dz95 | dz95.nc | Geopotential-thickness diagnostic (level 95) † — unconfirmed | candidate |
dz97 | dz97.nc | Geopotential-thickness diagnostic (level 97) † — unconfirmed | selected |
Roles: target = predictand (never an input); leak = excluded (contains the answer, §5); required = always included; selected = chosen into the RF-20 working set; candidate = available but not selected. † marks a full name inferred from the abbreviation and not yet confirmed against dataset metadata — tell me the official definitions and I will correct them.
Three findings drive the report:
pr, sev) effectively
contained the answer — pr is a byte-for-byte standardized copy of the target. They are now
blocklisted everywhere. The leakage ablation below quantifies how badly they inflate apparent skill.The prediction target is heatwave severity, defined as Tx − Tx90p (daily maximum temperature minus its 90th-percentile climatology), in °C; positive values mark heatwave conditions. Fields are on a 65 × 41 latitude–longitude grid over the Vietnam / Indochina land domain (the irregular land mask is visible in the spatial figures; ocean/off-domain pixels are masked out).
49 NetCDF files share the grid; one is the target (obs.nc). The remaining are atmospheric/land
predictors — temperatures (tx, t2m, ts, st20, st100), moisture & soil (sm20, sm100, dtd2m),
circulation & flux (ivt*, u10, v10, cape, msl), thickness/heights (dz*, dt*), radiation
(sw, lw, olr, …), and forecast metadata (lead, plus seasonal encodings
doy_cos, doy_sin). All predictors are z-score standardized and stored as float16
(~10 GB in RAM for the 20-variable working set). Missing values use the sentinel −9999 and are masked.
pr and sev — are excluded as target leaks (§5). After removing
obs + these two, 46 usable predictors remain.The sample axis has 95,680 entries, organized as
2,080 forecast initializations × 46 lead times (the forecast horizon lead cycles 0→45 every 46
samples; it is itself a predictor). Each sample is therefore one (initialization-date, lead) pair's spatial field. This
is a sub-seasonal forecasting setup: skill is expected to be high at short lead and decay toward a
climatological floor with horizon — which §10.3 confirms.
The split is chronological by sample index (forecasts are stored in time order):
| Split | Fraction | Samples | Init dates | Use |
|---|---|---|---|---|
| Train | 72% | 0 – 68,889 | first ~1,498 | fit model weights |
| Validation | 8% | 68,890 – 76,543 | ~next 166 | early stopping, LR schedule |
| Test | 20% | 76,544 – 95,679 | final 416 | held-out evaluation |
Feature selection (§6) is computed on the training period only — the test forecasts are never seen by the selector, the scaler, or the models.
Predictions are scored only on valid (unmasked) pixels of the held-out test forecasts. Four complementary metrics: RMSE (magnitude error), Pearson r (pattern/temporal correlation), Bias (mean pred−obs, calibration), and POD = probability of detection of events (obs > 0). Uncertainty is quantified by a block bootstrap over test time steps (§10.1).
Held-out test set: 19136 forecasts (final 416 initialization dates × 46 leads, the most recent 20% of the record — never seen in training or selection). 95% confidence intervals come from a block bootstrap over test time steps (1000 resamples).
| Model | Metric | Value | 95% CI (bootstrap) | Baseline | Δ vs baseline |
|---|---|---|---|---|---|
| MLP | RMSE | 0.8091 | [0.803, 0.816] | 0.8873 | -0.0782 |
| MLP | Pearson r | 0.6221 | [0.617, 0.627] | 0.4319 | +0.1902 |
| MLP | Bias | 0.2448 | [0.238, 0.252] | -0.0890 | +0.3338 |
| MLP | POD | 0.9553 | [0.954, 0.957] | 0.6398 | +0.3155 |
| CNN | RMSE | 0.6548 | [0.649, 0.660] | 0.8929 | -0.2381 |
| CNN | Pearson r | 0.7691 | [0.766, 0.773] | 0.4279 | +0.3412 |
| CNN | Bias | 0.1782 | [0.173, 0.183] | -0.1214 | +0.2996 |
| CNN | POD | 0.8530 | [0.850, 0.856] | 0.5914 | +0.2616 |
Δ colored by whether the new model improves on the random/MSE baseline (lower RMSE, higher r & POD, smaller |Bias|).
During selection, two predictors showed implausible correlation to the target:
pr: |Pearson r| = 1.000, max|difference| = 0.000 → standardized copy of the target obs.sev: |r| ≈ 0.80, fits as 2.5·obs − 3.05 → the raw severity field the target derives from.Both are now in LEAK_VARS and excluded like obs (46 usable predictors remain).
The ablation below trains the same RandomForest with and without them:
The 46 leak-free predictors are reduced to N=20 (the four required variables
tx, lead, doy_cos, doy_sin are always kept). Three complementary strategies are computed on a
memory-bounded subsample of the training period only (no test peeking):
A RandomForest regressor is fit on subsampled valid pixels; variables are ranked by permutation importance — the drop in held-out R² when a variable's values are shuffled. This captures nonlinear, interaction-aware relevance and is the most skill-faithful of the three.
A greedy filter that trades off relevance to the target against redundancy with already-picked variables:
maximizing correlation to the target y while penalizing correlation to the selected set S. This avoids picking 20 near-duplicate temperature fields.
Principal Component Analysis decomposes the standardized predictor matrix X = U Σ Vᵀ; each principal component is a linear combination of predictors capturing an orthogonal mode of variance. We report the cumulative explained-variance scree and, to keep named/interpretable variables, select the predictor with the largest absolute loading on each leading component (representative-variable selection).
Pairwise Jaccard overlap of the three top-20 sets (1.0 = identical choices):
| Pair | Jaccard overlap | interpretation |
|---|---|---|
| rf-mrmr | 0.429 | supervised vs supervised |
| rf-pca | 0.25 | supervised vs unsupervised |
| mrmr-pca | 0.481 | supervised vs unsupervised |
The two supervised, relevance-driven methods (RandomForest and mRMR) agree well above the chance overlap
(~0.28 for two random 20-of-46 subsets), indicating they lock onto the same target-relevant predictors. PCA,
which ranks predictors by variance irrespective of the target, overlaps less — exactly as theory predicts,
and a useful reminder that unsupervised dimensionality reduction optimizes a different objective than skill. The
default rf selection is therefore the skill-faithful choice; mRMR corroborates it.
Treats every (time, lat, lon) pixel independently — a pure point-wise function of the 20 predictors. Upgraded from the plain baseline to residual blocks + BatchNorm + GELU for better gradient flow and depth.
x → Linear(20→256) → BN → GELU → [ResBlock(256)] × 4 (skip: x + f(x)) → BN → GELU → Linear(256→1)
No spatial context — it is the strongest a per-pixel model can be, and a natural control.
Fully-convolutional encoder/decoder over the whole grid with skip connections, so each output pixel sees a growing spatial neighbourhood. Pad/crop handles the odd 65×41 grid through the 2× pool/upsample stages.
enc1(20→48) ─────────────┐ skip
↓pool enc2(48→96) ───┐ │
↓pool bottleneck(96→192)
↑up dec2 ←concat──┘ │
↑up dec1 ←concat──────┘ → 1×1 conv
Spatial context is the key advantage and explains why the CNN overtakes the MLP.
The baseline MSE produced a systematic negative bias and weak detection of heatwave events. We replace it with a Huber loss (robust to the long-tailed severity distribution) and up-weight event pixels (obs > 0):
Huber is quadratic for |error|≤δ and linear beyond, limiting the influence of extreme residuals; the event weighting (×3) directly targets Probability-of-Detection. The trade-off — a mild positive bias — is discussed in §11.
Training runs on 5 nodes × 20 CPUs = 100 CPUs with PyTorch DistributedDataParallel over the
Gloo CPU backend. Each rank holds a replica of the model and a shard of the data (via
DistributedSampler); gradients are all-reduced every step so replicas stay identical. Validation
runs on rank 0 and the validation loss is broadcast to all ranks so every rank's LR scheduler and
early-stopping counter stay in lockstep.
model.module. Verified with a local 2-rank repro.48 NetCDF files are read concurrently into one pre-allocated float16 array via a
ThreadPoolExecutor (capped to bound peak memory). Intra-op math uses
TORCH_THREADS per process; DataLoader workers and PyTorch threads are partitioned from the SLURM CPU
allocation to avoid oversubscription. Missing values (sentinel −9999) are masked feature-by-feature to
avoid a multi-GB boolean temporary.
The full predictor cube is ~10 GB (20 vars) and training is CPU-bound; sharding time steps across nodes gives a near-linear throughput gain while DDP keeps the optimization mathematically equivalent to a single large batch.
The 95% CIs in §4 are tight (see the bracket widths), so the skill gain is not an artifact of one particular test split — resampling the test time steps 1000× leaves the ranking and magnitudes stable.
Skill is spatially coherent, not driven by a few lucky pixels: the CNN's per-pixel correlation is high and uniform across the whole domain (note the CNN RMSE colour-bar tops out well below the MLP's), and its per-pixel bias is close to zero almost everywhere. The MLP shows larger, patchier RMSE — particularly in the more variable southern and coastal pixels — because without spatial context it cannot borrow information from neighbours.
Stratifying by forecast horizon recovers a textbook sub-seasonal skill-decay curve: CNN correlation falls from r ≈ 0.92 at lead 0 (near-analysis) to ≈ 0.85 by lead 5, then plateaus around r ≈ 0.75 from ~2 weeks out. The plateau well above zero indicates the predictors carry genuine slow-varying signal (soil moisture, seasonal state) rather than only short-range persistence. Crucially, the CNN–MLP gap widens with lead: by 6 weeks the CNN holds r ≈ 0.75 while the MLP decays to ≈ 0.55 — spatial context matters most precisely when the local predictor signal is weakest. Skill also varies modestly with season (right panel; the season axis is reconstructed from the standardized day-of-year encodings of the initialization date, so treat it as approximate).
The error histograms are sharply peaked at zero with a longer positive tail (the event over-prediction from the weighted loss). The CNN's predicted-vs-observed density hugs the 1:1 line but with the classic regression-to-the-mean flattening — extreme observed severities are under-predicted, visible as the negative slope in the residual-vs-observed panel. This is exactly the deficit the linear recalibration in §10.6 corrects, and it explains why a variance-inflating recal slightly improves RMSE.
Beyond removing pr/sev (§5), the selection itself is stable: bootstrapping the RandomForest importance shows the chosen variables are picked consistently across resamples, not by chance.
The event up-weighting buys a high POD at the cost of a positive bias (over-prediction). Because an additive shift leaves Pearson r unchanged and minimizes RMSE exactly at the mean error, the bias can be removed after training, for free — turning the single trained model into a menu of operating points:
| Model | Variant | RMSE | r | Bias | POD | FAR |
|---|---|---|---|---|---|---|
| CNN | raw (as trained) | 0.655 | 0.769 | +0.178 | 0.853 | 0.221 |
| CNN | debiased (−mean error) | 0.630 | 0.769 | -0.000 | 0.766 | 0.134 |
| CNN | linear recal (a·p+b) | 0.625 | 0.769 | -0.000 | 0.760 | 0.129 |
| MLP | raw (as trained) | 0.809 | 0.622 | +0.245 | 0.955 | 0.367 |
| MLP | debiased (−mean error) | 0.771 | 0.622 | +0.000 | 0.728 | 0.297 |
| MLP | linear recal (a·p+b) | 0.766 | 0.622 | +0.000 | 0.692 | 0.285 |
HEATWAVE_WEIGHT toward ~2.0 is the in-training
alternative. Either way the model is not locked to one operating point.evt. Kept as a legitimate predictor (|r|≈0.49, with genuine independent
variance) but it dominates importance; if it is target-derived it should be blocklisted too.python src/feature_selection.py # writes outputs/selected_vars.json (leak-free ranking) sbatch slurm.sh # 5 nodes × 20 CPUs, DDP; trains MLP + CNN, evaluates python report/compute_diagnostics.py # regenerates predictions + robustness diagnostics python report/selection_analysis.py # selection stability + leakage ablation python report/build_report.py # this report
Config knobs: VAR_SELECT_METHOD (rf/mrmr/pca/random), N_VARS,
HEATWAVE_WEIGHT, HUBER_DELTA, LEAK_VARS. Seed = 42.
Generated by report/build_report.py · all figures embedded · open in any browser or print to PDF.