Configuring your training run

Genomen uses a YAML configuration file to declaratively define your entire experiment—dataset, model, and training settings—in one place. Use the sample template in the repository to get started and customize it to your phenotype and compute environment.

The configuration is divided into three main sections that control dataset preparation, model behavior, and training parameters:

DataSetConfig defines the dataset used for training, including the phenotype to predict, input file format, populations, covariates, and sampling strategies for both samples and genetic variants (e.g., LD-based or GWAS-based selection).

Example

DataSetConfig:
  phenotype_id: "HC337"             # must be a column in the master file
  classification: true              # true if phenotype is binary; otherwise false
  file_format: "plink"              # only supports plink for now
  populations: ["white_british"]    # list of population groups to include
  include_x_chromosome: false       # whether to include variants from the X chromosome
  maf_threshold: 0.05               # minimum minor allele frequency (MAF) required for variants to be retained
  sex: null                         # optional filtering by sex ("m" for male, "w" for female, or null for all)
  covar_config:
    include_covars: false           # whether to include covariates at all in the model
    covar_keys: ["age", "sex", "Global_PC1", "Global_PC2", "Global_PC3", "Global_PC4", "Global_PC5", "Global_PC6", "Global_PC7", "Global_PC8", "Global_PC9", "Global_PC10"]
  sample_sampling:
    strat: "stratify"               # strategy for sampling individuals; "random" or "stratify" (k:1 balanced classes)
    max_samples: 50                 # maximum number of samples to include per patch (higher -> better)
    balance_pops: false             # whether to balance the number of samples per patch across populations
  variant_sampling:
    strat: "random"                 # options: "random", "window", "LD", "GWAS"
    max_features: 50                # maximum number of variants (features) to include per patch; used as window size when strat = "window"
    ld_config:                      # used when strat = "LD"
      prune_kb: 250                 # distance window in kilobases for LD pruning
      prune_step: 50                # step size (in variants) used during pruning
      prune_r2: 0.1                 # LD threshold for pruning
      tau: 0.1
      ld_window_kb: 1_000           # max window size for LD blocks in kb
      ld_window: 50_000             # max window size for LD blocks in number of variants
      eps: 0.0                      # epsilon parameter for epsilon-greedy sampling
      eps_schedule: "constant"      # epsilon annealing schedule ("constant" or "step")
      eps_step_size: 0.0            # step size for epsilon updates (if applicable)
    gwas_config:                    # used when strat = "GWAS"
      path: ""                      # path to a GWAS summary statistics file
      snps_column: "variant_id"     # column name for SNP identifiers
      pvalue_column: "pvalue"       # column name for raw p values
      nlogpvalue_column: "LOG10P"   # column name for negative log10 of p-values (used if provided)
      sep: 's+'                    # delimiter (space by default)
      impute_val: 0.1               # imputation value for variants not in GWAS
    window_overlap_ratio: 0.0       # stride as percentage of window size for window sampling

GenomenModelConfig specifies the architecture and hyperparameters of the models used to process covariates and genotypes, as well as the ensemble and aggregation strategies for combining model outputs.

Example

GenomenModelConfig:
  covar_config:                     # configuration of covariate model
    covar_strat: "residualization"  # "residualization" or "predictive"
    model_config:                   # config of model used for covariate prediction
      model_name: lightgbm          # model name, check genomen/model/models.json for an overview of available models
      hyperparameters: {}           # hyperparameters of model
      balance_classes: true         # whether to balance classes in estimator loss
  geno_config:                      # configuration of genotype model
    n_estimators: 2                 # number of estimators
    compute_interactions: True      # whether to compute interaction values at training time (approximately 2x training time)
    preprocessing_config:           
      z_score_thresh: 3.0           # z-score threshold to filter outliers
      standard_labels: false        # whether to standardize labels
      feature_selection:
        method: "none"              # options: "none", "k_best", "percentile", "variance_threshold", "mutual_info", "rfe"
        k: 15_000                   # number of variants selected in case of method="topk"
        percentile: 0.75            # percentile of variants selected in case of method="percentile"
        variance_threshold: 0.05    # variance threshold for variants in case of method="variance_threshold"
        score_func: "f_classif"     # scoring function used for scoring ("f_classif", "f_regression", or "chi2")
    model_config:
      model_name: lightgbm          # model name, check genomen/model/models.json for an overview of available models
      ensemble_estimator_names: []  # names of weak estimator models to be used in ensemble (model_name="ensemble")
      hyperparameters: {}           # hyperparameters of model
      balance_classes: true         # whether to balance classes in estimator loss
    aggregator_config:
      filter_strat: "geq-average"   # filtering strategy ("none", "positive", "geq-average", "top-p-percentile")
      agg_stat: "rank-mean"         # aggregation strategy ("mean", "loss-weighted-average", "stacking")
      model_config: 
        model_name: lightgbm        # model name of stacking model, check genomen/model/models.json for an overview of available models
        hyperparameters: {}         # hyperparameters of model
        balance_classes: true       # whether to balance classes in estimator loss
      p: 0.75                       # p used for top-p-percentile filtering
      temp: 0.05                    # temperature used for softmax in filter_strat="loss-weighted-average"

TrainConfig controls how models are trained and evaluated, including compute backend, batch size, number of jobs, evaluation metric, early stopping, and logging options.

Example

TrainConfig:
  batch_size: 2                     # number of models trained in parallel
  n_jobs: 32                        # number of jobs that can be run in parallel
  backend: "cpu"                    # backend to use ("cpu" or "gpu")
  ram_mb: 16000                     # available RAM
  scorer: "rocauc"                  # scoring function for early stopping ("r2", "rocauc", "pearson_corr")
  patience: 30                      # patience in number of batches
  seed: 42                          # seed for reproducibility
  log_with_wandb: false             # whether to log with wandb
  save_annotation: false            # whether to save annotation files (e.g., effect sizes or variant importance) to file 
  save_model: false                 # whether to save model artifacts
  compute_shap: false               # whether to compute SHAP values

Together, these sections provide a flexible way to reproduce or customize GenomEn experiments—from input preprocessing to model training and interpretation.

Setting up a training run

Before training, point GenomEn to your YAML configuration file:

import genomen.utils as utils

utils.set_config_path("config.yml")

########## Welcome to Genomic Ensembling (GenomEn) - Polygenic risk and association beyond linearity ##########

Loading the dataset

Once everything is set up, load the dataset via the DataSet class. The helper function split allows you to split a DataSet into training, validation, and test sets (randomly or using a predefined split column in the master table).

from genomen.data import DataSet, split

dataset = DataSet()
train_set, test_set = split(dataset, test_size=0.2)

INFO:DataSet:Looking for cached dataset...
INFO:DataSet:Found cached dataset. Proceeding to loading data...
INFO:DataSet:Got 479 cases in the train set (0.20 %). Balancing with k=5 (4790 samples per batch).

Training the model

Initialize and fit the GenomenModel.

from genomen.model import GenomenModel

model = GenomenModel()
model.fit(train_set, val_set)

INFO:genomen.model.model:Fitting covar model...
INFO:genomen.model.model:Validation covar-only score: 0.7755
INFO:genomen.model.model:Fitting geno model...
INFO:DataSet:Got 390 cases in the train set (0.21 %). Balancing with k=5 (3900 samples per batch).
Batch=7: Avg weak rocauc=0.4992 - Strong rocauc=0.5519 - Trained=16: 100%|███████████████████████| 8/8 [04:04<00:00, 30.62s/it]
INFO:GenoEstimator:Early stopping at batch 8. Best batch: 6 (12 estimators).

Making predictions

Once the model is trained, you can use it to make predictions on new data.

geno_preds, covar_preds, preds = model.predict(test_set)

The call returns genotype-only, covariate-only, and final ensemble predictions, respectively.