Configuring your training run

Genomen uses a YAML configuration file to declaratively define your entire experiment—dataset, model, and training settings—in one place. Use the sample template in the repository to get started and customize it to your phenotype and compute environment.

The configuration is divided into three main sections that control dataset preparation, model behavior, and training parameters:

  • DataSetConfig defines the dataset used for training, including the phenotype to predict, input file format, populations, covariates, and sampling strategies for both samples and genetic variants (e.g., LD-based or GWAS-based selection).

    Example
    DataSetConfig:
      phenotype_id: "HC337"             # must be a column in the master file
      classification: true              # true if phenotype is binary; otherwise false
      file_format: "plink"              # only supports plink for now
      populations: ["white_british"]    # list of population groups to include
      include_x_chromosome: false       # whether to include variants from the X chromosome
      maf_threshold: 0.05               # minimum minor allele frequency (MAF) required for variants to be retained
      sex: null                         # optional filtering by sex ("m" for male, "w" for female, or null for all)
      covar_config:
        include_covars: false           # whether to include covariates at all in the model
        covar_keys: ["age", "sex", "Global_PC1", "Global_PC2", "Global_PC3", "Global_PC4", "Global_PC5", "Global_PC6", "Global_PC7", "Global_PC8", "Global_PC9", "Global_PC10"]
      sample_sampling:
        strat: "stratify"               # strategy for sampling individuals; "random" or "stratify" (k:1 balanced classes)
        max_samples: 50                 # maximum number of samples to include per patch (higher -> better)
        balance_pops: false             # whether to balance the number of samples per patch across populations
      variant_sampling:
        strat: "random"                 # options: "random", "window", "LD", "GWAS"
        max_features: 50                # maximum number of variants (features) to include per patch; used as window size when strat = "window"
        ld_config:                      # used when strat = "LD"
          prune_kb: 250                 # distance window in kilobases for LD pruning
          prune_step: 50                # step size (in variants) used during pruning
          prune_r2: 0.1                 # LD threshold for pruning
          tau: 0.1
          ld_window_kb: 1_000           # max window size for LD blocks in kb
          ld_window: 50_000             # max window size for LD blocks in number of variants
          eps: 0.0                      # epsilon parameter for epsilon-greedy sampling
          eps_schedule: "constant"      # epsilon annealing schedule ("constant" or "step")
          eps_step_size: 0.0            # step size for epsilon updates (if applicable)
        gwas_config:                    # used when strat = "GWAS"
          path: ""                      # path to a GWAS summary statistics file
          snps_column: "variant_id"     # column name for SNP identifiers
          pvalue_column: "pvalue"       # column name for raw p values
          nlogpvalue_column: "LOG10P"   # column name for negative log10 of p-values (used if provided)
          sep: 's+'                    # delimiter (space by default)
          impute_val: 0.1               # imputation value for variants not in GWAS
        window_overlap_ratio: 0.0       # stride as percentage of window size for window sampling
  • GenomenModelConfig specifies the architecture and hyperparameters of the models used to process covariates and genotypes, as well as the ensemble and aggregation strategies for combining model outputs.

    Example
    GenomenModelConfig:
      covar_config:                     # configuration of covariate model
        covar_strat: "residualization"  # "residualization" or "predictive"
        model_config:                   # config of model used for covariate prediction
          model_name: lightgbm          # model name, check genomen/model/models.json for an overview of available models
          hyperparameters: {}           # hyperparameters of model
          balance_classes: true         # whether to balance classes in estimator loss
      geno_config:                      # configuration of genotype model
        n_estimators: 2                 # number of estimators
        compute_interactions: True      # whether to compute interaction values at training time (approximately 2x training time)
        preprocessing_config:           
          z_score_thresh: 3.0           # z-score threshold to filter outliers
          standard_labels: false        # whether to standardize labels
          feature_selection:
            method: "none"              # options: "none", "k_best", "percentile", "variance_threshold", "mutual_info", "rfe"
            k: 15_000                   # number of variants selected in case of method="topk"
            percentile: 0.75            # percentile of variants selected in case of method="percentile"
            variance_threshold: 0.05    # variance threshold for variants in case of method="variance_threshold"
            score_func: "f_classif"     # scoring function used for scoring ("f_classif", "f_regression", or "chi2")
        model_config:
          model_name: lightgbm          # model name, check genomen/model/models.json for an overview of available models
          ensemble_estimator_names: []  # names of weak estimator models to be used in ensemble (model_name="ensemble")
          hyperparameters: {}           # hyperparameters of model
          balance_classes: true         # whether to balance classes in estimator loss
        aggregator_config:
          filter_strat: "geq-average"   # filtering strategy ("none", "positive", "geq-average", "top-p-percentile")
          agg_stat: "rank-mean"         # aggregation strategy ("mean", "loss-weighted-average", "stacking")
          model_config: 
            model_name: lightgbm        # model name of stacking model, check genomen/model/models.json for an overview of available models
            hyperparameters: {}         # hyperparameters of model
            balance_classes: true       # whether to balance classes in estimator loss
          p: 0.75                       # p used for top-p-percentile filtering
          temp: 0.05                    # temperature used for softmax in filter_strat="loss-weighted-average"
  • TrainConfig controls how models are trained and evaluated, including compute backend, batch size, number of jobs, evaluation metric, early stopping, and logging options.

    Example
    TrainConfig:
      batch_size: 2                     # number of models trained in parallel
      n_jobs: 32                        # number of jobs that can be run in parallel
      backend: "cpu"                    # backend to use ("cpu" or "gpu")
      ram_mb: 16000                     # available RAM
      scorer: "rocauc"                  # scoring function for early stopping ("r2", "rocauc", "pearson_corr")
      patience: 30                      # patience in number of batches
      seed: 42                          # seed for reproducibility
      log_with_wandb: false             # whether to log with wandb
      save_annotation: false            # whether to save annotation files (e.g., effect sizes or variant importance) to file 
      save_model: false                 # whether to save model artifacts
      compute_shap: false               # whether to compute SHAP values

Together, these sections provide a flexible way to reproduce or customize GenomEn experiments—from input preprocessing to model training and interpretation.

Setting up a training run

Before training, point GenomEn to your YAML configuration file:

import genomen.utils as utils

utils.set_config_path("config.yml")
########## Welcome to Genomic Ensembling (GenomEn) - Polygenic risk and association beyond linearity ##########

Loading the dataset

Once everything is set up, load the dataset via the DataSet class. The helper function split allows you to split a DataSet into training, validation, and test sets (randomly or using a predefined split column in the master table).

from genomen.data import DataSet, split

dataset = DataSet()
train_set, test_set = split(dataset, test_size=0.2)
INFO:DataSet:Looking for cached dataset...
INFO:DataSet:Found cached dataset. Proceeding to loading data...
INFO:DataSet:Got 479 cases in the train set (0.20 %). Balancing with k=5 (4790 samples per batch).

Training the model

Initialize and fit the GenomenModel.

from genomen.model import GenomenModel

model = GenomenModel()
model.fit(train_set, val_set)
INFO:genomen.model.model:Fitting covar model...
INFO:genomen.model.model:Validation covar-only score: 0.7755
INFO:genomen.model.model:Fitting geno model...
INFO:DataSet:Got 390 cases in the train set (0.21 %). Balancing with k=5 (3900 samples per batch).
Batch=7: Avg weak rocauc=0.4992 - Strong rocauc=0.5519 - Trained=16: 100%|███████████████████████| 8/8 [04:04<00:00, 30.62s/it]
INFO:GenoEstimator:Early stopping at batch 8. Best batch: 6 (12 estimators).

Making predictions

Once the model is trained, you can use it to make predictions on new data.

geno_preds, covar_preds, preds = model.predict(test_set)

The call returns genotype-only, covariate-only, and final ensemble predictions, respectively.