Configuring your training run
Genomen uses a YAML configuration file to declaratively define your entire experiment—dataset, model, and training settings—in one place. Use the sample template in the repository to get started and customize it to your phenotype and compute environment.
The configuration is divided into three main sections that control dataset preparation, model behavior, and training parameters:
DataSetConfigdefines the dataset used for training, including the phenotype to predict, input file format, populations, covariates, and sampling strategies for both samples and genetic variants (e.g., LD-based or GWAS-based selection).Example
DataSetConfig: phenotype_id: "HC337" # must be a column in the master file classification: true # true if phenotype is binary; otherwise false file_format: "plink" # only supports plink for now populations: ["white_british"] # list of population groups to include include_x_chromosome: false # whether to include variants from the X chromosome maf_threshold: 0.05 # minimum minor allele frequency (MAF) required for variants to be retained sex: null # optional filtering by sex ("m" for male, "w" for female, or null for all) covar_config: include_covars: false # whether to include covariates at all in the model covar_keys: ["age", "sex", "Global_PC1", "Global_PC2", "Global_PC3", "Global_PC4", "Global_PC5", "Global_PC6", "Global_PC7", "Global_PC8", "Global_PC9", "Global_PC10"] sample_sampling: strat: "stratify" # strategy for sampling individuals; "random" or "stratify" (k:1 balanced classes) max_samples: 50 # maximum number of samples to include per patch (higher -> better) balance_pops: false # whether to balance the number of samples per patch across populations variant_sampling: strat: "random" # options: "random", "window", "LD", "GWAS" max_features: 50 # maximum number of variants (features) to include per patch; used as window size when strat = "window" ld_config: # used when strat = "LD" prune_kb: 250 # distance window in kilobases for LD pruning prune_step: 50 # step size (in variants) used during pruning prune_r2: 0.1 # LD threshold for pruning tau: 0.1 ld_window_kb: 1_000 # max window size for LD blocks in kb ld_window: 50_000 # max window size for LD blocks in number of variants eps: 0.0 # epsilon parameter for epsilon-greedy sampling eps_schedule: "constant" # epsilon annealing schedule ("constant" or "step") eps_step_size: 0.0 # step size for epsilon updates (if applicable) gwas_config: # used when strat = "GWAS" path: "" # path to a GWAS summary statistics file snps_column: "variant_id" # column name for SNP identifiers pvalue_column: "pvalue" # column name for raw p values nlogpvalue_column: "LOG10P" # column name for negative log10 of p-values (used if provided) sep: 's+' # delimiter (space by default) impute_val: 0.1 # imputation value for variants not in GWAS window_overlap_ratio: 0.0 # stride as percentage of window size for window samplingGenomenModelConfigspecifies the architecture and hyperparameters of the models used to process covariates and genotypes, as well as the ensemble and aggregation strategies for combining model outputs.Example
GenomenModelConfig: covar_config: # configuration of covariate model covar_strat: "residualization" # "residualization" or "predictive" model_config: # config of model used for covariate prediction model_name: lightgbm # model name, check genomen/model/models.json for an overview of available models hyperparameters: {} # hyperparameters of model balance_classes: true # whether to balance classes in estimator loss geno_config: # configuration of genotype model n_estimators: 2 # number of estimators compute_interactions: True # whether to compute interaction values at training time (approximately 2x training time) preprocessing_config: z_score_thresh: 3.0 # z-score threshold to filter outliers standard_labels: false # whether to standardize labels feature_selection: method: "none" # options: "none", "k_best", "percentile", "variance_threshold", "mutual_info", "rfe" k: 15_000 # number of variants selected in case of method="topk" percentile: 0.75 # percentile of variants selected in case of method="percentile" variance_threshold: 0.05 # variance threshold for variants in case of method="variance_threshold" score_func: "f_classif" # scoring function used for scoring ("f_classif", "f_regression", or "chi2") model_config: model_name: lightgbm # model name, check genomen/model/models.json for an overview of available models ensemble_estimator_names: [] # names of weak estimator models to be used in ensemble (model_name="ensemble") hyperparameters: {} # hyperparameters of model balance_classes: true # whether to balance classes in estimator loss aggregator_config: filter_strat: "geq-average" # filtering strategy ("none", "positive", "geq-average", "top-p-percentile") agg_stat: "rank-mean" # aggregation strategy ("mean", "loss-weighted-average", "stacking") model_config: model_name: lightgbm # model name of stacking model, check genomen/model/models.json for an overview of available models hyperparameters: {} # hyperparameters of model balance_classes: true # whether to balance classes in estimator loss p: 0.75 # p used for top-p-percentile filtering temp: 0.05 # temperature used for softmax in filter_strat="loss-weighted-average"TrainConfigcontrols how models are trained and evaluated, including compute backend, batch size, number of jobs, evaluation metric, early stopping, and logging options.Example
TrainConfig: batch_size: 2 # number of models trained in parallel n_jobs: 32 # number of jobs that can be run in parallel backend: "cpu" # backend to use ("cpu" or "gpu") ram_mb: 16000 # available RAM scorer: "rocauc" # scoring function for early stopping ("r2", "rocauc", "pearson_corr") patience: 30 # patience in number of batches seed: 42 # seed for reproducibility log_with_wandb: false # whether to log with wandb save_annotation: false # whether to save annotation files (e.g., effect sizes or variant importance) to file save_model: false # whether to save model artifacts compute_shap: false # whether to compute SHAP values
Together, these sections provide a flexible way to reproduce or customize GenomEn experiments—from input preprocessing to model training and interpretation.
Setting up a training run
Before training, point GenomEn to your YAML configuration file:
import genomen.utils as utils
utils.set_config_path("config.yml") ########## Welcome to Genomic Ensembling (GenomEn) - Polygenic risk and association beyond linearity ########## Loading the dataset
Once everything is set up, load the dataset via the DataSet class. The helper function split allows you to split a DataSet into training, validation, and test sets (randomly or using a predefined split column in the master table).
from genomen.data import DataSet, split
dataset = DataSet()
train_set, test_set = split(dataset, test_size=0.2) INFO:DataSet:Looking for cached dataset...
INFO:DataSet:Found cached dataset. Proceeding to loading data...
INFO:DataSet:Got 479 cases in the train set (0.20 %). Balancing with k=5 (4790 samples per batch). Training the model
Initialize and fit the GenomenModel.
from genomen.model import GenomenModel
model = GenomenModel()
model.fit(train_set, val_set) INFO:genomen.model.model:Fitting covar model...
INFO:genomen.model.model:Validation covar-only score: 0.7755
INFO:genomen.model.model:Fitting geno model...
INFO:DataSet:Got 390 cases in the train set (0.21 %). Balancing with k=5 (3900 samples per batch).
Batch=7: Avg weak rocauc=0.4992 - Strong rocauc=0.5519 - Trained=16: 100%|███████████████████████| 8/8 [04:04<00:00, 30.62s/it]
INFO:GenoEstimator:Early stopping at batch 8. Best batch: 6 (12 estimators). Making predictions
Once the model is trained, you can use it to make predictions on new data.
geno_preds, covar_preds, preds = model.predict(test_set) The call returns genotype-only, covariate-only, and final ensemble predictions, respectively.