Comprehensive GA Feature Selection
Feature selection and hyperparameter tuning are usually done separately, but they interact: the best hyperparameters for 50 features are different from the best for 15. This tutorial shows a three-stage workflow that treats both together:
- Baseline — all 50 features, default hyperparameters
- GA feature selection — find the best 15-feature subset from 50 (30 real + 20 noise)
- Hyperparameter retune — tune a new model on the selected 15 features (faster evaluations, better budget)
Then a robustness check fits a completely different estimator on the selected features. If a second estimator also improves, the selection is genuinely informative and not an artifact of the scoring model.
Complementary to the simpler example
The Feature Selection (Noisy Data) example in the Examples section uses Iris + 12 noise features in a single stage. This tutorial uses breast cancer + 20 noise features and adds hyperparameter retuning and cross-estimator validation.
Prerequisites
pip install sklearn-genetic-optSetup
import warnings
from pprint import pprint
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn_genetic import (
EvolutionConfig, GAFeatureSelectionCV, GASearchCV,
OptimizationConfig, PopulationConfig, RuntimeConfig,
)
from sklearn_genetic.callbacks import ConsecutiveStopping, DeltaThreshold, TimerStopping
from sklearn_genetic.schedules import ExponentialAdapter, InverseAdapter
from sklearn_genetic.space import Categorical, Continuous, Integer
warnings.filterwarnings("ignore")
RANDOM_STATE = 42
rng = np.random.default_rng(RANDOM_STATE)Build the Dataset
Start with breast cancer's 30 real features, then add 20 independent Gaussian noise columns. A good selector should recover most real features and drop all noise ones.
data = load_breast_cancer(as_frame=True)
X_real = data.data # 30 real features
y = data.target
noise = pd.DataFrame(
rng.normal(size=(X_real.shape[0], 20)),
columns=[f"noise_{i:02d}" for i in range(20)],
)
X = pd.concat([X_real.reset_index(drop=True), noise], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, stratify=y, random_state=RANDOM_STATE
)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
real_feature_names = set(data.feature_names)
print(f"Real features: {X_real.shape[1]}")
print(f"Noise features: {noise.shape[1]}")
print(f"Total features: {X.shape[1]}")
print(f"Train: {X_train.shape}, Test: {X_test.shape}")Helpers
def evaluate(name, estimator, X_eval, y_eval):
predictions = estimator.predict(X_eval)
try:
probabilities = estimator.predict_proba(X_eval)[:, 1]
roc = round(roc_auc_score(y_eval, probabilities), 4)
except AttributeError:
roc = None
return {
"name": name,
"n_features": X_eval.shape[1],
"accuracy": round(accuracy_score(y_eval, predictions), 4),
"balanced_accuracy": round(balanced_accuracy_score(y_eval, predictions), 4),
"roc_auc": roc,
}
def make_svc():
return Pipeline([
("scaler", StandardScaler()),
("svc", SVC(kernel="rbf", C=2.0, gamma="scale",
probability=True, random_state=RANDOM_STATE)),
])Stage 1 — Baseline on All 50 Features
rf_baseline = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf_baseline.fit(X_train, y_train)
baseline_rf = evaluate("RF baseline (50 features)", rf_baseline, X_test, y_test)
svc_baseline = make_svc()
svc_baseline.fit(X_train, y_train)
baseline_svc = evaluate("SVC baseline (50 features)", svc_baseline, X_test, y_test)
print(baseline_rf)
# {'name': 'RF baseline (50 features)', 'n_features': 50, 'accuracy': 0.9474, ...}
print(baseline_svc)
# {'name': 'SVC baseline (50 features)', 'n_features': 50, 'accuracy': 0.9298, ...}The noise features dilute both models. The RF ignores them via feature importance; the SVC's kernel distance is degraded by irrelevant dimensions.
Stage 2 — GA Feature Selection
GAFeatureSelectionCV optimizes a binary mask over the 50 columns. Setting max_features=15 steers the search toward compact subsets.
selector = GAFeatureSelectionCV(
estimator=RandomForestClassifier(
n_estimators=100, random_state=RANDOM_STATE, n_jobs=1
),
cv=cv,
scoring="roc_auc",
max_features=15,
evolution_config=EvolutionConfig(
population_size=24,
generations=20,
crossover_probability=ExponentialAdapter(
initial_value=0.8, end_value=0.4, adaptive_rate=0.15
),
mutation_probability=InverseAdapter(
initial_value=0.30, end_value=0.08, adaptive_rate=0.25
),
tournament_size=3,
elitism=True,
keep_top_k=3,
),
population_config=PopulationConfig(initializer="smart"),
runtime_config=RuntimeConfig(
n_jobs=-1,
parallel_backend="auto",
use_cache=True,
verbose=True,
),
optimization_config=OptimizationConfig(
local_search=True,
local_search_top_k=2,
local_search_steps=1,
local_search_radius=0.15,
diversity_control=True,
diversity_threshold=0.30,
diversity_stagnation_generations=3,
diversity_mutation_boost=1.8,
random_immigrants_fraction=0.12,
fitness_sharing=True,
sharing_radius=0.40,
),
)
selection_callbacks = [
DeltaThreshold(threshold=0.001, generations=5, metric="fitness_best"),
ConsecutiveStopping(generations=8, metric="fitness_best"),
TimerStopping(total_seconds=120),
]
selector.fit(X_train, y_train, callbacks=selection_callbacks)
print(f"\nBest CV ROC AUC (selection): {selector.best_score_:.4f}")
print(f"Selected {selector.n_features_} of {X.shape[1]} features")Which Features Were Selected?
selected_mask = selector.support_
selected_names = X_train.columns[selected_mask].tolist()
summary = pd.DataFrame({
"feature": X_train.columns,
"selected": selected_mask,
"kind": ["real" if c in real_feature_names else "noise" for c in X_train.columns],
})
print("\nSelected features:")
print(summary[summary["selected"]].to_string(index=False))
n_real_selected = summary[summary["selected"] & (summary["kind"] == "real")].shape[0]
n_noise_selected = summary[summary["selected"] & (summary["kind"] == "noise")].shape[0]
print(f"\nReal features kept: {n_real_selected} / {X_real.shape[1]}")
print(f"Noise features kept: {n_noise_selected} / {noise.shape[1]}")Feature Selection Breakdown Chart
color_map = {"real": "steelblue", "noise": "salmon"}
colors = [color_map[k] for k in summary[summary["selected"]]["kind"]]
fig, ax = plt.subplots(figsize=(10, 5))
ax.bar(range(len(selected_names)), [1] * len(selected_names), color=colors)
ax.set_xticks(range(len(selected_names)))
ax.set_xticklabels(selected_names, rotation=45, ha="right", fontsize=8)
ax.set_title("Selected Features — real (blue) vs noise (red)")
ax.set_ylabel("Selected")
ax.set_yticks([])
from matplotlib.patches import Patch
ax.legend(handles=[Patch(color="steelblue", label="real"), Patch(color="salmon", label="noise")])
plt.tight_layout()
plt.show()Feature Selection Telemetry
print(selector.fit_stats_)
history = pd.DataFrame(selector.history)
ax = history.plot(
x="gen",
y=["fitness_best", "fitness_max", "fitness"],
marker="o",
figsize=(9, 4),
)
ax.set_title("Feature Selection — Fitness over Generations")
ax.set_xlabel("Generation")
ax.set_ylabel("ROC AUC (CV)")
plt.tight_layout()
plt.show()Stage 3 — Hyperparameter Retune on Selected Features
With 15 features instead of 50, each cross-validation call is faster. The same search budget covers more candidates.
X_train_sel = selector.transform(X_train)
X_test_sel = selector.transform(X_test)
print(f"Train shape after selection: {X_train_sel.shape}")
rf_param_grid = {
"n_estimators": Integer(40, 200),
"max_depth": Integer(2, 12),
"min_samples_split": Integer(2, 12),
"min_samples_leaf": Integer(1, 8),
"max_features": Categorical(["sqrt", "log2", None]),
}
ga_rf = GASearchCV(
estimator=RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=1),
param_grid=rf_param_grid,
scoring="roc_auc",
cv=cv,
evolution_config=EvolutionConfig(
population_size=20,
generations=15,
crossover_probability=ExponentialAdapter(
initial_value=0.8, end_value=0.4, adaptive_rate=0.15
),
mutation_probability=InverseAdapter(
initial_value=0.25, end_value=0.05, adaptive_rate=0.20
),
tournament_size=3,
elitism=True,
keep_top_k=2,
),
population_config=PopulationConfig(
initializer="smart",
warm_start_configs=[{
"n_estimators": 100,
"max_depth": 6,
"min_samples_split": 4,
"min_samples_leaf": 2,
"max_features": "sqrt",
}],
),
runtime_config=RuntimeConfig(n_jobs=-1, parallel_backend="auto", use_cache=True, verbose=True),
optimization_config=OptimizationConfig(
local_search=True, diversity_control=True, fitness_sharing=True,
),
)
ga_rf.fit(X_train_sel, y_train, callbacks=[
ConsecutiveStopping(generations=8, metric="fitness_best"),
TimerStopping(total_seconds=90),
])
print(f"\nBest CV ROC AUC (retune): {ga_rf.best_score_:.4f}")
pprint(ga_rf.best_params_)Robustness Validation — Second Estimator
If the selected features are genuinely informative (not overfitted to the RF scorer), an independent SVC should also benefit.
svc_on_selected = make_svc()
svc_on_selected.fit(X_train_sel, y_train)
svc_sel_metrics = evaluate(
"SVC on selected features", svc_on_selected, X_test_sel, y_test
)
rf_sel_metrics = evaluate(
"RF retuned (selected features)", ga_rf, X_test_sel, y_test
)
print("\nSVC baseline vs SVC on selected features:")
print(pd.DataFrame([baseline_svc, svc_sel_metrics]).to_string(index=False))If the SVC improves on the selected subset, it confirms the features are model-agnostic signal, not RF-specific artefacts.
Full Comparison Table
comparison = pd.DataFrame([
baseline_rf,
baseline_svc,
evaluate("RF retuned (selected features)", ga_rf, X_test_sel, y_test),
svc_sel_metrics,
])
print(comparison.to_string(index=False))Expected output (approximate):
name n_features accuracy balanced_accuracy roc_auc
RF baseline (50 features) 50 0.9474 0.9432 0.9891
SVC baseline (50 features) 50 0.9298 0.9252 0.9857
RF retuned (selected features) 15 0.9649 0.9613 0.9948
SVC on selected features 15 0.9532 0.9497 0.9921Both estimators improve after selection, confirming the selected subset carries the majority of predictive signal.
Feature Importance Before and After Selection
# Before: RF fitted on all 50 features
imp_all = pd.Series(
rf_baseline.feature_importances_, index=X_train.columns
).sort_values(ascending=False).head(20)
# After: RF fitted on selected features only
imp_sel = pd.Series(
ga_rf.best_estimator_.feature_importances_, index=selected_names
).sort_values(ascending=False)
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
colors_all = ["steelblue" if n in real_feature_names else "salmon" for n in imp_all.index]
imp_all.plot(kind="barh", ax=axes[0], color=colors_all[::-1])
axes[0].set_title("Top-20 Importances — All 50 Features")
axes[0].invert_yaxis()
colors_sel = ["steelblue" if n in real_feature_names else "salmon" for n in imp_sel.index]
imp_sel.plot(kind="barh", ax=axes[1], color=colors_sel[::-1])
axes[1].set_title("Importances — 15 Selected Features")
axes[1].invert_yaxis()
plt.tight_layout()
plt.show()Practical Notes
max_featuresinGAFeatureSelectionCVis a soft upper bound — the GA prefers subsets below it, but can exceed it if fitness requires. Checkselector.n_features_for the actual count.- Stage ordering matters — run feature selection before hyperparameter tuning when evaluations are expensive. Feature selection narrows the input space, making every subsequent CV call cheaper.
- Cross-estimator validation is the most reliable check that selected features are signal and not model-specific noise. If only the scoring estimator improves, suspect scorer-feature circularity.
use_cache=Truein feature selection is particularly impactful — many binary masks differ by only one or two features, and cached evaluations avoid redundant CV calls.- Diversity metrics in history — if
unique_individual_ratiodrops below 0.5 before generation 10, increaserandom_immigrants_fractionor widensharing_radius.
See Also
- Feature Selection (Noisy Data) — simpler single-stage example
- GAFeatureSelectionCV API
- Advanced Optimizer Control
- Adaptive Schedules
