Outlier Detection
GASearchCV supports scikit-learn outlier-detection estimators. These estimators fit on X only — they never see y during training — so the standard scoring="roc_auc" string cannot be used directly. Instead, you define a custom scorer that calls the estimator's anomaly scoring method and compares the result against ground-truth labels.
Prerequisites
- Completed Basic Usage
- Ground-truth labels for your anomalies (at minimum for evaluation; the estimator itself is unsupervised)
The Custom Scorer Pattern
The general pattern for any outlier estimator:
from sklearn.metrics import roc_auc_score, make_scorer
def outlier_roc_auc(estimator, X, y):
# Replace .score_samples with the method appropriate for your estimator
# (see table below)
scores = -estimator.score_samples(X)
return roc_auc_score(y, scores)
scoring = make_scorer(outlier_roc_auc, needs_proba=False)Two details matter:
- Negate the score —
score_samplesanddecision_functionreturn lower values for anomalies, butroc_auc_scoreexpects higher values for the positive class (y=1= outlier). Negating aligns them. Omitting the negation causes the GA to silently optimise in the wrong direction. needs_proba=False— tellsmake_scorernot to callpredict_probaordecision_functionautomatically. Without it, sklearn tries to call the estimator's standard probability method, which outlier detectors do not have.
Which scoring method to use
| Estimator | Method | Notes |
|---|---|---|
IsolationForest | score_samples | Always available; independent of contamination |
LocalOutlierFactor(novelty=True) | score_samples | Requires novelty=True at construction |
LocalOutlierFactor(novelty=False) | negative_outlier_factor_ | Attribute set after fit; only usable on training data |
OneClassSVM | score_samples | Available since sklearn 0.24 |
EllipticEnvelope | score_samples | Also has decision_function |
LocalOutlierFactor default mode cannot score new data
Standard LOF (novelty=False) computes scores using the training neighbourhood. Calling score_samples on test data raises an error. Use LocalOutlierFactor(novelty=True) when you need to score unseen points — which is always the case inside cross-validation.
IsolationForest Example
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_auc_score, make_scorer
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn_genetic import EvolutionConfig, GASearchCV, PopulationConfig, RuntimeConfig
from sklearn_genetic.space import Continuous, Integer
# Synthetic dataset: two normal clusters + 5% uniform outliers
X_normal, _ = make_blobs(n_samples=475, centers=2, cluster_std=0.8, random_state=42)
rng = np.random.default_rng(42)
X_outliers = rng.uniform(low=-6, high=6, size=(25, 2))
X = np.vstack([X_normal, X_outliers])
y = np.array([0] * 475 + [1] * 25) # 0 = normal, 1 = outlier
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
def outlier_roc_auc(estimator, X, y):
scores = -estimator.score_samples(X)
return roc_auc_score(y, scores)
search = GASearchCV(
estimator=IsolationForest(random_state=42),
param_grid={
"n_estimators": Integer(50, 300),
"max_samples": Continuous(0.05, 0.80),
"contamination": Continuous(0.01, 0.20),
"max_features": Continuous(0.5, 1.0),
},
cv=cv,
scoring=make_scorer(outlier_roc_auc, needs_proba=False),
evolution_config=EvolutionConfig(population_size=15, generations=12),
population_config=PopulationConfig(initializer="smart"),
runtime_config=RuntimeConfig(n_jobs=-1, verbose=True),
)
search.fit(X_train, y_train)
print("Best CV ROC AUC:", round(search.best_score_, 4))
print("Best parameters:", search.best_params_)LocalOutlierFactor Example
LOF requires novelty=True for use inside cross-validation. The scorer is identical.
from sklearn.neighbors import LocalOutlierFactor
lof_search = GASearchCV(
estimator=LocalOutlierFactor(novelty=True),
param_grid={
"n_neighbors": Integer(5, 50),
"contamination": Continuous(0.01, 0.20),
"leaf_size": Integer(10, 60),
},
cv=cv,
scoring=make_scorer(outlier_roc_auc, needs_proba=False),
evolution_config=EvolutionConfig(population_size=15, generations=10),
population_config=PopulationConfig(initializer="smart"),
runtime_config=RuntimeConfig(n_jobs=-1, verbose=True),
)
lof_search.fit(X_train, y_train)
print("Best CV ROC AUC:", round(lof_search.best_score_, 4))
print("Best parameters:", lof_search.best_params_)Tips & Gotchas
- Always negate the score —
score_samplesis lower for anomalies. Pass-score_samplestoroc_auc_scorewheny=1is the outlier class. Getting this wrong produces a valid-looking search that quietly minimises detection quality. needs_proba=Falseis required — outlier estimators have nopredict_proba. Without this flag,make_scorertries to call it and raises anAttributeError.- Use
StratifiedKFold— with 5% outliers, plainKFoldcan produce folds with very few anomalies, making the AUC estimate noisy and the fitness signal unreliable. contaminationaffectspredict, notscore_samples— tuning it improves the hard-decision boundary but not the ranking score. If your downstream use only needs ranking (e.g., a top-K alert list), fixcontaminationbased on domain knowledge and remove it from the search space.score_samplesis independent ofcontaminationfor IsolationForest — you can optimise the scorer freely without worrying that contamination is circularly influencing the metric used to select it.- LOF with
novelty=Falsecannot score new data — attemptingscore_sampleson test data raises an error. Always setnovelty=Truewhen using LOF insideGASearchCV.
Next Steps
- Isolation Forest Tutorial — full end-to-end walkthrough with contour plots, ROC curve, and a 3-way comparison against baseline and random search.
- Callbacks — add early stopping to avoid over-tuning on a noisy anomaly scorer.
- Troubleshooting — diagnose flat fitness when the search space is too narrow.
