EpistasisLogisticRegression for binary phenotypes¶

Some genotype-phenotype experiments produce binary outcomes rather than continuous measurements: a protein either folds or it does not, a variant is viable or lethal, a sequence binds or fails to bind. EpistasisLogisticRegression handles these datasets by fitting a logistic regression classifier over the epistasis design matrix, predicting whether each genotype falls above or below a user-defined phenotype threshold.

When to use this model¶

Use EpistasisLogisticRegression when:

Your phenotype is binary (viable/nonviable, functional/nonfunctional, binding/non-binding).
You want to predict class membership, not a continuous phenotype value, from the genotype.
You have a natural cut-off threshold that separates the two classes.

If your phenotype is continuous, use EpistasisLinearRegression or one of the regularized linear models instead.

How it works¶

The classifier uses an internal two-step procedure adapted from the base classifier class:

An order-1 EpistasisLinearRegression (model.additive) is fit to the continuous phenotype values to learn each mutation's additive contribution.
The design matrix columns are scaled by the fitted additive coefficients. This projects the design matrix onto a per-mutation contribution scale, giving the logistic classifier interpretable, comparable features.
sklearn's LogisticRegression is fit on the projected matrix, using binarized labels (y > threshold -> 1, else 0).

Predictions and probabilities are made on this same projected feature space.

Note

The design matrix is the same Walsh-Hadamard or local-encoded matrix used by all other epistasis models, just scaled by the additive coefficients before being handed to the logistic solver. You do not need to build or transform the matrix yourself.

Constructor parameters¶

threshold (float, required): Phenotype cut-off. Genotypes with a continuous phenotype value strictly greater than threshold are assigned class 1 (viable); all others are class 0.
model_type (str, default "global"): Encoding for the design matrix: "global" (Hadamard) or "local" (biochemical).
C (float, default 1.0): Inverse regularization strength, following sklearn's convention. Smaller values apply stronger regularization. Equivalent to 1 / alpha in the linear regularized models.
max_iter (int, default 1000): Maximum solver iterations. Increase if the solver raises a convergence warning.
solver (str, default "lbfgs"): sklearn LogisticRegression solver. "lbfgs" is a good default for most problems. Other options include "liblinear", "newton-cg", "newton-cholesky", "sag", and "saga".
random_state (int | None, default None): Random seed for solvers that use randomness.

Workflow¶

The workflow is identical to the linear models: attach a GPM, fit, then predict.

Construct the classifier

from epistasis.models.classifiers import EpistasisLogisticRegression

model = EpistasisLogisticRegression(threshold=0.5, model_type="global")

Attach a genotype-phenotype map
```
model.add_gpm(gpm)
```
Fit
```
model.fit()
```
This fits the internal additive model and then the logistic classifier in sequence.

Predict class labels or probabilities

# Predicted class labels (0 or 1) for each genotype.
labels = model.predict()

# Class probabilities: shape (n_genotypes, 2).
# Column 0 = P(class 0), column 1 = P(class 1 / viable).
proba = model.predict_proba()

# Accuracy on the training data.
accuracy = model.score()
print(f"Training accuracy: {accuracy:.4f}")

Key methods¶

Method	Returns	Description
`fit(X=None, y=None)`	`self`	Fit the additive model then the logistic classifier.
`predict(X=None)`	`np.ndarray[int]`	Predicted class labels (0 or 1) for each genotype.
`predict_proba(X=None)`	`np.ndarray[float]`	Class probabilities, shape `(n_genotypes, 2)`.
`predict_log_proba(X=None)`	`np.ndarray[float]`	Log class probabilities, shape `(n_genotypes, 2)`.
`score(X=None, y=None)`	`float`	Fraction of correctly classified genotypes (accuracy).
`hypothesis(X=None, thetas=None)`	`np.ndarray[float]`	Probability of class 1 for each row of `X`.

Complete example¶

import gpmap
from epistasis.models.classifiers import EpistasisLogisticRegression

# Genotypes with continuous phenotypes; some above threshold, some below.
gpm = gpmap.GenotypePhenotypeMap(
    wildtype="AA",
    genotypes=["AA", "AB", "BA", "BB"],
    phenotypes=[0.1, 0.4, 0.6, 0.9],
)

# Classify as viable (1) if phenotype > 0.5, nonviable (0) otherwise.
model = EpistasisLogisticRegression(threshold=0.5, model_type="global")
model.add_gpm(gpm)
model.fit()

# Predicted labels for all genotypes in the GPM.
labels = model.predict()
print("Class labels:", labels)  # e.g. [0, 0, 1, 1]

# Probability of being viable (class 1).
proba = model.predict_proba()
print("P(viable):", proba[:, 1])

# Training accuracy.
print(f"Accuracy: {model.score():.4f}")

# Predict for new genotypes not in the original GPM.
new_labels = model.predict(X=["AB", "BB"])
new_proba = model.predict_proba(X=["AB", "BB"])

Other classifiers¶

EpistasisLogisticRegression is one of five classifiers in epistasis.models.classifiers. The others share the same add_gpm -> fit -> predict workflow and the same additive-projected feature space:

Discriminant analysis: EpistasisLDA (linear boundary) and EpistasisQDA (quadratic boundary).
EpistasisGaussianProcess: calibrated probabilistic classification.
EpistasisGaussianMixture: unsupervised viable / nonviable split.