gpmap.core¶
The core container module exposes GenotypePhenotypeMap and the schema validator validate_encoding_table.
GenotypePhenotypeMap¶
class GenotypePhenotypeMap:
def __init__(
self,
wildtype: str,
genotypes: list[str] | np.ndarray,
phenotypes: list[float] | np.ndarray,
*,
stdeviations: list[float] | np.ndarray | None = None,
n_replicates: list[int] | np.ndarray | None = None,
mutations: Mapping[int, list[str] | None] | None = None,
site_labels: list[str] | None = None,
include_binary: bool = True,
) -> None: ...
Parameters¶
wildtype(str, required)- Reference sequence. Must be a non-empty string. All
genotypesmust have this length. genotypes(list[str] | np.ndarray, required)- Observed sequences, each of length
len(wildtype). Will be coerced to a numpy object array of strings. phenotypes(list[float] | np.ndarray, required)- Measured phenotypes, length matching
genotypes. Coerced tofloat64. If the coercion changes any values (e.g., a NumPyInt64overflow), aUserWarningis emitted once. stdeviations(list[float] | np.ndarray | None, defaultNone)- Per-genotype standard deviations. Defaults to
np.nanfor every genotype when omitted. n_replicates(list[int] | np.ndarray | None, defaultNone)- Per-genotype replicate counts. Defaults to 1 for every genotype.
mutations(dict[int, list[str] | None] | None, defaultNone)- Per-site alphabets. Keys must be
0..L-1. ANonevalue marks a frozen site (only the wildtype letter is allowed). If omitted, the alphabet is inferred from observed letters. site_labels(list[str] | None, defaultNone)- Optional per-site labels. Defaults to
["0", "1", ..., "L-1"]. include_binary(bool, defaultTrue)- If
True, eagerly buildbinary_packedat construction. Set toFalseto defer the cost until the first access.
Attributes¶
| Attribute | Type | Notes |
|---|---|---|
wildtype |
str |
Reference sequence |
genotypes |
np.ndarray[object] |
Length n, dtype object of str |
phenotypes |
np.ndarray[float64] |
Length n |
stdeviations |
np.ndarray[float64] |
Length n, NaN by default |
n_replicates |
np.ndarray[int64] |
Length n, 1 by default |
mutations |
dict[int, list[str] \| None] |
Per-site alphabets |
site_labels |
list[str] |
Length L |
length |
int |
L, the wildtype length |
n |
int |
n, the number of genotypes |
encoding_table |
pd.DataFrame |
Lazy, schema-locked, cached |
binary_packed |
np.ndarray[uint8] |
Shape (n, n_bits), lazy, cached |
binary |
np.ndarray[object] |
Length n, dtype object of '0'/'1' strings, lazy, cached |
n_mutations |
np.ndarray[int64] |
Per-genotype Hamming weight |
data |
pd.DataFrame |
Lazy, cached, see the data column |
stdeviation_map |
StandardDeviationMap |
View returning raw stdeviations |
standard_error_map |
StandardErrorMap |
View returning std / sqrt(n_replicates) |
data columns¶
The data property returns a DataFrame with these columns, in order:
| Column | dtype |
|---|---|
genotypes |
string |
phenotypes |
float64 |
stdeviations |
float64 |
n_replicates |
Int64 |
binary |
string |
n_mutations |
Int64 |
from_dataframe¶
@classmethod
def from_dataframe(
cls,
df: pd.DataFrame,
*,
wildtype: str | None = None,
mutations: Mapping[int, list[str] | None] | None = None,
site_labels: list[str] | None = None,
include_binary: bool = True,
) -> GenotypePhenotypeMap
Hydrate a GenotypePhenotypeMap from a pandas DataFrame.
Required columns: genotypes, phenotypes. Optional: stdeviations, n_replicates. Other columns are ignored.
wildtype may be omitted if the DataFrame attrs carry a "wildtype" key. Otherwise it must be provided.
get_missing_genotypes¶
Returns a numpy object array of genotype strings that are in the full Cartesian product of mutations but not in self.genotypes. Respects the SpaceTooLargeError guard; for huge spaces, use enumerate_genotypes_str(..., allow_huge=True) directly.
validate_encoding_table¶
Raises SchemaError if the table is missing required columns, has NaN in site_index or binary_index_stop, or if binary_index_stop disagrees with the implied total bit count. Returns None on success.
Use this when you build an encoding table externally (or read one from a third-party source) and want to confirm it matches the schema before feeding it into genotypes_to_binary_packed.