Loading and saving¶
gpmap-v2 supports four file formats out of the box: JSON, CSV (with sidecar metadata), Excel, and pickle. Each round-trips losslessly: load a saved map and it is byte-for-byte equivalent to the one you stored.
Picking a format¶
Self-contained, human-readable, schema-versioned. The default choice for sharing maps.
Pandas-friendly. Stores the data columns in CSV and the wildtype, alphabets, and site labels in a sidecar <basename>.meta.json. Keep both files together.
Two-sheet workbook: data for the table, meta for the schema. Good for handing a map to a non-programmer collaborator.
What each format stores¶
| Format | Genotypes | Phenotypes | Stdevs | Replicates | Wildtype | Alphabets | Site labels |
|---|---|---|---|---|---|---|---|
| JSON | yes | yes | yes | yes | yes | yes | yes |
| CSV + sidecar | yes (CSV) | yes (CSV) | yes (CSV) | yes (CSV) | yes (sidecar) | yes (sidecar) | yes (sidecar) |
| Excel | yes (data) | yes (data) | yes (data) | yes (data) | yes (meta) | yes (meta) | yes (meta) |
| Pickle | full container, including caches |
JSON schema version¶
JSON files written by to_json carry a top-level "schema_version": "1". Legacy v1 files without this field still load, but issue a UserWarning:
>>> read_json("v1_legacy_map.json")
UserWarning: JSON file has no schema_version; treating as v1 legacy format
If you want to migrate a v1 file to the v2 format, just re-save it after loading:
from gpmap import read_json, to_json
gpm = read_json("v1_legacy_map.json")
to_json(gpm, "v2_map.json")
CSV sidecar layout¶
to_csv(gpm, "map.csv") writes two files:
map.csv # genotypes, phenotypes, stdeviations, n_replicates
map.csv.meta.json # {schema_version, wildtype, mutations, site_labels}
read_csv will look for the sidecar at either map.csv.meta.json (current convention) or map.meta.json (older convention) and raises FileNotFoundError if neither exists.
Keep CSV and sidecar together
The CSV file alone is not enough to reconstruct the map: wildtype and per-site alphabets live in the sidecar. If you only have the CSV, you can still load it via pd.read_csv and rebuild manually with GenotypePhenotypeMap.from_dataframe(df, wildtype="...", mutations=...).
Excel layout¶
to_excel writes a workbook with two sheets:
| Sheet | Columns |
|---|---|
data |
genotypes, phenotypes, stdeviations, n_replicates |
meta |
key, value (rows for schema_version, wildtype, mutations, site_labels) |
read_excel expects this layout. JSON-encoded mutations and site_labels are stored as strings in the meta sheet so Excel does not mangle them.
Pickle compatibility¶
Pickle preserves the full Python object, including cached binary arrays. This is the fastest format and also the most fragile:
- A pickle written with one version of
gpmap-v2is only guaranteed to load on the same major version. - Class paths inside the
gpmappackage are stable within a major version. Internal refactors that move classes between modules will break pickles across minor versions; we treat that as a breaking change.
For long-term storage, prefer JSON or CSV.
DataFrame round-trip¶
If you already have a pd.DataFrame, hydrate directly:
This is also how to round-trip through any format pandas supports natively (parquet, feather, HDF, ...). You provide the wildtype on the way in; everything else (alphabets, site labels) is inferred or defaulted.
When you save a DataFrame externally, store the wildtype and per-site alphabets in your own sidecar; gpmap-v2 only owns the JSON/CSV/Excel sidecar conventions.
Performance notes¶
| Format | n=65k (L=16) round-trip | Notes |
|---|---|---|
| Pickle | ~50 ms | full object dump |
| JSON | ~150 ms | one large indented dictionary |
| CSV | ~80 ms | data only, sidecar is trivial |
| Excel | ~1 s | openpyxl overhead dominates |
For programmatic pipelines, CSV is the sweet spot: pandas-friendly, version-stable, and fast. JSON is the best human-readable archival format.