Python and Biology
General Bioinformatics Frameworks
These libraries provide broad biological functionality (sequence analysis, file parsing, etc.):
- Biopython - The de facto standard library for bioinformatics in Python. Tools for parsing FASTA/GenBank, working with sequences, BLAST, alignments, phylogenetics, etc.
- Scikit-bio - Bioinformatics and computational biology toolkit built on NumPy. Focus on microbiome analysis, sequence data, diversity metrics, and statistics.
- BioPandas - Extends pandas for structural biology data (PDB, mmCIF). Integrates molecular data frames with protein 3D structure parsing.
Genomics / Transcriptomics
Libraries for working with genome sequences, reads, and expression data:
- pysam - Python interface to SAMtools and BCFtools. For reading/writing SAM/BAM/VCF files efficiently.
- pybedtools - Python wrapper around BEDTools. For genome interval operations (overlaps, unions, intersections).
- HTSeq - For analyzing high-throughput sequencing (RNA-Seq, etc.). Counts reads, annotates features, handles GFF/GTF formats.
- anndata - Data structure for annotated data matrices (used in single-cell). Foundation for many single-cell analysis tools (e.g. scanpy).
- scanpy - Analysis toolkit for single-cell RNA-seq data. Supports clustering, visualization, trajectory inference, etc.
- pyranges - Fast genomic interval operations in pure Python. Like pybedtools, but implemented with pandas for speed. Better to use pyranges_1.
- cyvcf2 - Fast VCF parser and query tool.
Proteomics and Structural Biology
Working with proteins, molecular dynamics, and structural data:
- MDAnalysis - Analyze molecular dynamics trajectories. Supports many formats (GROMACS, AMBER, etc.).
- ProDy - For protein structure and dynamics analysis.
- PyMOL API - Python interface to PyMOL for visualizing 3D molecular structures.
- biotite - Structural bioinformatics and sequence analysis; fast and modern.
Phylogenetics and Evolution
Tools for working with trees, evolutionary analysis, and alignments:
- ETE Toolkit - Phylogenetic tree analysis, visualization, and manipulation.
- DendroPy - For phylogenetic computing with trees and taxon data.
- toytree - Lightweight and fast tree visualization and manipulation.
Systems Biology & Pathways
Libraries for metabolic models, pathways, and systems simulation:
- COBRApy - Constraint-based modeling for metabolic networks.
- Tellurium - Modeling and simulation of biochemical networks.
- libSBML - Work with SBML (Systems Biology Markup Language).
- PySCeS - Python Simulator for Cellular Systems.
Microbiome and Metagenomics
- QIIME 2 API - Framework for microbiome analysis and reproducibility.
- biom-format - Read/write BIOM tables (Biological Observation Matrix format).
Population Genetics
- msprime - Efficient simulation of population genetic data.
- tskit - Tree sequence toolkit for storing and analyzing genealogical data.
Cheminformatics
(Overlaps with computational chemistry, but widely used in drug design and molecular biology)
- RDKit - Standard for cheminformatics: molecular fingerprints, descriptors, SMILES parsing.
- Open Babel (Pybel) - Chemical toolbox for converting between molecular file formats.
More Python libraries used in biology
- pyrosetta.distributed - The Python interface to Rosetta in cluster / distributed computing contexts.
- pyrosetta.bindings - Bindings for Rosetta energy functions only - used almost nowhere outside protein design labs.
Genomics & Sequencing
- dnaio - Fast FASTA/FASTQ reader/writer used under the hood by cutadapt.
- cutadapt (Python API) - Adapter trimming library - most people use the CLI, but it has a Python API.
- python-edlib - Ultra-fast library for edit distance / approximate sequence alignment (bindings for Edlib C library).
- parasail-python - Bindings to the Parasail SIMD-accelerated pairwise alignment routines.
- mappy - Python bindings to minimap2 for ultra-fast genome mapping.
- pyfaidx - FASTA indexing and fast random access (like samtools faidx, but in Python).
- screed - Indexed FASTA/FASTQ reader optimized for streaming very large datasets.
- sourmash - Implements MinHash comparisons for genomic sketching, metagenomics, and large-scale sequence similarity.
- khmer - K-mer counting, compression, and probabilistic data structures for huge genomes.
- xopen - Handles compressed files (bgzip, gz) more efficiently than standard Python readers; used in many seq tools.
Neuroscience & Neurobiology
- neo - Data model library for electrophysiology experiment formats.
- elephant - Statistical analysis for spiking neural data (built on Neo).
- PyNWB / HDMF - Work with the Neurodata Without Borders (NWB) data standard.
- brian2 - Spiking neural network simulator used in theoretical neuroscience.
- pyabf - Reading Axon Binary File (ABF) electrophysiology files.
- pylake - Control and analysis for optical tweezer experiments (LUMICKS instruments)
Structural biology / Chemical biology
- PyEMMA - Markov state modeling for protein conformational dynamics.
- MSMBuilder - Machine learning library for analyzing molecular dynamics trajectories.
- MDtraj - Specialized molecular dynamics trajectory analysis toolkit.
- molecool - Small library for manipulating molecular structures intended for teaching but used in niche workflows.
- pdbfixer - Automatically fix missing atoms/residues in PDB files - used before MD simulations.
Systems biology & Modelling
- cobra-me - For modeling macromolecular expression models ("ME models") - very specialized.
- PEtab / pyPESTO - Parameter estimation for systems biology models.
- sbnet (Systems Biology Notebook) - Experimental toolkit for rule-based modeling workflows.
- Bionetgen Python API - Bindings for BioNetGen rule-based modeling platform.
Microbiology / Metagenomics
- anvi'o (Python API) - High-dimensional microbial & metagenomic analysis framework. CLI is famous; the Python API is lesser known.
- genomeview - Python library for visualizing aligned reads and annotations directly from BAM/VCF/FASTA.
- anvi-snakemake / anvio-structure - Extensions for structural metagenomics
- mmgenome2 (Python parts) - Toolkit for metagenome binning; consists partly of Python libraries.
Population genetics
- allel (scikit-allel) - Analysis of large-scale population genetics data (VCF, Zarr-based).
- fwdpy11 - Forward-time evolutionary simulations (selection, demography).
- moments - Demographic inference from site-frequency spectra.
- dadi - Another demographic inference tool (Diffusion Approximations for Demographic Inference).
Proteomics
- pyteomics - Parsing MS/MS mass spectrometry formats (mzML, mzXML, MGF).
- alphapept - End-to-end proteomics pipeline with a Python engine.
- ms_deisotope - Deisotoping + charge state deconvolution for mass spec signals.
- GlyPy - Glycoinformatics library - glycan structures & mass spec data.
Phylogenetics
- biopython-phylip - Sxtensions for interacting with PHYLIP tools.
- augur (Nextstrain) - Python API for phylogenetic pipelines (used heavily in genomic epidemiology).
- phylopandas - Integrates phylogenetic sequences & metadata into a pandas-like API.
- pastml - Phylogenetic ancestral state reconstruction in Python.
Microscopy / Image analysis
- aicsimageio - Read microscopy data from proprietary formats (CZI, OME-TIFF, etc.).
- czifile - Read Zeiss CZI microscopy images.
- cellpose (Python API) - Deep learning-based cell segmentation.
- napari (with bioimage-specific plugins) - Interactive viewer - not strictly a “biology" library, but most of its ecosystem is microscopy-oriented.
Animal tracking / Behavioral biology
- dlc (DeepLabCut) - Pose estimation software used for tracking animal behavior.
- SLEAP - Multi-animal pose tracking using deep learning.
Plant biology
- PlantCV - Computer vision toolkit for plant phenotyping.
Bioinformatics adjacent / "Strange but real"
- skmer - Genome distance estimation without assembly using spectral k-mer methods.
- phylter - Detect suspicious sequences in phylogenetic datasets (contaminants and outliers).
- pyslim - Manipulate tree sequences produced by SLiM evolutionary simulations.
- pp-sketchlib - MinHash sketches for petabase-scale genomics (used by BIGSI databases).
- pyswift - Machine-learning-based annotation for Schizosaccharomyces pombe datasets