Primary Stack: Python • BioPython • Jupyter
This project implements comprehensive multiple sequence alignment (MSA) analysis workflows, including consensus sequence calculation, position-specific scoring matrices (PSSM), sequence profile visualization, and Markov chain simulation for biological sequence analysis.
Analyzing biological sequences requires computing consensus sequences from multiple alignments, generating position-specific scoring matrices (PSSMs), and creating sequence logo visualizations. Additionally, understanding stochastic processes like Markov chains helps model biological phenomena such as state transitions.
The project implements three main components:
-
Consensus Sequence Calculation: Processes multiple sequence alignments (MSA) from FASTA files to derive consensus sequences using BioPython's
AlignInfomodule. -
Sequence Profile Analysis: Generates position-specific scoring matrices (PSSMs) and converts them to probability-based profiles, then visualizes them as sequence logos using logomaker.
-
Markov Chain Simulation: Implements a discrete-time Markov chain simulator for modeling state transitions with custom transition matrices.
- Python 3.x: Core programming language
- BioPython: Sequence alignment and analysis
- logomaker: Sequence logo visualization
- numpy: Numerical computations
- Jupyter Notebook: Interactive analysis (markov_simulation.ipynb)
msa-consensus-pssm-and-markov/
├── consensus_sequence.py # Consensus sequence calculation from MSA
├── sequence_profile.py # PSSM generation and sequence logo creation
├── markov_simulation.py # Markov chain simulator
├── markov_simulation.ipynb # Interactive Markov chain analysis
├── msa_output.fasta # Multiple sequence alignment results
├── raw_sequences.fasta # Input raw sequence data
├── requirements.txt # Python dependencies
└── README.md # This file
- Ensure you have Python 3.7+ installed:
python --version- Install required dependencies:
pip install -r requirements.txtThe requirements include:
biopython>=1.79- Sequence alignment and analysislogomaker- Sequence logo visualizationnumpy>=1.21.0- Numerical operationsmatplotlib>=3.3.0- Plotting supportjupyter>=1.0.0- Interactive notebook
Calculate a consensus sequence from a multiple sequence alignment:
python consensus_sequence.py msa_output.fastaOutput: Prints the consensus sequence to stdout.
Generate PSSM, create probability-based profiles, and produce sequence logos:
python sequence_profile.py msa_output.fastaOutputs:
sequence_analysis_sequence_profile.csv- Position-specific probability matrixsequence_analysis_sequence_logo.png- Visual sequence logo (50x5 inches)
Custom Output Prefix: Set the MSA_OUTPUT_PREFIX environment variable to customize output filenames:
# Linux/Mac
export MSA_OUTPUT_PREFIX="my_analysis"
python sequence_profile.py msa_output.fasta
# Windows (Command Prompt)
set MSA_OUTPUT_PREFIX=my_analysis
python sequence_profile.py msa_output.fasta
# Windows (PowerShell)
$env:MSA_OUTPUT_PREFIX="my_analysis"
python sequence_profile.py msa_output.fastapython markov_simulation.pyjupyter notebook markov_simulation.ipynbThe simulation models state transitions (e.g., city visits) using a transition matrix and tracks the frequency of each state over 100 steps.
- raw_sequences.fasta: Raw biological sequences in FASTA format
- Contains unaligned sequences for initial processing
- msa_output.fasta: Multiple sequence alignment results
- Generated from MSA tool (e.g., Clustal Omega, MUSCLE)
- Used as input for consensus and profile analysis
- Reads MSA files in FASTA format
- Uses BioPython's
AlignInfo.SummaryInfo - Applies
dumb_consensuswith threshold=0 for conservative consensus - Handles single alignment files (typical for Clustal Omega output)
- Generates position-specific scoring matrices (PSSM)
- Converts raw counts to probability distributions
- Exports as CSV with sorted amino acid headers
- Creates high-resolution sequence logos (chemistry color scheme)
- Configurable for different sequence types
- Discrete-time Markov chain implementation
- Customizable transition matrices
- Simulates N steps starting from any state
- Tracks and reports state frequencies
- Example: Models traveler movement between cities
- Threshold parameter set to 0 for maximum conservation
- Alternative: Adjust threshold in
dumb_consensus()for different stringency
- Generated with
logomakerchemistry color scheme - Figure size: 50x5 inches for high detail
- Stack order:
small_on_topfor visual clarity
- Uses
numpy.random.choicewith transition probabilities - Set random seed for reproducible results:
np.random.seed(42)
Import Errors:
pip install biopython logomaker numpy matplotlib jupyterBioPython Version: Verify installation:
python -c "import Bio; print(Bio.__version__)"Requires BioPython >=1.79
File Not Found: Ensure FASTA files are in the project directory or provide full paths:
python consensus_sequence.py /path/to/alignment.fastaLogomaker Missing:
The logomaker package is required for sequence logo generation:
pip install logomakerOutput File Customization:
By default, output files are named sequence_analysis_sequence_profile.csv and sequence_analysis_sequence_logo.png.
To customize the prefix, set the MSA_OUTPUT_PREFIX environment variable before running:
export MSA_OUTPUT_PREFIX="my_project" # Linux/Mac
set MSA_OUTPUT_PREFIX=my_project # Windows CMDConsensus sequence printed to stdout
Example: MSFKILVACGLLL...
Files created:
- sequence_analysis_sequence_profile.csv (CSV with probability matrix)
- sequence_analysis_sequence_logo.png (Sequence logo visualization)
total number of cities visited: 100
Time in each city visited:
Berlin: 32
Munich: 38
Hamburg: 30