Multiple Sequence Alignment Analysis

Primary Stack: Python • BioPython • Jupyter

Description

This project implements comprehensive multiple sequence alignment (MSA) analysis workflows, including consensus sequence calculation, position-specific scoring matrices (PSSM), sequence profile visualization, and Markov chain simulation for biological sequence analysis.

Problem

Analyzing biological sequences requires computing consensus sequences from multiple alignments, generating position-specific scoring matrices (PSSMs), and creating sequence logo visualizations. Additionally, understanding stochastic processes like Markov chains helps model biological phenomena such as state transitions.

Approach

The project implements three main components:

Consensus Sequence Calculation: Processes multiple sequence alignments (MSA) from FASTA files to derive consensus sequences using BioPython's AlignInfo module.
Sequence Profile Analysis: Generates position-specific scoring matrices (PSSMs) and converts them to probability-based profiles, then visualizes them as sequence logos using logomaker.
Markov Chain Simulation: Implements a discrete-time Markov chain simulator for modeling state transitions with custom transition matrices.

Tech Stack

Python 3.x: Core programming language
BioPython: Sequence alignment and analysis
logomaker: Sequence logo visualization
numpy: Numerical computations
Jupyter Notebook: Interactive analysis (markov_simulation.ipynb)

Project Structure

msa-consensus-pssm-and-markov/
├── consensus_sequence.py           # Consensus sequence calculation from MSA
├── sequence_profile.py             # PSSM generation and sequence logo creation
├── markov_simulation.py            # Markov chain simulator
├── markov_simulation.ipynb         # Interactive Markov chain analysis
├── msa_output.fasta                # Multiple sequence alignment results
├── raw_sequences.fasta             # Input raw sequence data
├── requirements.txt                # Python dependencies
└── README.md                        # This file

Setup / Installation

Ensure you have Python 3.7+ installed:

python --version

Install required dependencies:

pip install -r requirements.txt

The requirements include:

biopython>=1.79 - Sequence alignment and analysis
logomaker - Sequence logo visualization
numpy>=1.21.0 - Numerical operations
matplotlib>=3.3.0 - Plotting support
jupyter>=1.0.0 - Interactive notebook

Usage

1. Consensus Sequence Calculation

Calculate a consensus sequence from a multiple sequence alignment:

python consensus_sequence.py msa_output.fasta

Output: Prints the consensus sequence to stdout.

2. Sequence Profile Analysis

Generate PSSM, create probability-based profiles, and produce sequence logos:

python sequence_profile.py msa_output.fasta

Outputs:

sequence_analysis_sequence_profile.csv - Position-specific probability matrix
sequence_analysis_sequence_logo.png - Visual sequence logo (50x5 inches)

Custom Output Prefix: Set the MSA_OUTPUT_PREFIX environment variable to customize output filenames:

# Linux/Mac
export MSA_OUTPUT_PREFIX="my_analysis"
python sequence_profile.py msa_output.fasta

# Windows (Command Prompt)
set MSA_OUTPUT_PREFIX=my_analysis
python sequence_profile.py msa_output.fasta

# Windows (PowerShell)
$env:MSA_OUTPUT_PREFIX="my_analysis"
python sequence_profile.py msa_output.fasta

3. Markov Chain Simulation

Command-line version:

python markov_simulation.py

Interactive Jupyter notebook:

jupyter notebook markov_simulation.ipynb

The simulation models state transitions (e.g., city visits) using a transition matrix and tracks the frequency of each state over 100 steps.

Data Files

Input Files

raw_sequences.fasta: Raw biological sequences in FASTA format
- Contains unaligned sequences for initial processing

Output Files

msa_output.fasta: Multiple sequence alignment results
- Generated from MSA tool (e.g., Clustal Omega, MUSCLE)
- Used as input for consensus and profile analysis

Key Features

Consensus Sequence Module

Reads MSA files in FASTA format
Uses BioPython's AlignInfo.SummaryInfo
Applies dumb_consensus with threshold=0 for conservative consensus
Handles single alignment files (typical for Clustal Omega output)

Sequence Profile Module

Generates position-specific scoring matrices (PSSM)
Converts raw counts to probability distributions
Exports as CSV with sorted amino acid headers
Creates high-resolution sequence logos (chemistry color scheme)
Configurable for different sequence types

Markov Chain Simulator

Discrete-time Markov chain implementation
Customizable transition matrices
Simulates N steps starting from any state
Tracks and reports state frequencies
Example: Models traveler movement between cities

Reproducibility Notes

Consensus Sequence

Threshold parameter set to 0 for maximum conservation
Alternative: Adjust threshold in dumb_consensus() for different stringency

Sequence Logos

Generated with logomaker chemistry color scheme
Figure size: 50x5 inches for high detail
Stack order: small_on_top for visual clarity

Markov Simulation

Uses numpy.random.choice with transition probabilities
Set random seed for reproducible results:
```
np.random.seed(42)
```

Troubleshooting

Import Errors:

pip install biopython logomaker numpy matplotlib jupyter

BioPython Version: Verify installation:

python -c "import Bio; print(Bio.__version__)"

Requires BioPython >=1.79

File Not Found: Ensure FASTA files are in the project directory or provide full paths:

python consensus_sequence.py /path/to/alignment.fasta

Logomaker Missing: The logomaker package is required for sequence logo generation:

pip install logomaker

Output File Customization: By default, output files are named sequence_analysis_sequence_profile.csv and sequence_analysis_sequence_logo.png. To customize the prefix, set the MSA_OUTPUT_PREFIX environment variable before running:

export MSA_OUTPUT_PREFIX="my_project"  # Linux/Mac
set MSA_OUTPUT_PREFIX=my_project       # Windows CMD

Expected Outputs

consensus_sequence.py

Consensus sequence printed to stdout
Example: MSFKILVACGLLL...

sequence_profile.py

Files created:
- sequence_analysis_sequence_profile.csv (CSV with probability matrix)
- sequence_analysis_sequence_logo.png (Sequence logo visualization)

markov_simulation.py / .ipynb

total number of cities visited: 100
Time in each city visited:
Berlin: 32
Munich: 38
Hamburg: 30

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
.gitignore		.gitignore
README.md		README.md
commit_message.txt		commit_message.txt
consensus_sequence.py		consensus_sequence.py
markov_simulation.ipynb		markov_simulation.ipynb
markov_simulation.py		markov_simulation.py
msa_output.fasta		msa_output.fasta
raw_sequences.fasta		raw_sequences.fasta
requirements.txt		requirements.txt
sequence_profile.py		sequence_profile.py

Folders and files

Latest commit

History

Repository files navigation

Multiple Sequence Alignment Analysis

Description

Problem

Approach

Tech Stack

Project Structure

Setup / Installation

Usage

1. Consensus Sequence Calculation

2. Sequence Profile Analysis

3. Markov Chain Simulation

Command-line version:

Interactive Jupyter notebook:

Data Files

Input Files

Output Files

Key Features

Consensus Sequence Module

Sequence Profile Module

Markov Chain Simulator

Reproducibility Notes

Consensus Sequence

Sequence Logos

Markov Simulation

Troubleshooting

Expected Outputs

consensus_sequence.py

sequence_profile.py

markov_simulation.py / .ipynb

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages