This repository contains the source code for processing TCGA, GTEx, CCLE and GDSC data and building hierarchical models for predicting the tissue and subtype origins of cancer cell lines.
The following publications refer to this repository
1)
The file container.def contains an Apptainer Definition File file for building a container with the necessary Python environment for running the Python code. Instructions for building and using the container are found here.
Required R packages are listed in R_requirements.txt.
Suggested order of executing scripts is:
- Data download (Using R)
- Data_download_tcga_hg38.R
- Subtype_data_download.R
- Data processing
- Data_preparation.ipynb
- scRNA_preprocessing.ipynb
- Model training and results
- model_training_script.py
- Result analysis
- Uncertainty_analysis.ipynb
- batch_correction_quantification.ipynb
- biomarker_analysis.ipynb
- computational_biomarker_analysis.ipynb
- Feature_selection_post_hoc.ipynb
- Applicability_domain_analysis.ipynb
- Visualizations
- Data_visualization.ipynb
Modify the paths as needed and follow other instructions contained in the notebooks.
- Juho Mikkonen (juho.mikkonen@uef.fi)
- Vittorio Fortino (vittorio.fortino@uef.fi)