Title: | Predict Cancer Subtypes Based on TCGA Data using Machine Learning Method |
---|---|
Description: | Provide functionality for cancer subtyping using nearest centroids or machine learning methods based on TCGA data. |
Authors: | Dadong Zhang <[email protected]> |
Maintainer: | Dadong Zhang <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2025-02-17 04:31:38 UTC |
Source: | https://github.com/dadongz/oncosubtype |
Predict the subtypes of selected cancer type based published papers
centroids_subtype(data, disease = "LUSC")
centroids_subtype(data, disease = "LUSC")
data |
data set to predict the subtypes which is a numeric matrix with row names of features and column names of samples |
disease |
character string of the disease to predict subtypes, currently support 'LUSC', 'LUAD' |
an object of class "SubtypeClass" with four slots: genes used for predictiong, predicted subtypes of samples, a matrix of predicting scores, and the method.
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered rownames(data) <- rowData(example_fpkm)$external_gene_name centroids_subtype(data, disease = 'HNSC') ## End(Not run)
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered rownames(data) <- rowData(example_fpkm)$external_gene_name centroids_subtype(data, disease = 'HNSC') ## End(Not run)
example FPKM data
example_fpkm
example_fpkm
SummarizedExperiment object
select highly variable genes from a expression matrix
get_hvg(data, top = 1000)
get_hvg(data, top = 1000)
data |
a (normalized) matrix with rownames of features and colnames of samples |
top |
number of top highly variable genes to output |
subset with top ranked genes by the variances
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered get_hvg(data) ## End(Not run)
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered get_hvg(data) ## End(Not run)
convert expression matrix to median-centered
get_median_centered(data, log2 = TRUE)
get_median_centered(data, log2 = TRUE)
data |
a numeric matrix or 'S4' object |
log2 |
logical, if 'TRUE' |
median-centered express matrix or an object with new slot "centered"
## Not run: get_median_centered(example_fpkm) ## End(Not run)
## Not run: get_median_centered(example_fpkm) ## End(Not run)
Predict the subtypes of selected cancer type
get_rf_pred(train_set, test_set, method = "rf", seed = NULL)
get_rf_pred(train_set, test_set, method = "rf", seed = NULL)
train_set |
training set with rownames of samples, first column named 'mRNA_subtype' and the rest of features and expression values. |
test_set |
test set with rownames of features and colnames of samples. |
method |
character string of the method to use currently support 'rf'. |
seed |
integer seed to use. |
a matrix with column names of subtypes and predicted probabilities.
HNSC predictor centroids from https://www.nature.com/articles/nature14129
hnsc_centroids
hnsc_centroids
A tibble
with 728 features and four subtypes.
Downloads a specified dataset from a GitHub repository if it is not already present in the specified local directory, then loads the dataset into the global environment. This function is designed to help manage package size by storing data externally and loading it on-demand.
load_dataset_from_github(disease, local_dir = path.expand(getwd()))
load_dataset_from_github(disease, local_dir = path.expand(getwd()))
disease |
A character string specifying the disease, which corresponds
to the name of the dataset to be loaded (e.g., "LUSC"). The function constructs
the filename as |
local_dir |
An optional character string specifying the path to the directory
where datasets should be stored locally. If not provided, defaults to a
subdirectory named |
Invisible NULL. The function is primarily used for its side effect of loading a dataset into the global environment. However, the function itself does not return the dataset directly.
## Not run: load_dataset_from_github("LUSC") ## End(Not run)
## Not run: load_dataset_from_github("LUSC") ## End(Not run)
LUAD predictor centroids from Wilkerson (2012)
luad_centroids
luad_centroids
A tibble
with 506 features and three subtypes bronchioid, magnoid, and squamoid.
LUSC predictor centroids from Wilkerson (2010)
lusc_centroids
lusc_centroids
A tibble
with 208 features and four subtypes: primitive, classical, secretory, and basal.
Predict the subtypes of selected cancer type using machine learning
ml_subtype( data, disease = "LUSC", method = "rf", removeBatch = TRUE, seed = NULL )
ml_subtype( data, disease = "LUSC", method = "rf", removeBatch = TRUE, seed = NULL )
data |
data set to predict the subtypes which is a numeric matrix with row names of features and column names of samples |
disease |
character string of the disease to predict subtypes, currently support 'LUSC', 'LUAD', and 'BLCA'. |
method |
character string of the method to use currently support 'rf'. |
removeBatch |
whether do batch effect correction using |
seed |
integer seed to use. |
An object of class "SubtypeClass" with four slots: genes used for predictiong, predicted subtypes of samples, a matrix of predicting scores, and the method.
Wilkerson MJ, Yin X, Hayes D, et al. (2010). “Lung Squamous Cell Carcinoma mRNA Expression Subtypes Are Reproducible, Clinically Important, and Correspond to Normal Cell Types.” Clin Cancer Res, 16(19), 4864-4875.
Wilkerson MJ, Yin X, Hayes D, et al. (2012). “Differential pathogenesis of lung adenocarcinoma subtypes involving sequence mutations, copy number, chromosomal instability, and methylation.” Plos One, 7(5), e36530.
Network TCGA (2015). “Comprehensive genomic characterization of head and neck squamous cell carcinomas.” Nature, 517, e36530.
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered rownames(data) <- rowData(example_fpkm)$external_gene_name ml_subtype(data, disease = 'LUAD', method = 'rf', seed = 123) ## End(Not run)
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered rownames(data) <- rowData(example_fpkm)$external_gene_name ml_subtype(data, disease = 'LUAD', method = 'rf', seed = 123) ## End(Not run)
Plot heatmap of the train set or test set
PlotHeat(object, set = "test", ...)
PlotHeat(object, set = "test", ...)
object |
a SubtypeClass object |
set |
options could be 'test', 'train' or 'both'. Default 'test'. |
... |
Parameters passed to |
a pheatmap object
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered rownames(data) <- rowData(example_fpkm)$external_gene_name object <- MLSubtype(data, disease = 'LUSC') PlotHeat(object, set = 'both', fontsize = 10, show_rownames = FALSE, show_colnames = FALSE) ## End(Not run)
## Not run: library(OncoSubtype) data <- get_median_centered(example_fpkm) data <- assays(data)$centered rownames(data) <- rowData(example_fpkm)$external_gene_name object <- MLSubtype(data, disease = 'LUSC') PlotHeat(object, set = 'both', fontsize = 10, show_rownames = FALSE, show_colnames = FALSE) ## End(Not run)
Set the SubtypeClass
an object of SubtypeClass with three empty solts