--- title: "Introduction to the R package statVisual" author: - affiliation: Sanofi name: Wenfei Zhang, Weiliang Qiu, Xuan Lin, Donghui Zhang date: "`r Sys.Date()`" output: html_document: number_sections: true toc: true header-includes: - \usepackage{setspace} - \doublespacing vignette: | %\VignetteIndexEntry{Introduction to statVisual} %\VignetteKeyword{statVisual} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8](inputenc) --- ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = FALSE, collapse = TRUE, comment = "#>", fig.width = 12, fig.height = 8, fig.align = "center" ) ``` # Introduction Biomarkers are widely used in pharmaceutical industry for drug discovery and development at various stages, from preclinical animal study to phase I- III and post market clinical trials, and can be used for target identification, diseased diagnostics, patient stratification, treatment prediction and etc. High-throughput assay technology enables the collection of various types of biomarkers, such as gene expression and various omics biomarkers. In high-throughput assays, a large number of biomarkers are measured in a single experiment, but subject to substantial known or unknown variability. Hence biomarkers from high-throughput assays yield two major characteristics, including high dimensionality of the data and relative large data variability. Due to the two characteristics, it is critical and challenging to visualize the biomarker data and corresponding statistical results to ensure the data quality and the reliability of downstream statistical analysis results. Therefore we developed this R package as a visualization tool for generating the commonly used plots in analyzing biomarkers from high-throughput assays, including data quality control, individual biomarker analysis and multivariate analysis. The tool also included an analysis pipeline for analyzing biomarkers in the setting of two groups comparison with the flexibility to extend to customized or project specific analysis. The R package *statVisual* provide novel solutions to the users by utilizing many powerful R base functions and R packages. For example, the function *hist* in the R package *graphics* can draw the histogram for a set of observations. However, to visualize histograms for two or more groups of observations in one figure, the users need to write their own code. The R package *statVisual* provides the function *Hist* to help the users to obtain such figure in one command. This vignette illustrates the usages of the functions provided by the R package *statVisual*. # System Requirements R (>=3.5.0) is required to run the R package *statVisual* properly. # Load Packages The following R packages are required to be installed before installing and running *statVisual*: ```{r eval = T, echo=T, message=F} # packages in Bioconductor library(Biobase) # base package for Bioconductor library(limma) # linear models for continuous omics data library(pvca) # principal variance component analysis # packages in CRAN library(dplyr) # data manipulation and pipe operation library(factoextra) # extract and visualize results of multivariate data analysis library(forestplot) # forest plot library(gbm) # generalized boosted regression models library(GGally) # extension to 'ggplot2' library(ggdendro) # dendrogram for data clustering library(ggfortify) # data visualization tools for statistical analysis results library(ggplot2) # create graphics based on "The Grammer of Graphics" library(ggrepel) # tidy text display in ggplot library(glmnet) # cross validation plot for glmnet library(grDevices) # R graphics devices and support for colors and fonts library(gridExtra) # Grid graphics library(knitr) # dynamic report generation library(methods) # formal methods and classes library(pROC) # display and analyze ROC curves library(randomForest) # Random forest variable importance library(reshape2) # flexibly reshape data library(rmarkdown) # dynamic documents for R library(rpart.plot) # plots for recursive partitioning for classification, regression and survival trees library(tibble) # simple data frames library(stats) # basic statistical functions ``` To load *statVisual* package, please type the following R statement: ```{r eval = T, echo=T, message=F} library(statVisual) ``` To check the information about the *statVisual* package, please type the following R statement: ```{r eval = F, echo = T, message = F} library(help = statVisual) ``` To find the usage of a function (e.g., *Hist*) in *statVisual*, please use the *help* function or use *?*. For example, ```{r eval = F, echo = T, message = F} help(Hist) ?Hist ``` # Available functions and datasets ## Available functions Below is a list of currently available functions in *statVisual* for plotting. * Analysis focusing on one outcome variable: + Hist - Compare groups based on histograms + Den - Compare groups based on density plots * Analysis focusing on two outcome variables: + XYscatter - Compare groups based on scatter plots + stackedBarPlot - Compare groups based on bar plots + BiAxisErrBar - Compare groups based on bi-axis error bar plots * Analysis of longitudinal data: + LinePlot - Compare groups based on trajectory plots (commonly used to visualize tumor volume data) + Box - Compare groups based on boxplots across time + ErrBar - Compare groups based on dotplots across time + barPlot - Compare groups based on barplots across time * Analysis focusing on pattern detection: + Dendro - Compare groups based on dendrogram of hierarchical clustering + iprcomp - Calculate principal components (missing values allowed) + PCA_score - Scatter plot of the 2 specified PCs + Heat - Heatmap with row names colored by groups + PVCA - PVCA plot + Volcano - Volcano Plot with option to label significant results * Analysis focusing on prediction: + BoxROC - Compare boxplots with ROC curve + cv_glmnet_plot - Cross validation plot of glmnet + ImpPlot - Plot of variable importance * The overall wrapper function: + statVisual - the wrapper function ## Available datasets * diffCorDat: A simulated dataset for illustrating differential correlations between cases and controls The simulated data set *diffCorDat* contains expression levels of 2 gene probes for 50 cases and 50 controls. The expression levels of probe1 are generated from $N(0, 1)$. The expression levels of probe2 for controls are also generated from $N(0, 1)$. The expression levels of probe 2 for cases are generated from the formula \begin{equation} probe2_{i} = -probe1_{i} + e_i, i=1, \ldots, nCases, \end{equation} where $e_i\sim N(0, 0.3^2)$. That is, the expression levels of probe 1 and probe 2 are negatively correlated in cases, but not correlated in controls. To load *diffCorDat*, we can use the following R code: ```{r eval = T, echo = T, message = F} data(diffCorDat) print(dim(diffCorDat)) print(diffCorDat[1:2,]) ``` * esSim: A simulated gene expression dataset for differential expression analysis The dataset *esSim* was generated based on the R code in the manual of the function \textit{lmFit} of the R Bioconductor package \textit{limma}. There are 100 probes and 20 samples (10 controls and 10 cases). The first 3 probes are over-expressed in cases. The 4-th and 5-th probes are under-expressed in cases. The remaining 95 probes are non-differentially expressed between cases and controls. Expression levels for 100 probes were first generated from normal distribution with mean 0 and standard deviation varying between probes ($sd=0.3\sqrt{4/\chi^2_4}$). For the 3 OE probes, we add 2 to the expression levels of the 10 cases. For the 2 UE probes, we subtract 2 from the expression levels of the 10 cases. To load *esSim*, we can use the following R code: ```{r eval = T, echo = T, message = F} data(esSim) print(dim(esSim)) print(esSim) ``` * genoSim: A simulated genotype dataset *genoSim* is an ExpressionSet object containing genotype data of 10 SNPs for 100 subjects (50 cases and 50 controls). Eight of SNPs have same minor allele frequency ($MAF=0.2$) between cases and controls. The other 2 SNPs have the different MAFs between cases and controls ($MAF_{cases}=0.4$ and $MAF_{controls}=0.2$). We assume Hardy-Weinberg Equilibrium. That is, the genotype for wild-type homozygote is $(1-MAF)^2$; the genotype for heterozygote is $2*MAF*(1-MAF)$; and the genotype for mutant homozygote is $MAF^2$. The phenotype of the ExpressionSet object contains two variables: subject id ($sid$) and case-control status ($grp$). The feature data contains two variables: snp id ($snp$) and SNP significance status ($memSNPs$). * longDat: A simulated dataset for longitudinal data analysis The dataset *longDat* is generated from the following mixed effects model for repeated measures: \begin{equation} y_{ij}=\beta_{0i}+\beta_1 t_{j} + \beta_2 grp_{2i} + \beta_3 grp_{3i} + \beta_4 \times\left(t_{j}\times grp_{2i}\right) + \beta_5 \times\left(t_{j}\times grp_{3i}\right) +\epsilon_{ij}, \end{equation} where $y_{ij}$ is the outcome value for the $i$-th subject measured at $j$-th time point $t_{j}$, $grp_{2i}$ is a dummy variable indicating if the $i$-th subject is from group 2, $grp_{3i}$ is a dummy variable indicating if the $i$-th subject is from group 3, $\beta_{0i}\sim N\left(\beta_0, \sigma_b^2\right)$, $\epsilon_{ij}\sim N\left(0, \sigma_e^2\right)$, $i=1,\ldots, n, j=1, \ldots, m$, $n$ is the number of subjects, and $m$ is the number of time points. When $t_j=0$, the expected outcome value is \begin{equation} E\left(y_{ij}\right)=\beta_0+\beta_2 dose_{2i} + \beta_3 dose_{3i}. \end{equation} Hence, we have at baseline \begin{equation} E\left(y_{ij}\right)=\beta_0,\; \mbox{for dose 1 group},\\ E\left(y_{ij}\right)=\beta_0 + \beta_2,\; \mbox{for dose 2 group},\\ E\left(y_{ij}\right)=\beta_0 + \beta_3,\; \mbox{for dose 3 group}. \end{equation} For dose 1 group, the expected outcome values across time is \begin{equation} E\left(y_{ij}\right)=\beta_0+\beta_1 t_{j}. \end{equation} We also can get the expected difference of outcome values between dose 2 group and dose 1 group, between dose 3 group and dose 1 group, and between dose 3 group and dose 2 group: \begin{equation} E\left(y_{ij} - y_{i'j}\right) =\beta_2+\beta_4 t_{j},\;\mbox{for subject $i$ in dose 2 group and subject $i'$ in dose 1 group}, \end{equation} \begin{equation} E\left(y_{kj} - y_{i'j}\right) =\beta_3+\beta_5 t_{j},\;\mbox{for subject $k$ in dose 3 group and subject $i'$ in dose 1 group}, \end{equation} \begin{equation} E\left(y_{kj} - y_{ij}\right) =\left(\beta_3 -\beta_2\right)+\left(\beta_5-\beta_4\right) t_{j},\;\mbox{for subject $k$ in dose 3 group and subject $i$ in dose 2 group}. \end{equation} We set $n=90$, $m=6$, $\beta_0=5$, $\beta_1=0$, $\beta_2=0$, $\beta_3=0$, $\beta_4=2$, $\beta_5=-2$, $\sigma_e=1$, $\sigma_b=0.5$, and $t_{j}=j$, $j=0, \ldots, m-1$. That is, the trajectories for dose 1 group are horizontal with mean intercept at $5$, the trajectories for dose 2 group are linearly increasing with slope $2$ and mean intercept $5$, and the trajectories for dose 3 group are linearly decreasing with slope $-2$ and mean intercept $5$. To load *longDat*, we can use the following R code: ```{r eval = T, echo = T, message = F} data(longDat) print(dim(longDat)) print(longDat[1:2,]) ``` \pagebreak # Analysis focusing on one outcome variable: ## Compare Groups Based on Histograms A common task in statistical comparison is to compare the mean values among groups. The reason to comparing the summary statistics (means) is to simplify the problem of comparing two distributions since it is hard to numerically compare two distributions. However, we can easily compare two distributions by visualizing the empirical distributions (e.g., histograms). To compare histograms across groups, we can use the function *Hist*: ```{r message = F, eval = T, echo = T, warning = F} # expression data dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # feature data fDat = fData(esSim) print(dim(fDat)) print(fDat[1:2,]) # choose the first probe which is over-expressed in cases pDat$probe1 = dat[1,] # check histograms of probe 1 expression in cases and controls pDat$grp=factor(pDat$grp) print(table(pDat$grp, useNA = "ifany")) ``` ```{r message = F, eval = F, echo = T, warning = F} Hist( data = pDat, y = 'probe1', group = 'grp') ``` We also can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'Hist', data = pDat, y = 'probe1', group = 'grp') ``` \pagebreak ## Compare Groups Based on Density Plots We can compare two distribution by comparing the estimated density functions. The function *Den* is used to visualize the differences of densities across groups. ```{r message = F, eval = F, echo = T, warning = F} Den( data = pDat, y = 'probe1', group = 'grp') ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'Den', data = pDat, y = 'probe1', group = 'grp') ``` \pagebreak # Analysis focusing on two outcome variables: ## Compare Groups Based on Scatter Plots The correlation is an important statistic to evaluate the linear relationship between two continuous variables. To check the linearity and to visualize the strength of the linear relationship, we can draw scatter plot. Some time, it is of interest to compare the correlations among groups. The function *XYscatter* can help the comparison by display the scatter plots across groups in one figure: For example, to check if the relationship between Sepal length vs. Sepal width is the same across different species, we can use the R code: ```{r message = F, eval = F, echo = T, warning = F} XYscatter( data = diffCorDat, x = 'probe1', y = 'probe2', group = 'grp', title = 'Scatter Plot: probe1 vs probe2') ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'XYscatter', data = diffCorDat, x = 'probe1', y = 'probe2', group = 'grp', title = 'Scatter Plot: probe1 vs probe2') ``` \pagebreak ## Compare Groups Based on Bar Plots For two categorical variables, we can use the function *stackedBarPlot* to show their association. For example, in the ExpressionSet object *genoSim*, there are simulated genotypes of 10 SNPs for 50 cases and 50 control. If we would like to know if the pattern of the genotypes of the SNP 1 in cases is the same as that in controls, we can draw bar plots. ```{r message = F, eval = T, echo = T, warning = F} data(genoSim) pDat = pData(genoSim) geno = exprs(genoSim) pDat$snp1 = geno[1,] print(table(pDat$snp1, pDat$grp, useNA="ifany")) ``` ```{r message = F, eval = F, echo = T, warning = F} stackedBarPlot(dat = pDat, catVar = "snp1", group = "grp", xlab = "snp1", ylab = "Count", group.lab = "grp", title = "Stacked barplots of counts for SNP1", catVarLevel = NULL) ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'stackedBarPlot', dat = pDat, catVar = "snp1", group = "grp", xlab = "snp1", ylab = "Count", group.lab = "grp", title = "Stacked barplots of counts for SNP1", catVarLevel = NULL) ``` Note that the input parameter *catVarLevel* can be used to change the order of the values of *catVar* shown in x-axis. For example, ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'stackedBarPlot', dat = pDat, catVar = "snp1", group = "grp", xlab = "snp1", ylab = "Count", group.lab = "grp", title = "Stacked barplots of counts for SNP1", catVarLevel = c(2, 0, 1)) ``` \pagebreak ## Compare Groups Based on Bi-Axis Error Bar Plots Some time, we would like to compare two outcomes with different scales across groups in one figure using error bar plots. The function *BiAxisErrBar* can do this task. Each bar plot displays mean \eqn{+/-}{+/-} standard error. ```{r message = F, eval = T, echo = T, warning = F} library(tidyverse) library(ggplot2) print(head(mtcars)) print(table(mtcars$gear, useNA="ifany")) ``` ```{r message = F, eval = F, echo = T, warning = F} BiAxisErrBar( dat = mtcars, group = "gear", y.left = "mpg", y.right = "wt") ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = "BiAxisErrBar", dat= mtcars, group = "gear", y.left = "mpg", y.right = "wt") ``` \pagebreak # Analysis of Longitudinal Data ## Compare Groups Based on Trajectory Plots In clinical trial, it is common to collect data at multiple time points. Therefore, it is natural to compare groups based on the trajectories of individual subjects across time. The function *LinePlot* can do this task: ```{r message = F, eval = F, echo = T,warning = F} LinePlot( data = longDat, x = 'time', y = 'y', sid = 'sid', group = 'grp') ``` We also can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T,warning = F} statVisual(type = "LinePlot", data = longDat, x = 'time', y = 'y', sid = 'sid', group = 'grp') ``` \pagebreak ## Compare Groups Based on Box Plots Across time If there are many individuals in a longitudinal dataset, the trajectory plot might look messy. In this case, we develop the function *Box* to compare groups using boxplots at each time point. In addition, line segments are used to connect the mean/median of each boxplot of the same group across time to show the differences between the mean trajectories. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points. ```{r message = F, eval = T, echo = T,warning = F} library(dplyr) ``` ```{r message = F, eval = F, echo = T,warning = F} Box( data = longDat, x = 'time', y = 'y', group = 'grp', title = "Boxplots across time") ``` Or we can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T,warning = F} statVisual(type = 'Box', data = longDat, x = 'time', y = 'y', group = 'grp', title = "Boxplots across time") ``` \pagebreak ## Compare Groups Based on Dot Plots Across Time Similarly, we can use dotplots to replace boxplots and add error bar (mean +/- se) for each dotplot. The reason why not add error bar to boxplot in *Box* function is to avoid confusing between the error bars and bars in boxplots. The function to compare groups based on dot plots across time is *ErrBar*. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points. ```{r message = F, eval = F, echo = T, warning = F} ErrBar( data = longDat, x = 'time', y = 'y', group = 'grp', title = "Dot plots across time") ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'ErrBar', data = longDat, x = 'time', y = 'y', group = 'grp', title = "Dot plots across time") ``` \pagebreak ## Compare Groups Based on Bar Plots Across Time Similarly, we can use barplots to replace boxplots and add error bar (mean +/- se) for each barplot. The reason why not add error bar to boxplot in *Box* function is to avoid confusing between the error bars and bars in boxplots. The function to compare groups based on bar plots across time is *barPlot*. Note that this function is suitable for the scenarios where observations of all subjects are measured at a few fixed time points. ```{r message = F, eval = F, echo = T, warning = F} barPlot( data = longDat, x = 'time', y = 'y', group = 'grp', title = "Bar plots across time") ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'barPlot', data = longDat, x = 'time', y = 'y', group = 'grp', title = "Bar plots across time") ``` \pagebreak # Analysis focusing on pattern detection: ## Compare Groups Based on Dendrogram of Hierarchical Clustering Hierarchical clustering is an exploratory tool to find patterns in data. Dendrogram can be used to visualize the hierarchical clustering results. By coloring the nodes of the dendrogram based on the group information, we can check if the clusters identified by the hierarchical clustering match with the known groups. The wrapper function *Dendro* can do this task: ```{r message = F, eval = T, echo = T, warning = F} library(ggdendro) data(esSim) dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe) pDat$probe1 = dat[1,] pDat$probe2 = dat[2,] pDat$probe3 = dat[3,] pDat$probe4 = dat[4,] pDat$probe5 = dat[5,] pDat$probe6 = dat[6,] print(pDat[1:2, ]) # check histograms of probe 1 expression in cases and controls pDat$grp=factor(pDat$grp) print(table(pDat$grp, useNA = "ifany")) ``` ```{r message = F, eval = F, echo = T, warning = F} Dendro( x = pDat[, c(3:8)], group = pDat$grp) ``` We also can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T,warning = F} statVisual(type = 'Dendro', x = pDat[, c(3:8)], group = pDat$grp) ``` \pagebreak ## Scatter Plot of the First 2 PCs The scatter plot of 2 specified principal components (PCs), e.g., the first 2 PCs, can also be used to detect patterns (e.g., batch effects) in data. By coloring the data points in the scatter plots based on the group information, we can check if the clusters identified by the PCA plot match with the known groups. The wrapper function *PCA_Score* can do this task. Note that he R function *prcomp* could not handle data with missing values. We developed an improved function *iprcomp* so that it can handle missing values by replacing the missing values in the dataset by median of the corresponding variable. This is just a temporary solution. The user can use their own imputation method before calling R function *prcomp*. We first check if *iprcomp* could capture the pattern in the original data without missing value. ```{r message = F, eval = T, echo = T, warning = F} # generate simulated data set.seed(1234567) dat.x = matrix(rnorm(500), nrow = 100, ncol = 5) dat.y = matrix(rnorm(500, mean = 2), nrow = 100, ncol = 5) dat = rbind(dat.x, dat.y) grp = c(rep(0, 100), rep(1, 100)) print(dim(dat)) res = iprcomp(dat, center = TRUE, scale. = FALSE) # for each row, set one artificial missing value dat.na=dat nr=nrow(dat.na) nc=ncol(dat.na) for(i in 1:nr) { posi=sample(x=1:nc, size=1) dat.na[i,posi]=NA } res.na = iprcomp(dat.na, center = TRUE, scale. = FALSE) ## # pca plot ## par(mfrow = c(3,1)) # original data without missing values plot(x = res$x[,1], y = res$x[,2], xlab = "PC1", ylab = "PC2") # perturbed data with one NA per probe # the pattern of original data is captured plot(x = res.na$x[,1], y = res.na$x[,2], xlab = "PC1", ylab = "PC2", main = "with missing values") par(mfrow = c(1,1)) ``` It looks like *iprcomp* captures the original pattern by replacing missing values with meidans of corresponding variables. More thorough investigations are warranted. We next draw pca plot based on the *esSim* dataset. ```{r message = F, eval = T, echo = T, warning = F} data(esSim) dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe) pDat$probe1 = dat[1,] pDat$probe2 = dat[2,] pDat$probe3 = dat[3,] pDat$probe4 = dat[4,] pDat$probe5 = dat[5,] pDat$probe6 = dat[6,] print(pDat[1:2, ]) # check histograms of probe 1 expression in cases and controls pDat$grp=factor(pDat$grp) print(table(pDat$grp, useNA = "ifany")) library(factoextra) pca.obj = iprcomp(pDat[, c(3:8)], scale. = TRUE) # scree plot factoextra::fviz_eig(pca.obj, addlabels = TRUE) ``` ```{r message = F, eval = F, echo = T, warning = F} PCA_score(prcomp_obj = pca.obj, dims = c(1, 3), data = pDat, color = 'grp', loadings = FALSE) ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'PCA_score', prcomp_obj = pca.obj, dims = c(1, 2), data = pDat, color = 'grp', loadings = FALSE) ``` \pagebreak ## Heatmap with Row Names Colored by Groups Heatmap can be used to visualize the patterns among data. Usually results of bi-clustering will be superimposed to the heatmap. To check if the bi-clustering results match the known group information, we can color the nodes (i.e., rownames of the heatmap) by groups. The function *Heat* can do this task: ```{r message = F, eval = T, echo = T, warning = F} data(esSim) dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe) pDat$probe1 = dat[1,] pDat$probe2 = dat[2,] pDat$probe3 = dat[3,] pDat$probe4 = dat[4,] pDat$probe5 = dat[5,] pDat$probe6 = dat[6,] print(pDat[1:2, ]) pDat$grp=factor(pDat$grp) ``` ```{r message = F, eval = F, echo = T, warning = F} Heat( data = pDat[, c(2:8)], group = 'grp') ``` We also can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T,warning = F} statVisual(type = 'Heat', data = pDat[, c(2:8)], group = 'grp') ``` \pagebreak ## PVCA plot Principal variance component analysis (PVCA) is proposed to estimate the variability of experimental effects including batch by hybridizing two popular data analysis methods: principal component analysis (PCA) and variance components analysis (VCA). It can be used as a screening tool to determine which sources of variability (biological, technical or other) are most prominent in a given microarray dataset (https://www.niehs.nih.gov/research/resources/software/biostatistics/pvca/index.cfm). The function *PVCA* draws the plot of the weighted average proportion variance versus effects: ```{r message = F, eval = T, echo = T, warning = F} library(pvca) # create a fake Batch variable data(esSim) esSim$Batch=c(rep("A", 4), rep("B", 6), rep("C", 10)) ``` ```{r message = F, eval = F, echo = T, warning = F} PVCA( clin_data = pData(esSim), clin_subjid = "sid", gene_data = exprs(esSim), batch.factors = c("grp", "Batch")) ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'PVCA', clin_data = pData(esSim), clin_subjid = "sid", gene_data = exprs(esSim), batch.factors = c("grp", "Batch")) ``` \pagebreak ## Volcano Plot Volcano plot can be used to check if the significant results are reasonable or not. Intuitively, the significant results (with low p-value) should have large absolute values of regression coefficients (in linear regression) or $log2$(odds ratios) (in logistic regression). To draw the plot of the relationship between fold change (odds ratio) vs. $-log10$(p value) with the option to label significant results, we can use the function *Volcano*: ```{r message = F, eval = T, echo = T, warning = F} library(ggrepel) library(limma) library(ggrepel) library(limma) # load the simulated dataset data(esSim) print(esSim) # expression levels y = exprs(esSim) print(dim(y)) print(y[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat) # design matrix design = model.matrix(~grp, data = pDat) print(design) options(digits = 3) # Ordinary fit fit <- lmFit(y, design) fit2 <- eBayes(fit) # get result data frame resFrame = topTable(fit2,coef = 2, number = nrow(esSim)) print(dim(resFrame)) print(resFrame[1:2,]) resFrame$sigFlag = resFrame$adj.P.Val < 0.05 resFrame$probe = rownames(resFrame) # make sure set NA to genes non-differentially expressed resFrame$probe[which(resFrame$sigFlag == FALSE)] = NA print(resFrame[1:2,]) print(table(resFrame$sigFlag, useNA = "ifany")) ``` ```{r message = F, eval = F, echo = T, warning = F} Volcano( resFrame = resFrame, stats = 'logFC', p.value = 'P.Value', group = 'sigFlag', rowname.var = 'probe', point.size = 1) ``` We also can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T,warning = F} statVisual(type = 'Volcano', resFrame = resFrame, stats = 'logFC', p.value = 'P.Value', group = 'sigFlag', rowname.var = 'probe', point.size = 1) ``` \pagebreak # Analysis focusing on prediction: ## Compare Box Plots with ROC Curve To compare two distributions, we also can use parallel boxplots. Usually, the ultimate purpose of comparing two distribution is to evaluate if the variable can be used to predict the groups with high accuracy. Hence, we develop the function *BoxROC* to put parallel boxplots and ROC curve in the same figure. The area under ROC curve is also shown in the figure. ```{r message = F, eval = T, echo = T, warning = F} library(dplyr) library(gridExtra) data(esSim) print(esSim) # expression data dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) pDat$grp = factor(pDat$grp) # choose the first probe which is over-expressed in cases pDat$probe1 = dat[1,] # check histograms of probe 1 expression in cases and controls print(table(pDat$grp, useNA = "ifany")) ``` ```{r message = F, eval = F, echo = T, warning = F} BoxROC( data = pDat, group = 'grp', y = 'probe1', point.size = 1) ``` We also can use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'BoxROC', data = pDat, group = 'grp', y = 'probe1', point.size = 1) ``` \pagebreak ## Cross validation plot for glmnet Cross validation plot can be used to visualize the estimated performance as a function of Lagrange multiplier. For continuous endpoint, mean square error (MSE) is used as performance metric. The function *cv_glmnet_plot* can do this task: ```{r message = F, eval = T, echo = T, warning = F} library(dplyr) library(tibble) library(glmnet) data(esSim) print(esSim) # expression data dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # feature data fDat = fData(esSim) print(dim(fDat)) print(fDat[1:2,]) # choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe) pDat$probe1 = dat[1,] pDat$probe2 = dat[2,] pDat$probe3 = dat[3,] pDat$probe4 = dat[4,] pDat$probe5 = dat[5,] pDat$probe6 = dat[6,] print(pDat[1:2, ]) # check histograms of probe 1 expression in cases and controls print(table(pDat$grp, useNA = "ifany")) pDat$grp = factor(pDat$grp) ``` ```{r message = F, eval = F, echo = T, warning = F} cv_glmnet_plot(x = as.matrix(pDat[, c(3:8)]), y = pDat$grp, family = "binomial") ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = "cv_glmnet_plot", x = as.matrix(pDat[, c(3:8)]), y = pDat$grp, family = "binomial") ``` \pagebreak ## Variable Importance plot The function *ImpPlot* can be used to visualize the relative importance of different variables in predicting outcome for random forest model. ```{r message = F, eval = T, echo = T, warning = F} library(dplyr) library(randomForest) library(tibble) data(esSim) dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe) pDat$probe1 = dat[1,] pDat$probe2 = dat[2,] pDat$probe3 = dat[3,] pDat$probe4 = dat[4,] pDat$probe5 = dat[5,] pDat$probe6 = dat[6,] print(pDat[1:2, ]) pDat$grp=factor(pDat$grp) rf_m = randomForest( x = pDat[, c(3:8)], y = pDat$grp, importance = TRUE, proximity = TRUE ) ``` ```{r message = F, eval = F, echo = T, warning = F} ImpPlot(rf_m) ``` We can also use the wrapper function *statVisual*: ```{r message = F, eval = T, echo = T, warning = F} statVisual(type = 'ImpPlot', rf_m) ``` # Discussion The R package *statVisual* that we developed extends existing R plot functions to help visualizing differences among groups for TM/BM applications. We notice that R package *GGally* provides a powerful visualization function *ggpairs* that can put pairwise histograms, estimated densities, scatter plots, boxplots, correlation coefficients in one figure. For example, ```{r message = F, eval = T, echo = T, warning = F} library(GGally) data(esSim) dat = exprs(esSim) print(dim(dat)) print(dat[1:2,]) # phenotype data pDat = pData(esSim) print(dim(pDat)) print(pDat[1:2,]) # choose the first 6 probes (3 OE probes, 2 UE probes, and 1 NE probe) pDat$probe1 = dat[1,] pDat$probe2 = dat[2,] pDat$probe3 = dat[3,] pDat$probe4 = dat[4,] pDat$probe5 = dat[5,] pDat$probe6 = dat[6,] print(pDat[1:2, ]) pDat$grp=factor(pDat$grp) ggpairs(data = pDat, mapping = ggplot2::aes_string(color = 'grp'), columns = c('probe1', 'probe5', 'probe6'), upper = list(continuous = "cor", combo = "box_no_facet", discrete = "facetbar", na = "na"), lower = list(continuous = "points", combo = "facethist", discrete = "facetbar", na = "na"), diag = list(continuous = "densityDiag", discrete = "barDiag", na = "naDiag"), xlab = 'X', ylab = 'Y', title = 'Title') ``` *GGally* also provides a function *ggcorr* to superimpose correlations onto the heatmap. For instance, ```{r message = F, eval = T, echo = T, warning = F} ggcorr(data = pDat[, c(3:8)], method = 'pairwise', label = TRUE, label_round = 2, label_size = 4) ``` Our R package *statVisual* is a useful complement to *GGally* since *statVisual* provides many functions (e.g., *BoxROC*, *LinePlot*, *Box*, *ErrBar*) that *GGally* does not provide. We welcome comments and suggestions to improve *statVisual*. In future, after appropriate improvements (e.g., replacing the internal datasets with public available datasets or simulated datasets) we will submit *statVisual* to CRAN (https://cran.r-project.org/) and submit a manuscript to a peer-reviewed journal so that other researchers can use this R package to facilitate their data analyses for TM/BM applications.