---
title: "Joint Modeling  (e.g., Multiomics) with fido::Orthus"
author: "Justin Silverman"
output: 
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{Joint Modeling  (e.g., Multiomics) with fido::Orthus}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
bibliography: bibliography.bib
---


```r
library(fido)
library(phyloseq)
library(dplyr)

set.seed(48482)
```

If you have not already done so, I would read through the *pibble* vignette before this one. 

# fido::orthus, the model
fido can be used for jointly modeling multivariate count data and multivariate Gaussian data. For example, this would be a reasonable model to jointly model 16S microbiome data and metabolomics data jointly. Because of the "two-headed" nature of this model, e.g., two observed data-sets, I named this model *orthus*, [a two-headed dog and brother of Cerberus in Greek Mythology](https://en.Wikipedia.org/wiki/Orthrus). The *orthus* model can be written as

$$
\begin{align}
Y_j & \sim \text{Multinomial}\left(\pi_j \right)  \\
\pi_j & = \phi^{-1}(\eta_j) \\
\begin{bmatrix}\eta_j \\ Z_j \end{bmatrix} &\sim N(\Lambda X, \Sigma) \\
\Lambda &\sim  N(\Theta, \Sigma, \Gamma) \\
\Sigma &\sim W^{-1}(\Xi, \upsilon) 
\end{align}
$$

Note this looks nearly identical to the *pibble* model but we have appended the second (Gaussian)
dataset ($Z$) onto $\eta$. In doing this, the definition of $\Lambda$ changes (it is now larger
with the bottom rows dictating how the covariates $X$ influence the second dataset). Similarly, 
$\Sigma$ now is much larger and can be though of as 
$$
\Sigma = \begin{bmatrix} \Sigma_{(\eta, \eta)} & \Sigma_{(\eta, Z)} \\
                          \Sigma_{(Z, \eta)} & \Sigma_{(Z, Z)}\end{bmatrix}
$$
where $\Sigma_{(\eta, \eta)}$ describes the covariance between log-ratios (e.g., the covariance
among the multinomial categories in log-ratio space), $\Sigma_{(Z, Z)}$ describes the covariance 
between the dimensions of $Z$ (e.g., between metabolites if Z is metabolomics data), and 
$\Sigma_{(\eta, Z)} = \Sigma_{(Z, \eta)}^T$ represents the covariance between log-ratios and dimensions of $Z$ (e.g., between microbial taxa and metabolites). Similar to $\Sigma$ and $\Lambda$, the parameters $\Xi$
and $\Theta$ undergo a similar expansion to accommodate the second dataset. 

# Joint modeling of Microbial 16S data and Metabolomics
To demonstrate *orthus* I will perform a toy analysis on data from @kashyap2013 and made available by @callahan2016 as part of their recently published microbiome data analysis workflow [@callahan2016]. I follow the data preprocessing of @callahan2016 we just don't drop taxa but instead amalgamate those that don't pass filtering to a category called "other". I do this to maintain the proper variance in the multinomial model.


```r
metab_path <- system.file("extdata/Kashyap2013", "metabolites.csv", package="fido")
microbe_path <- system.file("extdata/Kashyap2013", "microbe.rda", package="fido")
metab <- read.csv(metab_path, row.names = 1)
metab <- as.matrix(metab)
microbe <- get(load(microbe_path))

## Preprocessing ##
# Metabolite Preprocessing
keep_ix <- rowSums(metab == 0) <= 3
metab <- metab[keep_ix, ]

# 16S Preprocesing - plus some weirdness to rename amalgamated category to "other"
keep_ix <- taxa_sums(microbe) > 4
keep_ix <- keep_ix & (rowSums(otu_table(microbe)>2)>3)
microbe <- merge_taxa(microbe, taxa_names(microbe)[!keep_ix])
nms <- taxa_names(microbe)
rnm <- which(taxa_names(microbe)==taxa_names(microbe)[!keep_ix][1])
nms[rnm] <- "other"
taxa_names(microbe) <- nms
rm(nms, rnm)

# bit of preprocessing 
metab <- log10(1 + metab)
```

Now I am going to do just a bit of processing to get data into a format for orthus.
Note I have no extra metadata so we are just going to use an intercept in our model 
at this time. 


```r
Y <- otu_table(microbe, taxa_are_rows=TRUE)
Z <- metab #(metabolites are rows)
X <- matrix(1, 1, phyloseq::nsamples(microbe))

# save dims for easy reference
N <- ncol(Y)
P <- nrow(Z)
Q <- nrow(X)
D <- nrow(Y)
```


Now I am going to set up the priors. My priors are going to be similar to that of *pibble*
but now we need to think about a prior for the covariance among the metabolites and between the metabolites
and the log-ratios of the taxa. Remember, that priors must be defined in the $ALR_D$ (e.g., 
ALR with the reference being the D-th taxa; this may be changed in the future to 
make specifying priors more user friendly).

I am going to form our prior for $\Sigma$ by specifying $\upsilon$ and $\Xi$. 
I will specify that I have weak prior belief that the taxa are independent in terms of their log absolute abundance. 
We can translate this statement about covariance of log absolute abundance into a statement
about log-ratio covariance by pre- and post-multiplying by the $ALR_D$ contrast matrix (which 
I refer to as $GG$ below). Additionally, I believe that there is likely no 
substantial covariance between the taxa and the metabolites and I assume the metabolites are likely independent. 


```r
upsilon <- (D-1+P)+10 # weak-ish prior on covariance over joint taxa and metabolites
Xi <- diag(D-1+P)
GG <- cbind(diag(D-1), -1)
Xi[1:(D-1), 1:(D-1)] <- GG%*%diag(D) %*% t(GG)
Xi <- Xi * (upsilon-D-P) # this scales Xi to have the proper mean we wanted
image(Xi)
```

```{r echo=FALSE, fig.align='center', fig.cap='', out.width='50%'}
knitr::include_graphics('https://github.com/jsilve24/fido/raw/master/vignettes/orthus-prior-structure.png')
```


Note the structure of this prior, everything is independent but there is a moderate positive covariance
between the log-ratios based on their shared definition in terms of the $D$-th taxa. 

The other parts of the prior are less interesting. We are going to 
state that our mean for $\Lambda$ is centered about $\mathbf{0}$ and that the signal-to-noise ratio in the data is approximately 1 (this later part is specified by $\Gamma=I$). 


```r
Gamma <- diag(Q)
Theta <- matrix(0, D-1+P, Q)
```


Finally I fit the model. 

```r
fit <- orthus(Y, Z, X, Theta=Theta, Gamma=Gamma, Xi=Xi, upsilon=upsilon, n_samples=1000)
```

Next we are going to transform the log-ratios from $ALR_D$ to the $CLR$. 
I have written all the transformation functions, *e.g.*, `to_clr` etc... to work 
on `orthusfit` objects in a similar manner to how they work on `pibblefit` objects. 
For `orthusfit` objects they only transform the log-ratio components of parameters
leaving the other parts of inferred model parameters 
(*i.e.*, the parts associated with the metabolites) untouched. 


```r
fit <- to_clr(fit)
print(fit)
#> orthusfit Object: 
#>   Number of Samples:		 12 
#>   Number of Categories:		 114 
#>   Number of Zdimensions:	 405 
#>   Number of Covariates:		 1 
#>   Number of Posterior Samples:	 1000 
#>   Contains Samples of Parameters:Eta  Lambda  Sigma
#>   Coordinate System:		 clr
```

# Investigate Model Results

There are a ton of ways to visualize the inferred model. I could make network diagrams
relating taxa to taxa, taxa to metabolites and metabolites to metabolites. I could 
look at a low dimensional representation of joint covariance to create something very much
akin to canonical correlation analysis (CCA). I could look at how well the metabolites
predict the taxa and vice-versa. But for the sake of simplicity I will do something much simpler. 
Here I am just going to find a list of taxa metabolite covariances that the model is very confident about. 

```r
# First just look ath the cross-covariances fit by the model
# (covariance between taxa in CLR coordinates and metabolites)
# This requires that we extract the corner of Sigma. 
xcor <- fit$Sigma[1:D, D:(D-1+P),]

# Initial preprocessing to speed up computation of posterior intervals
# As there are a lot of cross-covariance terms we are going to first 
# weed down the list of things we have to look at by first pass 
# selecting only those taxa that have a large posterior mean for the covariance
xcor.mean <- apply(xcor, c(1,2), mean)
to.analyze <- fido::gather_array(xcor.mean, cov, taxa, metabolite) %>% 
  arrange(-abs(cov)) %>% 
  .[1:1000,] %>% 
  mutate(tm =paste0(taxa, "_", metabolite))

# Subset Covariance to those we are interested in and calculate posterior 
# confidence intervals. 
xcor.summary <- fido::gather_array(xcor, cov, taxa, metabolite, iter) %>%
  mutate(tm=paste0(taxa, "_", metabolite)) %>% 
  filter(tm %in% to.analyze$tm) %>% 
  mutate(taxa = rownames(Y)[taxa], metabolite = rownames(Z)[metabolite]) %>% 
  group_by(taxa, metabolite) %>% 
  fido:::summarise_posterior(cov) %>% 
  arrange(mean) %>% 
  filter(taxa != 'other') # we don't care about these

# Select those covariances where the model has high certainty (95%) that
# the true covariance is not zero. 
xcor.summary %>% 
  filter(sign(p2.5)==sign(p97.5)) %>% 
  filter(abs(mean) > 2)
#> # A tibble: 218 x 8
#> # Groups:   taxa [17]
#>    taxa  metabolite   p2.5   p25   p50  mean   p75  p97.5
#>    <chr> <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1 722   206.0445922 -6.55 -3.96 -3.08 -3.33 -2.32 -1.51 
#>  2 7816  206.0445922 -5.63 -3.23 -2.45 -2.65 -1.85 -1.05 
#>  3 722   290.9298419 -5.18 -3.09 -2.42 -2.62 -1.85 -1.23 
#>  4 18182 380.1846197 -5.20 -3.07 -2.38 -2.55 -1.83 -1.03 
#>  5 722   181.4504354 -5.19 -3.04 -2.36 -2.55 -1.80 -1.13 
#>  6 722   177.0565368 -4.98 -3.01 -2.30 -2.50 -1.78 -1.08 
#>  7 722   180.072273  -5.03 -3.06 -2.30 -2.49 -1.71 -0.986
#>  8 19517 380.1846197 -5.15 -3.06 -2.33 -2.49 -1.71 -0.952
#>  9 2943  380.1846197 -4.89 -2.97 -2.34 -2.49 -1.83 -1.03 
#> 10 722   176.0343919 -4.93 -2.95 -2.25 -2.48 -1.79 -1.09 
#> # ... with 208 more rows
```

So it looks there there are a few hundred covariances that 
we can be fairly confident about. 

# Qualifications and Caution
Please note, I performed this analysis to demonstrate the use of *orthus* which 
is a model that I have been repeatedly asked for. I think its a cool model and could 
be quite useful in the right circumstances. But I would like to point out a few 
philosophical points about the analysis I performed above. 

First, I performed this analysis just to demonstrate *orthus*. I really don't know 
the data showcased here. What is metabolite `206.0445922`? I have no idea. 
For some reason this is how the metabolites in that dataset were named. For the 
same reason I have left the taxa indexed by sequence variant number. 

Second (and more important), identifying relationships between taxa and metabolites
(or between any two high-dimensional multivariate data-sets) is really difficult! 
Here we are looking at just 114 taxa and 
405 but this leads to 
46170 possible covariances and 
here we only have 12 samples! Yes *orthus* is a Bayesian model, and Yes, Bayesian 
models can be quite useful when there are more parameters than samples, but there is a limit of 
reasonability. Really, Bayesian models are great when you can perfectly capture your
prior beliefs with your prior. But how often can that really be done perfectly?
As such I would caution users, to use *orthus* carefully. Consider which metabolites 
and taxa you really care about and if you can, isolate your analyses to those. 

Alright, that's probably enough philosophizing for an R package Vignette. I hope you enjoy
*orthus*. 

# References