Coracle is an artificial intelligence framework to identify microbes associated with a continuous physiological variable.
In our case, we measured a lot of corals for their standardized coral thermal tolerance using CBASS assays with subsequent
ED50 modeling (Voolstra et al. 2020; Voolstra et al 2021) and looked at prokaryote association using 16S rRNA metabarcoding.
We were curious to determine whether specific prokaryotes (bacteria) were indicative of increased thermal tolerance,
so we started Coracle to answer this question.
But really any continuous phenotypic variable can be queried against microbial assemblage.
The framework is designed to make the most out of smaller datasets and thus sacrifices efficient runtime for larger datasets.
Coracle uses an ensemble approach and combines different preprocessing steps and different machine learning methods
that are integrated into one comprehensible score.
It is meant to be a decision-'maker' by picking prokaryote candidates for further examination.
In the following form you are asked to upload your data in two files.
First, the continuous physiological variable file should include one column that specifies sample IDs
and one column that holds the values of the target variable.
A column header is necessary but can be empty.
The second file is a prokaryote abundance file that should include the sample IDs in the first column,
followed by the 'group' name (taxonomic annotation) as the column header and the bacterial abundance as
values (each column resembles one microbial group).
Microbial abundance and target variable should have the same number of rows and should have the
same sequence of sample IDs!
Datatables can be uploaded as comma separated files (.csv-ending is required)
or as tab stop delimited files (either .tsv or .txt).
If you try to upload different files or the dimensions of your files don't add up, an error will be shown.
Example files can be found in the Tutorial.
The runtime of coracle scales significantly with both the number of samples (n) and the number of bacteria groups (k).
Although in the worst case complexity is shown to be [n²k³log(k)], in practice it seems to be around [nk²].
Thus, the number of bacteria groups is the driving the runtime and we highly suggest using higher
aggregations of taxonomic levels.
The current version of Coracle is limited in both the runtime (24h) and the maximal number of microbial groups (10000).
Thus we recommend to use Family- or Order-level first.
Insights at the lowest levels (ASV/OTU) are possible by only feeding Coracle with microbial species
from previously successfull microbial groups (like Family-/Order-level).
Upload your datafiles and run Coracle
UniCor is a feature score for quantitative, hierarchical datasets.
It combines a feature’s association with the target and its uniqueness relative to other features in the same group.
The score is computed from feature–target correlation and the average feature–feature correlation and lies in the range −0.5 to 1.
UniCorP applies UniCor in a bottom-up propagation across the hierarchy.
At each level it evaluates features within their parent group, selects UNICORNs using a top-k rule
(k highest scores per level) and propagates selected features to the next higher level (e.g., species → genus).
This repeats until the highest level is reached, enriching upper levels with informative features.
Scoring can use Pearson or Spearman correlation. An optional preprocessing step for scoring can apply relative abundance
or CLR to reflect compositional structure.
These settings affect scoring only; propagation is determined by the selection rule at each level.
In the following form you are asked to upload your data in three files.
The first two files (feature set and target variable) are prepared:
First, the continuous physiological target variable file should include one column that specifies sample IDs
and one column that holds the values of the target variable.
A column header is necessary.
The second file is the continuous feature matrix (e.g. a prokaryote abundance file) that should include the 'sample' IDs in the first column,
followed by the 'feature' ID (taxonomic annotation, lowest hierarchical level) as the column header and the features (bacterial abundance) as
numeric values (each column resembling one OTU/ASV).
The hierarchical (e.g. taxonomic) structure should be prepared in a third file, with the 'feature' IDs (OTU/ASV) in the first column and the
complete hierarchical information in the following columns. The column headers should represent the different
hierarchical levels and should be in either ascending or descending order.
It is recommended to fill null values within the hierarchy with the next higher annotation.
Feature set and target variable should have the same number of rows and should have the
same sequence of sample IDs! (case sensitive).
Feature set and hierarchical information should have the same feature IDs (case sensitive).
Datatables can be uploaded as comma separated files (.csv-ending is required)
or as tab stop delimited files (either .tsv or .txt).
File types have to match.
Example files can be found in the Tutorial.
In principle, UniCorP follows the cost of computing pairwise correlations across all features,
which scales as
with n = number of samples and m = number of features.
With hierarchical grouping, correlations are computed within groups, giving a per-level cost of
where is the size of group g. Under approximately equal group sizes, the sum becomes
with G = number of groups, which is substantially smaller than .
The selection rule influences group sizes across levels.
With top-k per level, the number of propagated features is fixed by k, which stabilizes group sizes and
runtime across levels. Overall, runtime grows linearly with the number of samples and with the number of hierarchical levels processed.
Upload your datafiles and run UniCorP
Hierarchical Coracle (HiCoracle) extends Coracle with hierarchical feature selection.
It combines a bottom-up enrichment step with a top-down skimming step before final modeling at the lowest level.
Bottom-up (UniCorP): Starting at the lowest (most specific) hierarchical level,
UniCorP identifies uniquely correlated features (UNICORNs) within each group and propagates only the selected features upward.
This repeats level by level until the highest (least specific) level is enriched.
Top-down skimming (TDS): From the enriched highest level, HiCoracle selects informative groups and propagates
them downward through the hierarchy. This reduces the number of features that reach the lowest level.
Modeling: At the lowest level, the reduced feature set is analyzed with Coracle to quantify associations between
features and the continuous target variable.
In the following form you are asked to upload your data in three files.
The first two files (feature set and target variable) are prepared:
First, the continuous physiological target variable file should include one column that specifies sample IDs
and one column that holds the values of the target variable.
A column header is necessary.
The second file is the continuous feature matrix (e.g. a prokaryote abundance file) that should include the 'sample' IDs in the first column,
followed by the 'feature' ID (taxonomic annotation, lowest hierarchical level) as the column header and the features (bacterial abundance) as
numeric values (each column resembling one OTU/ASV).
The hierarchical (e.g. taxonomic) structure should be prepared in a third file, with the 'feature' IDs (OTU/ASV) in the first column and the
complete hierarchical information in the following columns. The column headers should represent the different
hierarchical levels and should be in either ascending or descending order.
It is recommended to fill null values within the hierarchy with the next higher annotation.
Feature set and target variable should have the same number of rows and should have the
same sequence of sample IDs! (case sensitive).
Feature set and hierarchical information should have the same feature IDs (case sensitive).
Datatables can be uploaded as comma separated files (.csv-ending is required)
or as tab stop delimited files (either .tsv or .txt).
File types have to match.
Example files can be found in the Tutorial.
Runtime complexity. HiCoracle is less sensitive to very large feature sets than Coracle
because it first restricts computations to within-group correlations during bottom-up propagation and then
limits the number of features passed downward during top-down skimming.
Bottom-up (UniCorP): The dominant cost per level is computing correlations within groups, which scales as
with n samples and features in group g.
Using top-k selection per level keeps the number of propagated features fixed and stabilizes runtime.
Top-down skimming (TDS): Runtime is controlled by the cap on features kept per level (n_features).
Larger caps select more children and increase cost. Smaller caps reduce cost but pass fewer features to the next level.
Note:
For simplicity we set top_k = n_features to reach stable runtime and resolution at the lowest level.
Optimal values depend on the dataset and the structure of its hierarchy.
Upload your datafiles and run HiCoracle
Coracle is an artificial intelligence framework to identify microbes associated with a continuous physiological variable.
In our case, we measured a lot of corals for their standardized coral thermal tolerance using CBASS assays with subsequent
ED50 modeling (Voolstra et al. 2020; Voolstra et al 2021) and looked at prokaryote association using 16S rRNA metabarcoding.
We were curious to determine whether specific prokaryotes (bacteria) were indicative of increased thermal tolerance,
so we started Coracle to answer this question.
But really any continuous phenotypic variable can be queried against microbial assemblage.
The framework is designed to make the most out of smaller datasets and thus sacrifices efficient runtime for larger datasets.
Coracle uses an ensemble approach and combines different preprocessing steps and different machine learning methods
that are integrated into one comprehensible score.
It is meant to be a decision-'maker' by picking prokaryote candidates for further examination.
UniCor is a feature score for quantitative, hierarchical datasets.
It combines a feature’s association with the target and its uniqueness relative to other features in the same group.
The score is computed from feature–target correlation and the average feature–feature correlation and lies in the range −0.5 to 1.
UniCorP applies UniCor in a bottom-up propagation across the hierarchy.
At each level it evaluates features within their parent group, selects UNICORNs using a top-k rule
(k highest scores per level) and propagates selected features to the next higher level (e.g., species → genus).
This repeats until the highest level is reached, enriching upper levels with informative features.
Scoring can use Pearson or Spearman correlation. An optional preprocessing step for scoring can apply relative abundance
or CLR to reflect compositional structure.
These settings affect scoring only; propagation is determined by the selection rule at each level.
Hierarchical Coracle (HiCoracle) extends Coracle with hierarchical feature selection.
It combines a bottom-up enrichment step with a top-down skimming step before final modeling at the lowest level.
Bottom-up (UniCorP): Starting at the lowest (most specific) hierarchical level,
UniCorP identifies uniquely correlated features (UNICORNs) within each group and propagates only the selected features upward.
This repeats level by level until the highest (least specific) level is enriched.
Top-down skimming (TDS): From the enriched highest level, HiCoracle selects informative groups and propagates
them downward through the hierarchy. This reduces the number of features that reach the lowest level.
Modeling: At the lowest level, the reduced feature set is analyzed with Coracle to quantify associations between
features and the continuous target variable.
In the following we show how to use Coracle and give a short tutorial
on the data handling requiered for the use of our tools.
Code examples are given for programming languages R and Python and example datasets
for all relevant steps are available for download. Depending on your dataset not all steps
may be necessary so feel free to skip irrelevant steps
First we provide the original dataset:
The dataset consists of the 16S OTU abundance file
(link)
of the CBASS84 study (Voolstra et al 2021).
The continuous physiological variable data table includes the sample IDs in the first column and the
associated ED50 temperature tolerance values (°C) in the second column.
The third and fourth column contain some metadata and the ASVs
fill subsequent column headers with their corresponding abundances (absolutes) for each sample ID as rows.
These tables are downloaded as comma separated files (.csv).
In the next step we split the ASV dataset to obtain our target variable in a separated file:
### 2. Split Target Variable and feature set
y = ASV["ED50"].to_frame()
x = ASV.iloc[:, 3:]
Python
### 2. Split Target Variable and feature set
y <- ASV[,"ED50"]
x <- ASV[, -c(1:3)]
R
You can run UniCorP and HiCoracle directly from these three files:
The feature matrix (x), the continuous target variable (y), and the taxonomic hierarchy (tax).
For Coracle analyses, we aggregate to a higher (less specific) taxonomic level (e.g., Family) to reduce dimensionality,
since Coracle works best with at most a few hundred features.
In order to access different taxonomic levels we have to merge
the ASV dataset (without ED50 and metadata):
### 3. Combine ASV with taxonmic information
merged = tax.merge(ASV.iloc[:,3:].transpose(), right_index=True, left_index=True)
Python
### 3. Combine ASV with taxonomic information
merged <- cbind(tax, t(ASV[, 4:ncol(ASV)]))
R
... and aggregate the absolute abundances according
to the groups of one of the taxonomic levels, if necessary.
In this case we aggregate at the family level to get a good
tradeoff between the number of features, the resolution of our dataset
and the corresponding performance of our models.
### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family = merged.groupby( ["Family"] ).sum() # Split ASV data from taxonomic information
ASV_family = ASV_family.transpose().iloc[4:, :].astype('int32' )
ASV_family.to_csv(directory + "x_fam.csv")
Python
### 4. Aggregate according to taxonomic level (e.g. Family)
ASV_family <- merged %>%
group_by(Family) %>%
summarize( across( where( is.numeric), sum, na.rm = TRUE)) %>%
t() %>%
as.data.frame() # Set the first row as column headers
colnames(ASV_family) <- as.character( ASV_family[1, ])
ASV_family <- ASV_family[-1, ]
write.csv(ASV_family, file = paste0(directory, "x_fam.csv"), row.names = TRUE)
R
We can now already run Coracle with the files
y (ED50/target variable) and x_fam (abundance at family level).
Both files can be used to run coracle as they support all requirements.
Microbial abundance and target variable have the same number of rows and share the same sequence of sample IDs!
Now we can upload the prepared data tables
(x_fam at the family level)
, enter an email-address (to which the results will be sent)
and click on run Coracle.
Coracle might take a few minutes to run.
If you choose to leave the tab open a landing page will be loaded once Coracle is finished.
There you can have a first look at your results, receive a short explanation and a button to download your files as a .csv-file.
Additionally, the explanation and a download link for your results will be sent to you to the email-address provided.
No registration is necessary. timeout errors can occur while waiting for the result page to load.