Flow cytometry bioinformatics

Flow cytometry and related methods allow the quantification of multiple independent biomarkers on large numbers of single cells.

It is also possible to characterize data in more comprehensive ways, such as the density-guided binary space partitioning technique known as probability binning, or by combinatorial gating.

[9] The rapid increase in the dimensionality of flow cytometry data, coupled with the development of high-throughput robotic platforms capable of assaying hundreds to thousands of samples automatically have created a need for improved computational analysis methods.

A simplified, if not strictly accurate, way of considering flow cytometry data is as a matrix of M measurements times N cells where each element corresponds to the amounts of molecules.

One approach is to visualize summary statistics, such as the empirical distribution functions of single dimensions of technical or biological replicates to ensure they are the similar.

[23] With this method, higher standard deviation can indicate outliers, although this is a relative measure as the absolute value depends partly on the number of bins.

Normalization methods to remove technical variance, frequently derived from image registration techniques, are thus a critical step in many flow cytometry analyses.

[24] The complexity of raw flow cytometry data (dozens of measurements for thousands to millions of cells) makes answering questions directly using statistical tests or supervised learning difficult.

Thus, a critical step in the analysis of flow cytometric data is to reduce this complexity to something more tractable while establishing common features across samples.

The regions on these plots can be sequentially separated, based on fluorescence intensity, by creating a series of subset extractions, termed "gates".

These gates can be produced using software, e.g. FlowJo,[28] FCS Express,[29] WinMDI,[30] CytoPaint (aka Paint-A-Gate),[31] VenturiOne, Cellcion, CellQuest Pro, Cytospec,[32] Kaluza.

In datasets with a low number of dimensions and limited cross-sample technical and biological variability (e.g., clinical laboratories), manual analysis of specific cell populations can produce effective and reproducible results.

[36] To address this issue, principal component analysis has been used to summarize the high-dimensional datasets using a combination of markers that maximizes the variance of all data points.

Density-based down-sampling and clustering was used to better represent rare populations and control the time and memory complexity of the minimum spanning tree construction process.

The FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods) project, with active participation from most academic groups with research efforts in the area, is providing a way to objectively cross-compare state-of-the-art automated analysis approaches.

[54] After identification of the cell population of interest, a cross sample analysis can be performed to identify phenotypical or functional variations that are correlated with an external variable (e.g., a clinical outcome).

flowType and RchyOptimyx (as discussed above) expand this technique by adding the ability of exploring the impact of independent markers on the overall correlation with the external outcome.

ISAC is considering replacing FCS with a flow cytometry specific version of the Network Common Data Form (netCDF) file format.

[66] It captures relations among data, metadata, analysis files and other components, and includes support for audit trails, versioning and digital signatures.

The lack of gating interoperability has traditionally been a bottleneck preventing reproducibility of flow cytometry data analysis and the usage of multiple analytical tools.

To address this shortcoming, ISAC developed Gating-ML, an XML-based mechanism to formally describe gates and related data (scale) transformations.

[10] The draft recommendation version of Gating-ML was approved by ISAC in 2008 and it is partially supported by tools like FlowJo, the flowUtils, CytoML libraries in R/BioConductor, and FlowRepository.

This new version offers slightly less flexibility in terms of the power of gating description; however, it is also significantly easier to implement in software tools.

Although it was originally designed for the field of flow cytometry, it is applicable in any domain that needs to capture either fuzzy or unambiguous classifications of virtually any kinds of objects.

AutoGate[68] performs compensation, gating, preview of clusters, exhaustive projection pursuit (EPP), multi-dimension scaling and phenogram, produces a visual dendogram to express HiD readiness.

A web-based interface provides easy access to these tools and allows the creation of automated analysis pipelines enabling reproducible research.

Firstly, CytoBank, which is a complete web-based flow cytometry data storage and analysis platform, has been made available to the public in a limited form.

A representative of this class of datasets is a study which includes analysis of two bone marrow samples using more than 30 surface or intracellular markers under a wide range of different stimulations.

[86] This was echoed by another group of Pacific Biosciences and Stanford University researchers, who suggested that cloud computing could enable centralized, standardized, high-throughput analysis of flow cytometry experiments.

[87] This article was adapted from the following source under a CC BY 4.0 license (2013) (reviewer reports): Kieran O'Neill; Nima Aghaeepour; Josef Spidlen; Ryan Brinkman (5 December 2013).

Schematic diagram of a flow cytometer, showing focusing of the fluid sheath, laser, optics (in simplified form, omitting focusing), photomultiplier tubes (PMTs), analogue-to-digital converter, and analysis workstation
Representation of flow cytometry data from an instrument with three scatter channels and 13 fluorescent channels. Only the values for the first 30 (of hundreds of thousands) of cells are shown.
An example pipeline for analysis of FCM data and some of the Bioconductor packages relevant to each step.
Two-dimensional scatter plots covering all three combinations of three chosen dimensions. The colours show the comparison of consensus of eight independent manual gates (polygons) and automated gates (colored dots). The consensus of the manual gates and the algorithms were produced using the CLUE package. [ 25 ] Figure reproduced from. [ 26 ]
Cell populations in a high-dimensional mass-cytometry dataset manually gated after dimension reduction using 2D layout for a minimum spanning tree. Figure reproduced from the data provided in. [ 40 ]
An example of frequency difference gating, created using the flowFP Bioconductor package. The dots represent individual events in an FCS file. The rectangles represent the bins.
Overview of the flowType/RchyOptimyx pipeline for identification of correlates of protection against HIV: First, tens of thousands of cell populations are identified by combining one-dimensional partitions (panel one). The cell populations are then analyzed using a statistical test (and bonferroni's method for multiple testing correction) to identify those correlated with the survival information. The third panel shows a complete gating hierarchy describing all possible strategies for gating that cell population. This graph can be mined to identify the "best" gating strategy (i.e., the one in which the most important markers appear earlier). These hierarchies for all selected phenotypes are demonstrated in panel 4. In panel 5, these hierarchies are merged into a single graph that summarized the entire dataset and demonstrates the trade-off between the number of markers involved in each phenotype and the significance of the correlation with the clinical outcome (e.g., as measured by the Kaplan–Meier estimator in panel 6). Figure reproduced in part from [ 53 ] and. [ 54 ]