How the data were prepared

The CMap dataset consists of a collection of transcriptional expression data from cultured human cells treated with bioactive small molecules and genetic perturbagens.

Each of the cell lines tested (15 diverse cell lines to date) is treated with a suite of over 20,000 chemical and genetic perturbagens. The diversity of the cell lines and perturbagens used ensures that data are generated to address the widest possible range of biological research.

The gene-expression data from genes that are either knocked out or overexpressed are cataloged in the CMap database.

Notes on data generation and collection

Invariant Set Scaling

In order to eliminate artifacts (non-biological sample variation) from the data, we use a rank invariant-set scaling procedure. We assume that there exist sets of genes that are common across diverse samples and hence can be consistently applied to standardize the samples.

To define such gene sets, we analyzed a large compendium of expression profiles and identified genes whose expression is relatively invariant (coefficient of variation CV<10%). Invariant control genes across the entire range of gene expression, we picked genes with low CVs at 10 different absolute values of expression. Furthermore, because no one gene is perfectly invariant, we chose 8 genes at each of the 10 levels, to yield a total of 80 control genes. These 80 control genes are measured in every sample and are used to construct a calibration curve for each sample. The calibration curve serves a dual purpose of normalization and assessment the quality of the sample.

Normalization Procedure

The invariant set scaling procedure is applied on a per-sample basis as follows:

  1. The sample is log2 transformed. This helps stabilize the variance and linearize the shape of the 
calibration curve.
  2. A sample calibration curve is computed using the median expression of the 8 invariant genes at each 
of the 10 pre-defined invariant levels.
  3. The entire sample is rescaled using a reference calibration curve computed from a large compendium of expression profiles. Specifically the relationship between the sample and the reference curve is modeled as y = axb + c, where x is the unscaled data,and a, b, and c are constants determined empirically from the sample using non-linear regression. Each data point in the sample is then rescaled using the estimated 
  4. The rescaled data is thresholded to be in the range [0, 15].