CATO scores for functional noncoding variants

Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo
Matt Maurano et. al, Nat. Genetics 2015

CATO (Contextual Analysis of TF Occupancy) scores

CATO scores pre-computed for dbSNP

V1.1: dbSNP142 -- we provide several data sets. If you do not know where to begin, start with the first one.
- All 13.4M SNPs overlapping DHSs [ starch (231MB) | bgzip (460MB) + tabix index ]
- All 34.1M SNPs overlapping motifs, regardless of whether they overlap a DHS [ starch (461MB) ]
  These may be used to derive scores for variants in DHSs from samples not used in our catalog, but see notes #1-2 below regarding cell-type specificity.
- All 104M SNPs in dbSNP142 used as input (no scores, just the SNPs) [ starch (539MB) ]
Files are in hg19 BED format with a header line. The format has changed relative to Data S3: the alleles are now relative to the plus strand (rather than to the motif), and a new Cell_types column includes the names of cell types having a DNase hotspot over each SNP. Finally, an annotation bug affecting the PhastCons and CpG island-overlap values for some SNPs has been corrected, resulting in minor changes to the scores of these SNPs.
V1.0 (obsolete; for archival only): dbSNP138 - 7.0M SNPs overlapping DHSs. [ bgzip + tabix index ]
File is in hg19 BED format with a header line (same columns as Data S3; indeed this is a superset).

Notes:

We recommend considering the CATO score for SNPs outside a DHS to be 0.
While CATO scores are not themselves cell-type specific, the cell types with a DHS are listed in the Cell_types column. For studies focusing on a set of predefined subset of relevant cell types, SNPs without a DHS in the appropriate cell type should be treated as having score = 0. This can also be done by intersecting the CATO scores with DHS tracks available below using `bedops -e`.

R linear models

RData object containing two lists of models which can be used with the predict() function in R:

results.fits.enriched is a list of models for 313 PWMs. PWM names are in the list names.
results.fit.scale.enriched contains another list with models enabling the conversion of the raw logistic score into a common score normalized by the proportion of overlapping SNPs demonstrating allelic imbalance.

Master list of DHSs

These master lists were used for MCV (standing for "multi-cell verified" and indicating cell-type selectivity) calculations (in conjunction with `bedmap --count`).
Compressed using starch format -- install BEDOPS package, and type `unstarch aisamples.dhs.hotspots.starch`

hotspots (1.2GB) delineate the full hypersensitive region (no FDR thresholding)
hotspot peaks (1.2GB) delineate fixed 150-bp regions around the cleavage maxima (1% FDR)

The .bed name (4th) column contains the sample name. To make individual .bed files per cell type, try:

unstarch aisamples.dhs.hotspots.starch | awk -F "\t" 'BEGIN {OFS="\t"} {print > $4 ".bed"}'

The 5th column contains the filtered DNase-seq tag count in the listed sample, and the 6th column is the count normalized to 1M tags per sample. These master lists include a selected set of malignant or immortalized lines that were utilized in certain analyses.

TF Clusters

We have organized PWMs from major databases (TRANSFAC, JASPAR, UniPROBE, Taipale) into clusters of similar sequence specificities by clustering TOMTOM similarity scores:

Visualization of TF clusters [HTML; slow to load | PDF] showing clusters with weblogos to represent each PWM.
Weblogos are available in PNG and EPS formats for each PWM and its reverse complement (ie *.rc).
List of PWMs, including TF gene name, cluster name.
Strand and offset are provided from TOMTOM alignment to enable a consistent vizualization of motifs in a given cluster

Genotypes

Partially filtered genotypes [ VCF (3GB) | .vcf.tbi | .vcfidx ]

Note that these genotypes represent an intermediate analysis file. Please see the description in the Online Methods for further details, and consider restricting any analysis to the SNPs in Supplementary Data Set 1.

Allelic imbalance per cell type

Format: one file per cell type. bed3+boolean (whether site was imbalanced in that cell type or not); snps.multicell.bed contains bed3 data describing all SNPs tested across at least 1 cell type [ tgz ]

These are the data from Fig 3b-e

Errata and comments on the published manuscript

Supplementary Data Set 3 posted on the Nature Genetics website (ng.3432-S7.zip) has an incorrect file extension, which may cause trouble with some decompression software. The file should be named ng.3432-S7.txt.gz
The Online Methods states that we used a fifth-order hidden Markov model as the background model for the fimo motif scans. While we did specify a fifth-order background model in the fimo command line, fimo uses only the zeroth term and so the effect was simply to set the G+C content value.
The Online Methods incorrectly states one of the terms in the CATO logistic regression model: "Width of DHS" should be "log(Width of DHS)".

Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo Matt Maurano et. al, Nat. Genetics 2015