Regulatory Build

The regulatory build provides a single 'best guess' set of regulatory features. These features are based on the information contained within the Ensembl funcgen database. Output and supporting data from the Regulatory Build are available in 'Region In Detail' and the various regulation displays. Configuration is available under the 'Regulation' menu item:

The rest of this document details the methodology and data used in this process.

Regulatory Feature Construction

The 'Regulatory Build' is performed by overlap analysis of annotations from data sets in a two stage cell type aware manner.

In stage one, core regions are identified across all available cell types using 'focus' features, which are chosen to define a set of potential binding sites. These tend to be broad coverage, narrowly focused marks which are likely candidates for different types of regulatory elements or motifs. Focus feature types include DNase1 which is known to mark accessible chromatin, TFBSs and CTCF, which characterises 'insulator/enhancer' elements. As such the core regions of regulatory features are likely to be positioned on or around any potential regulatory motif. Core regions are extended only in the case of direct overlap with another focus feature. To maintain resolution and to avoid chaining of regulatory features across regions of dense regulatory elements a 2KB cut-off is imposed. Exceeding this cut-off causes the offending focus feature to be treated as an attribute feature (see below) and so does not extend the core region.

Stage two extends the structure in a cell type specific manner, using 'attribute' features. Attribute features do not define a binding site and are some times longer ranging feature types which are useful for classification, such as histone modifications. If core data exists for a given cell type, a Regulatory Feature is seeded using the core region defined in stage one. The arms or bounds are defined by overlap of attribute features with respect to the core region. Directly overlapping attribute features are said to have one degree of separation. Attributes with two degrees of separation are only included if they are entirely contained within another longer associated attribute feature. This is done to capture information adjacent and indirectly associated with the core region, whilst avoiding longer range and potentially anomalous associations.

For some cell lines where the is no core data available, but there is substantial other attribute data present, a projection build method is employed. This involves projecting the core region defined by the other cell lines to the 'sparse' cell line. The attribute extension detailed above is then carried out using the projected core region.

These two stages give rise to regulatory FeatureSets for the core 'MultiCell' features and for each available cell type.

Regulatory Feature Annotation

Regulatory Features (regfeats) are classified by considering their position on the genome in relation to other classes of feature on the genome (eg genes, repeats, intergenic regions) together with the combination of regulatory attributes they possess as coded in their binary_string. In the binary string each position corresponds to a particular focus or attribute feature and a value of 1 indicates that the regulatory feature overlaps this particular type of focus or attribute feature. A set of randomly distributed features (mockfeats) corresponding to the regfeats in terms of length and chromosome are also generated so that we can judge if the placement of regfeats in relation to the genomic features is non-random.

The first step in the procedure is to determine which genomic features (genfeats) each regfeat overlaps. A single common basepair is sufficient to consider two features overlapping. We do the same with the mockfeats. (Strictly speaking this is not the first step, as we know from experience that certain regulatory features are most probably artefacts and that others contain no useable information so these are filtered out before the procedure begins and the mockfeats correspond to only the filtered set of regfeats).

Next we create a set of patterns of attributes we wish to evaluate. Currently this is all the patterns which occur in the display labels more than once, plus all the patterns which can be created by re-setting one bit of the existing patterns from 1 to 0.

For each pattern, we look at all the regulatory features which have the same or more bits set. If there are more than 100 such regfeats we count the number of times these features overlap each class of genfeat. We do the same count with the set of mockfeats which correspond to the regfeats. If >50% of the regfeats overlap a particular class of genfeat and the chi-squared statistic calculated using the mockfeat count as the 'expected' value is >8.0 (P0.005) we record that this pattern is associated with this class of genfeat.

If the pattern IS associated with a genfeat we collect a second set of patterns which have this pattern's PLUS any other bits set. For each of these patterns we look at all the regulatory features which have the same or more bits set and we count the number of times these features overlap each class of genfeat. If less than 50% of the regfeats overlap we record that this second pattern is not associated with the class of genfeat involved.

Having determined all the associated and non-associated patterns for each class of genfeat, we look at all the regfeats and use the 'associated' and then 'not-associated' patterns to set or unset a flag indicating whether the particular regfeat is associated with a particular class of genfeat. During this process it is possible for a given regfeat to be associated with more than one class of genfeat and some of these can be contradictory. This is particularly the case where all or nearly all the bits are set.

Finally, for the purposes of the regulatory build, there is a set of rules which 1. resolve conflicts amongst the above flags and 2. assign a regulatory feature_type to the regfeat. The following types are currently in use :-

At present only cell-type specific regulatory features are classified as different cell types may give conflicting signals reflecting their unique combination of regulatory and transcriptional states.

These data sets can be displayed along the chromosome in 'Region in Detail', displayed for a gene in the 'Regulation View' view or mined from the functional genomics database.

Transcription Factor Binding Site Annotation

For each transcription factor (TF) which has both a ChIP-seq data set in the functional genomics database and a publicly available position weight matrix (PWM) we have annotated the position of putative TF binding sites within the peaks called using the ChIP-seq reads.

Initially PWMs are mapped to the genome using the find_pssm_dna program from the MOODS software (1) with the -f flag set and a permissive threshold of 0.001. We then filter these mappings using a log odds score threshold. The threshold is derived per PWM by considering the occurrence of mappings in a sample of randomly positioned 'background' sequences matched in terms of size and chromosome to the ChIP-seq peaks for this TF. We select the threshold such that the proportion of these background peaks containing a mapping is approximately 5%.

Only mappings which overlap the corresponding ChIP-seq peaks are included in the functional genomics database.

PWMs are taken from JASPAR (2).

1. Janne Korhonen, Petri Martinmaki, Cinzia Pizzi, Pasi Rastas, Esko Ukkonen. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics, Vol. 25, No. 23. (1 December 2009), pp. 3181-3182.

2. Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da Piedade I, Krogh A, Lenhard B, Sandelin A. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 2008 Jan;36(Database issue):D102-6.

Source data

Regulatory features are generated by using a variety of genome wide epigenomic data sets. The vast majority of features are derived from ChIP-seq data. In order to offer a uniform set of features, we processed the raw reads from each of the sets. Reads from pooled replicates were aligned to the current genome assembly using the bwa algorithm (Li H and Durbin R, 2010) with default parameters. All matches to mitochondria were filtered out, and the resulting alignments were passed to the SWEMBL peak caller software (S. Wilder et al, in preparation). Peaks for most datasets were obtained using a strict set of parameters (-f 150 -R 0.015) obtained using CTCF as a reference dataset. These parameters are too strict for datasets that have a broad distribution of reads. Thus, we applied a more relaxed set of parameters (-f 150 -R 0.0025) for the DNAse1 datasets. The resulting peaks were then filtered out to avoid problematic areas identified in the ENCODE project.

Correspondence of Transcription Factors to ENSEMBL Genes and Jaspar Matrices (Human)

Transcription FactorEnsembl GeneJaspar Matrix(ces)
ATF3ENSG00000162772
Ap2alphaENSG00000137203 MA0003.1 PB0085.1 PB0189.1
Ap2gammaENSG00000087510 PB0087.1 PB0191.1
BAF155ENSG00000173473
BAF170ENSG00000139613
BATFENSG00000156127
BCL11AENSG00000119866
BCL3ENSG00000069399
BHLHE40ENSG00000134107 PB0007.1 PB0111.1
Bdp1ENSG00000145734
Brf1ENSG00000185024
Brf2ENSG00000104221
Brg1ENSG00000127616
CTCFENSG00000102974 MA0139.1
CfosENSG00000170345 MA0099.1 MA0099.2
CjunENSG00000177606 MA0099.2
CmycENSG00000136997 MA0059.1 MA0147.1
E2F1ENSG00000101412 MA0024.1
E2F4ENSG00000205250
E2F6ENSG00000169016
EBFENSG00000164330 MA0154.1
Egr1ENSG00000120738 MA0162.1 PB0010.1 PB0114.1
FOSL2ENSG00000075426
GTF2BENSG00000137947
GabpENSG00000154727 MA0062.1 MA0062.2 PB0020.1 PB0124.1
Gata1ENSG00000102145 MA0035.1 MA0035.2 MA0140.1
Gata2ENSG00000179348 MA0036.1
HEY1ENSG00000164683
IRF4ENSG00000137265 PB0034.1 PB0138.1
Ini1ENSG00000099956
JundENSG00000130522
MaxENSG00000125952 MA0058.1 PB0043.1 PB0147.1 PL0007.1 PL0014.1
NELFeENSG00000204356
NFKBENSG00000109320 MA0105.1
Nfe2ENSG00000123405
NfyaENSG00000001167 MA0060.1
NfybENSG00000120837
Nrf1ENSG00000106459
NrsfENSG00000084093 MA0138.2 MA0138.1
POU2F2ENSG00000028277 PH0144.1
PU1ENSG00000066336 MA0080.1 MA0080.2 PB0058.1 PB0162.1
Pax5ENSG00000196092 MA0014.1 MA0239.1
Pbx3ENSG00000167081
RPC155ENSG00000148606
RXRAENSG00000186350 MA0016.1 PB0057.1 PB0161.1 MA0065.2 MA0074.1 MA0115.1 MA0159.1
Rad21ENSG00000164754
SETDB1ENSG00000143379
SIX5ENSG00000177045
SP1ENSG00000185591 MA0079.1 MA0079.2
SRebp1ENSG00000072310
SRebp2ENSG00000198911
Sin3Ak20ENSG00000169375
Sirt6ENSG00000077463
SrfENSG00000112658 MA0083.1 PB0078.1
TAF1ENSG00000147133
TFIIIC-110ENSG00000115207
Tcf12ENSG00000140262
Tr4ENSG00000177463
USF1ENSG00000158773 MA0093.1
XRCC4ENSG00000152422
Yy1ENSG00000100811 MA0095.1
ZBTB33ENSG00000177485
ZNF274ENSG00000171606
ZZZ3ENSG00000036549
Znf263ENSG00000006194
p300ENSG00000100393

Correspondence of Transcription Factors to ENSEMBL Genes and Jaspar Matrices (Mouse)

Transcription FactorEnsembl GeneJaspar Matrix(ces)
CTCFENSMUSG00000005698 MA0139.1
CmycENSMUSG00000022346 MA0147.1
E2F1ENSMUSG00000027490 MA0024.1
EsrrbENSMUSG00000021255 MA0141.1
Klf4ENSMUSG00000003032 MA0039.2
NanogENSMUSG00000012396
Oct4ENSMUSG00000012396
STAT3ENSMUSG00000004040 MA0144.1
Smad1ENSMUSG00000031681
Sox2ENSMUSG00000074637 MA0143.1
Suz12ENSMUSG00000017548
Tcfcp2l1ENSMUSG00000026380 MA0145.1
ZfxENSMUSG00000079509 MA0146.1
nMycENSMUSG00000037169 MA0104.2
p300ENSMUSG00000055024


The current release comprises of the following datasets:

Human Regulatory Build version 10

CD4
Focus SetsData typeReference
CTCFChIP-Seq3
Attribute SetsData typeReference
H2AK5acChIP-Seq4
H2AK9acChIP-Seq4
H2AZChIP-Seq3
H2BK120acChIP-Seq4
H2BK12acChIP-Seq4
H2BK20acChIP-Seq4
H2BK5acChIP-Seq4
H2BK5me1ChIP-Seq3
H3K14acChIP-Seq4
H3K18acChIP-Seq4
H3K23acChIP-Seq4
H3K27acChIP-Seq4
H3K27me1ChIP-Seq3
H3K27me2ChIP-Seq3
H3K27me3ChIP-Seq3
H3K36acChIP-Seq4
H3K36me1ChIP-Seq3
H3K36me3ChIP-Seq3
H3K4acChIP-Seq4
H3K4me1ChIP-Seq3
H3K4me2ChIP-Seq3
H3K4me3ChIP-Seq3
H3K79me1ChIP-Seq3
H3K79me2ChIP-Seq3
H3K79me3ChIP-Seq3
H3K9acChIP-Seq4
H3K9me1ChIP-Seq3
H3K9me2ChIP-Seq3
H3K9me3ChIP-Seq3
H3R2me1ChIP-Seq3
H3R2me2ChIP-Seq3
H4K12acChIP-Seq4
H4K16acChIP-Seq4
H4K20me1ChIP-Seq3
H4K20me3ChIP-Seq3
H4K5acChIP-Seq4
H4K8acChIP-Seq4
H4K91acChIP-Seq4
H4R3me2ChIP-Seq3
PolIIChIP-Seq3
GM06990
Focus SetsData typeReference
CTCFChIP-Seq5
DNase1Dnase-Seq5
Attribute SetsData typeReference
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq5
H3K4me3ChIP-Seq5
GM12878
Focus SetsData typeReference
BATFChIP-Seq10
BCL11AChIP-Seq10
BCL3ChIP-Seq10
CTCFChIP-Seq7
CTCFChIP-Seq6
CfosChIP-Seq9
CmycChIP-Seq6
DNase1Dnase-Seq6
DNase1Dnase-Seq5
EBFChIP-Seq10
Egr1ChIP-Seq10
FAIREFAIRE-Seq6
GabpChIP-Seq10
IRF4ChIP-Seq10
JundChIP-Seq9
MaxChIP-Seq9
NFKBChIP-Seq9
NrsfChIP-Seq10
POU2F2ChIP-Seq10
PU1ChIP-Seq10
Pax5ChIP-Seq10
Pbx3ChIP-Seq10
Rad21ChIP-Seq9
SP1ChIP-Seq10
Sin3Ak20ChIP-Seq10
SrfChIP-Seq10
TAF1ChIP-Seq10
Tcf12ChIP-Seq10
Tr4ChIP-Seq9
USF1ChIP-Seq10
Yy1ChIP-Seq9
ZBTB33ChIP-Seq10
ZZZ3ChIP-Seq9
p300ChIP-Seq10
Attribute SetsData typeReference
H3K27acChIP-Seq7
H3K27me3ChIP-Seq7
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq7
H3K36me3ChIP-Seq5
H3K4me1ChIP-Seq7
H3K4me2ChIP-Seq7
H3K4me3ChIP-Seq7
H3K9acChIP-Seq7
H4K20me1ChIP-Seq7
PolIIChIP-Seq10
PolIIChIP-Seq6
PolIIChIP-Seq9
PolIIIChIP-Seq9
H1ESC
Focus SetsData typeReference
CTCFChIP-Seq7
CmycChIP-Seq6
DNase1Dnase-Seq6
DNase1Dnase-Seq5
FAIREFAIRE-Seq6
NrsfChIP-Seq10
TAF1ChIP-Seq10
Attribute SetsData typeReference
H3K27acChIP-Seq11
H3K27me3ChIP-Seq7
H3K27me3ChIP-Seq11
H3K27me3ChIP-Seq11
H3K27me3ChIP-Seq11
H3K36me3ChIP-Seq7
H3K36me3ChIP-Seq11
H3K36me3ChIP-Seq11
H3K36me3ChIP-Seq11
H3K4me1ChIP-Seq7
H3K4me1ChIP-Seq11
H3K4me1ChIP-Seq11
H3K4me1ChIP-Seq11
H3K4me2ChIP-Seq7
H3K4me3ChIP-Seq7
H3K4me3ChIP-Seq11
H3K4me3ChIP-Seq11
H3K4me3ChIP-Seq11
H3K9acChIP-Seq7
H3K9acChIP-Seq11
H3K9acChIP-Seq11
H3K9acChIP-Seq11
H3K9me3ChIP-Seq11
H3K9me3ChIP-Seq11
H3K9me3ChIP-Seq11
H4K20me1ChIP-Seq7
PolIIChIP-Seq10
PolIIChIP-Seq6
HUVEC
Focus SetsData typeReference
CTCFChIP-Seq7
CTCFChIP-Seq6
CTCFChIP-Seq5
CjunChIP-Seq9
CmycChIP-Seq6
DNase1Dnase-Seq6
DNase1Dnase-Seq5
FAIREFAIRE-Seq6
MaxChIP-Seq9
Attribute SetsData typeReference
H3K27acChIP-Seq7
H3K27me3ChIP-Seq7
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq7
H3K36me3ChIP-Seq5
H3K4me1ChIP-Seq7
H3K4me2ChIP-Seq7
H3K4me3ChIP-Seq7
H3K4me3ChIP-Seq5
H3K9acChIP-Seq7
H3K9me1ChIP-Seq7
H4K20me1ChIP-Seq7
PolIIChIP-Seq7
PolIIChIP-Seq6
PolIIChIP-Seq9
HeLa
Focus SetsData typeReference
Ap2alphaChIP-Seq9
Ap2gammaChIP-Seq9
BAF155ChIP-Seq9
BAF170ChIP-Seq9
Bdp1ChIP-Seq9
Brf1ChIP-Seq9
Brf2ChIP-Seq9
Brg1ChIP-Seq9
CTCFChIP-Seq6
CTCFChIP-Seq5
CfosChIP-Seq9
CjunChIP-Seq9
CmycChIP-Seq6
CmycChIP-Seq9
DNase1Dnase-Seq6
DNase1Dnase-Seq5
E2F1ChIP-Seq9
E2F4ChIP-Seq9
E2F6ChIP-Seq9
FAIREFAIRE-Seq6
GabpChIP-Seq10
Ini1ChIP-Seq9
JundChIP-Seq9
MaxChIP-Seq9
Nrf1ChIP-Seq9
RPC155ChIP-Seq9
TAF1ChIP-Seq10
TFIIIC-110ChIP-Seq9
Tr4ChIP-Seq9
Attribute SetsData typeReference
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq5
H3K4me3ChIP-Seq5
PolIIChIP-Seq10
PolIIChIP-Seq6
PolIIChIP-Seq9
HepG2
Focus SetsData typeReference
BHLHE40ChIP-Seq10
CTCFChIP-Seq7
CTCFChIP-Seq6
CTCFChIP-Seq5
CmycChIP-Seq6
DNase1Dnase-Seq6
DNase1Dnase-Seq5
FAIREFAIRE-Seq6
FOSL2ChIP-Seq10
GabpChIP-Seq10
HEY1ChIP-Seq10
JundChIP-Seq10
RXRAChIP-Seq10
SRebp1ChIP-Seq9
SRebp2ChIP-Seq9
Sin3Ak20ChIP-Seq10
USF1ChIP-Seq10
ZBTB33ChIP-Seq10
p300ChIP-Seq10
Attribute SetsData typeReference
H3K27acChIP-Seq7
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq7
H3K36me3ChIP-Seq5
H3K4me2ChIP-Seq7
H3K4me3ChIP-Seq7
H3K4me3ChIP-Seq5
H3K9acChIP-Seq7
H4K20me1ChIP-Seq7
PolIIChIP-Seq6
PolIIChIP-Seq9
IMR90
Focus SetsData typeReference
DNase1Dnase-Seq11
Attribute SetsData typeReference
H2AK5acChIP-Seq11
H2BK120acChIP-Seq11
H2BK12acChIP-Seq11
H2BK15acChIP-Seq11
H2BK20acChIP-Seq11
H3K14acChIP-Seq11
H3K18acChIP-Seq11
H3K23acChIP-Seq11
H3K27acChIP-Seq11
H3K27me3ChIP-Seq11
H3K36me3ChIP-Seq11
H3K4acChIP-Seq11
H3K4me1ChIP-Seq11
H3K4me2ChIP-Seq11
H3K4me3ChIP-Seq11
H3K56acChIP-Seq11
H3K79me1ChIP-Seq11
H3K79me2ChIP-Seq11
H3K9acChIP-Seq11
H3K9me3ChIP-Seq11
H4K20me1ChIP-Seq11
H4K5acChIP-Seq11
H4K8acChIP-Seq11
H4K91acChIP-Seq11
K562
Focus SetsData typeReference
ATF3ChIP-Seq9
Bdp1ChIP-Seq9
Brf1ChIP-Seq9
Brf2ChIP-Seq9
Brg1ChIP-Seq9
CTCFChIP-Seq7
CTCFChIP-Seq6
CTCFChIP-Seq5
CfosChIP-Seq9
CjunChIP-Seq9
CmycChIP-Seq6
CmycChIP-Seq9
DNase1Dnase-Seq6
DNase1Dnase-Seq5
Egr1ChIP-Seq10
FAIREFAIRE-Seq6
GTF2BChIP-Seq9
GabpChIP-Seq10
HEY1ChIP-Seq10
Ini1ChIP-Seq9
JundChIP-Seq9
MaxChIP-Seq9
NELFeChIP-Seq9
Nfe2ChIP-Seq9
NfyaChIP-Seq9
NfybChIP-Seq9
NrsfChIP-Seq10
PU1ChIP-Seq10
Rad21ChIP-Seq9
SIX5ChIP-Seq10
SP1ChIP-Seq10
Sin3Ak20ChIP-Seq10
Sirt6ChIP-Seq9
SrfChIP-Seq10
TAF1ChIP-Seq10
TFIIIC-110ChIP-Seq9
USF1ChIP-Seq10
XRCC4ChIP-Seq9
Attribute SetsData typeReference
Gata1ChIP-Seq9
H3K27acChIP-Seq7
H3K27me3ChIP-Seq7
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq7
H3K36me3ChIP-Seq5
H3K4me1ChIP-Seq7
H3K4me2ChIP-Seq7
H3K4me3ChIP-Seq7
H3K4me3ChIP-Seq5
H3K9acChIP-Seq7
H3K9me1ChIP-Seq7
H4K20me1ChIP-Seq7
PolIIChIP-Seq7
PolIIChIP-Seq10
PolIIChIP-Seq6
PolIIChIP-Seq9
PolIIIChIP-Seq9
Znf263ChIP-Seq9
K562b (no Regulatory Features built, but data is available).
E2F4ChIP-Seq9
E2F6ChIP-Seq9
Gata1ChIP-Seq9
Gata2ChIP-Seq9
SETDB1ChIP-Seq9
Tr4ChIP-Seq9
Yy1ChIP-Seq9
ZNF274ChIP-Seq9
Znf263ChIP-Seq9
NHEK
Focus SetsData typeReference
CTCFChIP-Seq7
CTCFChIP-Seq6
CTCFChIP-Seq5
DNase1Dnase-Seq6
DNase1Dnase-Seq5
FAIREFAIRE-Seq6
Attribute SetsData typeReference
H3K27acChIP-Seq7
H3K27me3ChIP-Seq7
H3K27me3ChIP-Seq5
H3K36me3ChIP-Seq7
H3K36me3ChIP-Seq5
H3K4me1ChIP-Seq7
H3K4me2ChIP-Seq7
H3K4me3ChIP-Seq7
H3K4me3ChIP-Seq5
H3K9acChIP-Seq7
H3K9me1ChIP-Seq7
H4K20me1ChIP-Seq7
PolIIChIP-Seq7

Mouse Regulatory Build version 6

ES
Focus SetsData typeReference
DNase1Dnase-Seq12
CTCFChIP-Seq13
CmycChIP-Seq13
E2F1ChIP-Seq13
EsrrbChIP-Seq13
Klf4ChIP-Seq13
NanogChIP-Seq13
Oct4ChIP-Seq13
STAT3ChIP-Seq13
Smad1ChIP-Seq13
Sox2ChIP-Seq13
Suz12ChIP-Seq13
Tcfcp2I1ChIP-Seq13
ZfxChIP-Seq13
nMycChIP-Seq13
p300ChIP-Seq13
Attribute SetsData typeReference
H3ChIP-Seq14
H3K4me3ChIP-Seq14
H3K9me3ChIP-Seq14
H3K27me3ChIP-Seq14
H3K36me3ChIP-Seq14
H4K20me3ChIP-Seq14
PolIIChIP-Seq14
ES Hybrid *
Attribute SetsData typeReference
H3K36me3ChIP-Seq14
H3K4me3ChIP-Seq14
H3K9me3ChIP-Seq14
MEF *
Attribute SetsData typeReference
H3K27me3ChIP-Seq14
H3K36me3ChIP-Seq14
H3K4me3ChIP-Seq14
H3K9me3ChIP-Seq14
NPC *
Attribute SetsData typeReference
H3K27me3ChIP-Seq14
H3K36me3ChIP-Seq14
H3K4me3ChIP-Seq14
H3K9me3ChIP-Seq14

* The Mouse Regulatory Features for ESHyb, MEF and NPC were built using ES Focus features.

References for datasets

1. Genome-wide identification of DNaseI hypersensitive sites was performed by Greg Crawford and Terry Furey (Duke University) using a whole genome DNase-sequencing protocol (Crawford et al., Genome Research 2006).
DNase-sequencing was performed using the Illumina (Solexa) sequencing by synthesis method from a DNase treated library generated from the GM06990 cell line (Crawford and Furey, unpublished).

2. Kim, T.H.; Abdullaev, Z.K.; Smith, A.D.; Ching, K.A.; Loukinov, D.I.; Green, R.D.; Zhang, M.Q.; Lobanenkov, V.V. & Ren, B.
Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome.
Cell, 2007 , 128 , 1231-1245

3. A. Barski, S. Cuddapah, K. Cui, T.Y. Roh, D.E. Schones, Z. Wang, G. Wei, I. Chepelev and K. Zhao, (2007). High-resolution profiling of histone methylations in the human genome, Cell 129 (2007), pp. 823-837.

4. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, Zhao K. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008 Jul;40(7):897-903. Epub 2008 Jun 15. PMID: 18552846

5. This data was produced as part of the ENCODE project and is used in accordance to their data release policy. These data were generated by the UW ENCODE group. More information here and here

6. This data was produced as part of the ENCODE project and is used in accordance to their data release policy. These data and annotations were created by a collaboration of multiple institutions. More information here

7. This data was produced as part of the ENCODE project and is used in accordance to their data release policy. The ChIP-seq data were generated at the Broad Institute and in the Bradley E. Bernstein lab at the Massachusetts General Hospital/Harvard Medical School. More information here

8. Raha D, Wang Z, Moqtaderi Z, Wu L et al. Close association of RNA polymerase II and many transcription factors with Pol III genes. Proc Natl Acad Sci USA 2010 Feb 23;107(8):3639-44. PMID: 20139302

9. This data was produced as part of the ENCODE project and is used in accordance to their data release policy. These data were generated and analyzed by the labs of Michael Snyder, Mark Gerstein and Sherman Weissman at Yale University; Peggy Farnham at UC Davis; and Kevin Struhl at Harvard. More information here

10. This data was produced as part of the ENCODE project and is used in accordance to their data release policy. These data were provided by the Myers Lab at the HudsonAlpha Institute for Biotechnology. More information here

11. This data was produced as part of the Epigenomics Roadmap and is used in accordance to their data release policy. More information in here: http://nihroadmap.nih.gov/epigenomics/

12. Dnase1-sequencing was produced as a collaboration between Ensembl, David Adams (Wellcome Trust Sanger Institute), and Greg Crawford (Duke University).

13. Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008 Jun 13;133(6):1106-17. PMID: 18555785

14. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007 Aug 2;448(7153):548-9. PMID: 17603471