Custom data sets
If you want to filter or customise your download, please try
Biomart,
a web-based querying tool.
FTP Download
API Code
If you do not have access to CVS, you can obtain our latest API code as a gzipped tarball:
Download complete API for this release
Note: the API version needs to be the same as the databases you are accessing, so please
use CVS to obtain a previous version if querying older databases.
Database dumps
Entire databases can be downloaded from our FTP site in a
variety of formats. Please be aware that some of these files
can run to many gigabytes of data.
Looking for MySQL dumps to install databases locally? See our
web installation instructions
for full details.
Each directory on
ftp.ensembl.org contains a
README file,
explaining the directory structure.
[[SCRIPT::EnsEMBL::Web::Document::HTML::FTPtable]]
To facilitate storage and download all databases are
GNU
Zip (gzip, *.gz) compressed.
About the data
The following types of data dumps are available on the FTP site.
- FASTA
- FASTA sequence databases of Ensembl gene, transcript and protein
model predictions. Since the
FASTA format does not permit sequence annotation,
these database files are mainly intended for use with local sequence
similarity search algorithms. Each directory has a README file with a
detailed description of the header line format and the file naming
conventions.
- DNA
- Masked
and unmasked genome sequences associated with the assembly (contigs,
chromosomes etc.).
- The header line in an FASTA dump files containing DNA sequence
consists of the following attributes :
coord_system:version:name:start:end:strand
This coordinate-system string is used in the Ensembl API to retrieve
slices with the SliceAdaptor.
- cDNA
- cDNA sequences for Ensembl or ab
initio predicted
genes.
- Peptides
- Protein sequences for Ensembl or ab
initio predicted
genes.
- RNA
- Non-coding RNA gene preditions.
- Flatfile
- Flat files allow more extensive sequence annotation by means of
feature tables and contain thus the genome sequence as annotated by
the automated Ensembl
genome
annotation pipeline. Each nucleotide sequence record in a flat
file represents a 1Mb slice of the genome sequence. Flat files are
broken into chunks of 1000 sequence records for easier downloading.
- EMBL
- Ensembl database dumps in EMBL nucleotide
sequence database format
- GenBank
- Ensembl database dumps
in GenBank nucleotide sequence
database format
- MySQL
- All Ensembl MySQL databases are available in text format as are
the SQL table definition files. These can be imported into to any SQL
database for a local
installation of a mirror
site. Generally, the FTP directory tree contains one one directory per
database. For more information about these databases and their
Application Programming Interfaces (or APIs) see the
API section.
- GTF
- Gene sets for each species. These files include annotations of
both coding and non-coding genes. This file format is
described here.
- EMF flatfile dumps (variation and comparative data)
-
Alignments of resequencing data are available for several species as
Ensembl Multi Format (EMF) flatfile dumps. The accompanying README
file describes the file format.
Also, the same format is used to dump whole-genome multiple alignments
as well as gene-based multiple alignments and phylogentic trees used
to infer Ensembl orthologues and paralogues. These files are available
in the ensembl_compara database which will be found in
the mysql
directory.
- BED format files (comparative data)
-
Constrained elements calculated using GERP are available in BED
format. For more information see the accompanying README file.
BED format is a simple line-based format. The first 3 mandatory columns
are:
- chromosome name (may start with 'chr' for compliance with UCSC)
- start position. This is a 0-based position
- end position.
More information on the BED file format...
- Tarball
-
The entire Ensembl API is gzipped and concatenated into a single TAR file. This is updated daily.
- Miscellaneous scripts
-
The following example scripts for some common tasks can be used
with Ensembl APIs.
-
Assembly_mapper_1.0
Converts coordinates from one assembly version to another.
-
ID_mapper_1.0
Maps old Ensembl stable identifiers for genes, transcripts and
translations to their current versions.
-
Variant_effect_predictor
Predicts the positions and potential effects of sequence variations in
the context of transcripts and their translations.