Ensembl Schema Documentation

Introduction

This document gives a describes the tables that make up the Ensembl 'Funcgen' schema. Tables are grouped logically by their function, and the purpose of each table is explained. This document refers to version 62 of the Ensembl variation schema.

A simplified entity relationship diagram of the schema is available here.

List of the tables:

Main feature tables

Set tables

Array design tables

Experiment tables

Ancilliary tables

Core tables

Core like tables



Main feature tables

These define the various genomics features and their relevant associated tables.


regulatory_feature Show columns

The table contains imports from externally curated resources e.g. cisRED, miRanda, VISTA, redFLY etc.

See also:


regulatory_attribute Show columns

Denormalised table defining links between a @link regulatory_feature and it's constituent 'attribute' features.

See also:


annotated_feature Show columns

Represents a genomic feature as the result of an analysis i.e. a ChIP or DNase1 peak call.

See also:


motif_feature Show columns

The table contains genomic alignments of @link binding_matrix PWMs.

See also:


associated_motif_feature Show columns

The table provides links between motif_features and annotated_features representing peaks of the relevant transcription factor.

See also:


binding_matrix Show columns

Contains information defining a specific binding matrix(PWM) as defined by the linked analysis e.g. Jaspar.

See also:


external_feature Show columns

The table contains imports from externally curated resources e.g. cisRED, miRanda, VISTA, redFLY etc.

See also:


result_feature Show columns

Represents the mapping of a raw/normalised signal. This is optimised for the web display in two ways:
    1 Data compression by collection into different sized windows or bins.

    2 For array data it also provides an optimised view of a @link probe_feature and associated @link result.

See also:


probe_feature Show columns

The table contains genomic alignments @link probe entries.

See also:



Set tables

Sets are containers for distinct sets of raw and/or processed data.


data_set Show columns

Defines highest level data container for associating the result of an analysis and the input data to that analysis e.g. Seq alignments(Input/ResultSet) and peak calls (FeatureSet)

See also:


supporting_set Show columns

Defines association between @link data_set and underlying/supporting data.

See also:


feature_set Show columns

Container for genomic features defined by the result of an analysis e.g. peaks calls or regulatory features.

See also:


result_set Show columns

Container for raw/signal data, used as input to an analysis or for visualisation of the raw signal i.e. a wiggle track.

See also:


result_set_input Show columns

Link table between @link result_set and it's contstituents which can vary between an array experiment (@link experimental_chip/@link channel) and a sequencing experiment (@link input_set). Note the joint primary key as inputs can be re-used between result sets.

See also:


dbfile_registry Show columns

This generic table contains a simple registry of paths to support flat file (DBFile) access. This should be left joined from the relevant adaptor e.g. ResultSetAdaptor.

See also:


input_set Show columns

Defines a distinct set input data which is not imported into the DB, but used for some analysis e.g. a BAM file.

See also:


input_subset Show columns

Defines a file from an input_set, required for import tracking and recovery.

See also:



Array design tables


array Show columns

Contains information defining an array or array set.

See also:


array_chip Show columns

Represents the individual array chip design as part of an array or array set.

See also:


probe_set Show columns

The table contains information about probe sets.

See also:


probe Show columns

Defines individual probe designs across one or more array_chips. Note: The probe sequence is not stored.

See also:


probe_design Show columns

Stores data from array design analyses.

See also:



Experiment tables

These define the experimental meta and raw data .


experiment Show columns

Stores data high level meta data about individual experiments

See also:


experimental_group Show columns

Contains experimental group info i.e. who produced data sets.

See also:


mage_xml Show columns

Contains MAGE-XML for array based experiments.

See also:


experimental_chip Show columns

Represents the physical instance of an @link array_chip used in an @link experiment.

See also:


channel Show columns

Represents an individual channel from an @link experimental_chip.

See also:


result Show columns

Contains a score or intensity value for an associated probe location on a particular @experimental_chip.

See also:



Ancilliary tables

These contain data types which are used across many of the above tables and are quite often denormalised to store generic associations to several table, this avoids the need for multiple sets of similar tables.


feature_type Show columns

Contains information about different types/classes of feature e.g. Brno nomenclature, Transcription Factor names etc.

See also:


associated_feature_type Show columns

Link table providing many to many mapping for @link feature_type entries.

See also:


cell_type Show columns

Contains information about cell/tissue types.

See also:


experimental_design Show columns

Denormalised link table to allow many to many design_type associations.

See also:


design_type Show columns

Contains extra information about experimental designs, preferably ontology terms.

See also:


status Show columns

Denormalised table associating funcgen records with a status.

See also:


status_name Show columns

Simple table to predefine name of status.

See also:



Core tables

These are exact clones of the corresponding core schema tables. See core schema docs for more details.


analysis Show columns

Usually describes a program and some database that together are used to create a feature on a piece of sequence. Each feature is marked with an analysis_id. The most important column is logic_name, which is used by the webteam to render a feature correctly on contigview (or even retrieve the right feature). Logic_name is also used in the pipeline to identify the analysis which has to run in a given status of the pipeline. The module column tells the pipeline which Perl module does the whole analysis, typically a RunnableDB module.

See also:


analysis_description Show columns

Allows the storage of a textual description of the analysis, as well as a "display label", primarily for the EnsEMBL web site.

See also:


meta Show columns

Stores data about the data in the current schema. Unlike other tables, data in the meta table is stored as key-value pairs. These data include details about the database, RegulatoryBuild and patches. The species_id field of the meta table is used in multi-species databases and makes it possible to have species-specific meta key-value pairs. The species-specific meta key-value pairs needs to be repeated for each species_id. Entries in the meta table that are not specific to any one species, such as the schema.version key and any other schema-related information must have their species_id field set to NULL . The default species_id, and the only species_id value allowed in single-species databases, is 1.


meta_coord Show columns

Describes which co-ordinate systems the different feature tables use.

See also:


identity_xref Show columns

Describes how well a particular xref object matches the EnsEMBL object.

See also:


external_synonym Show columns

Some xref objects can be referred to by more than one name. This table relates names to xref IDs.

See also:


external_db Show columns

Stores data about the external databases in which the objects described in the xref table are stored.

See also:


ontology_xref Show columns

This table associates ontology terms/accessions to Ensembl objects (primarily EFO/SO). NOTE: Currently not in use

See also:


unmapped_reason Show columns

Describes the reason why a mapping failed.

See also:



Core like tables

These are almost exact clones of the corresponding core schema tables. Some contain extra fields or different enum values to support the funcgen schema


xref Show columns

Holds data about objects which are external to EnsEMBL, but need to be associated with EnsEMBL objects. Information about the database that the external object is stored in is held in the external_db table entry referred to by the external_db column.

See also:


object_xref Show columns

Describes links between Ensembl objects and objects held in external databases. The Ensembl object can be one of several types; the type is held in the ensembl_object_type column. The ID of the particular Ensembl gene, translation or whatever is given in the ensembl_id column. The xref_id points to the entry in the xref table that holds data about the external object. Each Ensembl object can be associated with zero or more xrefs. An xref object can be associated with one or more Ensembl objects.

See also:


unmapped_object Show columns

Describes why a particular external entity was not mapped to an ensembl one.

See also:


coord_system Show columns

Stores information about the available co-ordinate systems for the species identified through the species_id field. For each species, there must be one co-ordinate system that has the attribute "top_level" and one that has the attribute "sequence_level". NOTE: This has been extended from the core implementation to support multiple assemblies by referencing multiple core DBs.

See also:


seq_region Show columns

Stores information about sequence regions from various core DBs.

See also: