DIMA is a Domain Interaction MAp and aims at becoming a comprehensive resource for functional and physical interactions among conserved protein-domains. We want to provide easy access to published methods for predicting domain-domain interactions on one hand and experimental data on the other hand.
The advantage of such an integrated service is twofold:If you have any questions, comments, suggestions, criticism etc. please contact < dima-server AT wzw.tum.de >.
An overview of the DIMA database. Domain interactions are predicted by four computational methods: CMM (correlated mutations), DIPD (domain interaction prediction in a discriminative way), DPEA (domain pair exclusion algorithm), and DPROF (domain phylogenetic profiling). DIPD constructs the PPI and non-PPI datasets required for machine learning based on the IntAct database of protein interactions while CMM and DPEA use both PPIs from IntAct and functionally linked orthologous groups of proteins (COGs) from the STRING database as input. DPROF is carried out on genomes from the PEDANT databases using orthologous relationships extracted form the SIMAP database. Structurally known domain-domain interactions are imported from the iPfam and 3did databases. The Negatome database is used to filter out unlikely physical interactions between domains. Users can search for domain interactions by a single or multiple identifiers, domain description, or protein sequence. DIMA results are presented as a concise table and displayed using a dynamic graphical representation of the local domain neighborhood. Results of the domain phylogenetic profiling are displayed in a separate tab.
Up to now we have implemented the following datasets and methods:
Experimentally supported data (iPFAM and 3did) is directly imported from the respective databases. The prediction approaches (Profile, DPEA, CMM and DIPD), on the other hand, are not simply imported from the datasets provided with the original publications. We apply these algorithms to up-to-date input datasets in order to provide predictions based on current knowledge for each release. A list of non-interacting domain pairs was extracted from Negatome and used to filter all DDIs generated by different computational methods.
Links and references to all methods used and a few resources not directly included in DIMA can be found under Links
Some of the methods used by DIMA can be configured to use different parameters. If you choose to skip this step, all results produced by the server will be based on default settings. Sticking with defaults is probably a good idea in the beginning but as you get more familiar with DIMA and you are trying to get answers to more specific questions you probably want to modify the settings. You can find a "Preferences" link in the navigation bar at the top of every DIMA page.
On the Preferences page, you can select all methods and datasets you would like to be used for searching and network generation and modify parameters of some methods.
The Domain Pair Exclusion Method (DPEA) is a method to derive the most likely domain-domain interactions from a large body of experimentally supported protein-protein interactions (Riley et al. 2005). DIMA uses protein interaction data from IntAct for this purpose. DPEA yields S-scores which represent a modified likelihood statistics. The authors of DPEA suggested to use a score cutoff of 3.0 in their publication and DIMA applies this value by default.
When applying DPEA to predicted protein interactions (DPEA-STRING) great care should be taken to avoid excessive false-positive predictions. DIMA uses STRING predictions (von Mering et al. 2007) excluding "database" and "experimental" scores (i.e. only real predictions). Only predictions with a resulting STRING-score ≥ 0.9 (i.e. very high confidence) are used for DPEA analysis. We then apply a default threshold of 3.0 to the results. Of course, this value can be changed by the user.
Domain phylogenetic profiling is based on profile-strings indicating the presence ('1') or absence ('0') of a domain in the selected genomes. As for protein phylogenetic profiling, the rationale is that proteins/domains that depend on each other for an important cellular function generally need to be present together or not at all in a given genome/proteome. Three basic parameters determine which domain profiles are considered related:
The distance between two bit-strings can be measured in different ways.
We offer here only mutual information as a measure of profile similarity. Since mutual information is a measure of similarity rather than distance, related domains ought to have profiles of high mutual information and the "distance threshold" is interpreted as a lower limit. Mutual information performs quite well but is somewhat less intuitive (in our opinion).
Mutual information takes values from the range [0;1], were 1 indicates identical profiles. Keep in mind that mutual information is a measure of similarity, not distance, so good scores exceed the threshold. You may want to start out with thresholds in the order of 0.9 – 0.95. Here we provide the default threshold 0.6.
Selection of genomes to be profiled has significant impact on the results. Depending on your goals you may want a broad selection of organisms or a specific group. If you select a group of closely related organisms for profiling all domains, which these Proteomes have in common, will end up with zero entropy (h=0) because the profile will consist of all "1"s. I.e. they will be removed by the entropy filter. Very closely related organisms will give additional (undue?) weight to whatever set of domains they contain. Usually, it is a good idea to use a large number of organisms from the widest phylogenetic spectrum possible. Anything below 30 organisms will perform about as well as consulting a soothsayer...
The iPFAM and 3did datasets contains domain-pairs which have been found to form contacts according to the structure data of the PDB database. As this is experimental data rather than a prediction method there are no parameters for it.
The domain interaction prediction in a discriminative way (DIPD) method for predicting DDIs from PPIs utilizes both PPIs and non-PPIs to construct the domain combinations and then formulates the DDI prediction as a feature selection problem in machine learning.
For the DIPD method the PPIs and non-PPIs datasets are constructed based on the IntAct database. The non-PPIs dataset is generated randomly from the PPIs dataset and then the filter is used to exclude: (i) the known PPIs, (ii) the PPIs whose both interacting partners do not belong to the same species, (iii) the PPIs which do not contain any domain pairs derived from known PPIs, (iv) all orthologous group interactions.
After the production of PPIs and non-PPIs datasets, the filter method is first applied to reduce redundancy and sequential feature selection method is then employed to select the informative features, namely, domain interactions. To get the list of domain interactions, we use a default threshold of 3.0 (unbalanced correlation score) to the result. This score can be changed by the user.
OMES (Observed Minus Expected Squared) algorithm: First, for every possible pair of columns i,j a list containing all distinct pairs of amino acids is generated. Any pairs having a gap at either i or j are discarded from the analysis. Frequencies of all amino acids at all positions in the set of N sequences are then estimated. Finally, co-variation between positions is calculated by comparing the expected co-occurrence of each two residues x, y in each two columns i, j (N_{ex}) to the frequency with which they actually do appeared together (N_{obs}) using the ^{2} goodness-of-fit test. The expected number of sequences (N_{ex}) that contain amino acid x at position i and amino acid y at position j is based on the frequencies of x and y at positions i and j, respectively.
In this formulation, L represents the number of distinct residue pairs that can be found at positions i and j ; N_{valid} is the sequences without gaps in columns i, i ; N_{xi} is the number of times residue x occurs in column i ; and N_{yj} is the number of times residue y occurs in column j.
As the OMES algorithm is based on the ^{2} goodness-of-fit test. The chi-square test is a nonparametric statistical hypothesis test used to test if a sample of data came from a population with a chi-square distribution. Therefore, the P-value for a correlation score x can be calculated as:
where F(x) is the cumulative distribution function (cdf) of the chi-square distribution. The distribution function, thus, provides the probability that a value less than or equal to x is observed by chance. The chi-square distribution takes one parameter k which is a positive integer that specifies the number of degrees of freedom (df). we calculated the P-values with degrees of freedom being always df = 1 as proposed by Larson and colleagues. The obtained P-values were combined using the Fisher's combined probability test.
The Fisher's combined probability test uses P-values from k independent tests to calculate a test statistic:
where pi designates an individual P-value to be integrated. In the case all of the null hypothesis of the k tests are true, ^{2} will have a ^{2}_{2k} distribution with 2k degrees of freedom. The P-value for ^{2} itself can then be interpolated from a chi-square table using 2k degrees of freedom.
ELSC (Explicit Likelihood of Subset Covariation) algorithm measures how many possible subsets of size n would have the composition found in column j.
where N_{x,j} is the number of residues of type x at position j in the unperturbated MSA; n_{x,j} is the number of residues of type x at position j in the subset MSA defined by column i perturbation.
In McBASC algorithm the correlated mutations are calculated as:
where s_{i} is the standard deviation of s_{ikl} about the mean <s_{i}> and the indices k,l run from 1 to the number of sequences in the family (N).
Searching information with DIMA is easy. If you already know the PFAM or InterPro identifier of your domain of interest you simply type it into the respective search form and hit the "Search" button.
Usually you will not have an ID to work with but maybe the common name of the domain – or a few words that should be in the description. In that case you can use the second search form which will initiate a full text search of the domain names and short descriptions. You can now choose from a list of candidates and launch the actual DIMA query with them.
Finally, you may have piece of protein sequence which contains one or more domains you would like to learn more about. Just paste the sequence into the third search form and the server will analyze it for PFAM domain hits. These will be presented to you so you can select which domains should go into the DIMA query. Please be patient, when using this function – as we first need to analyze your sequence it may take a while before you get results.
If you are not really sure what to expect from DIMA and don't have a specific search in mind you can have a look at the example links in the search forms using the link at the top of the search page.
Of course, many domains have no predicted neighbors – so try a few different ones in order to get an impression of DIMA. Lack of neighbors has many reasons: iPFAM is considered very reliable but at the same time it's quite incomplete. Domain profiling has considerable blind-spots, too: low entropy domains will never yield a signal, the critical organisms have not been sequenced (or never will be), your threshold settings are too strict to detect it, the related domains are not covered by PFAM... Of course, the combination of different methods improves the chance to detect a connection considerably – that's the point of the DIMA resource after all – but we still have a long way to go for better coverage. We suggest you take positive predictions as good hints for a relationship while absence of a predicted connection doesn't necessarily mean much.
The results of your search are presented in a table under the tab "Results". In each row of the table you will find the PFAM domain ID (hyperlinked to query this domain), a PFAM button linking to the respective PFAM entry, a short domain description, a link to the respective InterPro Entry and indicators signifying which method or data supports this domain relationship.
The first line of each table is special: I contains the query domain (highlighted in red). Some methods (like domain profiling) will always support self-neighborhood while others (such as iPFAM) support this link ony if an actual interaction has been observed.
Clicking of the PFAM domain ID will start a DIMA search for the respective domain
You can also click on the tab table "Profiling" in order to see additional information about the results of respective methods.
For the domain profiling technique, you will be presented a table containing profile entropies, actual distances and a graphical representation of the profile in which green areas represent '1' while '0' is shown in red. Positioning the mouse cursor over the profile bar will in most browsers show the name of the respective organism. While the graphical representation is good for an overview, detailed analysis requires to look at the raw-output which can be downloaded or viewed by clicking the respective link (data in tab-separated format).
For a graphical representation of the domain neighborhood, you can click the "Network" on the tab table. Depending on the choice you made in the preferences form, you will get a layout with circular nodes. The query-node(s) are shown in red, and the edge(s) with the score(s) are shown in different colors.
For larger analyses you may want to see the entire domain network predicted by DIMA. As this operation takes considerably longer than individual domain queries we do not offer this as an interactive service. Nevertheless, you can request a network to be calculated using your preferences settings. Click the "Compute Netwoks" link in the navigation bar to do so. First adjust/review your settings in preferences, enter your email address and hit "submit". Your request will be queued and executed depending on server-load. The resulting domain network will be delivered to you by email.
When requesting entire networks, please be very careful to review your settings (preferences)! E.g. too high thresholds can result in HUGE output files which put an undue burden on both our servers and your mailbox.