Documentation

Protein Contact is a collection of data and visualization resources to facilitate computational structural biology studies.



Figure. Visualizing structural contacts as circles.



Visualizing PDB codes.

You can visualize structural atomic contacts as circles, allowing for a more compact representation than 3D visualization.

Input.

We offer two ways to visualize a PDB: you either specify a PDB code or you supply your own PDB file. In either case, you also need to provide chains(s) that you wish to visualize. If you supply a single chain only intra-molecular contacts will be visualized. If you supply more than one chain, only inter-molecular contacts will be visualized. You can specify if you wish to visualize HETATOM entries. By default these will not be visualized.

Output.

The chains are projected onto circle segments. Clickig on the chains expands them to show the residues constituting each chain, again projected on circle segments (see Figure above). Connections between residues indicate contacts. Visualized contacts can be adjusted by distance cutoff using the radio-button dialog on top of the representation. If the visualization was created from the PDB code rather than supplied file,the chemical nature of the bond will also be available together with distance cutoffs. Hovering over residues displays their name. Hovering over connecting arcs displays the residues involved and hides all the other contacts temporarily.

Sharing.

The output can be shared by copying the sharable link available from the top tab 'share'. The visualization can also be embedded in a website by using an iframe code for which can be found in the top tab 'share'.



Figure. Common steps in computational structural bioinformatic analyses. We provide a service which applies such commonly used filtering steps using our pre-computed data.



Introduction.

We have recognized several filtering steps which appear to be common to many computational structural bioinformatics analyses (see Figure above). We offer this data for download (see the Data tab) and we provide an interface to query it directly to create filtered list of PDB chains.

Input.

Our PDB filtering endpoint is available here. From here you have a set of filters (all optional) to choose from:

  • Resolution: specify the minimal resolution for the structures.
  • Molecule type: do we want proteins or nucleotides? If the chain contains both it will be always returned.
  • Chain length: what is the minimum length of the chain to be returned?
  • Organism: would you like to constrain the results only to a specific organism? If some other organism is also cited in the PDB record, it will be returned.
  • Structural Family: would you like to constrain the results to only a specific structural family (CATH)?
  • Clustering identify cutoff: would you like to cull the sequences by sequence identity? Calculated using CD-hit, separately for nucleotides and proteins.
  • Remove noncanonicals would you like chains with non-canonical residues/nucleotides to be removed? Non canonical residues/nucleotides are not the standard 20 and the standard 5 respectively.
  • Remove discontinuous would you like chains with discontinuous chains to be removed? Calculated using DSSP.
  • Bad residue treshold how many residues with 'bad' B-factor are you willing to tolerate per chain? Bad residue is defined as one with at least one atom with B-factor of above 80 or nil altogether.
Output

Results of filtering are available as a searchable table and immediately available for download. The downloadable results are available as a mapping from selected PDBs and chains to the information attached to them:

"2FWPA": { //chain id -- in the format, PDBchain
	"c": "prot",//type of molecule (prot/nucl)
	"r": "1.85",//resolution of the molecule
	"l": 159,//length of the chain
	"o": "ACETOBACTER ACETI",//organism of the PDB
	"f": {//structural families mapped to portions of chain
		"3.40.50.7700": ["20-178:A"]
	}
},
						


Figure. Summary of the data we collect from the Protein Data Bank.



Introduction

We collect X-ray structure data from the PDB and pre-process these using widely-used protocols. The data is available for download here. The data is stored in two levels of abstraction, PDB-file-level and Residue-level. In the downloaded file, these are stored in pdb_level and residue_level folders respectively (we make a parser available for the pdb_level data here). Additionally there is the raw_pdbs file which contains the corresponding structures from the PDB with one caveat: if an alternative residue was encountered at any point, only the first one was kept. On the PDB-file-level information is given on the quality of the structure (resolution, organisms) and the constituent chains (discontinuities, length etc.). On the residue level, we store the atomic contact information only. PDB-file-level information is supposed to simplify filtering sets of structures to analyze by providing broad-brush filters. Residue-level data is supposed to hold information to be analysed, atomic contact information, including interaction chemistry and surface accessibilities.

PDB-level

We collect the following data for each PDB file:

  • Resolution: collected from the rscb PDB.
  • Organism information: collected from the rscb PDB. These are simply the annotations as given in the PDB file, without attempting to determine correctness and if different chains come from
  • Individual chain information, collected for each constituent chain, this includes:
    • Chain discontinuities, as calculated by DSSP, lists the residue ids after which there appears to be a chain discontinuity.
    • Noncanonical residues/nucleotides, any amino acids or nuecleotides which are not standard 20 or standard five respectively.
    • Bad bfactor, number of residues with at least one atom with B-factor of nil or above 80.
    • Molecule type: protein (prot) or nucleotide (nucl) -- it is possible to have both.
    • Structural family, as defined by CATH.
    • Cluster, CD-HIT sequence clusters. Calculated for proteins only, available at sequence identity cutoffs of 70%, 80%, 90% and 99%. Calculated for proteins and for nucleotides separately without a length filter.

The bulk data are json-formatted in the following format for each PDB (this data can be parsed using the file here):

{
	"chains": "ABCDEF", //Chains in this PDB file.
	"resolution": "3",  //Resolution of this PDB file.
	"organisms": "HOMO SAPIENS MUS MUSCULUS" //Organisms found in this PDB file.

	"chain_details": { //Details of individual chains.
		"A": { //chain identifier
			"clustering": { //cluster membership for different identity cutoffs.
				"0.7": 15379,
				"0.9": 21690,
				"0.8": 17503,
				"0.99": 27320
			},
			"non_canonical": 0, //number of non-canonical residues.
			"cath": { //cath structural family with residue limits for domains.
				"2.60.40.10": ["1-108:A", "109-210:A"]
			},
			"length": 214, //chain length.
			"classes": ["prot"], //what kind of chain is it? nucleotide/protein
			"disconts": [], //discontinuities found in the PDB file.
			"bad_bf_residues": 15 //number of residues with 'bad' B-factors.
		},
		.
		.
		.
}

						

To make your life easier, we are providing the following python script to load data from the pdb_level information and do some parsing:

import json
from os.path import join,isfile
import pprint
import os
import random

release_db = '.'#where you downloaded the data file.

#Load the names of cath.
def load_cath_file():
	cath = {}
	for line in open('../code/cath-b-newest-names'):
		cath_id,name= line.split('\t')
		cath[cath_id] = name
	return cath


#Traverses the folder with PDB information to create a parsable python object.
def load_data():
	
	d = {}

	#List of organisms.
	organisms = {}

	cath = load_cath_file()

	#CATH
	cath_data = {}

	for root, dirs, files in os.walk(join(release_db,'pdb_level')):
		print len(d)
		for pdb_file in files:
			data = json.load(open(join(release_db,'pdb_level',pdb_file[1:3],pdb_file)))
			pdb_code = pdb_file.replace('.json','')
			d[pdb_code] = data
			#Get organism data.
			try: 
				organisms[d[pdb_code]['organisms']]+=1
			except KeyError:
				organisms[d[pdb_code]['organisms']]=1
			#Get structural family data.
			for chain in d[pdb_code]['chain_details']:
				if d[pdb_code]['chain_details'][chain]['cath'] == None:
					continue
				for str_family in d[pdb_code]['chain_details'][chain]['cath']:
					
					try:
						cath_data[str_family] = cath[str_family]
					except KeyError:
						print "Failed fetching name",str_family
						
	
	return d,organisms,cath_data

#Load the data.
data,organisms,cath = load_data()

#Create a constrained list of PDBs given parameters.
#Parameters
#* resolution : float, provide resolution of the structure.
#* exclude_discontinuous : should we exclude chains which are discontinuous.
#* structural_family : list of CATH entries to match against.
#* clustering_cutoff : cluster cutoff, one of 0.6,0.7,0.8,0.9,0.99
#* bad_residue_tolerance : number of residues with nil B-factors or greater than 80 that are tolerated within chain.
#* exclude_non_canon : exclude non-canonical residues from search.
#* organism : list of keywords for organism.
#* min_length : minimum length of the chain
def create_list(resolution=None,exclude_discontinuous=False,structural_family=None,clustering_cutoff=None,bad_residue_tolerance=0,exclude_non_canon=True,organism=None,min_length=None,molecule_type=None):

	#Format:
	#Lists with	
	results = {}

	#Keeps track of clusters we have encountered.
	clusters = {}
	
	for pdb in data:
		
		#Check resolution
		if resolution!=None and float(resolution) < float(data[pdb]['resolution']):
			continue
		for chain in data[pdb]['chain_details']:
			
			#Identifies for results.
			_id =  str(pdb+chain)

			#Length filter.
			length = data[pdb]['chain_details'][chain]['length']
			if min_length !=None:
				if length < min_length:
					continue

			#Check discontinuities
			if len(data[pdb]['chain_details'][chain]['disconts']) >0 and exclude_discontinuous == True:
				
				continue
			#Exclude chains with excessive numbers of bad residues.
			if  int(data[pdb]['chain_details'][chain]['bad_bf_residues'])>bad_residue_tolerance:
				continue
			#Exclude chains with excessive numbers of bad residues.
			
			if  int(data[pdb]['chain_details'][chain]['non_canonical'])>0 and exclude_non_canon==True:
				print "Skipping because of non-canonicals"
				continue
			#Structural family filter.
			if structural_family!=None:
				#If we do not have family annotation and we want to filter by family, ignore right away.
				if data[pdb]['chain_details'][chain]['cath'] == None:
					continue
				found_family=False
				for family in data[pdb]['chain_details'][chain]['cath']:
					if family in structural_family:
						found_family = True
						break
				if found_family == False:
					
					continue
			
			#Deal with molecule classes.
			molecule_class = ''
			if molecule_type!=None:
				if molecule_type not in data[pdb]['chain_details'][chain]['classes']:
					continue
			if len(data[pdb]['chain_details'][chain]['classes']) == 1:
				molecule_class = data[pdb]['chain_details'][chain]['classes'][0]
			else:
				molecule_class = data[pdb]['chain_details'][chain]['classes']
			#Clustering filter.
			if clustering_cutoff !=None:
				if molecule_class!='prot' and molecule_class !='nucl':#Means we have a double class or no class at all -- clustering was molecule specific.
					continue
				print pdb
				cluster = data[pdb]['chain_details'][chain]['clustering'+molecule_class][clustering_cutoff]
				if cluster not in clusters:
					clusters[cluster] = {'id':_id,'r':resolution}
				else:
					continue#For now clusters are added on the first come basis.

			#organism filter.
			if organism!=None:
				if organism not in data[pdb]['organisms']:
					continue
			#If we jumped all the hurdles, add chain to results.
			results[_id] = {'r':str(data[pdb]['resolution']),'o':data[pdb]['organisms'],'f':data[pdb]['chain_details'][chain]['cath'],'l':length,'c':molecule_class}

	

	return results

if __name__ == '__main__':
	
	#Example usage:
	result = create_list(resolution=5.0,structural_family=['2.60.40.10'],exclude_discontinuous=True,exclude_non_canon=True,organism='HOMO SAPIENS',min_length=10,molecule_type='prot')
	print "Results",result
	
Residue-level

We collect the following interaction atomic data for each PDB file:

  • Individual residue data: for each residue we record the following information:
    • Residue identifier chain, residue number and insertion of the residue.
    • Type: amino acid, nucleotide or HETATOM name for the molecule
    • Canonical information: if the residue is a canonical amino acid or nucleotide (one of traditional 20 amino acids or one of five traditional nucleotides).
    • Common solvent: information whether it is a common solvent (water is excluded from this).
    • Hetatom: information whether the residue is in fact a hetatom.
    • Accessible Surface Area: information on the surface accessibility of side chains, calculated using FREESASA. Uncomplexed accessibilities are calcualted for each chain in PDB separately. Complexed accessibilities are calculated for the entire PDB file.
  • Residue-Residue interactions we record the minimal, maximal and Ca (if applicable) distance between residues in the PDB file. for each interaction we store the following information:
    • Residue-residue interaction identifiers: chain, residue number and insertion of the two interacting residues.
    • Residue-residue distances: we record the minimal atom distance, maximal atom distance and
    • Chemical characteristics: we record if the residues can be in any of the following non-covalent interactions: clash,covalent,vdw_clash,vdw,proximal,hydrogen_bond,weak_hydrogen_bond,halogen_bond,ionic,metal_complex,aromatic,hydrophobic,carbonyl,polar,weak_polar as defined by Arpeggio.

The bulk data are json-formatted in the following format for each PDB:


	'contacts': //list of contacts in this structure
		.
		.
		.
		"('P111', 'P12')": //IDs of residues in the interaction.
				{'basic': // basic distance information
					  {
						'ca': 5.0, //Calpha distance (if applicable)
						'max': 8.5, //maximum distance between atoms in the residue/nucleotide.
						'min': 3.2 //minimum distance between atoms in the residue/nucleotide
					   },
                      		  'chem': { //chemical bond information
					   'hydrophobic': [ //name of the interaction
						['P/12/CG2', 'P/111/CB'], //list of atoms involved in the interaction.
                                                .
						.
						.,
					    .
					    .
					    .
                                	   
		.
		.
		.

	'chains': //list of chains.
		'H': //chain identifier
		.
		.
		.
		'H99': {'aa': 'ASP', //residue/nucleotide/HET molecule name.
			  'canon': 'true', //is it a canonical residue/nucleotide?
			  'common_solvent': 'false', //is it a common solvent?
			  'het': 'false', //is it a hetatom entry?
			  'surface': {'sc_absolute_complexed': 13.6, //side chain absolute accessibility -- complexed
				       'sc_absolute_diff': 0.0, //side chain absolute accessibility difference between complexed and uncomplexed.
				       'sc_absolute_uncomplexed': 13.6, //side chain absolute accessibility -- uncomplexed
				       'sc_relative_complexed': 13.5, //side chain relative accessibility -- complexed
				       'sc_relative_diff': 0.0, //side chain relative accessibility difference between cmoplexed and uncomplexed
				       'sc_relative_uncomplexed': 13.5}}} //side chain relative accessibility uncomplexed

					
API

Processed PDB data are also available as an API. The format is identical as the one in bulk file. PDB level and Residue level data are returned (see the bulk data tab).

http://proteincontact.org/api?pdb=1ahw

Please do not attempt to download all the data via the api -- we make the same data availalbe bulk download for this purpose.