Practical 3: Introduction to protein structure

Bert de Groot

Contents

Reading PDB files
X-ray crystallography
NMR spectroscopy
References

In the last lectures we learned how protein structures are determined experimentally. Today, we'll focus on how computational techniques are employed to aid structure determination.

There are two main techniques for solving protein structures: x-ray crystallography and Nuclear Magnetic Resonance (NMR). As can be seen from the current PDB holdings, more than 180,000 protein structures have been solved so far, and are available from the Protein Data Bank. About 88% of these are x-ray crystallographic structures, the rest are NMR structures (7%) and cryoEM structures (5%). As can also be seen on the PDB holdings list, the number of solved proteins grows even faster, due to advancements in structure determination techniques. Nevertheless, the number of known protein sequences is three times larger (currently about 565,000, as available from the UniProtKB/Swiss-Prot database).

Reading PDB files

Experimentally solved protein structures are stored at the Protein Data Bank, from which individual protein structures can be retrieved as so-called PDB files. Before we will turn to the structure determination itself, let us have a closer look at a typical PDB file, to see what can be learned about the background of the structure (experimental conditions etc.) and the structural quality (the resolution, coordinate uncertainty). We will focus on PDB entry 1DWR (an x-ray structure of myogloblin complexed with carbon monoxid) as it can be downloaded from the Protein Data bank.

PDB file format

The initial lines of a PDB entry contain information on:

The protein, date of deposition, PDB ID code

The biological source of the marcomolecule

List of authors who sent the entry to the PDB

References to the literature that describe the structure in detail

Some basic information regarding the crystallographic data

Especially critical to check are the resolution and the R-factor (particularly the free-R factor), that contain information on how well the deposited structure matches the measured data (x-ray reflection intensities in this case).

HEADER    OXYGEN TRANSPORT                        11-DEC-99   1DWR              
TITLE     MYOGLOBIN (HORSE HEART) WILD-TYPE COMPLEXED WITH CO                   
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: MYOGLOBIN;                                                 
COMPND   3 CHAIN: A                                                             
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 ORGANISM_SCIENTIFIC: EQUUS CABALLUS;                                 
SOURCE   3 ORGANISM_COMMON: HORSE;                                              
SOURCE   4 ORGANISM_TAXID: 9796;                                                
SOURCE   5 ORGAN: HEART                                                         
KEYWDS    OXYGEN TRANSPORT, RESPIRATORY PROTEIN                                 
EXPDTA    X-RAY DIFFRACTION                                                     
AUTHOR    K.CHU,J.VOJTECHOVSKY,B.H.MCMAHON,R.M.SWEET,J.BERENDZEN,               
AUTHOR   2 I.SCHLICHTING                                                        
REVDAT   3   24-FEB-09 1DWR    1       VERSN                                    
REVDAT   2   29-APR-05 1DWR    1       REMARK HET    HETNAM FORMUL              
REVDAT   2 2                           HETATM                                   
REVDAT   1   03-MAR-00 1DWR    0                                                
JRNL        AUTH   K.CHU,J.VOJTECHOVSKY,B.H.MCMAHON,R.M.SWEET,                  
JRNL        AUTH 2 J.BERENDZEN,I.SCHLICHTING                                    
JRNL        TITL   CRYSTAL STRUCTURE OF A NEW LIGAND BINDING                    
JRNL        TITL 2 INTERMEDIATE IN WILDTYPE CARBONMONOXY MYOGLOBIN              
JRNL        REF    NATURE                        V. 403   921 2000              
JRNL        REFN                   ISSN 0028-0836                               
JRNL        PMID   10706294                                                     
JRNL        DOI    10.1038/35002641                              
[..] 
REMARK   2                                                                      
REMARK   2 RESOLUTION.    1.45 ANGSTROMS.                                       
REMARK   3                                                                      
REMARK   3 REFINEMENT.                                                          
REMARK   3   PROGRAM     : X-PLOR 3.851                                         
REMARK   3   AUTHORS     : BRUNGER                                              
REMARK   3                                                                      
REMARK   3  DATA USED IN REFINEMENT.                                            
REMARK   3   RESOLUTION RANGE HIGH (ANGSTROMS) : 1.45                           
REMARK   3   RESOLUTION RANGE LOW  (ANGSTROMS) : 20                             
REMARK   3   DATA CUTOFF            (SIGMA(F)) : 0.0                            
REMARK   3   DATA CUTOFF HIGH         (ABS(F)) : NULL                           
REMARK   3   DATA CUTOFF LOW          (ABS(F)) : NULL                           
REMARK   3   COMPLETENESS (WORKING+TEST)   (%) : 96.1                           
REMARK   3   NUMBER OF REFLECTIONS             : 23794               
REMARK   3  
REMARK   3  FIT TO DATA USED IN REFINEMENT.                                     
REMARK   3   CROSS-VALIDATION METHOD          : THROUGHOUT                      
REMARK   3   FREE R VALUE TEST SET SELECTION  : RANDOM                          
REMARK   3   R VALUE            (WORKING SET) : 0.211                           
REMARK   3   FREE R VALUE                     : 0.255                           
REMARK   3   FREE R VALUE TEST SET SIZE   (%) : 5.0                             
REMARK   3   FREE R VALUE TEST SET COUNT      : NULL                            
REMARK   3   ESTIMATED ERROR OF FREE R VALUE  : NULL                       
[..]
REMARK   3  RMS DEVIATIONS FROM IDEAL VALUES.                                   
REMARK   3   BOND LENGTHS                 (A) : 0.013                           
REMARK   3   BOND ANGLES            (DEGREES) : 1.93

Following the introductory material, some specific information regarding the protein and its crystalline form are provided, including:

The protein sequence

The secondary structure elements

Disulfide bonds (where present)

Dimensions of the unit cell

The space group and related information are given.

SEQRES   1 A  153  GLY LEU SER ASP GLY GLU TRP GLN GLN VAL LEU ASN VAL          
SEQRES   2 A  153  TRP GLY LYS VAL GLU ALA ASP ILE ALA GLY HIS GLY GLN          
SEQRES   3 A  153  GLU VAL LEU ILE ARG LEU PHE THR GLY HIS PRO GLU THR        
[..]
FORMUL   6  HOH   *132(H2 O1)                                                   
HELIX    1   1 SER A    3  ASP A   20  1                                  18    
HELIX    2   2 ASP A   20  HIS A   36  1                                  17    
HELIX    3   3 HIS A   36  GLU A   41  1                                   6    
[..]
CRYST1   63.600   28.800   35.600  90.00 106.50  90.00 P 1 21 1      2

Then come the actual atom coordinates (or structure), whose listing takes up most of the average PDB file.
Each listing begins with "ATOM" and is followed by:

The atom number

The atom name (which is uniquely defined for each type of amino acid)

The type of residue

The chain identifier: In this case "A" (each molecule in the model as a unique one)

The residue's sequence number in the protein

The x, y and z coordinates

The occupancy (the fraction of atoms in the crystal that occupies the specified coordinates, usually one)

The temperature factor (or B factor), a measure for the coordinate uncertainty

ATOM      1  N   GLY A   1      -2.316  16.963  14.230  1.00 20.85           N  
ATOM      2  CA  GLY A   1      -2.992  16.384  15.439  1.00 19.00           C  
ATOM      3  C   GLY A   1      -2.137  15.253  16.013  1.00 18.18           C  
ATOM      4  O   GLY A   1      -2.132  14.129  15.477  1.00 18.59           O  
ATOM      5  N   LEU A   2      -1.387  15.560  17.074  1.00 15.98           N  
ATOM      6  CA  LEU A   2      -0.561  14.582  17.790  1.00 13.52           C  
ATOM      7  C   LEU A   2      -1.065  14.383  19.213  1.00 12.59           C

This goes on for a while, until the end of the peptide chain, which is marked by the "TER" line. If there are any other molecules that co-crystallized with the protein (such as solvent molecules or ligands) they are listed as "hetero-atoms" near the end of the file.

HETATM 1203 FE   HEM A 154      14.347  28.659   5.074  1.00  8.13          FE  
HETATM 1204  CHA HEM A 154      15.659  31.898   5.315  1.00  5.90           C  
HETATM 1205  CHB HEM A 154      13.490  28.753   8.433  1.00  6.10           C  
HETATM 1206  CHC HEM A 154      13.145  25.505   4.903  1.00  5.24           C  
HETATM 1207  CHD HEM A 154      15.262  28.616   1.824  1.00  6.55           C

Go back to Contents

X-ray crystallography

The main technique for determining protein structures is x-ray crystallography. Since the first protein structure (myoglobin) was solved by this technique by John Kendrew and Max Perutz in the late fifties, several thousand others followed. As can be appreciated from the picture on the right, which shows John Kendrew with the structural model of myoglobin, at that time the determination of a structure the size of a protein, without the aid of a computer, was a formidable task.

It is important to note that in both x-ray crystallography and NMR, protein structures are not measured directly in the experiment. Rather, a set of data is collected (a diffraction pattern or a NMR spectrum), from which a model of the protein structure is derived. To appreciate the difference between data and structure, we'll now look at two different structures of the same protein, and the corresponding x-ray crystallographic data. For this, we will concentrate on the bacterial light driven proton pump bacteriorhodopsin. Click here for more background information on bR.

First take a look two bR structures on the Protein Data Bank, with PDB entries 1BRR and 1QHJ, using a browser on your PC.

Let's download the two structures to the remote CIP-Pool machine:

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/1BRR.pdb
curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/1QHJ.pdb

View the structures with pymol:

pymol 1BRR.pdb

In pymol console, type these:

hide all
show cartoon, chain A

To highlight the dye (the light sensor in the protein interior):

show spheres, resn RET and chain A
color gold, resn RET and chain A

Have a close look at the structure, before repeating the procedure for 1QHJ.pdb in a different window, such that you can compare the two structures.

Question: By looking at the structures, which of the two structures would you prefer, in terms of coordinate accuracy?

Remember, so far we only looked at the coordinates, which represent a model that was optimized against the actually measured data. So, let us now have a look at the data. In x-ray crystallography, data are collected by measuring a diffraction pattern that is obtained from x-rays reflected by a protein crystal. As mentioned in the lecture, this diffraction pattern itself does not suffice to determine the complete structure since only the amplitudes of the diffracted waves were collected, not their phases. In x-ray crystallography, however, there are a number of tricks available (e.g. isomorphous replacement, molecular replacement) but we will not go into that in detail here. What is important to remember is that eventually, an atomic electron density map is obtained.

Question: Why do primarily the electrons of a molecular sample contribute to the diffraction of x-rays? answer.

Using the browser on your PC, visit the PDBe server to view the electron density map 1BRR. Enter the PDB code (1BRR), and wait for the summary page to load. Several plots with information on this structure are available. Feel free to browse around to check the meaning of the individual plots.

Now open this molecular viewer.

Make sure the Source is set to PDB, and enter 1BRR in the "PDB Id(s)" field (left panel). Click Apply
Click Enable in the Volume Streaming subpanel of the right panel. This will allow displaying the XRD electron density.
The idendity of a residue is displayed in the bottom right corner if the cursor is hovering above it
The local electron density map is being displayed after clicking a residue. You can also click on the one-letter code of the sequence on the top panel.
Move the mouse along the sequence to position 80 until you see the sequence fragment "WARYA", and click on the "Y" (take a look at this chart if you are unfamiliar with the 3 and 1 letter codes)
You can adjust the settings for the density in the Volume Streaming subpanel, including enabling/disabling density with the eye icon. Playing around with the isosurface slider allows one to grasp the density better. Reset to default on the bottom left of the subpanel.

2F0-Fc: the general density combining diffraction data and atomic model
Fo-Fc (+ve = positive): Where there is electron density in the diffraction data, but not in the model
Fo-Fc (-ve = negative): Where there are atoms according to the model, but nothing according to the diffraction data
Interested in more details on what the maps are? See here

You should now see the six-membered aromatic ring of the tyrosine. Do you find the electron density for the aromatic ring convincing?
Now shift the focus to residue S35, the "S" in the sequence "SDPDA".
How is the fit between the model and the data here? Hint: use the negative map
To see the retinal, the light sensor in the center of the protein, choose 3:RETINAL at the very top, in the Sequence of third dropdown menu.

Note:You can display the control help by clicking the question mark at the top left and familiarize yourself with the controls.

Now repeat the procedure for entry 1QHJ. How is the fit for residue Y83? And for S35? And the retinal?

Question: Based on the data and on the model structures, would you say there is a large impact of the resolution of the data on the accuracy of the structural model?

Question: What ranges of resolution do you think belong to low, medium and high resolution structures. What are the typical structural features do you expect to be resolved, respectively. answer.

The highest resolution x-ray crystallographic structures have a resolution of approx. 0.8 Angstrom or even somewhat better. To see an example of such a dataset, look at the density for structure 2B97. Note that you can zoom into the map by clicking "Shift" on the keyboard and moving the mouse with the left button pressed. Do you recognize the difference in appearance?

Question: Although the resolution of this structure is rather high, at 0.75 Angstrom, the hydrogen atoms (e.g. on the side chains) are still difficult to see. Why is this?

A measure for the coordinate uncertainty of the individual atoms due to the thermal motion in the crystal is given by the temperature factor (or B factor).
Low B-factors (< 30) correspond to well-defined parts of the structure, whereas high B-factors (> 80) might indicate highly disordered parts of the structure or even mis-interpreted parts of the model.

Question: How do the temperature factors of a crystallographic structure in principle compare to the flexibilities of a protein in a MD simulation? answer.

Go back to Contents

NMR

The other main technique for determining protein structures is NMR. In contrast to x-ray crystallography, no crystals are required for an NMR experiment. Rather, the structure is determined of the protein in solution. Therefore, it has the advantage that the protein can be studied in its native environment. On the other hand, the resolution of an NMR structure is usually lower and there is a size limitation of a few hundred amino acids for structure determination using NMR.

It would go beyond the scope of this course to explain the NMR experiment in detail. We will therefore only briefly touch on the experimental setup and then focus on the structure building and refinement step based on the obtained data. The NMR signal is recorded as a nuclear magnetic resonance spectrum of predominantly the hydrogen atoms after the sample has been subjected to a (number of) strong magnetic pulse(s). Mainly hydrogen atoms give rise to the signal, because of the magnetic spin properties of the hydrogen nucleus (a proton). The naturally occurring isotopes of the other elements that are found in proteins, carbon (¹²C) and oxygen (¹⁶O), have a zero nuclear magnetic moment. Nitrogen (¹⁴N) does have a non-zero magnetic moment, but can usually not be used in NMR, for reasons that would go beyond the scope of this course to explain. These elements, therefore, can only be utilized in NMR experiments when chemically replaced by a specific isotope, like ¹³C or ¹⁵N. The most structurally relevant information is usually obtained from a so called NOESY experiment (Nuclear Overhauser Enhancement SpectroscopY). The Nuclear Overhauser Effect or Nuclear Overhauser Enhancement is the change (enhancement) of the signal intensity from a given nucleus as a result of exciting or saturating the resonance frequency of another nucleus. Since this effect is distance-dependent, it can be used to derive the distance between an interacting pair of protons. In practice, protons closer than 6A apart can be identified this way.

Now, we will calculate a model of the structure of a small protein, the B1 domain of protein G, from the proton-proton distance information obtained from a NOESY experiment. Download the data file containing the distance information.

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/restraints.tbl

You can have a look at the file (with the program "more" or "less" or a browser or editor of your choice) to assure yourself that there are indeed only distance bounds listed in this file. Additionally, we need an initial guess of the structure. Since we don't know the structure yet, we have to start from an unstructured peptide chain. Download the peptide chain and have a look at the structure with:

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/proteinG.pdb
pymol proteinG.pdb

Finally, we need something called a molecular topology, a chemical description of the protein: which atoms the molecule contain, which atoms are covalently bonded to each other, etc. This molecular topology file is available. Download it with:

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/proteinG.mtf

Now, in principle, we have all the data to attempt to build a structure that is in agreement with all experimentally determined distances. The remaining things we still need are an input file and a parameter file for the CNS program. Dowload the input and the parameter file with:

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/anneal.inp
curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/protein-allhdg5-4.param

Note that, in contrast to x-ray crystallography, where a single structure is presented, to reflect the fact that the NMR experiment probes an ensemble of protein molecules in solution, an NMR structure is usually represented by an ensemble of structures, that all fulfill the NMR data.

Launch the calculation by first typing

source /usr/opt/cbp/cns_1.3/.cns_solve_env_sh

and then

cns_solve < anneal.inp

10 structures of the B1 domain of protein G will now be calculated by simulated annealing. This is a computationally intensive calculation, as the structure is slowly, dynamically transformed from the extended starting conformation to the real structure, by a slow-cooling simulation, also called simulated annealing. As the calculation is running, we can have a look at how exactly such a calculation proceeds, and how the final structure is generated from the starting guess. Open another terminal window, download the pdb and open the just downloaded file in pymol:

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/sa.pdb
pymol sa.pdb

and then press the "play" button at the bottom right of the main pymol window, to see an animation of the simulated annealing structure calculation procedure. If the movie plays too fast, on the menu, under Movie -> Frame Rate, choose a different frame rate.

When the CNS structure calculation has finished, go back to that window and collect the results:

cat anneala_*.pdb > anneal.pdb

to combine all ten generated structures into one file. View the result with:

pymol anneal.pdb

Question: Which parts of the structure are well-defined, and which parts show more ambiguity?

There is also an x-ray structure available of the B1 domain of protein G, available under the PDB code 1PGB. Download it from the Protein Data Bank and compare it to the just calculated NMR structure.

curl -O http://www3.mpibpc.mpg.de/groups/de_groot/compbio/p3/1PGB.pdb

Question: What are the main differences between the NMR and x-ray structures of the B1 domain of protein G? hint

Question: Which limitations do you think have NMR and x-ray crystallography, respectively? answer.

Optional

change the initial temperature of the annealing simulation in the file anneal.inp

change the final temperature of the annealing simulation in the file anneal.inp

Question: How do you expect these settings to change the results?

Go back to Contents

Further references

Principles of protein structure and basic in biophysics and biochemistry:

Cantor and Schimmel, Biophysical Chemistry Part I: The conformation of biological macromolecules
PDB website tutorial

Advanced reading:
K Henzler-Wildman and D Kern. Dynamic personalities of proteins Nature 450: 964-972 (2007). [link]

Go back to Contents

For questions or feedback please contact Bert de Groot / bgroot@gwdg.de

Show answers