LUTE (Local Unpruned Tuple Expansion): Accurate Continuously Flexible Protein Design with General Energy Functions and Rigid Rotamer-Like Efficiency - PMC

Abstract

Most protein design algorithms search over discrete conformations and an energy function that is residue-pairwise, that is, a sum of terms that depend on the sequence and conformation of at most two residues. Although modeling of continuous flexibility and of non-residue-pairwise energies significantly increases the accuracy of protein design, previous methods to model these phenomena add a significant asymptotic cost to design calculations. We now remove this cost by modeling continuous flexibility and non-residue-pairwise energies in a form suitable for direct input to highly efficient, discrete combinatorial optimization algorithms such as DEE/A or branch-width minimization. Our novel algorithm performs a local unpruned tuple expansion (LUTE), which can efficiently represent both continuous flexibility and general, possibly nonpairwise energy functions to an arbitrary level of accuracy using a discrete energy matrix. We show using 47 design calculation test cases that LUTE provides a dramatic speedup in both single-state and multistate continuously flexible designs.*

Keywords: : algorithms, combinatorial optimization, drug design, machine learning, protein structure

1. Introduction

Protein design algorithms compute protein sequences that will perform a desired function (Donald, 2011). They generally do this by minimizing the energy of a desired binding or structural state [or some combination thereof (Leaver-Fay et al., 2011; Hallen and Donald, 2015)] with respect to sequence (Desmet et al., 1992; Floudas et al., 1999; Kuhlman and Baker, 2000; Georgiev et al., 2008, 2014; Karanicolas and Kuhlman, 2009; Donald, 2011; Gainza et al., 2012). Given a model of the conformational space of a protein and its energy function, which maps conformations to their energies, this is a well-defined computational problem (Donald, 2011).

Previously, this minimization problem has been most efficient to solve if two restrictions are imposed on the model. First, the conformational space of the protein is modeled as discrete. Specifically, each residue takes on conformations from a discrete set [typically, experimentally observed side chain conformations known as rotamers (Janin et al., 1978)]. Hence, we optimize with respect to the amino acid type and rotamer of each residue. Second, the energy function is assumed to be residue pairwise, that is, it is assumed to be a sum of terms that each depend on the amino acid types and conformations of at most two residues.

A large body of efficient algorithms has been developed for this restricted case of the protein design problem, many of which offer provable accuracy. In particular, the dead-end elimination (DEE) algorithm (Desmet et al., 1992) removes rotamers that provably cannot be part of the global minimum energy conformation (GMEC). The A* algorithm from artificial intelligence (Hart et al., 1968) finds the optimal conformation using these unpruned rotamers (Leach and Lemon, 1998). This DEE/A* framework has been generalized to model free energies for each sequence instead of simply GMECs (Lilien et al., 2005; Georgiev et al., 2008) (the algorithm). It has also been generalized to optimize combinations of stability and specificity by minimizing, with respect to sequence, a linear combination of the conformationally optimized energies of several bound and unbound states of a protein, instead of just the energy of a single state (Hallen and Donald, 2015) (the COMETS algorithm). Several methods in addition to DEE/A* have also been used to address the protein design problem. Some of these, such as Metropolis Monte Carlo and simulated annealing (Lee and Levitt, 1991; Kuhlman and Baker, 2000), lack provable guarantees of accuracy, and thus may miss the optimal conformation significantly (Simoncini et al., 2015). Other algorithms with provable accuracy are also available, largely building on techniques from integer linear programming (Kingsford et al., 2005; Roberts et al., 2015) and weighted constraint satisfaction (Traoré et al., 2013; Roberts et al., 2015; Traoré et al., 2016). Notably, treewidth- and branch width-based algorithms, such as TreePack (Xu and Berger, 2006) and BWM (Jou et al., 2015), solve this problem with provable accuracy in polynomial time for systems whose residue interaction graph has treewidth or branch width bounded by a constant (Jou et al., 2015).

However, proteins are actually continuously flexible, and continuous flexibility both in the side chains (Gainza et al., 2012) and backbone (Hallen et al., 2013) has been shown to result in significantly lower energies and biologically better sequences (Gainza et al., 2012; Hallen et al., 2013). Although a residue side chain will usually be found in the vicinity of the modal conformation for a rotamer, its dihedral angles will often differ from this mode by 10° or more (Janin et al., 1978). These continuous adjustments are often critical for determining what conformations are sterically feasible (Gainza et al., 2012). Thus, incorporation of continuous flexibility modeling substantially increases the accuracy of designs. The minDEE and iMinDEE methods (Georgiev et al., 2008; Gainza et al., 2012) do this for continuous side chain flexibility, and DEEPer (Hallen et al., 2013) for simultaneous continuous side chain and backbone flexibility. These methods replace the traditional discrete rotamers used in DEE/A* with voxels in the conformation space of each residue, called residue conformations (RCs). An RC is defined as an amino acid type together with bounds on each of the conformational degrees of freedom of the residue (e.g., side chain dihedrals) (Hallen et al., 2013). The modal conformation for a rotamer is usually found at the center of this voxel. In this model, the conformation space of an entire protein is a union of voxels, each of which is constructed as the cross-product of single-residue voxels. Thus, each voxel in the conformation space of the entire protein is represented by a list of RCs, one for each residue being modeled as flexible. RCs are constructed to be small enough that we can use local minimization to find the optimal energy within the voxel. This applies to both the single-residue and entire-protein voxels.

However, the global minimum energy in this model could not previously be computed directly by DEE/A*. Instead, DEE/A* was used to enumerate RC lists (protein conformational voxels) in order of a lower bound on minimized energy (Georgiev et al., 2008; Gainza et al., 2012; Hallen et al., 2013). Subsequently, the optimal energy for each RC list with a sufficiently low-energy lower bound was computed by minimization. The lower bound was computed from minimized pairwise interaction energies (Georgiev et al., 2008). This minimization was accelerated significantly by precomputing polynomials to approximate the energy landscape, using the EPIC algorithm (Hallen et al., 2015). However, minimization was still the bottleneck in continuously flexible designs and prevented them from approaching the efficiency of designs with discrete flexibility. In essence, these previous methods modeled continuous flexibility by modifying DEE/A* and making it do much more work. In contrast, LUTE achieves much greater efficiency by representing continuous flexibility in a form suitable for direct input into DEE/A*.

We must also address the question of the energy function. The energy landscape of a real protein is not residue pairwise, or otherwise exactly described solely as the sum of local terms. There is, however, ample evidence that protein interactions are local in a more general sense (Zhang and Zhang, 2003; Flocke and Bartlett, 2004; Vizcarra et al., 2008; Hallen et al., 2015)—that is, that the cross-derivative of the energy with respect to conformational degrees of freedom of two residues will tend to zero fairly quickly as the distance between the residues increases. These properties are also observed for more realistic energy functions that return an energy for the entire protein, rather than breaking the energy into terms as molecular mechanics does. For example, the Poisson–Boltzmann model for implicit solvation (Sitkoff et al., 1994) and quantum-chemical models return an energy for the entire system on which they are run. Thus, a viable approach to modeling protein energies more realistically is to infer local terms from full-protein energies. Vizcarra et al. (2008) apply this approach to the Poisson–Boltzmann model, calculating pairwise energies from differences in full-protein conformational states and achieving a pairwise energy matrix that quite accurately matches the Poisson–Boltzmann energies of full conformations.

However, their method can only accommodate rotamer pairs, does not support continuous flexibility, and can only be used when substituting a single rotamer into a conformation is possible while maintaining the conformation of the other residues. This is impossible when residues share conformational degrees of freedom, which is typically needed for backbone flexibility (Georgiev and Donald, 2007; Hallen et al., 2013), and may also cause problems in the case of steric clashes. Also, DEE/A* has been generalized to accommodate higher-than-pairwise energy terms if these terms are computed explicitly for particular tuples, for example, triples of residues (LuCore et al., 2015). However, most energy functions modeling higher-than-pairwise effects, including Poisson–Boltzmann, return a single energy for the entire system, rather than a sum of explicit local terms as required by algorithms such as those in LuCore et al. (2015).

Hence, today's protein and drug designers are faced with a choice. They can neglect continuous flexibility and energy terms that are not explicitly local (e.g., explicitly pairwise), thus incurring significant error. Or they can pay a massive overhead to incorporate them—by enumerating many conformations (for continuous flexibility) or searching exhaustively (for nonpairwise energy functions). We now offer a way around this dilemma. We construct an energy function that is an explicit, discrete sum of local energy terms, which are associated with tuples of RCs. This function maps RC lists, which represent voxels in the conformation space of a protein, to energies. But it will approximate, to arbitrary accuracy, the minimized voxel energy, which can be computed with any energy function: no need for residue pairwiseness or any other local representation. Computing this approximation is a machine-learning problem, and we solve it with a least-squares method.

Our approach has some resemblance to cluster expansion methods, which have previously been used in quantum mechanics (Čížek, 2009) and to represent optimized energies for protein sequences (Grigoryan et al., 2006, 2009). However, as discussed in Hallen et al. (2015), approximations of energy surfaces can be much more compact if unrealistically high-energy regions of conformational space are excluded from the approximation (and from the subsequent conformational search). Thus, unlike cluster expansion methods, we exclude pruned tuples of RCs, making our derived energy function a local unpruned tuple expansion (LUTE). Because conformational and sequence search using the LUTE energy function is a discrete optimization problem of the type solved by DEE/A*, BWM*, and other very efficient algorithms, it allows designs to run quickly using these algorithms, while still approximating continuous flexibility and highly realistic energy functions to a high level of accuracy.

We have implemented LUTE in the OSPREY (Georgiev et al., 2008, 2009; Gainza et al., 2013) open-source protein design package, which has yielded many designs that performed well experimentally—in vitro (Stevens et al., 2006; Gorczynski et al., 2007; Chen et al., 2009; Frey et al., 2010; Georgiev et al., 2012; Roberts et al., 2012; Rudicell et al., 2014) and in vivo (Gorczynski et al., 2007; Frey et al., 2010; Roberts et al., 2012; Rudicell et al., 2014) as well as in nonhuman primates (Rudicell et al., 2014). OSPREY contains a wide array of flexible modeling options and provably accurate design algorithms (Georgiev et al., 2009; Gainza et al., 2013), allowing LUTE to be used for many types of designs.

By presenting LUTE, this article makes the following contributions:

1. A method to represent continuous flexibility and general energy functions to arbitrary accuracy in an LUTE that can be used directly as input to discrete combinatorial search algorithms such as DEE/A*.
1. A free implementation of LUTE in our laboratory's open-source OSPREY protein-design software package (Georgiev et al., 2008, 2009; Chen et al., 2009; Frey et al., 2010), available for download (Georgiev et al., 2009) as free software (Georgiev et al., 2009), supporting representation of both continuous side chain and backbone flexibility and of molecular mechanics and Poisson–Boltzmann energy functions.
1. Integration of LUTE with the DEE/A* (Leach and Lemon, 1998), iMinDEE (Gainza et al., 2012), BWM* (Jou et al., 2015), and COMETS (Hallen and Donald, 2015) algorithms for sequence and conformational search.
1. Bounds on the time and space complexity of protein design calculations that model continuous flexibility and/or use energy functions with nonlocal terms. The time and space complexity are exponential merely in the branch width w of the residue interaction graph, and thus the designs can be done in polynomial time for systems whose branch width is bounded by a constant.