• Keine Ergebnisse gefunden

Implementation of a Bayesian secondary structure estimation method for the SESCA circular dichroism analysis package

N/A
N/A
Protected

Academic year: 2022

Aktie "Implementation of a Bayesian secondary structure estimation method for the SESCA circular dichroism analysis package"

Copied!
8
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Contents lists available atScienceDirect

Computer Physics Communications

www.elsevier.com/locate/cpc

Implementation of a Bayesian secondary structure estimation method for the SESCA circular dichroism analysis package , ✩✩

Gabor Nagy, Helmut Grubmuller

DepartmentofTheoreticalandComputationalBiophysics,MaxPlanckInstituteforBiophysicalChemistry,AmFassberg11,37077Göttingen,Germany

a rt i c l e i n f o a b s t r a c t

Articlehistory:

Received18December2020

Receivedinrevisedform30March2021 Accepted26April2021

Availableonline6May2021 Datasetlink:

https://www.mpibpc.mpg.de/sesca

Keywords:

Spectraofbiomolecules CDspectroscopy

Proteinsecondarystructure Bayesianstatistics Modelsofproteins

Circular dichroism spectroscopy is a structural biology technique frequently applied to determine the secondary structure composition of soluble proteins. Our recently introduced computational analysis package SESCA aids the interpretation of protein circular dichroism spectra and enables the validation of proposed corresponding structural models. To further these aims, we present the implementationandcharacterizationofanewBayesiansecondarystructureestimationmethodinSESCA, termedSESCA_bayes.SESCA_bayessamplespossiblesecondarystructuresusingaMonte Carloscheme, drivenby thelikelihoodof estimatedscaling errorsand non-secondary-structure contributionsofthe measuredspectrum.SESCA_bayesprovidesanestimatedsecondarystructurecompositionandseparate uncertaintiesonthefractionofresiduesineachsecondarystructureclass.Italsoassistsefficientmodel validationbyprovidingaposteriorsecondarystructureprobability distributionbasedonthemeasured spectrum.OurpresentedstudyindicatesthatSESCA_bayesestimatesthesecondarystructurecomposition withasignificantlysmalleruncertaintythanitspredecessor,SESCA_deconv,whichisbasedonspectrum deconvolution. Further, the mean accuracy of the two methods in our analysis is comparable, but SESCA_bayesprovidesmoreaccurateestimatesforcirculardichroismspectrathat containconsiderable non-SScontributions.

Programsummary ProgramTitle: SESCA_bayes

CPCLibrarylinktoprogramfiles: https://doi.org/10.17632/5nnsbn6ync.1 Developer’srepositorylink: https://www.mpibpc.mpg.de/sesca Licensingprovisions: GPLv3

Programminglanguage: Python

Natureofproblem: Thecirculardichroismspectrumofaproteinisstronglycorrelatedwithitssecondary structure composition. However, determining the secondary structure from a spectrum is hindered by non-secondary structure contributions and by scaling errors due the uncertainty of the protein concentration.Ifnottakenproperlyintoaccount,theseexperimentalfactorscancauseconsiderableerrors when conventional secondary-structure estimation methods are used. Because these errors combine with errorsof the proposed structural model in anon-additive fashion, it is difficultto assess how muchuncertaintytheexperimentalfactorsintroducetomodelvalidationapproaches basedoncircular dichroismspectra.

Solutionmethod: Foragivenmeasuredcirculardichroismspectrum,theSESCA_bayesalgorithmapplies Bayesian statistics to account for scaling errors and non-secondary structure contributions and to determine the conditional secondary structure probability distribution. This approach relies on fast spectrum predictionsbased on empiricalbasis spectrum sets and joint probability distribution maps forscalingfactors andnon-secondary structuredistributions.BecauseSESCA_bayesestimatesthe most probablesecondarystructurecompositionbasedonaprobability-weightedsampledistribution,itavoids the typical fitting errors that occur during conventional spectrum deconvolution methods. It also

ThereviewofthispaperwasarrangedbyProf.StephanFritzsche.

✩✩ ThispaperanditsassociatedcomputerprogramareavailableviatheComputerPhysicsCommunicationshomepageonScienceDirect(http://www.sciencedirect.com/ science/journal/00104655).

*

Correspondingauthor.

E-mailaddress:hgrubmu@gwdg.de(H. Grubmuller).

https://doi.org/10.1016/j.cpc.2021.108022

0010-4655/©2021TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/).

(2)

G. Nagy and H. Grubmuller Computer Physics Communications 266 (2021) 108022

estimatestheuncertainty ofcircular dichroismbasedmodel validationmore accuratelythanprevious methodsoftheSESCAanalysispackage.

©2021TheAuthor(s).PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Circulardichroism(CD)spectroscopyinthefarultraviolet(UV) range(175-260nm)isan established methodtostudy thestruc- ture of proteins in solution [1,2], because of the conformation- dependentcharacteristicCDsignalofpeptidebondsthatcomprise the backbone ofall proteinsandoligo-peptides. Inparticular, the CD spectrum is known to change with the secondary structure (SS) ofproteins, and markedly differentspectra are observed for proteins rich in

α

-helices,β-sheets, anddisordered regions [3,4].

Because of thesecharacteristic signals, it iscommon to interpret CDspectra by decomposing theminto aset ofbasis spectrathat eachrepresenttheaverageCDsignalofpure(secondary)structure elements.

The CD analysis package SESCA (Structure-based Empirical SpectrumCalculationApproach)[5] allowsforusingseveralempir- icalbasisspectrumsetsintwomethods.Thefirstmethodpredicts atheoreticalCDspectrumfromaproposedSScomposition,which istypically obtainedfromamodelstructure orstructuralensem- ble.Thesecond methodfitsameasured CDspectrumtoestimate theproteinSScomposition.Bothmethodscanbeusedtovalidate proteinstructuralmodels.Theaccuracyandprecisionofvalidation methodsismainlylimitedbyscalingerrorsduetotheuncertainty of the measured protein concentration and non-SScontributions that are notrepresentedin thebasis spectra.Wehavequantified theuncertaintycausedbythesedeviationsbetweenmeasuredCD spectraandtheirpredictedSSsignalspreviously[6].

The same study also revealed a potential caveat in the cur- rent SS estimation method used in SESCA. In this deconvolution method,a linearcombinationofselected basisspectra isusedto approximate a measured CD spectrum of the protein of interest.

The coefficientsoftheapproximation withthesmallestdeviation are used to estimate the fraction of SS elements in the protein underthemeasurementconditions.Unfortunately,theinterference caused by non-SS contributions mayincrease the deviationfrom themeasuredspectrumforsomeSScompositionsanddecreaseit forothers,whichmayleadtosignificant errorsindeconvolution- basedSSestimates.

To alleviate this problem, we developed and implemented a new SS estimation method for SESCA. The Python module, SESCA_bayes determines the likelihood of putative SS composi- tions usingaBayesian inferenceframework foragivenmeasured CDspectrum anda basis spectrumset.Thismethoduses theex- pectedjointprobabilitydistributionofdeviationscausedbyscaling errors and non-SS contributions, and thus fully accounts for the uncertainty caused by these two experimental factors. Here, we describe thetheoretical background,generalworkflow,aswell as input andoutput parameters ofthisimplementation. Further,we will assess the accuracy and precision of thismethod through a seriesofsampleapplications.

2. Theory:BayesianSSprobabilities

Our goal using this method is to determine the conditional probability P(S S|C D) ofSScompositionsgivenapreviouslymea- sured CD spectrum. According to Bayes’rule [7], thisprobability canbeinferredaccordingto

P

(

S Sj

|

C D

)

P

(

C D

|

S Sj

) ·

P

(

S Sj

),

(1)

where P(C D|S Sj) is the probability of observing the measured spectrumforaproteinwithagivenSScompositionj(i.e.,thelike- lihoodfunction)andP(S Sj)isthepriorprobabilityofthegivenSS compositionoftheprotein.

As shown in Fig. 1 (top), the likelihood P(C D|S Sj) is deter- mined in fivesteps. First,the SS signal is predictedfrom the SS compositionofinterest (Cji) usingan appropriatebasis spectrum set(Bil),asdiscussedinourpreviousstudy[5].Second,ifthebasis setprovidessidechaincorrectionsbasedontheproteinsequence, theyareaddedtothepredictedspectrum.Third,themeasuredCD spectrumisrescaled to minimizethe root-mean-squaredeviation (RMSD)fromthepredictedspectrumusing

minRMSDj

( α

j

) =

L

i=1

(

Scompjl

α

j

·

Sexpl

)

2

L

,

(2)

where Sexpl and Scompjl are CD intensities of the measured CD spectrum and the spectrum computed for S Sj at wavelength l, respectively. The obtained scaling factor

α

j quantifies and elim- inates deviations from scaling errors of the measured spectrum, whereastheRMSDfromtherescaledspectrum(RMSDj)quantifies the average deviation due to unaccounted non-SS contributions.

OnceRMSDj and

α

j(collectivelyCDdeviations)arecomputed,the likelihoodofsuch deviationsis determinedfromthejointproba- bilitydistribution(PRMSD,see below)to estimate thelikelihood of observing the measured CD spectrum for the given SS com- position P(C D|S Sj).Finally,to compute the posterior probability P(S Sj|C D)ofSScompositionj,theCDspectrumlikelihoodismul- tipliedbythepriorSSprobability.

3. Methods

3.1. Jointprobabilitydistributions

We computed discrete joint-probability distribution functions forSESCA_bayesthatcanbeusedtodetermineCDspectrumlike- lihoods. These probability distributions were computed from CD deviations extractedfrom SSestimations ofpreviously measured CDspectra.ReferenceCDspectraweretakenfromtheSP175ref- erenceset[8],whichcontains71synchrotronradiationCD(SR-CD) spectraofglobularproteinswithvaryingSScompositions.TheCD spectrum of Jacalin (SP175/41) was discarded from the data set dueto issuesreportedduring themeasurementanditsunusually largeestimatedCDdeviations.

The joint probability distribution functions of CD deviations were constructed as the sum of 70 two-dimensional Gaussian functions,eachrepresentingtheestimatedscalingfactorsandnon- SScontributions ofareferencespectrumfromtheSP175set.The meanandthevarianceoftheseGaussianfunctionswasdetermined byaveragingovermultipleRMSDjand

α

jvaluesobtainedforeach CDspectrumfromSSestimations usingfourdifferentbasis spec- trum sets. This approach yielded likelihood functions that were definedfora widerange ofpossibleCD deviations,andtook the uncertaintydueto discretizationerrorsofthebasis spectrumde- terminationintoaccount.

InSESCAtherearetwotypesofbasissets,thosethataresolely basedonSScompositions, andthose thatalsoinclude sidechain corrections. Because the average size ofCD deviations differs for thesetwo basis set types,we determined two probability distri- butionsshowninFig.2.Thejointprobabilitydistributionfunction

(3)

Fig. 1.Secondarystructureprobabilitycalculationscheme.Thefiguredepictsthealgorithmtocomputetheposteriorprobabilityofagivensecondarystructurej,basedonits priorprobability,andthedeviationsbetweenitspredictedCDsignalandagivenmeasuredCDspectrum.Inputdataaredepictedasblueparallelograms,operationsaswhite rectangles,anddecisionsaswhitediamonds.

for basis set without side-chain corrections (left) was calculated fromCDdeviations estimatedusingthe basissets DS-dT,DSSP-1, HBSS-3,andDS5-4.Forbasissetsincludingside-chaincorrections, thejointprobabilityofCDdeviations(right)werecomputedusing thebasissetsDS-dTSC3,DSSP-1SC3,HBSS-3SC1,andDS5-4SC1.For clarity, theFigure showsboth a linear(toprow) aswell asloga- rithmic(bottomrow)representationoftheCDdeviationlikelihood.

Forboth likelihoods, the one-dimensional probability distribution ofRMSDj was alsocalculated,whichcanbe usedtoestimate the secondary structure from CD spectra without regards to the ap- pliedscalingfactors,albeittheseestimatesnaturallyhavealower precision.

3.2. Syntheticspectra

To test the accuracy of the Bayesian SS estimation method, eightsyntheticCDspectrawerecreatedusingalinearcombination ofthethreebasisspectrafromtheDS-dTbasisset(asdiscussedin ourprevious study[5]).Tothisaim,thecoefficientsshowninTa- ble 1for the basis spectra

α

-helix,β-strand,and Other foreach spectrum were used.For five of eight syntheticspectra (k= 1 to 5),randomcoefficientsweregeneratedfromuniformlydistributed randomnumbersbetweenzeroandone,subsequentlynormalized tosumuptoone.Forthesixthsyntheticspectrum(k=6),theco- efficients0.3,0.4,and0.3aswell asthenon-SScontributions(see

below)wereadoptedfromourpreviousstudy [6] forcomparison.

Synthetic spectra seven and eight were generated using low

α

- helixand β-strandcontents respectivelyto representintrinsically disorderedproteins(IDP).

TomodeltheeffectsofexperimentaldeviationsfromtheidealSS signal,theCDspectraweremodifiedbyaddingnon-SSsignalsand scalingerrors. The size oftheseCD deviationsforeach synthetic spectrum was quantified by the scaling factors

α

k and the root- mean-squaredintensitiesofnon-SSsignalsRMSIklistedinTable1.

Synthetic spectrum 1 (k= 1) was a positive control without any CD deviations (

α

k= 1.0, RMSDIk= 0.0 kMRE), spectra 2 and6 in- cludedsmall(0.4 kMRE)andlarge (3.5kMRE) non-SSdeviations, respectively,butno scalingerrors.CD deviationsforspectra3,4, 5,7and8were drawn fromthe marginaldistributions ofexper- imentallyobservedscalingfactors andnon-SScontributionsusing rejectionsampling.

Non-SS signalswere generated assums of Gaussian functions using

SnonSSkl

=

G

g=1 Ig

2

π σ

g2

·

e

lμg)2 2σ2g

,

(3)

wherethe non-SS signal SnonSSkl ofspectrum k at wavelengths λl from178to 269nmwas computedfromthefollowingrandomly

(4)

G. Nagy and H. Grubmuller Computer Physics Communications 266 (2021) 108022

Fig. 2.ThepanelsdepicttheheatmaprepresentationoftwolikelihoodfunctionsprovidedforBayesianSSestimationwithSESCA.Theestimatedjoint-probabilitydistributions areshownforbasisspectrathatA)predictCDsignalssolelyfromSSinformation(left)andB)alsoincludeCDcorrectionsfromsequence-basedside-chaininformation(right).

Panelsonthetopandbottomshowthesameprobabilitydistributionsusingalinearandlogarithmiccolorscale,respectively.

Table 1

SScompositionsandCDdeviationsofmodelproteins.Columnsshowtheindexand nameoftherespectivemodelprotein,thefractionofitsaminoacidsassignedto theSSclassesα-helix,β-strand,andOther,aswellasscalingfactorsαkandroot- meansquaredintensitiesRMSIkofnon-SSsignals toquantifyscaling errorsand non-SScontributionsintheproteinCDspectrum,respectively.Synthdenotessyn- theticspectruminthe proteinsname,whereasLysm,Dqd-1,Sub-C,andMeV-Nt abbreviate Lysozyme,DehydroquinatedehydrataseI,Subtilisin Carlsberg, Measles VirusNucleoproteinC-terminaldomainrespectively.NotethatSSfractions,scaling factors,andnon-SScontributionsforallsyntheticproteins(k=1-8)wereparame- tersusedtogeneratetheirCDspectrum,whereasforreferenceproteins(k=9-12), allvalueswerecomputedbasedontheirmeasuredspectraandproteindatabank structures(193L,2DHQ,1KU8,and1SCD,respectively).ForthedisorderedMev-Nt, thereferenceparametersweredeterminedfromtheCDspectrumbyLonghietal.

andamoleculardynamicsensemblebyRobustellietal.[9].

k protein α-helix β-strand Other αk RMSIk

1 Synth-1 0.11 0.40 0.49 1.0 0.0

2 Synth-2 0.41 0.20 0.39 1.0 0.4

3 Synth-3 0.43 0.10 0.47 1.2 2.0

4 Synth-4 0.27 0.26 0.47 1.6 2.7

5 Synth-5 0.00 0.33 0.67 1.4 0.7

6 Synth-6 0.30 0.40 0.30 1.0 3.6

7 Synth-7 0.11 0.00 0.89 1.1 2.3

8 Synth-8 0.00 0.08 0.92 1.3 1.3

9 Lysm 0.35 0.03 0.62 1.1 1.0

10 Dqd-1 0.43 0.18 0.39 1.1 2.9

11 Jacalin 0.01 0.28 0.71 0.3 3.2

12 Sub-C 0.30 0.12 0.58 0.4 1.2

13 MeV-Nt 0.08 0.03 0.89 1.7 1.8

chosen parameters.The numberofGaussiansG waschosen from therange1to5,therelativepeakintensityforGaussiang Ig was chosen between -20.0 and20.0, witha peak position

μ

g chosen from178to241nm, andpeakhalf-widths

σ

g chosen between2

and37nm.Oncetheparametersweredetermined,thenon-SSsig- nalateverywavelength(using1nmspacing)wascalculated,and thenon-SS signal intensitywas rescaled to matchthe previously definedRMSIvaluesinTable1.

Thefinal syntheticspectrawere computedbydeterminingthe SSsignals first,by adding the appropriately scaled non-SS signal contributionsinasecondset,andfinallybyrescalingtheresulting CDspectrumaccordingtotheindicatedscalingfactor.

4. Algorithmoverview

Our newly implemented Python module SESCA_bayes.py per- formsaMonte-Carlo(MC) samplinginSSspacetodeterminethe mostprobableSScompositionofaproteinbasedonits measured CDspectrum. Fig.3showsthe flowchart ofthe algorithmthat is dividedintothreephases:preparation,sampling,andevaluation.

4.1. Preparationandinputparameters

In the preparation phase, input, output, and run parameters are read based on the user-provided command line arguments.

IfSESCA_bayes.py is used asa Pythonmodule, an array of argu- mentscan be processedbythe functionRead_Argsandpassed to theMain function to run the algorithm.Argumentsin SESCA are identifiedbyprecedingcommandflags(markedbythe“@”charac- terinthefirstposition.Therearefourinputfiles–shownasblue parallelogramsinFig.3 –that SESCA_bayesaccepts, each readin white-spaceseparateddatablocksstoredassimpleasciitextfiles.

The CDspectrum file (specified using the @spect flag) should containtwocolumns,wavelengthinnanometers(nm)andCDsig- nal intensityin 1000 mean residueellipticity (kMRE) units.This file mustbe specified forSESCA_bayes, andifno command flags are provided,the first argumentis automatically recognized asa CDspectrumfile.

(5)

Fig. 3.SchematicworkflowoftheBayesiansecondarystructureestimationmoduleinSESCA.Theschemedepictsinputdatafilesasblueparallelograms,dataonthesampled SScompositionsareshownasaredparallelogram.Operationsaredepictedaswhiterectangles,anddecisionsareshownaswhitediamonds.Posteriorprobabilitycalculation operations(seeFig.1)arehighlightedasyellowrectanglesonthescheme.

Theside-chaincorrectionfile(specifiedby@corr)isanoptional file to add baseline or sequence-dependent side-chain correction to the predictedCD spectrum, which are independent of the SS composition. Ifthebasis spectrumset hasbasis spectratocalcu- lateside-chaincontributions,thesesignalscanbecomputedbefore runningSESCA_bayes,andaddedasacorrection.

TheBayesianparameterfile(@par)containsseveraldatablocks, most importantly, the binned probability distribution function of CD deviations PRMSD,α (likelihood function), prior SS probability distributions for the SS composition P(S Sj) and scaling factors P(

α

j),aswell astheMC step parameters.Ifno parameterfile is providedby theuser,SESCA_bayes.py usesoneoftwodefaultpa- rameter files (Bayes_2D_SC.dat and Bayes_2D_noSC.dat) found in the “libs” sub-directory of SESCA, depending on whether a side chaincorrectionfilewasprovidedornot.Thesefilescontainoneof thetwolikelihoodfunctionsshowninFig.2,anduniformpriorSS probabilitydistributions.Amoredetaileddescriptionoftheparam- eterblocksisprovidedintheexamplessub-directory(examples 5).

The basis set file (@lib) contains several data blocks for CD spectrum calculations, including a block where the CD intensity of3-6basisspectraateachwavelength(175-269nm)isprovided.

Severalderivedbasissetsareavailableinlibssub-directory,anda detaileddescriptionofthedatablocksisgiveninexample 1.

In addition to the input files, SESCA_bayes recognizes several additionalcommandflags tomodifyprogram behavior.The num- berof initialSScompositionsforMC samplingphase isspecified by@size.ThenumberofMCstepsperinitialSScompositionisset by @iter.The @scale flagallows the user to control whetherthe measured CDspectrum isrescaled before determiningthe devia- tionfromthepredictedCDspectraornot.Intheabsenceofthese commandflags,the values100,500, and1(yes)areusedforthe SSestimation.

Finally,threecommandflagscontroltheoutputofSESCA_bayes.

py; providinga “0”argument toanyoftheseflags disables writ- ingtheassociatedoutput.Thecommandflag@writespecifiesthe file name for the primary output, and if no command flags are given,SESCA_bayesautomaticallyrecognizesthesecondargument as primary output file. This file contains a summary of the in- putparameters, binned posterior probability distributions forthe SS compositions andscaling factors, as well as the mostproba- bleSSfractionsandtheir uncertainties.Thecommandflags@proj and @data allows the user to print secondary output files. The

@projflagspecifiesafilenameforheatmap-styletwo-dimensional projectionoftheposteriorSSdistribution.Theprojection ismade along two SSfractionsselected usingthe @pdim flag Finally,the flag @data specifies a file name for printing all the sampled SS

(6)

G. Nagy and H. Grubmuller Computer Physics Communications 266 (2021) 108022

compositions the primary output is computed from, along with theirestimatedCDdeviations,priorandposteriorprobabilities.By default,onlytheprimaryoutputfileisprintedinto‘SS_est.out’,and nosecondaryoutputiswritten.

4.2. MonteCarlosampling

TodeterminethemostprobableSScompositionoftheprotein basedonitsCDspectrum,samplingoftheSSspaceisrequired.To thisaim, SESCA_bayes usesa MC samplingscheme starting from N(set by @size)initialSScompositions, drawnfromthe priorSS distribution usingrejectionsampling. Asthe centerpartofFig. 3 shows,ateverysteptoftheMCsamplingphase,achangeoneach oftheSScompositions(Cji,t)isattempted.Thechangeisrealized by transferring a given SS fraction between two randomly cho- senSSclasses,yielding anewSScomposition(Cji,t). Theamount of thetransferred SS fractionfromthe donorclass to the accep- tor classis determinedbased onthe distributionspecifiedin the Bayesianparameterfile.Ifnodistributionisprovided,thefraction is drawn froma Gaussian distribution with a meanof 0.05 and varianceof0.1.ToremaininthespaceofpossibleSScompositions, the transferredSSfractioncannot exceed thecurrentfractionas- signedtothedonorclass,andclassesthatcurrentlyhaveafraction ofzeroassignedtothemcannotbeselectedasdonors.

Afterthechangesareattempted,theposteriorprobabilities Pjt ofthenewSScompositionsarecalculated(seeSection2)andcom- pared to the posterior probabilities (Pjt) of the SS compositions before the change.The attempted change isaccepted orrejected byapplyingtheMetropoliscriteriontotheratioofposteriorprob- abilities, i.e.the change is accepted if the ratio Pjt/Pjt is larger than arandomlygenerated numberbetweenzeroandone. Ifthe change isaccepted, Cji,t is addedto the sampledSS distribution andusedastheinitialSScompositionCji,t+1 inthenextMCstep, otherwiseCji,t isaddedtothesampledSSdistribution(again)and isused inthenext MC step.Thisprocedureis repeateduntil the specified numberofMC attemptsis reached,andyields N×tmax sampledSScompositions.ThesampledSScompositionsresemble theprior SSdistributionduring theinitialMC stepsbutconverge towardsanSSdistributionweightedbytheposteriorSSprobabili- ties.

4.3. Sampleevaluation

The sampled SS distribution is analyzed in the evaluation phase, asshownin thebottom partof Fig.3. Toavoidthe over- representationofverylowposteriorprobabilitySScompositions,a fractionoftheinitiallysampledSScompositionsmaybediscarded from final SS distribution. This fraction can be set by the user through the@discard flag,otherwise,the initial5% ofSScompo- sitionsisdiscarded.Theremainingprobability-weightedensemble ofpossible SScompositionsisused to computethe estimatedSS composition Cestji fortheprotein,theestimatedscalingfactor

α

estj , aswellastoapproximatethediscreteposteriorprobabilitydistri- butionforbothquantities.

TheestimatedSScompositionisdeterminedbycomputingthe mean and standard deviation (SD) of each SS fraction over the sampledSScompositions.Similarly,themostprobablescalingfac- tor iscomputedasthemean andSDofscalingfactors estimated forthesampledSScompositions.Thediscreteprobabilitydistribu- tion forbothscaling factorsandSScompositionsiscomputed by binningall sampledSScompositionsandscalingfactorsusingthe parameters extractedfromtheprior distributionsprovided inthe Bayesian parameterfile. Thenumber ofsampledSScompositions andscaling factorsineach binis normalizedby thefinal sample sizetoobtainthediscreteprobabilitydistributions.Thecomputed

estimates, their uncertainties and the discrete probability distri- butionsare all writtenintheprimary output file(defined bythe

@write flag) and returned as output by the SESCA_bayes mod- ule.Ifrequested(@projflag),thesampledSScompositionscanbe printedina separate file. Finally,the two-dimensional projection ofposteriorSSdistributionalongtwochosenSSfractionscanalso writtenintoaseparateoutputfile(@projflag),formattedasahu- manreadable heatmap,that canbe easilyprocessedintoimages usinge.g.Python’sMatplotlibmodule[10] orexternalvisualization programs.

5. Testingthealgorithm 5.1. Accuracyandprecision

Theaccuracy andprecision ofthe BayesianSSestimation was tested using the 13 CD spectra listed in Table 1. Eight of these spectra(k=1-8) aresyntheticspectrathatweregeneratedfroma givenSScomposition,butmodifiedbyaddingartificialnon-SSsig- nalsandscalingerrors(seeSection3.2) toemulateCDdeviations in real measured spectra. Four of the remaining five CD spectra (k=9-12)are measuredspectrafromtheSP175set[8],forwhich the estimated SS compositions are compared to those extracted from the (protein data bank) structure of the reference protein.

The last CD spectrum (k=13) was measured for the intrinsically disorderedC-terminal domainofthe Measlesvirus Nucleoprotein by Longhi et al. [ref] [11]. Because this domain is disordered, therewasnoexperimentalreferencestructureavailableforit,and thereforeweusedamoleculardynamicsensembleofRobustelliet al.[9] as reference. This ensemble was generated usingthe Am- ber99SB-dispforce field andwas validated by small angle X-ray scattering (SAXS) and nuclear magnetic resonance (NMR) exper- iments. Table 1 lists the (estimated) CD deviations of all 13 CD spectra,quantified by the scaling factors

α

k and the root-mean- squareintensity(RMSIk)ofnon-SSsignalsineachspectrum.

TotesttheaccuracyofSESCA_bayes,weestimatedtheSScom- positionoftheabove13CDspectrausingthesameDS-dTbasisset withthreeSSclasses(

α

-helix,β-strand,andOther)thatwasused togeneratethesyntheticspectra.TheobtainedBayesianestimates for the test set are summarized in Table 2. This table includes themeanandSD(in parentheses)of SSfractionsofthesampled posteriordistributions, aswell asthetotal SSdeviationfromthe referenceSScompositions,computedaccordingto

S Sk

=

1 2

F

i=1

|

Cestki

Cre fki

|,

(4)

whereCestki are theestimatedSSfractionsand Ckire f are therefer- enceSSfractionslistedinTable1.

TheobtainedSSfractionsshowafairlyconsistent0.03to0.06un- certainty. As expected, 35 of 39 SS fractions are within two SD of their reference value, with no significant difference in accu- racybetweensyntheticandmeasuredCDspectra,orglobularand disorderedproteinmodels.Inaddition,thecalculatedtotalSSde- viations(S S) fromthereferencestructures range between0.03 and0.12,andtenofthirteenvaluesarealsosmallerthantheesti- mateduncertaintyoftheestimation(twoSD)that wascalculated fromtheindividualSDofSSfractions(

σ

ki)accordingto

σ

k

=

1

2

F

i=1

σ

ki2

.

(5)

(7)

Table 2

Bayesiansecondarystructureestimates.Thetableliststheindexandnameofthe modelprotein,theestimatedfractionofitsaminoacidsassignedtoSSclassesα- helix,β-strand,andOther,aswellasthetotalSSdeviationS Skfromthereference SScompositionsshowninTable1.Theuncertainty(standarddeviation)ofeachSS fractionanddeviationisgiveninparentheses.Estimatesthataremorethan2SD awayfromtheirreferencevaluearehighlightedinred.

k protein α-helix β-strand Other S Sk

1 Synth-1 0.14 (0.05) 0.44 (0.06) 0.43 (0.04) 0.06 (0.04) 2 Synth-2 0.49 (0.06) 0.19 (0.06) 0.32 (0.06) 0.08 (0.05) 3 Synth-3 0.45 (0.06) 0.12 (0.05) 0.43 (0.05) 0.03 (0.04) 4 Synth-4 0.22 (0.04) 0.21 (0.05) 0.57 (0.04) 0.09 (0.04) 5 Synth-5 0.03 (0.04) 0.26 (0.03) 0.71 (0.05) 0.07 (0.04) 6 Synth-6 0.36 (0.05) 0.32 (0.04) 0.31 (0.06) 0.08 (0.04) 7 Synth-7 0.15 (0.04) 0.01 (0.04) 0.84 (0.05) 0.05 (0.04) 8 Synth-8 0.03 (0.04) 0.12 (0.04) 0.85 (0.05) 0.07 (0.04) 9 Lysm 0.38 (0.05) 0.04 (0.05) 0.57 (0.05) 0.05 (0.04) 10 Dqd-2 0.48 (0.06) 0.06 (0.05) 0.47 (0.05) 0.12 (0.05) 11 Jacalin 0.01 (0.04) 0.31 (0.06) 0.68 (0.06) 0.03 (0.05) 12 Sub-C 0.26 (0.05) 0.13 (0.04) 0.61 (0.04) 0.04 (0.04) 13 MeV-Nt 0.07 (0.05) 0.13 (0.04) 0.80 (0.05) 0.10 (0.04)

5.2. Comparisontodeconvolution

Next, we comparethe accuracy andprecision of the Bayesian estimates tothat of SSestimatesobtained throughspectrum de- convolution. Tothis aim,we estimatedSScompositions withthe deconvolution moduleof SESCA (SESCA_deconv)for thesame 13 CD spectra (Table 1), using the same DS-dT basis spectrum set.

Thedeconvolutionwascarriedout usingthemostaccurateproto- col(methodD2)testedpreviously [6].Thismethodconstrains the basisspectrumcoefficientstopositivevalues,butnormalizesthem to unityonly afterthesearch forthe bestapproximation. TheSS compositions obtainedusing SESCA_deconv are listed in Table 3, alongwiththetotalSSdeviationsfromreferenceSScompositions (found in Table 1). The total SSdeviation of deconvolution esti- mates(S Sk)rangesfrom0.0to0.29. ThemeanSSdeviationfor the whole set (0.07) issimilar to that of the Bayesian estimates (0.07), butshows asignificantly larger scatter (0.09 vs. 0.04).All threeCDspectrawithlargerthanaverageSSdeviations(k=3,4,10) have large non-SS contributions (2.0-2.9 kMRE), which is in line withourpreviousfindingsthatnon-SScontributionsmaybedetri- mentaltotheaccuracyofdeconvolutionmethods.

Although the SESCA_deconv module does not provide informa- tion on the uncertainty of individual SS fractions, many SESCA basis sets (including DS-dT) include a calibration curve to esti- mate the expectedtotal SS deviationif the true SS composition is unknown. This curve was computed from 4.105 synthetic spectrum-structurecombinations,whichwerebinnedaccordingto their estimated non-SS contributions (RMSDj), to provide an ex- pectedmeanandSDofSSdeviationsforagiven(rescaled)RMSD.

ComparingthetrueSSdeviationsofthedeconvolutionresultswith their estimated values shows that these estimates correctly de- scribe the precision ofthe deconvolution method:8 of13 S Sk valuesarewithin1SDoftheestimatedtotaldeviation, andall13 fallwithin2SD.However,theaverageuncertaintyofthedeconvo- lution(0.09)isagainconsiderablylargerthanthatoftheBayesian SSestimates(0.04),anditincreaseswithincreasingnon-SScontri- butions.

In summary,Bayesian SSestimationandspectrum deconvolu- tion provides SSestimates that –in mostcases –have asimilar accuracy. However, Bayesian SS estimates are considerably more precise when significant non-SS contributions are present in the measured spectrum. Further, the Bayesianapproach provides un- certaintiesforeachindividualSSfractionaswellasfortheoptimal

Table 3

Secondarystructureestimatesbasedonspectrumdeconvolution.Thetableliststhe indexandnameofthemodelprotein,theestimatedfractionofitsaminoacidsas- signedtoSSclassesα-helix,β-strand,andOther,aswellasthetotalSSdeviation S SkfromthereferenceSScompositionsshowninTable1.Thevaluesinparenthe- sesafterS SkshowthemeanandSDoftheestimatedtotalSSdeviationcomputed fromtherescaledRMSDbetweenthemeasured(generated)CDspectrumandpre- dictedspectrumoftheSSestimate.

k protein α-helix β-strand Other S Sk

1 Synth-1 0.11 0.40 0.49 0.00 (0.00±0.02)

2 Synth-2 0.41 0.20 0.39 0.00 (0.05±0.03)

3 Synth-3 0.72 0.03 0.24 0.29 (0.16±0.09)

4 Synth-4 0.19 0.22 0.59 0.12 (0.08±0.05)

5 Synth-5 0.01 0.33 0.66 0.01 (0.06±0.04)

6 Synth-6 0.31 0.31 0.37 0.08 (0.14±0.09)

7 Synth-7 0.14 0.00 0.86 0.03 (0.13±0.08)

8 Synth-8 0.02 0.03 0.95 0.06 (0.07±0.04)

9 Lysm 0.34 0.06 0.60 0.03 (0.07±0.05)

10 Dqd-2 0.51 0.04 0.45 0.14 (0.09±0.06)

11 Jacalin 0.00 0.35 0.65 0.07 (0.20±0.09)

12 Sub-C 0.25 0.13 0.62 0.05 (0.07±0.05)

13 MeV-Nt 0.09 0.10 0.81 0.08 (0.07±0.04)

scalingfactorofthemeasuredCDspectrum,whichisanadditional advantageofthenewmethod.

5.3.Examplespectrumanalysis

Tofurther investigate the differencesbetween the two meth- ods,we analyzed theSSestimates fortheCD spectrumwiththe largestdifferencebetweenthedeconvolution andBayesian SSes- timates. Fig. 4A shows the obtainedposterior SS probability dis- tributionforsyntheticspectrum3,whichcontains largerthanav- erage non-SS contributions (2.02 kMRE). The heatmap shown in Fig.4Aillustrates thatthemostlikelySScompositionsareindeed clusteredaround theSS composition thesynthetic spectrum was created from (shown as a red cross), with the highest posterior probabilityregions (shownindarkgreen) locatedintheimmedi- ate(S Sk<0.05)vicinityofcorrectSScomposition.However,the SScomposition determinedby deconvolution(purplecross)hasa muchhigher

α

-helixcontentanditisnotinahigh-probabilityre- gionintheBayesianSSestimation.

Toexaminewhythe twoalgorithms evaluatethe proposed SS compositionsdifferently,inFig.4BwecomputedthepredictedCD signalsofthetwoestimatedSScompositions, rescaledthem,and comparedthem to thesyntheticspectrum, asisdone during the deconvolutionprocess.Thefigureshowsthatwiththeproperscal- ing factor both SS compositionsapproximate the synthetic spec- trum well, but the deconvolution estimate (purple dashed line, RMSDj: 1.31kMRE) fitsslightlybetterthantheBayesian estimate (bluedashedlines,RMSDj: 1.71kMRE).

Incontrast,theBayesianSSestimationrescalesthesyntheticCD spectrumtomatchthepredictedspectra,andevaluatesthelikeli- hoodoftheSScompositionsbasedonthejointprobabilityoftheir non-SScontributionsRMSDjandtheirscalingfactors

α

j,asshown inFig. 4C.Although the two estimateshave a comparable RMSD inthismethodaswell,thedeconvolutionestimaterequiresascal- ingfactor(

α

j:1.99)toachieveagoodagreementthatisshownto beveryunlikelyaccordingtothejoint-probabilitymapinFig.2A.

Comparingthetwo estimatedSS signals(dashedlines) to theSS signal ofthetrueSScomposition (in red)illustrates how consid- eringscalingfactorsimprovestheprecisionoftheSESCA_bayes.In thiscase,eliminatingSScompositionswithunlikelyscalingfactors from the sampled distribution allowed a fairly accurate (RMSD:

0.99kMRE)approximationofthetrueSSsignal.

(8)

G. Nagy and H. Grubmuller Computer Physics Communications 266 (2021) 108022

Fig. 4.SSestimationforasyntheticCDspectrum.ThefigurecomparesthetrueSScomposition(showninred)withSScompositionsobtainedfromBayesianSSestimation (inblue)andspectrumdeconvolution(inpurple).A)showstheposteriorprobabilitydistributionofsampledSScompositionsinaheatmaprepresentationandindicatesthe trueSScompositionandthedeconvolutionSSestimateascrosses.TheSScompositions,estimatedscalingfactors,andSSdeviationsarealsolistedinatabulatedformaton thetop.ThedifferenceonhowthetwoestimatesareevaluatedbyB)thedeconvolutionandC)BayesianSSestimationisalsoshown.Duringdeconvolution,thepredicted CDsignalofSSestimatesisrescaledtomatchthemeasuredCDspectrum,andthemeasureofqualityissolelytheRMSD.IntheBayesianapproach,themeasuredspectrum isrescaledtomatchthepredictedSSsignals,andboththeRMSD-sandthescalingfactorsareusedtodeterminethemostlikelySScomposition.

License

SESCA is free available under GNU general public license 3 (GPLv3).

CRediTauthorshipcontributionstatement

G.N. designedand performed thecomputational analysis, and implemented code improvements. H.G. is the corresponding au- thor, and assisted with the conceptualization. Both authors con- tributedtowritingthemanuscript.

Declarationofcompetinginterest

Theauthorsdeclarethattheyhavenoknowncompetingfinan- cialinterestsorpersonalrelationshipsthatcouldhaveappearedto influencetheworkreportedinthispaper.

Codeavailability

ThenewSESCAimplementationbasedonthisstudyisavailable at:https://www.mpibpc.mpg.de/sesca.

Acknowledgements

TheauthorswouldliketothankK.BlomandP. Kellersforthe aid in editing the manuscript, as well as S. Longhi andS. Piana forprovidingthemeasured CDspectrumandmoleculardynamics ensembleofMeV-Nt,respectively.

Funding:Thisresearchprojectwasfundedandsupportedbythe AlexandervonHumboldtFoundationandtheMaxPlanckSociety.

References

[1] G.D. Fasman (Ed.), Circular Dichroism and the Conformational Analysis of Biomolecules,SpringerUS,Boston,MA,1996,http://link.springer.com/10.1007/ 978-1-4757-2508-7.

[2]N.J.Greenfield,Anal.Biochem.235 (1)(1996)1–10.

[3] S.M. Kelly, T.J. Jess, N.C. Price, Biochim. Biophys. Acta, Proteins Proteomics 1751 (2)(2005)119–139,https://doi.org/10.1016/j.bbapap.2005.06.005,http://

linkinghub.elsevier.com/retrieve/pii/S1570963905001792.

[4]S.Brahms,J.Brahms,J.Mol.Biol.138(1980)147–178.

[5] G.Nagy,M.Igaev,N.C.Jones,S.V.Hoffmann,H.Grubmüller,J.Chem.Theory Comput.(Aug.2019),https://doi.org/10.1021/acs.jctc.9b00203,http://pubs.acs. org/doi/10.1021/acs.jctc.9b00203.

[6] G. Nagy, H. Grubmüller, Eur. Biophys. J. 49 (6) (2020) 497–510, https://

doi.org/10.1007/s00249-020-01457-6,http://link.springer.com/10.1007/s00249- 020-01457-6.

[7]A.Gelman,J.B.Carlin,BayesianDataAnalysis,3rdedition,TextsinStatistical Science,ChapmanandHall/CRC,2014.

[8] J.G. Lees, A.J. Miles, F. Wien, B.A. Wallace, Bioinformatics 22 (16) (2006) 1955–1962, https://doi.org/10.1093/bioinformatics/btl327, http://

bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btl327.

[9]P. Robustelli, S. Piana, D.E. Shaw, Proc. Natl. Acad. Sci. 115 (21) (2018) E4758–E4766.

[10] J.D.Hunter, Comput. Sci. Eng. 9 (3) (2007) 90–95, https://doi.org/10.1109/ MCSE.2007.55.

[11] S. Longhi, V. Receveur-Bréchot, D. Karlin, K. Johansson, H. Dar- bon, D. Bhella, R. Yeo, S. Finet, B. Canard, J. Biol. Chem. 278 (20) (2003) 18638–18648, https://doi.org/10.1074/jbc.M300518200, https://

linkinghub.elsevier.com/retrieve/pii/S0021925819549730.

Referenzen

ÄHNLICHE DOKUMENTE

The physical aim of using Bayesian inversion in phase- field fracture is adjusting the effective parameters to fit the solution with the reference values (see Remark 3.1)..

To estimate the poverty rate for small areas, an equation of expenditure (or income) is firstly estimated using a household survey, and then it is applied in a

In this paper we present the results of a compu- tational study of the structure, the flexibility, and the electronic circular dichroism of STU followed by a model study of the

The results from applying the novel REDFITmc2 algorithm for spectrum estimation in the presence of timescale errors support in a quantitative manner the intuition that stronger

The contributions of this study are as follows: (1) constructing a reliable structure of Chinese beginning online instructors’ perceived competencies, (2) explaining why and how

sufficient conditions of consistency are formulated in terms of absolute continuity and singularity of some special family of conditional probabilistic measures. In the case

The Ricker model is trans- formed into a linear regression form, and the uncertainty in the model parameters and the 'noise' of the model are calculated using Bayesian

Both phases reveal a circular dichroism (CD) existing in a reflection of circularly polarized light within a narrow region of wavelengths [3] as is well known in the case of