The relationship between protein domains and homopeptides in the Plasmodium falciparum proteome

The proteome of the malaria parasite Plasmodium falciparum is notable for the pervasive occurrence of homopeptides or low-complexity regions (i.e., regions that are made from a small subset of amino-acid residue types). The most prevalent of these are made from residues encoded by adenine/thymidine (AT)-rich codons, in particular asparagine. We examined homopeptide occurrences within protein domains in P. falciparum. Homopeptide enrichments occur for hydrophobic (e.g., valine), or small residues (alanine or glycine) in short spans (<5 residues), but these enrichments disappear for longer lengths. We observe that short asparagine homopeptides (<10 residues long) have a dramatic relative depletion inside protein domains, indicating some selective constraint to keep them from forming. We surmise that this is possibly linked to co-translational protein folding, although there are specific protein domains that are enriched in longer asparagine homopeptides (≥10 residues) indicating a functional linkage for specific poly-asparagine tracts. Top gene ontology functional category enrichments for homopeptides associated with diverse protein domains include “vesicle-mediated transport”, and “DNA-directed 5′-3′ RNA polymerase activity”, with various categories linked to “binding” evidencing significant homopeptide depletions. Also, in general homopeptides are substantially enriched in the parts of protein domains that are near/in IDRs. The implications of these findings are discussed.


INTRODUCTION
Plasmodium falciparum (Pf) is a single-celled protozoan that causes malaria in humans. Malaria causes hundreds of thousands of deaths every year, with~405,000 in 2018 (Global Malaria Programme, 2019). Treatment for malaria is confounded by its ability to adapt quickly to drugs and to the human immune system; its antigenic diversity is a major problem for vaccine development (Ferreira, Da Silva Nunes & Wunderlich, 2004;Ferreira et al., 2003;Freitas-Junior et al., 2000). The complete genome sequence of Pf contains >5,000 protein-coding genes (Gardner et al., 2002). Early analysis indicated that low-complexity regions (LCRs) (i.e., regions that consist mostly of a small subset of amino-acid types) or homopeptides (runs of single amino acids) are a prominent feature of the encoded proteins, with more than half of proteins being low-complexity over most of their sequences (Pizzi & Frontali, 2001). Asparagine-rich regions are the most abundant (An & Harrison, 2016;Pizzi & Frontali, 2001). The LCRs have been postulated to have a function primarily at the nucleotide level (Xue & Forsdyke, 2003). Their abundance depends largely on genomic A+T or G+C content (DePristo, Zilversmit & Hartl, 2006;Xue & Forsdyke, 2003), and they also acquire further low-complexity insertions and deletions according to a power-law rule: that is, longer LCRs acquire longer insertions/ deletions (DePristo, Zilversmit & Hartl, 2006). Pf LCRs can be classified into three distinct types, including a high G+C type that is linked to recombination hotspots (Zilversmit et al., 2010). There is a pattern of enrichment of long intergenic poly(AT) tracts in Plasmodium species, some of which are immediately adjacent to genes and run into them (Russell et al., 2014). Although asparagine is preferred in LCRs of Pf, a different residue type with AT-rich codons (lysine) is more prominent in the CVK group, which is a set of four primate-infecting plasmodia (Chaudhry et al., 2018). As well as being sites of polymorphic variation themselves (Chaudhry et al., 2018), Pf LCRs are linked to increased single-nucleotide polymorphism in their vicinity (Haerty & Golding, 2011).
Although homopeptides are sites of such polymorphic variation, earlier work showed that some homopeptides are deeply conserved across orthologs from bacteria and eukaryotes, suggesting ancient origin and functional essentiality (Faux et al., 2005). In general, homopeptides are more conserved in bacteria, than in archaea and eukaryotes, and there is a correlation between repeat length differences and species divergence (Uthayakumar et al., 2012). Homopeptides increase the functional versatility of proteins, and facilitate spatial organization of proteins in a repeat-dependent way (Chavali et al., 2017). They are also significantly linked to many human diseases (Lobanov et al., 2016).
Low-complexity regions rich in hydrophilic residues are significantly associated with protein intrinsic disorder (Delucchi et al., 2020;Romero et al., 2001). Previous surveys have shown that 10-40% of Pf LCR residues are predicted as intrinsically disordered in tracts ≥40 residues long, and that >60% of sequences have such a tract (Feng et al., 2006;Mohan et al., 2008). Such annotated disordered regions in Pf are significantly depleted of predicted MHC-binding peptides, which has implications for vaccine development, since many vaccine target proteins are intrinsically disordered (Guy et al., 2015).
Low-complexity regions rich in asparagine (and/or glutamine) are common in domains that form prions (i.e., self-propagating amyloid particles) (Harbi & Harrison, 2014a;Harbi et al., 2012;Harrison, 2017;Su & Harrison, 2019). In budding yeast (Saccharomyces cerevisiae), propagation of these particles can be sustained during budding, mating and laboratory protocols (Harbi & Harrison, 2014b). Predicted prions have been detected in all the domains of life (Espinosa Angarica, Ventura & Sancho, 2013), including thousands in viruses and phages (Tetz & Tetz, 2017;, and tens of thousands in bacteria (Harrison, 2019). Pf has prion-like domains (that arise in asparagine-rich LCRs) in 10-24% of its proteins (Singh et al., 2004;An & Harrison, 2016;Pallares et al., 2018). Just like Pf, there are Saccharomycetes fungi that have high proportions of prion-like proteins with poly-asparagine in them (An, Fitzpatrick & Harrison, 2016). There is some evidence that asparagine-rich LCRs act as "tRNA sponges" that slow down the translation rate of proteins to aid in co-translational folding (Filisetti et al., 2013;Frugier et al., 2010). This is a type of parallel function at the DNA/RNA level for which there is increasing evidence for intrinsically disordered regions (IDRs) in proteins (Pancsa & Tompa, 2016).
Here, we investigate the relationship between homopeptides and both defined protein domains and intrinsic disorder in Pf proteins. We observe significant depletions in homopeptides for specific types of amino acid, in particular asparagine and aspartate. Homopeptides are substantially enriched in the parts of protein domains that are near or in IDRs.

Source data
The UniProt (Boeckmann et al., 2003) reference proteome for Plasmodium falciparum strain 3D7 was downloaded from www.uniprot.org in January 2019. Protein domain annotations for Pf were taken from the Pfam database (El-Gebali et al., 2019). For comparative analysis, three further reference proteomes were downloaded from the same source for the following: another P. falciparum strain (FCH/4), P. yoelii (strain 17XNL), and P. vivax (strain Salvador I).

Annotation
Homopeptides were defined as repetitions of one amino-acid type with a minimum length of three residues (Fig. 1).
Proteins were annotated for intrinsic disorder using the DISOPRED3 and IUPRED2a programs (Dosztanyi et al., 2005;Huntley et al., 2015;Ward et al., 2004). IUPRED2a operates on inputted single sequences, and predicts intrinsic disorder by estimating inter-residue interaction energies (Erdos & Dosztanyi, 2020). It was the best performing single-sequence method for intrinsic disorder annotation, with an area-under-curve (AUC) value of 0.83 for the ROC curve in a recent assessment (Nielsen & Mulder, 2019); DISOPRED3 also had a value of AUC = 0.83 in this assessment, and was one of the best performing methods that use evolutionary information as input. Only regions of predicted disorder ≥30 residues long were considered. A 30-residue length cut-off was used since this is a common threshold or boundary value used in characterizing intrinsicallydisordered regions, or in training algorithms for prediction of intrinsic disorder (Atkins et al., 2015). Thus, we used the default "long" parameter choice for the IUPRED2a program. Also, the DISOPRED3 program was run with a 2% expected false positive rate for the algorithm training set, which is in is the author's recommended parameter range (Jones & Cozzetto, 2015). The results from either intrinsic disorder annotation program were considered separately.

Enrichments & depletions
Because we wish to examine the effects of protein domain structure and intrinsic disorder on the occurrence of homopeptides in Pf, we checked whether there is any deviation from random placement for homopeptides within protein domains and annotated intrinsic disorder. These can be either enrichments relative to background populations or depletions. The background populations were either the whole proteome or the set of protein domain annotations as described below. Enrichments and depletions for homopeptides in protein domains were calculated as depicted (Fig. 1). These were determined for individual amino-acid types in homopeptides using Eq. (1) below for hypergeometric probability, for sampling with replacement: with the sample counts given by: k = number of residues in homopeptides in domains of one amino-acid type n = number of residues in homopeptides in domains and the background counts given by the quantities: K = number of residues in homopeptides in the proteome of one amino-acid type N = number of residues in homopeptides in the proteome Figure 1 Schematic of the analysis. Homopeptides were defined as ≥3 consecutive amino acids of the same type in a sequence. Protein domains near intrinsically disordered regions (IDRs) were determined using a 10-residues buffer. Also, if the IDR is within a protein domain or otherwise overlaps it, a 10-residue buffer is considered on either side of the IDR as shown. Enrichments/depletions for the amount of homopeptides in specific protein domain types were also calculated with the sample counts given by: k = number of residues in homopeptides in one domain type n = number of residues in homopeptides in all domain types and the background counts given by the quantities: K = number of residues in one domain type N = number of residues in all domain-types Enrichments and depletions for protein domains overlapping or near annotated IDRs were also calculated with the sample counts given by: k = number of residues in one domain type which near or inside disordered regions n = number of residues in all domain types which near or inside disordered regions and the background counts given by the quantities: K = number of residues in one domain type N = number of residues in all domain-types Proximity to IDRs was determined using a 10-residue buffer at either end of the annotated IDRs (Fig. 1).
Gene ontology (GO) category enrichments were also analyzed for specific types of homopeptide enrichment (Huntley et al., 2015). These were calculated by mapping the protein domains onto GO categories, and re-totalling the numbers of residues per GO category rather than per domain.
All enrichments/depletions were calculated using hypergeometric probability with appropriate Bonferroni corrections for multiple hypothesis testing. For example, the Bonferroni correction for enrichments/depletions of homopeptides of individual amino acids in protein domains was P = 0.05/20 = 0.0025, since there are 20 different amino acids being sampled from the same background population.

Propensity
A propensity for homopeptides of a specific amino-acid type to occur in protein domains (P dom ) was calculated as: The values of k, n, K and N are as listed above just below Eq. (1). This was calculated for the homopeptide threshold ≥3 residues.

Enrichments and depletions of homopeptides in protein domains
Homopeptides are abundant and pervasive in the Pf proteome, yet it is not clear from a structural perspective which homopeptides are more tolerated in protein domains. We analyzed the preferences of homopeptides of each specific amino-acid type for insertion into protein domains, using three different length thresholds for homopeptides ( Table 1). The statistical enrichments/depletions are listed, as well as the fraction of the homopeptide populations for each amino acid, and the propensity (P dom ) of Table 1 Homopeptides enriched/depleted amino-acid types in protein domains sorted by P-values.

Amino
Acid ( *** P dom is the propensity of homopeptides of a specific amino-acid type to occur in protein domains. It is calculated as described in "Methods". homopeptides of each amino acid for protein domains, calculated as described in "Methods". For the minimum homopeptide threshold of ≥3 residues length, the most enriched amino acids include the major aliphatic hydrophobic residues valine, isoleucine and leucine, which is to be expected because of the extensive hydrophobic cores of protein domains. Also, the small residues alanine and glycine exhibit highly significant enrichments. For these amino-acid types, the enrichments are only for short homopeptides (of size 3 or 4 residues), since the enrichments disappear for longer homopeptide thresholds (Table 1). Positively-charged homopeptides and other hydrophilic homopeptides are also generally enriched (lysine, arginine, serine, and threonine), while negatively-charged homopeptides are significantly depleted or show no preferences. Lysine homopeptides are the second most abundant in the proteome and are made from AT-rich codons; their relationships with specific protein domains are discussed below. Some amino acids show an enrichment, with a comparable propensity for structural domains (P dom ) as for other amino acids, but these are not significant. Most strikingly though, short asparagine homopeptides are highly significantly depleted within protein domains.
Histograms of homopeptide length also indicate that within protein domains, homopeptides generally lack the longer homopeptide lengths (≥10 residues) that make up the majority of homopeptides outside of protein domains (Fig. 2).
For the longest homopeptide threshold (≥10 residues length), the number of amino-acid types which are comparatively tolerated in protein domains dramatically decreases to one (asparagine; Table 1). Keeping in mind that the enrichment calculations are effectively based on the comparison of different amino-acid types, there should always be at least one enriched amino-acid type unless there are completely no homopeptides at all at a certain threshold. Poly-asparagine homopeptides are depleted in domains until the threshold is extended to 10, which leaves it as the only one existing in domains. The enrichment observed for longer polyasparagine tracts (≥10 residues, Table 1) arises from a small number of specific protein domains that may have a functional linkage for these polyasparagine tracts (Table S1), for example, for specific protein interactions.
The enrichment/depletion results for individual amino-acid types are little affected by the boundary definition of protein domains (i.e., chopping off 3, 5 or 7 residues from the ends of the domains, Table S3), with just some enrichments for glutamine and glutamate becoming significant for these shortened domains. This indicates that these homopeptides significantly occur near the ends of protein domains.
Since lysine and arginine homopeptides are in general significantly enriched in protein domains, and asparagine and aspartate significantly depleted, we examined which individual protein domains are linked to these trends (Table S1). Despite the substantial general depletion of asparagine homopeptides within protein domains, there are 87 individual domains with significant enrichment of asparagine homopeptides, including the low-copy-number Sin-like region and the SacI homology domain; these also stand out when we restrict the analysis to polyasparagine tracts ≥10 residues long (Table S1). The most prominent lysine homopeptide enrichments are for Rifin and PfEMP DBL domains. Specific domains are also linked to hydrophobic or small-residue homopeptides, such as glycine homopeptides arising for ribosomal proteins (Table S1).

Gene ontology enrichments
We examined the enrichments and depletions of GO functional categories associated with homopeptides in protein domains (Table S2). Some top enrichments of GO functional categories for homopeptides in protein domains include: GO:0003899 (DNA-directed 5′-3′ RNA polymerase activity), an enrichment caused by seven different protein domains, and GO:0042578 (phosphoric ester hydrolase activity), which is unique to the SacI homology domain (which is involved in clathrin-mediated endocytosis; three copies in Pf) ( Table S2). Inspection of other GO category enrichments indicate that they are also caused by diverse protein domains, for example, GO:0016192 (vesicle-mediated transport) which is linked to homopeptide enrichments in 11 different protein domains, pointing to specific functional significance for homopeptides in the interaction of these proteins. Nonetheless, in general "protein binding" is significantly depleted in the list (Table S2), as are the other high-level "RNA-binding" and "GTP-binding" terms. The relationship between specific protein domain homopeptides and intrinsic disorder Intrinsically disordered regions tend to have homopeptides and low-complexity sequences in them (Romero et al., 2001). We surmised that the relationship of different protein domains with homopeptides might be caused by their proximity to or overlap with IDRs of proteins. In general, homopeptides are enriched in the parts of protein domains that are near or in IDRs (Table 2; results for either the DISOPRED3 or IUPred2A program are shown). Also, there is only one individual protein domain type that is significantly depleted in homopeptides near/in IDRs, with the remainder of significant deviations being enrichments ( Table 2).

Comparison of trends in other plasmodia
The trends observed for Pf strain 3D7 were validated by analysis of another Pf strain (FCH/4) that was picked from the UniProt reference proteome list (Boeckmann et al., 2003) (Table S4). There is just one small change with enrichments of glutamate homopeptides in protein domains becoming significant (Table S4). Comparisons were also made with proteomes of P. yoelii, a malaria parasite of rodents, and P. vivax, a member of the CVK group of primate-infecting plasmodia (Chaudhry et al., 2018). P. yoelii has an overall approximately even predominance of N and K homopeptides, and P. vivax has predominance of K homopeptides rather than of N homopeptides (Table S4). The depletion of N homopeptides in protein domains is maintained in P. yoelii, but there is no significant depletion/enrichment in P. vivax. K homopeptides also become significantly depleted within protein domains in P. yoelii, despite their similar overall levels to Pf (29% in P. yoelii vs 23% in Pf). The results for homopeptide enrichments in parts of protein domains overlapping IDRs (Table 2) also remain highly significant for these three other Plasmodia proteomes (P-values~0.0).

Homopeptide trends
Homopeptide enrichments within protein domains, such as for hydrophobic (L, I or V) or small (A or G) residues, disappear at longer lengths (≥5 residues). This indicates a limit to their toleration within protein domain cores, for example, because they are not so easily accommodated in regular secondary structures. Notes: * These P-value results are not affected by chopping off 3, 5 or 7 residues from the ends of the protein domains, as for Table S3. ** Total number of individual protein domains that have enrichment of homopeptides within their parts that are near or overlapping IDRs. *** Total number of individual protein domains that have depletion of homopeptides within their parts that are near or overlapping IDRs.
We observed a substantial significant relative depletion of short asparagine runs (<10 residues long) in protein domains. Plasmodia have acquired great amounts of N homopeptide tracts during evolution, but statistically these have not been appearing or "landing" within domains. The lack of short intra-domain asparagine runs may be because they interfere with protein folding in some way. For example, they may slow down co-translational protein folding due to a lack of asparaginyl-tRNAs, since levels of asparaginyl-tRNAs in Pf are normal despite the high amounts of asparagine in their coding sequences (Filisetti et al., 2013;Frugier et al., 2010). Thus asparagine homopeptides may be "tRNA sponges" that soak up tRNAs and slow down translation and co-translational folding (Filisetti et al., 2013;Frugier et al., 2010). It is possible that homopeptides, and in particular poly-asparagine homopeptides may make protein domains more prone to misfolding. Although generally slower translation is thought to aid in correct co-translational folding (Waudby, Dobson & Christodoulou, 2019), sometimes faster translation is more desirable through segments that are prone to misfolding (O'Brien, Vendruscolo & Dobson, 2014), or for translational efficiency at buried residue sites or sites that are vulnerable to structurally disruptive mutations (Wang et al., 2015;Zhou, Weems & Wilke, 2009). However, experiments with Pf chaperone Hsp110c, have shown that Pf has cellular mechanisms that are designed to prevent aggregation linked to asparagine tracts (Muralidharan & Goldberg, 2013). A few specific protein domains have enrichment of long polyasparagine tracts. Such tracts may have a specific functional role in these proteins, perhaps for protein or nucleic-acid interaction. Another possibility might be that correct folding of these specific domains is not affected by slow rates of translation, thus N homopeptides can arise in them because of their general abundance in the proteome. Interestingly, the significant depletion of N homopeptides is maintained in the rodent malaria pathogen P. yoelii, but there is an absence of significant depletion/ enrichment in the more distantly related P. vivax malaria pathogen from the CVK group. This may indicate that the N homopeptides in the P. vivax protein domains are reduced mainly to those that have functional roles.

Gene ontology enrichments
In the GO enrichments/depletion analysis we see a general trend for depletion of functional categories associated with "binding" (protein binding; RNA binding; GTP binding; ion binding). This suggests that homopeptides may be selected against in structured interaction interfaces, perhaps since they introduce a lack of interaction specificity, or increase the likelihood of off-target binding. Also, the GO results indicate that homopeptide occurrences may be useful information for the discrimination of protein function from the analysis of sequences (Huntley et al., 2014(Huntley et al., , 2015Jiang et al., 2016;Mutowo-Meullenet et al., 2013).

Intrinsic disorder
We surmise that homopeptides in protein domains are enriched in the parts of the domains near or in IDRs because IDRs generally have more tolerance for insertions/ deletions, and are the main determinants of changes in protein length over evolution (Light et al., 2013). Also, our results indicate that the parts of protein domains that can become intrinsically disordered are enriched in homopeptides relative to other domain parts. This may be an important part of encrypting their ability to transition structurally (Narasumani & Harrison, 2015).

CONCLUSIONS
The most pervasive homopeptide in Plasmodium falciparium, poly-asparagine, is substantially depleted within protein domains, whereas other homopeptides that we might expect in the hydrophobic core of domains, such as poly-leucine or -valine or -isoleucine, and other generally abundant homopeptides (such as lysine) are enriched at the shorter homopeptide lengths studied. We hypothesize that generally poly-asparagine formation is repressed inside protein domains because its occurrence may slow co-translational folding (Filisetti et al., 2013;Frugier et al., 2010), which might be problematic within a protein domain that has only partially been translated (Scenarios in which both transient fast and slow folding may be problematic for co-translational protein folding are possible (O'Brien, Vendruscolo & Dobson, 2014)). Further experimental work is needed to investigate these hypotheses. In general, homopeptides are depleted for functional categories associated with diverse types of binding, indicating that they may interfere with specificity in structured interfaces. However, the parts of protein domains that can become intrinsically disordered have homopeptide enrichment relative to other parts of domains. Since some domains fold upon binding to other proteins, and the parts of protein domains that overlap intrinsic disorder have such homopeptide enrichment, these results suggest that protein homopeptides may be useful in effecting such structural transitions.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the Natural Sciences and Engineering Research Council of Canada. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: Natural Sciences and Engineering Research Council of Canada.