 Home > Biochemistry News > Biotechnology News > A genetic evolutionary mechanism that produces mucin function

A genetic evolutionary mechanism that produces mucin function

 Last Update: 2022-09-09
 Source: Internet
 Author: User

summary

How new gene functions evolve is a fundamental question

Brief introduction

Parallel independent evolution leading to similar genetic variation has been discussed as a co-driver of convergence responses to adaptive stress (1).

Recent studies have shown that mucin genes are grouped according to their function rather than evolutionary commonalities and may be particularly prone to convergent evolution (8, 9).

Most functionally similar genes come from a replication of a common ancestral gene (17).

Results and discussion

Multiple instances of SCPP site evolution from de novo mucin

To establish the basis for studying the evolution of mucin, we constructed a simple but conservative bioinformatics approach that identifies potential mucin genes in a given genome by searching for available gene annotations, and confirms mucin function by validating proline(P), threonine (T) rich exon repeats, as well as serine amino acids

Figure 1 New and previously known mucin genes

The phylogenetic development on the left describes the relationship between

Opens in the viewer

We found that four ferret-specific mucins are localized at the secretory calcium-binding phosphorus protein (SCPP) site (in humans with CSN1s1 at the 5' end of the name at the 3' end).

The orphan mucin gene in the SCPP loci evolved independently

The evolution of genes within the SCPP loci has been discussed in the context of calcium-binding proteins, which are important for the mineralization of bones and teeth and the major protein components in milk and saliva (19).

Next, we asked whether the hypothesized mucin gene we identified encodes a protein with functional mucin properties (Figure 2A).

Figure 2 validates the mucin function

(A) Simplified model

Expand to get more open in the viewer

Muc10 as an example of evolution from de novo mucin

The identification of multiple neomycin genes in the SCPP locus provides a unique opportunity to address the question of whether these genes evolve through new functions after gene replication (17), as new genes from noncoding sequences (23–25) or through other mechanisms (Figure 3A).

Figure 3 A proline-rich protein has evolved into salivary mucus

(A) During our interrogation, we considered a number of plausible evolutionary mechanisms: gene duplication, the evolution of non-coding region coding sequences that have been repeated in the genome, and obtaining repeats from existing proteins

Expand to get more open in the viewer

We summarized the underlying model of mucin evolution (Figure 3A), and we first asked Muc10 to be the product of the most recent replication event, No.

We first validated this hypothesis by artificially comparing human PROL1 with mouse and rat MUC10 (Preamble 1 gene) peptide sequence analysis showing that these proteins have about 60% and 33% homology at the 5′ and 3′ ends, the former corresponding to signal peptides (Figure 3B).
Homology does not extend to the intermediate region of the MUC10 protein, it has at least 9 repeat sequences, and the length of the 39 base pairs (bp) (13 amino acids) is about 85%
the same as each other.
These exon replicates are not present in any primate PROL1 protein (Figure 3B).
Further studies have shown that these repeat sequences are rich in T and S amino acids (Figure 2B) To ensure the validity of the observations, we amplified and sequenced repeat fragments of mouse and rat PROL1 from mouse samples (C57BL/6J strains).

This repeat sequence is identical to the mouse reference genomic sequence (sequence file S1), but there is no homologous sequence in the non-rodent genome, which further supports that these repeat sequences were obtained in the ancestors of mice and rats
.

While obtaining mucin function, the tissue expression patterns of homologous genes also underwent significant changes in preamble 1 and Muc10 genes
.
In particular, PROL1 is mainly expressed in the human tear gland, while it is almost not expressed in other tissues
.
In contrast, MUC10 in mice and rats is abundantly expressed in saliva (31) and almost no expression in the lacrimal glands (32).
It seems that modulating the Muc10 type in mice and the ancestors of mice, has evolved to obtain a strong salivary gland-specific expression
.

To explain the Muc10 type in mice, we considered two cases (Figure 3C).
First, it is reasonable that the Muc10 type, from the preamble 1 precursor, may have adopted MUC7 No.
7 after
it disappeared in mice and mouse pedigree.
In this case, we expect muc7 humans and Muc10 to be similar in mice as well
.
Second, it is possible that the Muc10 type evolved independently in the lineage of mice and mice, resulting in a different expression trend than human expression trend 7 MuC7 In order to distinguish between these two cases, we perform immunohistochemical staining of MUC10 and MUC7 in mouse and human salivary gland tissues ( Figure 3D).
Consistent with previous studies (31), we found that MUC10 is expressed only in the submandibular glands in mice, while IN humans MUC7 is expressed
in both the submandibular and sublingual glands.
In addition, while MUC10 is expressed in all cell types in the mouse submandibular gland, MUC7 is expressed by specific cell populations in the
gland.
Overall, at the tissue and cell level, the expression patterns of MUC10 and MUC7 are different, suggesting that the Regulatory Mechanism of Muc10 is likely to have evolved
independently in the mouse lineage.

Lineage-specific mucin evolved from proline-rich precursors

Based on the transition in the rodent lineage from Preamble 1 to Muc10, we hypothesized that other new mucins may have also evolved from proline-rich proteins
.
Specifically, we are interested in three genes, namely
.
, Preamble 1 (recently called OPRPN), SMR3A type (previously Preamble 5), and SMR3B type (before Preamble 3) They are adjacent to each other on the SCPP loci and may be identical
in blood.
To detect whether these genes constitute precursors to neomycins, we looked for sequence homology between these three proteins and the newly identified 28 lineage-specific mucins
.
We found at least 5 lineage-specific mucins in closely related species that resemble proline-rich non-erosive protein sequences (Figure 4, Figure S4, and Table S1).

We also found that they retain signal peptides from their precursors (60% to 84% amino acid homology) but evolve TS-rich repeat sequences in a lineage-specific manner (Figure 4).
For example, similar to mice and rats, rhinos have significantly higher T and S amino acid levels of PROL1 than other species (Wilcoxon test; P<0.
002198) (Figure 2B).
However, PROL1 in rhinos and MUC10 in mouse and rat lineages have little sequence homology, suggesting that T and S richness in these proteins is unlikely to pass through lineage-identical
.
The emergence of new gene functions is often considered a rare phenomenon
.
Therefore, it is worth noting that in two distant mammalian lineages, rhinoceros and mice, evolution has produced a new mucin gene from the same ancestral gene, And Preface 1 These observations are consistent with the evolutionary scenario, where the ancestors secreted proline-rich protein PROL1, independently acquiring mucin function in two different lineages.
Rather than being genetically reproduced or non-coding sequences of new functions after evolution of de novo genes
.

Figure 4 Evolves lineage-specific mucin (mucinated) from proline-rich proteins
.

Three examples
of lineage-specific mucination events.
Branches that may occur in phylogenetic mucilation are indicated
by red dots.
The homologous region provides values
in the straight line and BLAST representation e.
The proposed mechanism of how a non-mucus precursor protein (top) produces its homologous mucin (bottom) is schematically demonstrated in three examples, namely rhinos (rhinos), cats, and cattle (represented by stars on phylogenetics
).
Exons are repeatedly represented
by small boxes.
The number of replicates and the number of nucleotides per replicate are shown in bold below
the specified portion of the replicates.
The blue intensity indicates approximate PTS richness
.

Opens in the viewer

Our observations offer several avenues
for future research.
For example, we found in the pangolin genome two new mucin prologues 1 and SMR3A/B types in the pangolin genome that were enriched with exon T- and S-repeat sequences in pangolins (Figure S4).

This is an interesting observation, as these lineage-specific mucins may have contributed to the unusual stickiness of pangolin saliva, a property that was most likely chosen to accommodate the animal's insectivorous habits (33).
Thus, our findings suggest that the evolutionary reuse of the mucin gene uses the mechanism we outlined for the evolution of MUC10 in mouse and rat lineages, where T and S-rich exon repeat sequences are obtained from a secreted proline-rich protein (Figure 3A).
In conclusion, we believe that the presence of proline-rich secreted proteins at the SCPP site promotes the evolution
of mucins.

Rapid evolution of mucin exon repeat sequences

In our previous analysis of Muc7 No.
7 in mammals, we found that its exon repeats retain their T and S levels, but differ greatly in copy numbers within and between species (28).
Our results, Muc7 No.
7, contrasted with other exon repeats in the genome, which occur in more than 10% of all protein-coding genes and are generally highly conserved at the nucleotide and copy number levels (34, 35).
Based on these results, we hypothesized that the exon repeats of mucin differ in copy number as a response
to the overall glycosylation of mucin modulated by various selective pressures, including dietary and pathogenic changes.
If this hypothesis is true, we expect that we will observe a fairly large level of copy number change in the interspecies mucin repeats, and that the T and S content of individual repeats will remain unchanged
over time.

We studied for the first time the copy number variation of mucin repeats between mammals (Figure 4 and Table S1
).
We found that the number of mucin repeats was basically starting from 3 in seals at Muc19 – like carnivores to 42-year-old Muc2-like/Smr3a type Independent of the mechanism of repeat length (Figure S5) or copy number change (Figure S6
).
In addition, we have several examples where copy number variation for certain repeating sequences evolves in a species-specific
way.
For example, we found that the muc10 type of the maximum likelihood tree that reproduces in mice and individual mice can divide the repeat sequences of each species into different clusters with high confidence (Figure S6
).
This finding suggests that in mouse and rat lineages, exon repeat copy numbers expand
independently.
We have previously reported an increase in and loss of lineage-specific copy numbers of primate MUC7 (28).
Overall, the change in copy number of exon mucin repeats that we observed is consistent
with the fitness hypothesis described above.

Next, we studied our second expectation, which is that the T and S levels of the mucin exon repeat sequence remain unchanged
over the course of evolution.
We focus on Muc10 rodents and mucus-like In felines, a rational arrangement of individual repeating units is possible
.
By measuring the number of synonymous and non-synonymous nucleotide differences between repeating units, we observed that the frequency of occurrence of non-synonymous changes associated with T and S amino acids occurred less frequently than expected based on the number of synonymous changes (^R2<0.
15; Figure S7
).
This finding suggests that repeated T and S levels remained at similar levels and did not follow neutral expectations
.
For amino acids other than T and S, we observed the expected neutral ratio of non-synonymous differences (^R2>0.
65; Figure S7
).
In general, for example, Muc7 (28) no.
7, Muc10 type, and mucus-like exon repeat sequences, mucin repeat sequences adaptively retain their T and S amino acid content , indicating that lineage-specific mucin evolved under selective restriction to retain O-glycosylation
.

Lineage-specific mucins are involved in variations in the mammalian salivary glycoprotein group

Previous studies of mucins, mainly in humans, have classified mucin as membrane-bound or secreted (36, 37).
Given that the SCPP gene family is composed primarily of genomes encoding secreted proteins, we hypothesize that lineage-specific mucins that evolve at this site will also have secretory properties
.
We conducted bioinformatics testing of this hypothesis and found that all new lineage-specific mucins were predicted to be secreted (see Materials and Methods; Table S1).

In addition, we did not find transmembrane domains in any of the lineage-specific mucins, which supports that they may be secreted proteins
.

We verified previous work (26) showing SCPP mucin, MUC7 type 7 and Muc10, with a large number of specific expressions in the salivary glands of humans and mice, respectively (Figure 3D).
Therefore, we investigated whether other lineage-specific mucins are also expressed in the salivary glands
.
With the exception of MUC7 in humans and MUC10 in mice, immunohistochemistry or Western blot analysis
of lineage-specific mucins is difficult due to the lack of commercially available empirical antibodies.
However, despite limited cross-species expression data from salivary glands, we were able to detect salivary gland expression of some lineage-specific mucins, including bat mucus, cow mucus-like, and the new pangolin gene_9802.
Use available RNA sequencing (RNA-seq) data (Figures S4 and S8
).
To further investigate the expression of mucin genes in saliva, we performed liquid chromatography-mass spectrometry (LC-MS) analysis of the entire saliva of humans, mice, rats, pigs, cattle, dogs, and ferrets (see Materials and Methods; Figure 5A).
In addition to mucins known to be expressed in saliva, such as MUC5b, MUC7, MUC19, and MUC10, we also found some previously known mucins that were not expressed in saliva, such as MUC4, MUC21, MUC13, MUC2, and MUC16 (Figure 5A).
In addition, we found that 8 species of specific mucus are secreted in the saliva of dogs, ferrets, and cows (Figures 5, A, and B, and Table S2
).

Figure 5 Comparison
of salivary mucin in different mammals.

(A) Liquid chromatography-mass spectrometry analysis of whole salivary mucins in different mammals
.
It was not previously known that the mucin expressed in saliva was colored
in a dark blue box.
The lineage-specific mucin identified in this study is a magenta box
.
A gray box indicates that the mucus protein has a previously known saliva expression, while a light gray indicates that the gene is present in the species genome, while no expression
is detected in saliva.
The empty box indicates that the species does not have a corresponding gene
.
Gene annotations are provided by the respective assemblies
.
Longer gene names indicated by an asterisk are shortened (PROGLY: proteoglycanoid; MUC2: MUC2-sample; MUC5AC: MUC5AC-sample; MUCC.
1: MUCC.
1 - sample).

Graphical representation of the data in (B) (A) to represent the total number
of mucins expressed in the total saliva (WS) of humans, mice, rats, dogs, ferrets, cattle, and pigs (magenta rectangle).
Lineage-specific mucins found within the SCPP locus are represented by black borders
.
(C) All saliva of the above mammals is separated by SDS-PAGE, and glycosylated proteins
are displayed with periodic acid-schiff staining.
Gel bands analyzed with LC-MS are circled
.
The gray circle indicates a band
that the mucin does not recognize.
The vertical banner below the gel channel shows the identified mucin, the amount of which corresponds
to the bands in the gel.
Magenta highlights indicate lineage-specific orphan mucins, while blue highlights indicate known mucus proteins that have not previously been identified
in saliva.
Molecular weight, molecular weight
.

Expand to get more open in the viewer

To experimentally verify whether the retention of T and S amino acids in the lineage-specific mucin observed at the sequence level is converted to protein glycosylation, we performed SDS-polyacrylamide gel electrophoresis (PAGE) isolation of salivary proteins based on tris acetate, followed by periodic acid-schiff (PAS) staining, which revealed glycosylated proteins (see Materials and Methods; Figure 5C) (27, 29).
By comparing the electrophoretic band types of salivary proteins in pigs, cattle, ferrets, dogs, rats, mice, and humans, we detected a high degree of diversity
in glycosylated protein bands between subject species.
To confirm that strong staining bands represent mucin at the amino acid sequence level, we excised pas staining bands separately and performed mass spectrometry analysis (see Materials and Methods; Figure 5C).
We were able to confirm the large expression of most mucins identified by LC-MS in saliva (Figure 5C and Table S2
).
In lineage-specific mucins, in addition to MUC7 and MUC10, we can identify SMR3A in the saliva of dogs and ferrets, the proteoglycosacid protein in dog saliva, and MUC5AC-like in the saliva of ferrets, which may be bioinformatics predictions
of glycosylation.

An unexpected but interesting result of the SDS-PAGE analysis was a high degree of variation in the content of glycosylated proteins in mammalian saliva samples
.
Our current method has limitations in distinguishing between mucin and other glycoproteins
.
Therefore, linking glycoprotein variants between mammals to mucin remains a hypothesis that requires further research, perhaps using recently available methods of mucin purification (38).
Nonetheless, previous studies have shown that within our SDS-PAGE size range, paS stains the most intense primary glycosylated proteins in human saliva are MUC5B and MUC7 (27, 39, 40).
Therefore, our findings provide evidence that at least some of the observed differences are driven by mucin
.
For example, ferret saliva produces at least four times the glycosylated band of human saliva (Figure 5B).
This is consistent with our finding that among the species we surveyed, ferrets had the largest number of lineage-specific mucins (Figure 1).
In addition to lineage-specific mucins, we found that multiple mucin genes with homologous sequences in almost all mammals are expressed
in a species-specific manner in ferret saliva.
These observations of ferrets provide another piece of evidence that the high diversity of muculin proteins in mammalian saliva evolved by acquiring new mucin genes and repurposing existing mucins to express and secrete in saliva (Figure 5B).

Establishment of a model of mucin evolution

We documented multiple instances of independent evolution of mucin function in different mammals and showed that most of these newly discovered mucins are located within the
SCPP locus.
It is unusual that the repeated evolution of this gene function at a particular site does not occur through the replication of the
entire gene.
Therefore, we constructed a mucin evolutionary model (Figure 6) in which the non-mucin gene encoding proline-rich secreted protein acts as the building block
of the new mucin.
This hypothesis makes biological sense because proline-rich proteins are structurally (rigid due to the abundance of proline) and functionally (secreted proteins) similar
to mucins.
They differ from mucins simply because they lack T and S-rich exon repeat sequences
and are the main targets of O-glycosylation.
Therefore, these genes have the potential
to rapidly acquire mucin function by repeatedly adding exon repeat sequences.
Our study provides an initial and conservative map with an emphasis on SCPP sites
.
We conducted a parallel analysis of the recently available, biochemically guided "mysoprotein" database and came to similar conclusions, but identified other candidates for lineage-specific mucin formation (Figure S9
).
Therefore, a more thorough effort is needed to extend this analysis to other species and sites
.

Figure 6 Evolutionary assembly line
of slime liquefaction.

The top chromosome shows a hypothetical secretory protein loci, where the overall regulatory structure leads to expression
in the glands and secretory tissues.
In the case of the SCPP locus, in addition to being expressed in glandular tissue, these genes encode proline-rich secreted proteins
.
The next step is to obtain a repeat sequence encoding a serine-rich and threonine peptide (gray and blue boxes
).
Second, existing post-translational modification mechanisms attach O-glycans to newly formed TS-rich repeat sequences
.
Finally, new gene functions are maintained in populations, provided they lead to environmental adaptations, such as pathogen clearance, or the unique case of pangolins is increased saliva viscosity to accommodate their specific dietary niche, i.
e.
, trapping ants with long, sticky tongues
.

Opens in the viewer

Our proposed mucin evolutionary model has three broader implications
.
First, it uses exon duplication as the main driver of rapid evolution and functional diversity (41).
Second, it reveals proline-rich proteins as precursors
to mucin production.
Third, it argues that glycosylation is a possible force for the adaptive evolution of mammals (42).
Our model is consistent with the growing recognition of repeatability, convergence, and reversal as common themes of molecular evolution (43).

In addition to the mechanical insights, our findings raise the question: What is the resilience that causes new mucin gene retention? One clue comes from the saliva expression of these mucins
.
In humans, the mucus function in saliva is associated
with pathogen binding, mucus layer formation, facilitating digestion, and providing viscosity and lubricity to saliva.
Therefore, it is safe to say that the new mucin may have beneficial effects
in immunity, diet and the mechanical properties of saliva.
Previous work, including our study, has shown that O-glycans on mucin interact with pathogens (39).
The secreted mucin is thought to be bait (21) saturating pathogen receptors in the secretions, thus preventing them from binding
to the surface of the tissue.
They can also "tame" pathogenic behaviors, promoting more symbiotic interactions between microbes and host organisms (44, 45).
The overall density, size, structure, and spatial distribution of mucin O-glycans determine the range of interactions with pathogens (39, 46) so that individual mucins may evolve to target specific microorganisms (47).
For example, sialic acid residues, as terminal components of mucin O-glycans, provide molecular motifs for the identification of specific pathogens (48, 49) These themes often change in the evolutionary arms race (49, 50).
Thus, lineage-specific mucins may bind to, or be bound to, specific pathogens in a lineage-specific manner, and changes in the number of copies of their exons can fine-tune glycosylation, which may help keep up with changing pathogenic pressures
.

The evolution of mucins may also be related
to the digestion and perception of different foods by different species.
The mucin content in saliva can interact directly with food, altering the ability to perceive (51, 52).
In addition, mucins can interact and may alter the microbial composition of the gastrointestinal tract (53) and thus affect digestion (54).
It has been suggested that oral and gut microbes are in a state of competition in their interactions with gastrointestinal mucins (55).
Thus, due to selective pressures formed by diet working together with the gastrointestinal microbiota, some mucins may be adaptively maintained
in a particular lineage.
Mucins also play a key role
in determining the physical properties of body fluids and their function in forming tissue barriers.
Therefore, an exciting future area of research will be to study the saliva activity of new mucins versus the physical properties of saliva, such as viscosity, lubricity and spindle pattern (56).

In summary, our study establishes the mechanism
by which the common functional and structural properties of a gene cluster promote the recurrence of mucin function in other evolution-unrelated genes.
Our findings provide mechanistic insights
into the de novo formation of mucins and how they produce diversity in mucin groups.
We also open up avenues for future work to characterize the function, formation mechanisms, and adaptive effects of mucins, and at a broader level, to study the evolution of
new gene functions.

Materials and methods

Preliminary identification of candidate mucin

Gene and protein annotations are available for download from the National Biotechnology Information Center 's (NCBI) Genome Index Database at ftp:// ftp.
ncbi.
nih.
gov/genomes/ by searching for the keywords "muc", "mucin", "mucin like" and "mucin domain containing" (accessed May 26, 2021), The hypothetical mucin was extracted
from this dataset.
Each of the species queried (humans, mice, cows, and ferrets) contains some presumed mucin genes that are not annotated by the mucin database www.
medkem.
gu.
se/mucinbiology/databases/ (reviewed on May 26, 2021
).

BLAST search for homologous sequences

Once we have a list of candidate mucin genes through the keyword search above, we can use NCBI-BLAST to determine the presence or absence
of candidate mucins in the reference genome of each human, mouse, cow, and ferret.
This step allows us to verify annotations as well as distinguish between lineage-specific genes and homologous genes
.
Simply put, protein sequences are downloaded from UniProt and NCBI
.
Search for these sequences in each species using BLASTp (non-redundant protein sequences
).
Blast score parameters (57) algorithms are as follows: Matrix, BLOSUM62; Gap cost, exists 11 extensions 1; Composition adjustment, component score matrix adjustment, as described elsewhere (58).
The blast hit rate is assessed based on maximum score, total score, query coverage (>30%), e-value (<0.
01) and identification percentage (>20%)
.
Next, we identified gene annotations
in the region of the genome with the highest homology to the candidate protein sequence in the corresponding reference genome.
In addition, we used the NCBI and UCSC Genome Browser to compare the genomic locations of these hypothetical genes relative to other known mucin genes to determine collinear locations
.
We need to note in Figure 1 that our pipeline is conservative and relies on the accuracy of gene annotation and the quality
of the assembly.
We believe that while our main observations remain unchanged, further validation is needed to construct a final map
of mammalian mucin content.
For example, tandem repeat sequences are particularly difficult to assemble and therefore may be missing in some reference genomes
.
The recently released Human T2T Alliance Conference (59), arguably the most accurate mammalian reference genome, identifies two new mucins in the human genome, MUC3B and MUC22-like
.
These are not included in our dataset
.
Therefore, it is clear that future assembly based on long reading sequences in other mammals will compensate for these shortcomings and expand our understanding of
mucins.

Study of mucin properties

We organized a two-pronged pipeline to confirm the mucin properties
in these hypothetical mucin candidates.
An important feature of mucin is that its repeated sequence of open reading frames is confined to the domain (8).
In our pipeline, we used the Tandem Repeat Finder to search for repeat sequences of candidate mucins in all four of our mammalian query species (60).
The algorithm identifies repeating moduluses
in a given sequence.
One problem is that the mold body is difficult to define (for example, we can have multiple duplicate molds in a series repeating array) (e.
g.
, Figure S6
).
For consistency, we reported all motifs (repeated concatenations) ≥3) using the longest motif unit
in our analysis.

Next, we locate domains rich in proline, threonine, and serine, an important feature
of mucin.
We used a Perl script algorithm called PTSpred (61).
PTSpred uses a sliding window (50 to 200 amino acids) along a given protein sequence to calculate the percentage
of proline, threonine, and serine amino acids within this window.
We use recommended thresholds to identify PTS domains
.
The new (lineage-specific) mucin properties are determined by requiring all of the following features: the presence of greater than 4% of the predicted O-glycosylation sites in each peptide segment, the presence of TS abundances greater than 20% in the peptide sequence, the presence of repeat sequences contained within the gene domain, and finally, the presence of proline, threonine-, and serine-rich amino acid sequences aggregate in exon repeat sequences
.

Determination of the secretory potential of proteins

To establish signal peptides on protein sequences, we use signalp5.
0(62), which can be www.
cbs.
dtu.
dk/services/SignalP/, using standard parameters for prediction
.
In addition, we searched for known mucin domains [such as vascular hemophilia factor-like, epidermal growth factor-like, sperm protein incretin kinase, and agrin domains (8)] using Pfam 32.
0(https://pfam.
xfam.
org/) (63).
The algorithm utilizes multiple sequence alignment and hidden Markov models to predict these regions
.
At the same time, we used TMHMM to look for the presence of transmembrane helixes in neomyrins (www.
cbs.
dtu.
dk/services/TMHMM/) (64).
In addition, to determine the likelihood that new mucins will be secreted, we used an SRTpred server (65) available https://webs.
iitd.
edu.
in/raghava/srtpred/home.
html in short, this database uses machine learning algorithms to measure the secretion potential of proteins, with positive values indicating secretion
。 At the same time, we also validated these results in the exported database (available at www.
outcelte.
com website/) (66), including machine learning to estimate secretion potential
.
In particular, a score of 0.
5 or higher indicates that there may be secretions
.
Table S1 reports the results for
SRTpred and OUTCYLE.

Determination of protein O-glycosylation potential

Predict O-glycosylation sites with SPRINT-Gly (which can be https://doi.
org/10.
1093/bioinformatics/btz215) (67).
This deep neural network method predicts the likelihood
that a T or S peptide will be O-glycosylated based on the amino acid sequence in each given window.
Simply put, the algorithm scans the T and S amino acids in each protein sequence and generates a window
containing the upstream 4 amino acids and the 4 downstream amino acids around the identified T or S amino acids.
It then assigns a probability
of O-glycosylation based on this window and previously confirmed O-glycosylated peptides in humans and mice.
To further support the sprint-Gly prediction of potential O-glycan loci, we used Net-O-glyc4.
0 (available in www.
cbs.
dtu.
dk/services/NetOGlyc/) (68), which can estimate potential O-glycosylation between mammalian species trained by O-glycosylation experiments in human cell lines
。 The results of both algorithms are consistent
.
However, we found that using SPRINT-Gly provides a more rigorous prediction of O-glycosylation, so we chose to use the results of
this more conservative algorithm in the graph.

Identification of additional lineage-specific mucins and their possible congeners

As described in the main text, we identified regions of 250-300 kb (depending on species) of genes within the CSN3 and AMTN SCPP loci as hot spots
for lineage-specific mucins.
We then expanded our search for
lineage-specific genes in other mammals (49 mammals in total) within this loci.
In particular, we identified gene annotations in this hotspot region and downloaded protein sequences
.
We then use these protein sequences, using our mucin assay pipeline to classify the genes, including determining exon repeats and the O-glycosylation potential of these repeats, as described above
.
Next, we use a BLAST search, using the same parameters as the initial screening above, to search for homologous sequences
of each candidate mucin in other mammalian species.
This process allowed us to identify 28 lineage-specific mucins, as described in
Table S1.

Identify precursors of lineage-specific mucins

We want to test the hypothesis that at least some lineage-specific mucins evolved from existing genes that did not have TS-rich repeat sequences, such as MUC10 evolved from proline-rich ancestor protein precursors (Figure 3).
To do this, we combined gene annotation, BLAST search, and RNA sequence maps to thoroughly search
for protein sequences from 28 lineage-specific mucins in mammals.
It's worth noting that every precursor we identified was a proline-rich protein
.
Due to the reproducibility of lineage-specific proteins, our study was not simple
.
First, duplicate content increases the uncertainty of the explosion similarity search, thereby reducing statistical power
.
Secondly, due to the rich repetition of PTS, there is a possibility of
false alarm explosion hits.
So, to avoid including duplicates in the initial BLAST search, we used the first 30 amino acids, which are roughly the same
as the signal peptides in the secreted protein.
Next, we manually compare lineage-specific mucins with presumed ancestral congeners to identify specific regions of sequence similarity, as described in Figure 4
.
We describe in detail the details of our search for each lineage-specific protein below, and we describe in detail below the proline-rich precursors
we identified.
Overall, our pipeline is conservative, and other lineage-specific mucins may also have proline-rich precursors
that we did not detect in this study.

Mucus-like carnivores

To determine carnivorous lineage-specific mucin (called MUC2 in cats, but SMR3A in ferrets and dogs; For the ancestral origin of Figure S2, line 7, we analyzed the first 30 amino acids of the MUC2-like protein sequence in cats (domestic cat, felCat9) versus humans (taxid: 9606, hg38).

We start with an impact on the human genome because gene annotation and the accuracy of protein sequences are optimal for humans, and there may be unknown biases
in other species.
We found that the SMR3A and SMR3B genes were significantly hit (e=6×10^{?).
8} ).
We then manually compared human SMR3A and SMR3B with cat MUC2-like protein sequences and found that SMR3A had two highly similar regions, while SMR3B had only one region
.
We then use BLAST again to verify these individual trim areas (see Alignment in sequence file S1 and Figure 4 for e Values, e< 10 years?^{).
30} ).
By the way, new components updated during the revision now annotate this gene for cats as SMR3A
.

There is ungulate mucus

We were able to track ungulates (cattle, sheep, camels, alpacas and antelopes) with even toes; One lineage-specific mucin found in Figure S2, line 1) is the ancestral proline-rich SMR3B protein
.
Similar to the pipeline above, we first gave the first 30 amino acid sequences of this lineage-specific mucin to humans and found a significant blow to the SMR3B gene (e=0.
001).

We then narrowed our search to an outer group of pedigrees, with ungulates (taxid: 9787
).
The most significant blow was SMR3B on donkeys (e=3×10^{? 12} ).
We verified that the Donkey SMR3B was not duplicated
.
Next, we manually align the cow MUC2 and donkey SMR3B sequences and retrieve the values of the distinct parts reported in BLAST e figure 4 and the sequence file S1
.

Rodent MUC10

We found that the first 30 amino acids of the protein exploded into human PROL1 (e=0.
046).

Based on the previous example, we compared amino acid sequences in mice and humans and used BLAST searches to identify similarities and assess their uniqueness
.
We found that doing the same with mice produced a lower effect e values
.
Figure 3B is now reported notably that gene annotations have led to confusion
about the evolutionary origin of these genes.
For example, consistent with our results, mice refer to the latest gene annotation in the genome for MUC10 to refer to the gene PROL1
.
However, the latest human gene annotation update refers to Preamble 1 in humans oprpn inc

Rhino PROL1

When we violently attacked the first 30 amino acids of Rhino PROL1 on humans, we did not find any significant effects
.
Instead, believing in reference to gene annotations in the genome, we compared Rhino-PROL1 with human PROL1 (now OPRPN
).
We found multiple well-aligned sections, which we interrogated in detail using BLAST, and found that some of them had high hit rates (e< 10?^{).
6} ).
We report these to Figure 4 and sequence file S1
.

Sequence amplification and validation

Mouse Prol1/Muc10 genome sequences are polymerase chain reaction (PCR) amplification and sanger sequencing using standard methods
.
The primer sequence and sequencing results are found
in the sequence file S1.
Our sequencing region and mouse (mm10) reference genomes did not differ in the number of replicates and nucleotides
.

Phylogenetic and synonymous and non-synonymous site analysis

Lineage-specific mucin sequences found in rodents (Muc10 type) and cats (mucus-like) were downloaded
from the NCBI.
The repeat contained in the repeat field is manually compiled in textwangler and aligned with CLUSTALW (69) in millions (70).
A maximum likelihood phylogenetic tree
was constructed using 100 bootstrap replicas.
The repeat sequence is then analyzed on the MEGA's paired distance computer to determine changes in homonymous and non-synonymous sites within and between rodents and felines
.

RNA sequence data mining

The RNA sequence data used to construct Figure S8 was taken from the expression exon overlay trajectory on the NCBI genomic data viewer (www.
ncbi.
nlm.
nih.
gov/genome/gdv/).
This database contains comprehensive RNA sequence data
from a variety of tissues and species.
To determine whether a gene has observable tissue expression, we used a "housekeeper" RNA expression gene, PSMB2 type known to be expressed in all tissues of all placental mammals (71).
If a gene is expressed on an order of magnitude with the PSMB2 type, we think the gene is "expressed"
in that tissue.

Saliva collection

Collect saliva samples from individual human, rat, rat, pig, cow, dog, and ferret individuals and store them in ? 80 degrees Celsius
.
Human subjects: Human saliva
is collected through passive drooling according to a protocol approved by the Human Subjects Institutional Review Committee (IRB) Committee of the University of Buffalo (Study No.
030-505616).
All human participants received informed consent
.
Samples of other mammals were collected in collaboration with colleagues and other research institutions
.
For a more detailed description of the collection methods used by different mammalian species, see (iii).

SDS-PAGE isolation of salivary proteins and PAS staining of glycosylated components

Samples are denatured under reducing conditions, 4 × triacetic acid buffers (NuPAGE, Invitrogen, Carlsbad, CA), 2.
5% β-mercaptoethanol (by sample volume) are added and boiled in water for 10 min
.
Isolate equal amounts of total protein (15 μg per channel) by SDS-PAGE using a 3-8% gradient triacetate microgel (NuPAGE, Invitrogen, Carlsbad,
CA).
As previously mentioned, staining with PAS shows glycosylated protein bands (40).
Stained gels are imaged in transparent mode using a flatbed scanner (ImageScanner III, GE Healthcare
).

Saliva sample preparation for mass spectrometry

Preparation of saliva samples using surfactant-assisted precipitation/granule digestion (71).
Simply put, 50 μg of protein is extracted from each saliva sample and SDS is added at a final concentration of 0.
5%.

Samples are sequentially reduced at 56 °C with 10 mM dithiothreitol (DTT) for 30 min and alkylated with 25 mM iodoacetamide (IAM) for 30 min at 37 °C, both of which are performed in a covered heat mixer (Eppendorf
).
Six volumes of frozen acetone are then added to the sample under intense vortex action and at ? 20 °C for 3 h
.
After centrifugation of 18,000 g, 30 min at 4 °C, decant the samples and gently wash the coated protein
with 500 μl of methanol.
After 1 min of air drying, add 40 μl of 50 mM (pH 8.
4) tricarboxylic acid (FA) to the pellet and add a total volume of 10 μl of trypsin [0.
25 μg/μl, dissolved in 50 mM (pH 8.
4) tris-FA] for continuous shaking at 37 °C for 6 h trypsin digestion
.
Add 0.
5 μl of FA to stop digestion and centrifuge at 18,000 to separate protein digestion g, 4 °C, 30 min
.
Carefully transfer the supernatant to the LC vial for analysis
.

Removal of protein gel bands and preparation of mass spectrometry

Prepare a cut gel band sample
using gel digestion.
First cut the gel bands into smaller cubes (1 to 2 mm per size) with a clean scalpel and then transfer to a new Eberin tube (Eppendorf).

Gel cubes are dehydrated by incubating in 500 μl acetonitrile (ACN) for 5 min with continuous rotation and discarding the liquid (all dehydration steps below follow the same procedure unless otherwise specified).

After incubating 500 μl of 50% ACN in 50 mM tris-FA (pH 8.
4) overnight, the gel cube is subsequently dehydrated three times and held in a thermomixer for 5 min at 37 °C to completely evaporate the remaining ACN
.
Samples were sequentially reduced at 100 μl 10 mM DTT for 30 min at 56 °C and alkylated at 37 °C for 30 min at 100 μl 25 mM IAM, both of which were performed
with continuous shaking in a covered thermomixer.
The gel block is then dehydrated three times and cultured for 30 min
in 200 μl trypsin (0.
0125 μg/μl) (in tris-FA) on ice.
Excess trypsin is then removed and replaced with 200 μl of tris-FA, with samples cultured overnight
at 37 °C with continuous shaking.
Add 20 μl of 5% FA to stop digestion, incubate for 15 min under constant vortex conditions, and then transfer the liquid to a
new leaf-shaped tube.
Dehydrate the gel band with 500 μl of 50% ACN in 50 mM tris-FA and 500 μl ACN continuously for 15 min and combine
the liquid in three steps.
The protein digest is dried in SpeedVac and recombinant in 50 μl of 1% ACN and 0.
05% trifluoroacetic acid (ddH) with a slight vortex
of 10 min.
Samples are centrifuged at 18,000 g, 4 °C for 30 min and carefully transferred the supernatant to an LC vial for analysis
.

LC-MS analysis

The LC-MS system consists
of the Dionex UltiMate 3000 nm LC system, the Dionex UltiMate 3000 micro LC system with WPS-3000 autosampler and the Orbitrap Fusion Lumos mass spectrometer.
Before the nano liquid chromatography column (75 μm inner diameter ×65 cm, filled with 2.
5 μm Xselect CSH C18 material), a large inner diameter (i.
d.
) capture column (300-μm i.
d.
×5 mm) is installed for large volume sample loading, purification, and delivery
.
For each sample, inject 4 μl of the derived peptide for LC-MS analysis
.
Mobile phases A and B are 0.
1% FA in 2% ACN and 0.
1% FA
in 88% ACN.
The 180 min LC gradient curve is 4%3 min, 4-11%5 min, 11-32%B 117min, 32-50%B 10min, 50-97% B 1min, 97%B, 17min, then balanced to 4%27 min, the mass spectrometer operates in data-correlated acquisition mode with a maximum duty cycle of 3s
.
In the mass/charge ratio range, MS1 spectra (m/z type) ranging from 400 to 1500 at 120k resolution were obtained with Orbitrap
.
Automatic gain control and maximum injection time are set to 175% and 50 ms, dynamic exclusion is set to 60 s, and ± 10 ppm
.
The precursor ion m/z type 1.
2th window was separated with a quadrupole rod and dissociated
by a high-energy collision at 30% energy.
MS2 spectra were obtained with an ion trap at a rapid scanning rate with a maximum injection time of 35 ms
.
Detailed LC-ms settings and related information are described in a previous article on Shen, etc.
(72).

Search for LC-MS files based on UniProt protein sequence database and the hypothesized mucin sequences of the corresponding species predicted in this study (sequence file S1) (Swiss Prot: Homo sapiens, micromuscular; Swiss Protection +TrEMBL: Brown House Mouse, Cow Taurus, House Dog, Ferret, and Scrotum Use Sequence HT to embed proteome discoverer 1.
4 (Thermo Fisher Scientific
).
To estimate and control the false detection rate (FDR), a target bait search method combined with a database of forward and reverse protein sequences
was applied.
Search parameters include: (i) precursor ion mass tolerance: 20ppm; (ii) Product ion mass tolerance: 0.
8da; (iii) Maximum number of missing cleavages per peptide: 2; (iv) Fixed modification: cysteine carbacylization; Dynamic modification: methionine oxidation, acetylated
peptide N-terminal.
Peptide screening, protein inference and grouping, and FDR control are all done via scaffoldv5.
0.
0.
0 (proteomesoftware Inc.

).
Protein identification criteria include 1% protein/peptide FDR and ≥ 2 peptides
per protein.
Lists of proteins containing relative protein abundance (spectral count) and sequence coverage are exported from Scaffold and manually managed by R using custom scripts
.
The parameters described here, including the 0.
8-Da mass tolerance for MS2, have been routinely used in the field [see, e.
g.
, (73)].
Mass spectrometry proteomics data has been deposited via PRIDE into the Protein Exchange Association (74) dataset identifier pxd03197 partner repository
.

Parallel detection of lineage-specific mucin evolution using a mucin group database

Our pipeline uses the general definition of mucin, which contains high O-glycosylated T- and S-rich repeat sequences, as a starting point
for bioinformatics.
Recently, however, the biochemical guidance mucin classification (38) has been published, thus providing another startup database
for human mucins.
We conducted a parallel analysis
of the genes in the top 50 of the "mucin" attributes in this database.
Specifically, of these 50 genes, we identified 28 genes that fit our definition of human mucin (i.
e.
, tandem repeats rich in T and S).

All of these genes have previously been identified as having very high levels of O-glycosylation, so we did not perform additional analysis
on this.
Of the 28 hypothetical mucin genes, 15 have been included in our previous analysis, including well-described human mucin genes such as MUC5B and MUC2
.
In addition, based on our definitions and the biochemical properties of the mucinome database, we identified 13 genes that were not previously labeled as mucin genes, but all exhibited all the characteristics
of the mucin gene.
In addition, we found that 6 of these 13 genes preserved mucin repeat domains in the mammals we studied, while 7 may have evolved mucin repeat domains in a lineage-specific way (Figure S9).

These results provide additional candidates for exciting future studies to validate the functional and evolutionary relevance
of these hypothetical mucin genes.

Statistics

Use the Wilcoxon test to determine the values in P in Figure 2 as well as Figure S9
.
All other statistics
performed are mentioned in the appropriate methods section above.

Charts and analytics

All statistical analyses were performed using
R.
All data and graphs are created in RStudio, Keynote, and BioRender using
R.

ethics

Human subjects: Human saliva
is collected through passive drooling according to a protocol approved by the University of Buffalo Human Subjects IRB Committee (Study No.
030-505616).
All human participants received informed consent
.
Animal experimentation: Collaborate with colleagues and other research institutions to collect samples
from other animals.
Samples from all live animals used in this study were collected using minimally invasive methods, such as saliva collection kits or passive saliva
.
For a description of the source of the sample and the method of collection, please refer to (iii) in the Acknowledgments section
.

This article is an English version of an article which is originally in the Chinese language on echemi.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to service@echemi.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.