-
Categories
-
Pharmaceutical Intermediates
-
Active Pharmaceutical Ingredients
-
Food Additives
- Industrial Coatings
- Agrochemicals
- Dyes and Pigments
- Surfactant
- Flavors and Fragrances
- Chemical Reagents
- Catalyst and Auxiliary
- Natural Products
- Inorganic Chemistry
-
Organic Chemistry
-
Biochemical Engineering
- Analytical Chemistry
-
Cosmetic Ingredient
- Water Treatment Chemical
-
Pharmaceutical Intermediates
Promotion
ECHEMI Mall
Wholesale
Weekly Price
Exhibition
News
-
Trade Service
The popularity of single-cell multiomics analysis in biological research has promoted the understanding of
cellular heterogeneity and subpopulations.
In particular, the increasing availability of cell indexing (CITE-seq) protocols for transcriptome and epitopes by sequencing has greatly facilitated related research progress
.
CITE-seq is a single-cell multiomics technique capable of simultaneously analyzing RNA gene expression and cell surface proteins, with the potential to discover cellular heterogeneity missed by single-modality single-cell RNA sequencing (scRNA-seq), and is now widely used in biomedical research, especially in immune-related diseases and other diseases such as influenza and COVID-19
.
One challenge with CITE-seq analysis is the need to integrate multiple CITE-seq and scRNA-seq datasets, which increases the information content and also exacerbates the computational difficulties
.
In addition, CITE-seq data is expensive to generate compared to scRNA-seq data
.
One potential solution is to understand the relationship between RNA and protein, borrow information from large reference datasets, and then make protein predictions
on scRNA-seq data.
Both Seurat 4 and TotalVI have been introduced to implement this feature, but they are computationally expensive and have limitations
.
Recently, a research team from the University of Pennsylvania published a report entitled "A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and" in Nature Machine Intelligence imputation" article
.
The research team has developed a versatile deep learning method, sciPENN, that supports CITE-seq and scRNA-seq data integration, enabling prediction, imputation of scRNA-seq, CITE-seq protein expression, quantification of uncertainty, and transfer of cell type markers from CITE-seq to scRNA-seq
.
Comprehensive evaluation across multiple datasets shows that sciPENN outperforms other current
methods of its kind.
The article was published in Nature Machine Intelligence
The model architecture of sciPENN is shown in Figure 1, and its overall goal is to learn
from one or more CITE-seq reference datasets.
When the CITE-seq reference data do not overlap exactly, sciPENN can estimate
the missing protein for each reference dataset.
After learning in the CITE-seq reference data, sciPENN was able to predict all proteins in the scRNA-seq query dataset and consolidate multiple datasets into
a common embedding space.
sciPENN can estimate the average expression of proteins, quantify the uncertainty of estimation, and selectively transfer cell type tags from CITE-seq reference data to scRNA-seq
query data.
Figure 1.
An overview
of the sciPENN methodology.
The research team used the 161,764 human peripheral blood mononuclear cell dataset (PBMC) reported in the Seurat 4 paper, which contains 224 proteins
.
For the test set, the mucosa-associated lymphoid tissue dataset (MALT) was used, which contains 8,412 cells
generated by 10x Genomics.
Of the 17 proteins in the MALT dataset, 10 overlap
with the PBMC dataset.
The research team analyzed the above data using the sciPENN, Seurat 4, and TotalVI methods, respectively (Figure 2).
First, PBMC CITE-seq reference data and MALT scRNA-seq query data are co-embedded into a potential space using each method (Figure 2).
Due to the huge differences between PBMC and MALT query data, even if internal batch correction strategies are employed in all three methods, it is difficult for sciPENN, TotalVI, and Seurat 4 to fully mix the two datasets
in the potential embedding space.
However, sciPENN has the best ability to integrate the two datasets, and it achieves a partial blending
of the two datasets in a potential embedding.
At the same time, the research team also tested the accuracy of protein expression prediction in the three methods, quantifying
it by correlation and root mean square error (RMSE).
The results show that sciPENN achieves the highest protein prediction accuracy
of all proteins.
This high protein prediction accuracy allows sciPENN to accurately recover protein expression patterns
.
Figure 2.
Protein expression prediction
in the MALT dataset using the Seurat 4 PBMC dataset as a reference.
Considering a more balanced balance between queries and reference datasets, the research team used a human blood monocytes and dendritic cell CITE-seq dataset (monocytes dataset) to preserve the true expression for the test set (Figure 3).
The analysis shows that sciPENN achieves a complete mixing of the two datasets during the embedding process.
TotalVI achieves almost complete mixing with minimal non-overlapping; Seurat 4 does not completely blend the two datasets
.
Figure 3.
Prediction of protein expression in monocytes
datasets.
Next, the research team randomly divided the complete PBMC data into training half and test half, selected three protein markers of the CD8 isoform (CD45RA, CD44-2, and CD38-1) and examined the ability of sciPENN to restore the trend of tagged proteins (Figure 4).
CD45RA is a distinct marker for CD8 infantiles, CD44-2 is a clear marker for CD8 TEM3 and CD8 TCM2, and CD38-1 is a clear marker
for CD8 TCM2.
The results show that sciPENN's protein predictions accurately revert these trends, allowing researchers to use sciPENN predictions alone to detect high-expression cell subtypes
of proteins.
TotalVI and Seurat 4 performed slightly worse than sciPENN, with Seurat 4 underestimating CD44-2 expression in CD8 TEM3 and TotalVI underestimating CD38-1 expression
in CD8 NAIVE 2.
Figure 4.
Protein expression prediction and cell type marker transfer
in PBMC datasets.
Finally, the team examined the ability of sciPENN to predict protein expression in PBMC and H1N1 RNA-seq data, which was not included in the comparison
because the loss function of TotalVI rapidly decayed to nonnumeric.
The research team divided the proteins predicted in each test dataset into three categories: only in Hanifa, only in Sanger, and both
.
The results showed that sciPENN predicted common proteins more accurately and better
than unique proteins.
The above results highlight the importance of
combining multiple CITE-seq datasets for protein expression prediction.
Figure 5.
Protein expression prediction
in the H1N1 dataset using the Seurat 4 PBMC dataset as a reference.
In summary, the research team developed the sciPENN deep learning model, which can predict and estimate protein expression, integrate multiple CITE-seq datasets, quantify prediction and estimate uncertainty
.
sciPENN is able to learn from multiple CITE-seq datasets with partially overlapping protein panels, estimate the missing proteins that each make up the CITE-seq dataset, and even predict protein expression
in external scRNA-seq datasets after learning from partially overlapping CITE-seq datasets 。 In addition, sciPENN provides more reliable and accurate results than totalVI and Seurat 4, while also being highly scalable and computationally efficient, making it an ideal tool for
integrated CITE-seq and scRNA-seq data analysis.
References:
Lakkis, J.
, Schroeder, A.
, Su, K.
et al.
A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation.
Nat Mach Intell (2022).
https://doi.
org/10.
1038/s42256-022-00545-w
· END ·