echemi logo
Product
  • Product
  • Supplier
  • Inquiry
    Home > Active Ingredient News > Immunology News > Nature sub-journal sciPENN, a multi-purpose deep learning method that predicts and imputes scRNA-seq and CITE-seq protein expression

    Nature sub-journal sciPENN, a multi-purpose deep learning method that predicts and imputes scRNA-seq and CITE-seq protein expression

    • Last Update: 2023-01-04
    • Source: Internet
    • Author: User
    Search more information of high quality chemicals, good prices and reliable suppliers, visit www.echemi.com

    The popularity of single-cell multiomics analysis in biological research has promoted the understanding of
    cellular heterogeneity and subpopulations.
    In particular, the increasing availability of cell indexing (CITE-seq) protocols for transcriptome and epitopes by sequencing has greatly facilitated related research progress
    .
    CITE-seq is a single-cell multiomics technique capable of simultaneously analyzing RNA gene expression and cell surface proteins, with the potential to discover cellular heterogeneity missed by single-modality single-cell RNA sequencing (scRNA-seq), and is now widely used in biomedical research, especially in immune-related diseases and other diseases such as influenza and COVID-19
    .

    One challenge with CITE-seq analysis is the need to integrate multiple CITE-seq and scRNA-seq datasets, which increases the information content and also exacerbates the computational difficulties
    .
    In addition, CITE-seq data is expensive to generate compared to scRNA-seq data
    .
    One potential solution is to understand the relationship between RNA and protein, borrow information from large reference datasets, and then make protein predictions
    on scRNA-seq data.
    Both Seurat 4 and TotalVI have been introduced to implement this feature, but they are computationally expensive and have limitations
    .

    Recently, a research team from the University of Pennsylvania published a report entitled "A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and" in Nature Machine Intelligence imputation" article
    .
    The research team has developed a versatile deep learning method, sciPENN, that supports CITE-seq and scRNA-seq data integration, enabling prediction, imputation of scRNA-seq, CITE-seq protein expression, quantification of uncertainty, and transfer of cell type markers from CITE-seq to scRNA-seq
    .
    Comprehensive evaluation across multiple datasets shows that sciPENN outperforms other current
    methods of its kind.

    The article was published in Nature Machine Intelligence

    The model architecture of sciPENN is shown in Figure 1, and its overall goal is to learn
    from one or more CITE-seq reference datasets.
    When the CITE-seq reference data do not overlap exactly, sciPENN can estimate
    the missing protein for each reference dataset.
    After learning in the CITE-seq reference data, sciPENN was able to predict all proteins in the scRNA-seq query dataset and consolidate multiple datasets into
    a common embedding space.
    sciPENN can estimate the average expression of proteins, quantify the uncertainty of estimation, and selectively transfer cell type tags from CITE-seq reference data to scRNA-seq
    query data.

    Figure 1.
    An overview
    of the sciPENN methodology.

    The research team used the 161,764 human peripheral blood mononuclear cell dataset (PBMC) reported in the Seurat 4 paper, which contains 224 proteins
    .
    For the test set, the mucosa-associated lymphoid tissue dataset (MALT) was used, which contains 8,412 cells
    generated by 10x Genomics.
    Of the 17 proteins in the MALT dataset, 10 overlap
    with the PBMC dataset.

    The research team analyzed the above data using the sciPENN, Seurat 4, and TotalVI methods, respectively (Figure 2).

    First, PBMC CITE-seq reference data and MALT scRNA-seq query data are co-embedded into a potential space using each method (Figure 2).

    Due to the huge differences between PBMC and MALT query data, even if internal batch correction strategies are employed in all three methods, it is difficult for sciPENN, TotalVI, and Seurat 4 to fully mix the two datasets
    in the potential embedding space.
    However, sciPENN has the best ability to integrate the two datasets, and it achieves a partial blending
    of the two datasets in a potential embedding.

    At the same time, the research team also tested the accuracy of protein expression prediction in the three methods, quantifying
    it by correlation and root mean square error (RMSE).
    The results show that sciPENN achieves the highest protein prediction accuracy
    of all proteins.
    This high protein prediction accuracy allows sciPENN to accurately recover protein expression patterns
    .

    Figure 2.
    Protein expression prediction
    in the MALT dataset using the Seurat 4 PBMC dataset as a reference.

    Considering a more balanced balance between queries and reference datasets, the research team used a human blood monocytes and dendritic cell CITE-seq dataset (monocytes dataset) to preserve the true expression for the test set (Figure 3).

    The analysis shows that sciPENN achieves a complete mixing of the two datasets during the embedding process.
    TotalVI achieves almost complete mixing with minimal non-overlapping; Seurat 4 does not completely blend the two datasets
    .

    Figure 3.
    Prediction of protein expression in monocytes
    datasets.

    Next, the research team randomly divided the complete PBMC data into training half and test half, selected three protein markers of the CD8 isoform (CD45RA, CD44-2, and CD38-1) and examined the ability of sciPENN to restore the trend of tagged proteins (Figure 4).

    CD45RA is a distinct marker for CD8 infantiles, CD44-2 is a clear marker for CD8 TEM3 and CD8 TCM2, and CD38-1 is a clear marker
    for CD8 TCM2.

    The results show that sciPENN's protein predictions accurately revert these trends, allowing researchers to use sciPENN predictions alone to detect high-expression cell subtypes
    of proteins.
    TotalVI and Seurat 4 performed slightly worse than sciPENN, with Seurat 4 underestimating CD44-2 expression in CD8 TEM3 and TotalVI underestimating CD38-1 expression
    in CD8 NAIVE 2.

    Figure 4.
    Protein expression prediction and cell type marker transfer
    in PBMC datasets.

    Finally, the team examined the ability of sciPENN to predict protein expression in PBMC and H1N1 RNA-seq data, which was not included in the comparison
    because the loss function of TotalVI rapidly decayed to nonnumeric.
    The research team divided the proteins predicted in each test dataset into three categories: only in Hanifa, only in Sanger, and both
    .
    The results showed that sciPENN predicted common proteins more accurately and better
    than unique proteins.
    The above results highlight the importance of
    combining multiple CITE-seq datasets for protein expression prediction.

    Figure 5.
    Protein expression prediction
    in the H1N1 dataset using the Seurat 4 PBMC dataset as a reference.

    In summary, the research team developed the sciPENN deep learning model, which can predict and estimate protein expression, integrate multiple CITE-seq datasets, quantify prediction and estimate uncertainty
    .
    sciPENN is able to learn from multiple CITE-seq datasets with partially overlapping protein panels, estimate the missing proteins that each make up the CITE-seq dataset, and even predict protein expression
    in external scRNA-seq datasets after learning from partially overlapping CITE-seq datasets 。 In addition, sciPENN provides more reliable and accurate results than totalVI and Seurat 4, while also being highly scalable and computationally efficient, making it an ideal tool for
    integrated CITE-seq and scRNA-seq data analysis.

    References:

    Lakkis, J.
    , Schroeder, A.
    , Su, K.
    et al.
    A multi-use deep learning method for CITE-seq and single-cell RNA-seq data integration with cell surface protein prediction and imputation.
    Nat Mach Intell (2022).
    https://doi.
    org/10.
    1038/s42256-022-00545-w

    · END ·

    This article is an English version of an article which is originally in the Chinese language on echemi.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to service@echemi.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

    Contact Us

    The source of this page with content of products and services is from Internet, which doesn't represent ECHEMI's opinion. If you have any queries, please write to service@echemi.com. It will be replied within 5 days.

    Moreover, if you find any instances of plagiarism from the page, please send email to service@echemi.com with relevant evidence.