Data preparation¶
Output from kallisto-bustools (kp-python)¶
In this example we use raw sequencing data stored in .fastq
format, from 1000 PBMC,
the data can be accessed at
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3
Note
This is by no means a tutorial for processing scRNA-seq data. We only demonstrate the workflow of connecting upstream analysis software and DCS.
Download the data and unpack the .tar
file (~5.17 GB):
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar
tar -xvf pbmc_1k_v3_fastqs.tar
To process sequencing data one could use kallisto bus
tool to generate BUS file
following by bustools count
to generate count matrices from a BUS file.
However, we prefer to use kb-python
, a package that wraps the kallisto
and bustools
single-cell RNA-seq workflow (Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature biotechnology, 34(5), 525)
kb-python
can be installed with pip
.
kb count -i kallisto_index/homo_sapiens/transcriptome.idx \
-g kallisto_index/homo_sapiens/transcripts_to_genes.txt \
-x 10xv3 \
--filter \
-t 4 \
pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz \
pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz \
pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz \
pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
Output from kb count command above
[2020-11-20 14:32:51,136] INFO Using index kallisto_index/homo_sapiens/transcriptome.idx to generate BUS file to . from
[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz
[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz
[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz
[2020-11-20 14:32:51,136] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz
[2020-11-20 14:36:33,477] INFO Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2020-11-20 14:36:57,118] INFO Whitelist not provided
[2020-11-20 14:36:57,118] INFO Copying pre-packaged 10XV3 whitelist to .
[2020-11-20 14:36:57,675] INFO Inspecting BUS file ./tmp/output.s.bus
[2020-11-20 14:37:07,641] INFO Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with whitelist ./10xv3_whitelist.txt
[2020-11-20 14:37:29,264] INFO Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2020-11-20 14:37:47,478] INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus
[2020-11-20 14:37:59,662] INFO Filtering with bustools
[2020-11-20 14:37:59,662] INFO Generating whitelist ./filter_barcodes.txt from BUS file ./output.unfiltered.bus
[2020-11-20 14:37:59,790] INFO Correcting BUS records in ./output.unfiltered.bus to ./tmp/output.unfiltered.c.bus with whitelist ./filter_barcodes.txt
[2020-11-20 14:38:14,344] INFO Sorting BUS file ./tmp/output.unfiltered.c.bus to ./output.filtered.bus
[2020-11-20 14:38:30,918] INFO Generating count matrix ./counts_filtered/cells_x_genes from BUS file ./output.filtered.bus
The output directory that we are interested in is counts_filtered/
. Rename it:
mv counts_filtered/ kb_1k_PBMC_output/
This will generate counts data in the directory kb_1k_PBMC_output/
.
Output from CellRanger¶
Here we use CellRanger-processed data stored in .mtx
format, from 1000 PBMC,
the data can be accessed at
https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3
Download the data and unpack the .tar.gz
file (~9 MB):
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz
tar -xzf pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz && mv filtered_feature_bc_matrix/ cellranger_1k_PBMC_output/
These two commands will prepare the processed counts data in the directory cellranger_1k_PBMC_output/
.
Import from kallisto-bustools (kp-python)¶
import DigitalCellSorter
from DigitalCellSorter.core import readMTXdata
# Read the MTX data
df = readMTXdata(dataDir='kb_1k_PBMC_output/', origin='kb-python')
# (Optional) Convert gene names to HUGO
DCS = DigitalCellSorter.DigitalCellSorter()
DCS.prepare(df)
DCS.convert('ensembl', 'hugo')
# Check the DCS data
print(DCS.df_expr)
Import from CellRanger¶
import DigitalCellSorter
from DigitalCellSorter.core import readMTXdata
# Read the MTX data
df = readMTXdata(dataDir='cellranger_1k_PBMC_output/', origin='cellranger')
# (Optional) Convert gene names to HUGO
DCS = DigitalCellSorter.DigitalCellSorter()
DCS.prepare(df)
DCS.convert('ensembl', 'hugo')
# Check the DCS data
print(DCS.df_expr)
Function readMTXdata¶
Function to read data in MTX format (see usage examples above).
-
readMTXdata
(dataDir, origin, fileMatrix=None, fileBarcodes=None, fileGenes=None, headerRows=None, sampleName=None, stripGeneVersions=True, saveData=True, dropGeneDuplicates=True, dropCellDuplicates=True)[source]¶ Read MTX format into pandas DataFrame compatible with DCS input format
- Parameters
- dataDir: str
Path to gene expression counts data
- origin: str
- Name of the software where the data was generated. Supported options are:
‘kb-python’ for kallisto-bustools ‘cellranger’ for cellRanger
- fileMatrix: str, Default None
Name of the matrix file
- fileBarcodes: str, Default None
Name of the cell barcodes file
- fileGenes: str, Default None
Name of the genes file
- headerRows: list, Default None
List of rows in matrix file to skip
- sampleName: str, Default None
Name of the data sample to include in the batch level
- stripGeneVersions: boolean, Default True
Remove ensembl gene version. E.g. “ENSG00000236246.1” –> “ENSG00000236246”
- saveData: boolean, Default True
Whether to save data in hdf format. If True then the data is saved to a compressed hdf at the same location as matrix data
- dropGeneDuplicates: boolean, Default True
Whether to remove gene duplicates (keep first)
- dropGeneDuplicates: boolean, Default True
Whether to remove barcode duplicates (keep first)
- Returns:
- pandas.DataFrame
Table that has genes in rows and cells in columns
- Usage:
df = readMTX(dataDir=’filtered_feature_bc_matrix/’, origin=’cellranger’) #df = readMTX(dataDir=’counts_filtered/’, origin=’kb-python’)
DCS = DigitalCellSorter.DigitalCellSorter() DCS.prepare(df) DCS.convert(‘ensembl’, ‘hugo’) print(DCS.df_expr)
Human Cell Atlas tools¶
Set of generic tools for retrieving, loading, and preparation of Human Cell Atlas (HCA) datasets is contained in this module.
Example:
import os
import DigitalCellSorter.ReadPrepareDataHCA as prep
# Example URL of a relatvely small dataset of scRNA-seq of human pancreas
url = "https://data.humancellatlas.org/project-assets/project-matrices/cddab57b-6868-4be4-806f-395ed9dd635a.homo_sapiens.mtx.zip"
# Path of directories where the data will be placed
extractPath = os.path.join(os.path.join(os.path.dirname(__file__), ''), 'data', os.path.splitext(os.path.basename(url))[0])
# Download data and unpack it to a specified directory
prep.getHCAdataByURL(url, extractPath)
# Record *.h5 files of individual donor IDs
IDs = prep.recordFilesOfIndividualDonors(extractPath, organName='islet of Langerhans')
# Load ready-to-use dataset of the first donor ID
df = prep.getDataframeByDonorID(extractPath, IDs[0])
# Print the shape of just loaded dataset
print(df.shape)
Submodule ReadPrepareDataHCA
Functions:
Prepare data from Human Cell Atlas (HCA) preview dataset h5 data file. |
|
|
|
|
Get pandas.DataFrame by Donor ID |
|
Download and extract data from Human Cell Atlas Portal |
|
|
|
Unpickle object from a (binary) file |
|
Record h5 files of HCA individual donors in a dataset |
|
Pickle object into a (binary) file |
-
getHCAdataByURL
(url, extractPath, extractData=True)[source]¶ Download and extract data from Human Cell Atlas Portal
- Parameters:
- url: str
URL of the data of interest
- extractPath: str
Path where to save and extract data to
- extractData: boolean, Default True
Whether to extract downloaded data
- Returns:
None
- Usage:
getHCAdataByURL(url, extractPath)
-
recordFilesOfIndividualDonors
(extractPath, organName=None, donorIDcolumn='donor_organism.provenance.document_id', organColumn='derived_organ_parts_label', useHogoGeneNames=True)[source]¶ Record h5 files of HCA individual donors in a dataset
- Parameters:
- extractPath: str
Path of directories where HCA matrix files were downloaded and extracted. See function getHCAdataByURL() for detail.
- organName: str, Default None
Name of the organ name. E.g. ‘pancreas’, ‘bone marrow’, etc.
- donorIDcolumn: str, Default donor_organism.provenance.document_id
Column with unique IDs of donors in the file. Another option is ‘specimen_from_organism.provenance.document_id’ IDs at samples level is needed.
- organColumn: str, Default ‘derived_organ_parts_label’
‘derived_organ_label’
‘derived_organ_parts_label’ This option is ignored when organName parameter is None.
- useHogoGeneNames: boolean, Default True
Whether to use HUGO gene names.
- Returns:
- list
List of donor IDs
- Usage:
recordFilesOfIndividualDonors(extractPath, organName=’retina’)
-
getDataframeByDonorID
(extractPath, donorID)[source]¶ Get pandas.DataFrame by Donor ID
- Parameters:
- extractPath: str
Path of directories where HCA matrix files were downloaded and extracted. See function getHCAdataByURL() for detail.
- donorID: str
Donor ID.
- Returns:
- pandas.DataFrame
Matrix corresponding to the Donor ID
- Usage:
getDataframeByDonorID(extractPath, donorID)
-
PrepareDataOnePatient_PREVIEW_DATASET
(filename, patient, saveFolderName, useAllData=True, cellsLimitToUse=1000, randomlySample=True, randomSeed=0)[source]¶ Prepare data from Human Cell Atlas (HCA) preview dataset h5 data file. The user can download the file ica_bone_marrow_h5.h5 from https://preview.data.humancellatlas.org/ (Raw Counts Matrix - Bone Marrow) and place in folder data. The file is ~485Mb and contains all 378000 cells from 8 bone marrow donors (BM1-BM8). Note: this data file is no longer available at HCA data server, however, some users may have a copy of it and need to extract data from it.
- Parameters:
- filename: str
Path and name of the file to store binary data in
- patient: str
Identifier of the patient: ‘BM1’, ‘BM2’, ‘BM3’, ‘BM4’, ‘BM5’, ‘BM6’, ‘BM7’ or ‘BM8’
- saveFolderName: str
Path where to save prepared data file
- useAllData: boolean, Default True
Whether to use all data or a subset
- cellsLimitToUse: int, Default 1000
Number of cells to use if useAllData=False
- randomlySample: boolean, Default True
Whether to sample cell randomly of pick top number
- randomSeed: int, Default 0
Random seed
- Returns:
None
- Usage:
PrepareDataOnePatient(os.path.join(‘data’, ‘ica_bone_marrow_h5.h5’), ‘BM1’, os.path.join(‘data’, ‘’), useAllData=False, cellsLimitToUse=5000)