Identify transcripts in PacBio single-molecule sequencing data.
Workflow for running scalable de novo isoform discovery on PacBio HiFi data written in Workflow Description Language (WDL)
Workflow Inputs
The type of run (single-cell or bulk) is determined by whether or not the barcodes_txt
file is provided. If provided, the single-cell IsoSeq pipeline, including barcode correction, will run. If not provided, the bulk IsoSeq pipeline will run.
Input | Description | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
batch_id | Batch name; used for naming files | ||||||||||||
hifi_reads | Array of HiFi reads in BAM format | ||||||||||||
primers_fasta | FASTA file containing forward and reverse primer sequences. Used to demultiplex and refine reads. | ||||||||||||
reference | Reference data and associated files. See the IsoSeq docs for more information.
| ||||||||||||
adapters_fasta | Optional file containing fasta adapter sequences, ordered in the expected order of the adapters within the reads. If this file is provided, skera will be run first to segment the reads from the sample movie BAM. Required if the movie BAMs were generated using MAS-Seq. | ||||||||||||
barcodes_txt | Optional file containing valid whitelisted barcode sequences. If provided, the single-cell IsoSeq pipeline will be run. Otherwise, bulk IsoSeq will run. | ||||||||||||
tag_design | Optional UMI/Barcode design. If not provided and the single-cell pipeline is run, the | ||||||||||||
backend | Backend where the workflow will be executed | ||||||||||||
zones | Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’. | ||||||||||||
aws_spot_queue_arn | Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and | ||||||||||||
aws_on_demand_queue_arn | Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and | ||||||||||||
container_registry | Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used. | ||||||||||||
preemptible | If set to |
Workflow Outputs
Common outputs are produced regardless of which pipeline is run; depending on which pipeline is run, either single-cell or bulk outputs will be produced in addition to common outputs.
Output | Description | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Common |
| ||||||||||||||||||||||||||||||||||
Single-cell IsoSeq |
| ||||||||||||||||||||||||||||||||||
Bulk IsoSeq |
|
References
Reference datasets are hosted publicly for use in the pipeline.
Containers
Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.
The Docker image used by a particular step of the workflow can be identified by looking at the docker
key in the runtime
block for the given task.