IsoSeq Isoform Discovery

Identify transcripts in PacBio single-molecule sequencing data.

Workflow for running scalable de novo isoform discovery on PacBio HiFi data written in Workflow Description Language (WDL)

Workflow Inputs

The type of run (single-cell or bulk) is determined by whether or not the barcodes_txt file is provided. If provided, the single-cell IsoSeq pipeline, including barcode correction, will run. If not provided, the bulk IsoSeq pipeline will run.

Input Description

batch_id

Batch name; used for naming files

hifi_reads

Array of HiFi reads in BAM format

primers_fasta

FASTA file containing forward and reverse primer sequences. Used to demultiplex and refine reads.

reference

Reference data and associated files. See the IsoSeq docs for more information.

name	Reference name; used to name outputs (e.g., “GRCh38”)
fasta	Reference genome and index
annotation_gtf	Annotation file for the reference genome in gtf format
cage_bed	CAGE peaks in BED format
intropolis_tsv	Intropolis data in custom format
polyA_list	polyA motif list in custom format

adapters_fasta

Optional file containing fasta adapter sequences, ordered in the expected order of the adapters within the reads. If this file is provided, skera will be run first to segment the reads from the sample movie BAM. Required if the movie BAMs were generated using MAS-Seq.

barcodes_txt

Optional file containing valid whitelisted barcode sequences. If provided, the single-cell IsoSeq pipeline will be run. Otherwise, bulk IsoSeq will run.

tag_design

Optional UMI/Barcode design. If not provided and the single-cell pipeline is run, the isoseq tag default tag will be used [T-8U-10B]

backend

Backend where the workflow will be executed

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

Common outputs are produced regardless of which pipeline is run; depending on which pipeline is run, either single-cell or bulk outputs will be produced in addition to common outputs.

Output Description

Common

refine_metadata	Metadata output from the polyA and concatemer removal step
refine_summary_jsons	Summary JSON output from the polyA and concatemer removal step
refine_report_csvs	Report CSV output from the polyA and concatemer removal step
aligned_bam	Reads aligned to the reference genome in BAM format
collapse_read_stat	Read stats output from the transcript collapse step
collapse_report_json	Report JSON output from the transcript collapse step
sorted_gff	Sorted gff output by `pigeon`
classification_summary_txt	Summary file output by `pigeon` transcript classification
classification_report_json	Report JSON output by `pigeon` transcript classification
classification_txt	Classification file output by `pigeon` transcript classification
junctions_txt	A junctions txt file containing every junction for each isoform. Follows the SQANTI3 junction file convention.
filtered_reasons_txt	The filter tool outputs a txt file containing the reasons an isoform was filtered. See the `pigeon` documentation for reasons an isoform can be filtered.
filtered_report_json	Filtered report JSON file
filtered_classification_txt	Filtered classification txt file following the SQANTI3 classification file convention, with two added columns: `fl_assoc`, and `cell_barcodes`.
filtered_junctions_txt	Filtered junctions txt file
filtered_gff	Filtered gff output by `pigeon filter`
gene_saturation_txt	A txt file containing the read count and number of unique genes found in a subsambled number of reads

Single-cell IsoSeq

corrected_bam	Barcode-corrected BAM
corrected_summary_json	Corrected barcode report JSON
bcstats_json	Stats for group barcodes in JSON format
bcstats_tsv	Stats for group barcodes in TSV format
seurat_tar	Files required to run tertiary analysis with Seurat.

Bulk IsoSeq

clustered_bam	Clustered BAM and index
cluster_report_csv	Clustering report in CSV format

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.