IsoSeq Isoform Discovery

Identify transcripts in PacBio single-molecule sequencing data.

Identify transcripts in PacBio single-molecule sequencing data.

Workflow for running scalable de novo isoform discovery on PacBio HiFi data written in Workflow Description Language (WDL)

IsoSeq workflow diagram
IsoSeq workflow diagram

Workflow Inputs

The type of run (single-cell or bulk) is determined by whether or not the barcodes_txt file is provided. If provided, the single-cell IsoSeq pipeline, including barcode correction, will run. If not provided, the bulk IsoSeq pipeline will run.

Input Description
batch_id

Batch name; used for naming files

hifi_reads

Array of HiFi reads in BAM format

primers_fasta

FASTA file containing forward and reverse primer sequences. Used to demultiplex and refine reads.

reference

Reference data and associated files. See the IsoSeq docs for more information.

name Reference name; used to name outputs (e.g., “GRCh38”)
fasta Reference genome and index
annotation_gtf Annotation file for the reference genome in gtf format
cage_bed CAGE peaks in BED format
intropolis_tsv Intropolis data in custom format
polyA_list polyA motif list in custom format
adapters_fasta

Optional file containing fasta adapter sequences, ordered in the expected order of the adapters within the reads. If this file is provided, skera will be run first to segment the reads from the sample movie BAM. Required if the movie BAMs were generated using MAS-Seq.

barcodes_txt

Optional file containing valid whitelisted barcode sequences. If provided, the single-cell IsoSeq pipeline will be run. Otherwise, bulk IsoSeq will run.

tag_design

Optional UMI/Barcode design. If not provided and the single-cell pipeline is run, the isoseq tag default tag will be used [T-8U-10B]

backend

Backend where the workflow will be executed

zones

Zones where compute will take place; required if backend is set to ‘AWS’ or ‘GCP’.

aws_spot_queue_arn

Queue ARN for the spot batch queue; required if backend is set to ‘AWS’ and preemptible is set to true

aws_on_demand_queue_arn

Queue ARN for the on demand batch queue; required if backend is set to ‘AWS’ and preemptible is set to false

container_registry

Container registry where workflow images are hosted. If left blank, PacBio’s public Quay.io registry will be used.

preemptible

If set to true, run tasks preemptibly where possible. On-demand VMs will be used only for tasks that run for >24 hours if the backend is set to GCP. If set to false, on-demand VMs will be used for every task. Ignored if backend is set to HPC.

Workflow Outputs

Common outputs are produced regardless of which pipeline is run; depending on which pipeline is run, either single-cell or bulk outputs will be produced in addition to common outputs.

Output Description
Common
refine_metadata Metadata output from the polyA and concatemer removal step
refine_summary_jsons Summary JSON output from the polyA and concatemer removal step
refine_report_csvs Report CSV output from the polyA and concatemer removal step
aligned_bam Reads aligned to the reference genome in BAM format
collapse_read_stat Read stats output from the transcript collapse step
collapse_report_json Report JSON output from the transcript collapse step
sorted_gff Sorted gff output by pigeon
classification_summary_txt Summary file output by pigeon transcript classification
classification_report_json Report JSON output by pigeon transcript classification
classification_txt Classification file output by pigeon transcript classification
junctions_txt A junctions txt file containing every junction for each isoform. Follows the SQANTI3 junction file convention.
filtered_reasons_txt The filter tool outputs a txt file containing the reasons an isoform was filtered. See the pigeon documentation for reasons an isoform can be filtered.
filtered_report_json Filtered report JSON file
filtered_classification_txt Filtered classification txt file following the SQANTI3 classification file convention, with two added columns: fl_assoc, and cell_barcodes.
filtered_junctions_txt Filtered junctions txt file
filtered_gff Filtered gff output by pigeon filter
gene_saturation_txt A txt file containing the read count and number of unique genes found in a subsambled number of reads
Single-cell IsoSeq
corrected_bam Barcode-corrected BAM
corrected_summary_json Corrected barcode report JSON
bcstats_json Stats for group barcodes in JSON format
bcstats_tsv Stats for group barcodes in TSV format
seurat_tar Files required to run tertiary analysis with Seurat.
Bulk IsoSeq
clustered_bam Clustered BAM and index
cluster_report_csv Clustering report in CSV format

References

Reference datasets are hosted publicly for use in the pipeline.

Containers

Docker images definitions used by this workflow can be found in the wdl-dockerfiles repository. Images are hosted in PacBio’s quay.io. Docker images used in the workflow are pegged to specific versions by referring to their digests rather than tags.

The Docker image used by a particular step of the workflow can be identified by looking at the docker key in the runtime block for the given task.

Top