Tutorial
This tutorial demonstrates how to process SATAY (Saturation Transposon Analysis in Yeast) data using the test dataset included with satay-tools.
Setting Up Your Workspace
For detailed installation instructions, see Installation.
Before running the tutorial, make sure your activate satay-tools environment, cd into satay_tools, and create an output directory for your results:
conda activate satay-tools
cd satay_tools
output_dir=~/test_out # Change this to desired location
mkdir -p $output_dir
Pipeline Overview
The SATAY analysis pipeline consists of four main steps:
align: Map reads to reference genome using STAR aligner
map: Identify transposon insertion sites from aligned reads
merge: Combine insertion counts across samples into count matrices
analyze: Perform statistical analysis to identify fitness-altering mutations
Step 1: Align Reads
Align SATAY sequencing reads to the reference genome:
satay align \
-f tests/test_data/medium_dataset/ \
-o $output_dir \
-g ref/GCF_000146045.2_R64_genomic.fna.gz
Note
This step requires Linux — the STAR aligner does not work on macOS (see Installation). The remaining steps work on either platform.
Parameters
-f, --fastq-dir: Directory containing FASTQ files (can be gzipped). Reads are processed as single-end.-o, --output-dir: Output directory for BAM files-g, --genome-fasta: Reference genome in FASTA format. The S. cerevisiae R64 genome (NCBI assemblyGCF_000146045.2) is bundled underref/; supply your own FASTA here to use a different genome.
Optional parameters:
-t, --threads: Number of threads for alignment (default: 4)-r, --limit-bam-sort-ram: Max RAM in bytes for STAR BAM sorting (default: 2000000000). Increase if STAR reports a BAM-sorting RAM error.
Note
Only single-end reads are supported. On the first run, if a STAR genome index is not already present next to the FASTA, satay-tools builds one automatically (tuned for the small yeast genome, with a 16 GB index-build RAM cap). The index is written alongside the reference and reused on subsequent runs.
Outputs
The align step produces:
{sample}.bam: Merged and sorted BAM files for each sample
Step 2: Map Insertion Sites
Identify transposon insertion sites from aligned BAM files:
satay map \
-b $output_dir \
-o $output_dir \
-s 20190221.A-2_noaF \
-a ref/GCF_000146045.2.genes.gff.gz
Parameters
-b, --bam-dir: Directory containing BAM files from align step-o, --output-dir: Output directory for insertion site files-s, --sample-name: Sample identifier to process. Must be a part of the BAM file name-a, --gff: Annotation file(s) in GFF or BED format. May be given multiple times to count over several interval sets.
Outputs
The map step produces:
{sample}_*.cnts: Per-gene insertion and read counts{sample}.bed: Filtered alignments in BED format{sample}.bed.insertions.sorted.merged.filtered: High-confidence insertion sitesprocess_samples.log: Processing log file
Step 3: Merge Counts
Combine insertion site data from multiple samples into one count matrix:
satay merge \
-d $output_dir \
-a ref/GCF_000146045.2.genes.gff.gz \
-n test1
Parameters
-d, --counts-dir: Directory containing count files from map step-a, --gff: Name of the annotation file that was used for counting-n, --name: Name prefix for output files
Optional parameters:
--format: Format of the annotation file,gfforbed(default:gff)
Outputs
The merge step produces two count matrices, where {name} is the value passed
to -n (test1 in this example) and {date} is the current date:
{date}_{name}_transposon_counts.csv: Number of unique transposon insertions per gene per sample{date}_{name}_read_counts.csv: Total read depth per gene per sample
These matrices have genes as rows and samples as columns, suitable for downstream
statistical analysis. Pass one of them (typically the transposon counts) as the
--counts-file for the analyze step.
Step 4: Statistical Analysis
Perform differential abundance analysis to identify genes with altered fitness between conditions:
satay analyze \
-f tests/test_data/test_merged_counts.txt \
-s tests/test_data/test-metadata.csv \
-o $output_dir \
-c conc \
-b "0"
Note
This step normally takes a count matrix produced by merge (one of the
*_transposon_counts.csv / *_read_counts.csv files). The tutorial
instead uses the bundled tests/test_data/test_merged_counts.txt and
test-metadata.csv because the tutorial’s own samples are all the same
baseline condition, which cannot support a differential comparison. The
bundled files provide multiple conditions so the analysis runs end to end.
Parameters
-f, --counts-file: Merged count matrix from merge step-s, --sample_data: Sample metadata file (CSV format)-o, --output-dir: Output directory for analysis results-c, --comp-col: Column name in metadata that defines the contrasts (experimental condition)-b, --baseline: Baseline/reference value in--comp-col; all other values are compared to it
Optional parameters:
-a, --gff: GFF file used to annotate results with gene names (optional)-l, --filter: Drop genes/intervals with fewer than this many counts across samples (default: 100)--alpha: FDR cutoff for DESeq2 (default: 0.05)--sample-id-col: Column in the metadata holding sample IDs; must match the count matrix columns (default:sample_id)--ids: GFF fields to keep when annotating (default:locus_tag gene)-t, --threads: Number of CPUs for DESeq2 inference (default: min(8, available cores))
Sample Metadata Format
The metadata file is a CSV with one row per sample. It must contain a sample-ID
column (named sample_id by default, configurable with --sample-id-col)
whose values match the column names of the count matrix, plus the column passed
to --comp-col holding the condition/grouping variable. For example:
sample_id,conc
20190221.A-1_noaF,0
20190221.A-1_4nMaF,4
20190221.A-2_noaF,0
20190221.A-2_4nMaF,4
Here --comp-col conc defines the contrasts and --baseline 0 sets the
reference level, so each non-zero concentration is compared against 0.
Outputs
The analyze step produces:
Differential abundance results table
Key columns in the results table:
log2FoldChange: Effect size (positive = enriched, negative = depleted)padj: Adjusted p-value (FDR-corrected)baseMean: Average normalized count across samples