Tutorial

This tutorial demonstrates how to process SATAY (Saturation Transposon Analysis in Yeast) data using the test dataset included with satay-tools.

Setting Up Your Workspace

For detailed installation instructions, see Installation.

Before running the tutorial, make sure your activate satay-tools environment, cd into satay_tools, and create an output directory for your results:

conda activate satay-tools
cd satay_tools
output_dir=~/test_out # Change this to desired location
mkdir -p $output_dir

Pipeline Overview

The SATAY analysis pipeline consists of four main steps:

align: Map reads to reference genome using STAR aligner
map: Identify transposon insertion sites from aligned reads
merge: Combine insertion counts across samples into count matrices
analyze: Perform statistical analysis to identify fitness-altering mutations

Step 1: Align Reads

Align SATAY sequencing reads to the reference genome:

satay align \
  -f tests/test_data/medium_dataset/ \
  -o $output_dir \
  -g ref/GCF_000146045.2_R64_genomic.fna.gz

Note

This step requires Linux — the STAR aligner does not work on macOS (see Installation). The remaining steps work on either platform.

Parameters

-f, --fastq-dir: Directory containing FASTQ files (can be gzipped). Reads are processed as single-end.
-o, --output-dir: Output directory for BAM files
-g, --genome-fasta: Reference genome in FASTA format. The S. cerevisiae R64 genome (NCBI assembly GCF_000146045.2) is bundled under ref/; supply your own FASTA here to use a different genome.

Optional parameters:

-t, --threads: Number of threads for alignment (default: 4)
-r, --limit-bam-sort-ram: Max RAM in bytes for STAR BAM sorting (default: 2000000000). Increase if STAR reports a BAM-sorting RAM error.

Note

Only single-end reads are supported. On the first run, if a STAR genome index is not already present next to the FASTA, satay-tools builds one automatically (tuned for the small yeast genome, with a 16 GB index-build RAM cap). The index is written alongside the reference and reused on subsequent runs.

Outputs

The align step produces:

{sample}.bam: Merged and sorted BAM files for each sample

Step 2: Map Insertion Sites

Identify transposon insertion sites from aligned BAM files:

satay map \
  -b $output_dir \
  -o $output_dir \
  -s 20190221.A-2_noaF \
  -a ref/GCF_000146045.2.genes.gff.gz

Parameters

-b, --bam-dir: Directory containing BAM files from align step
-o, --output-dir: Output directory for insertion site files
-s, --sample-name: Sample identifier to process. Must be a part of the BAM file name
-a, --gff: Annotation file(s) in GFF or BED format. May be given multiple times to count over several interval sets.

Outputs

The map step produces:

{sample}_*.cnts: Per-gene insertion and read counts
{sample}.bed: Filtered alignments in BED format
{sample}.bed.insertions.sorted.merged.filtered: High-confidence insertion sites
process_samples.log: Processing log file

Step 3: Merge Counts

Combine insertion site data from multiple samples into one count matrix:

satay merge \
  -d $output_dir \
  -a ref/GCF_000146045.2.genes.gff.gz \
  -n test1

Parameters

-d, --counts-dir: Directory containing count files from map step
-a, --gff: Name of the annotation file that was used for counting
-n, --name: Name prefix for output files

Optional parameters:

--format: Format of the annotation file, gff or bed (default: gff)

Outputs

The merge step produces two count matrices, where {name} is the value passed to -n (test1 in this example) and {date} is the current date:

{date}_{name}_transposon_counts.csv: Number of unique transposon insertions per gene per sample
{date}_{name}_read_counts.csv: Total read depth per gene per sample

These matrices have genes as rows and samples as columns, suitable for downstream statistical analysis. Pass one of them (typically the transposon counts) as the --counts-file for the analyze step.

Step 4: Statistical Analysis

Perform differential abundance analysis to identify genes with altered fitness between conditions:

satay analyze \
  -f tests/test_data/test_merged_counts.txt \
  -s tests/test_data/test-metadata.csv \
  -o $output_dir \
  -c conc \
  -b "0"

Note

This step normally takes a count matrix produced by merge (one of the *_transposon_counts.csv / *_read_counts.csv files). The tutorial instead uses the bundled tests/test_data/test_merged_counts.txt and test-metadata.csv because the tutorial’s own samples are all the same baseline condition, which cannot support a differential comparison. The bundled files provide multiple conditions so the analysis runs end to end.

Parameters

-f, --counts-file: Merged count matrix from merge step
-s, --sample_data: Sample metadata file (CSV format)
-o, --output-dir: Output directory for analysis results
-c, --comp-col: Column name in metadata that defines the contrasts (experimental condition)
-b, --baseline: Baseline/reference value in --comp-col; all other values are compared to it

Optional parameters:

-a, --gff: GFF file used to annotate results with gene names (optional)
-l, --filter: Drop genes/intervals with fewer than this many counts across samples (default: 100)
--alpha: FDR cutoff for DESeq2 (default: 0.05)
--sample-id-col: Column in the metadata holding sample IDs; must match the count matrix columns (default: sample_id)
--ids: GFF fields to keep when annotating (default: locus_tag gene)
-t, --threads: Number of CPUs for DESeq2 inference (default: min(8, available cores))

Sample Metadata Format

The metadata file is a CSV with one row per sample. It must contain a sample-ID column (named sample_id by default, configurable with --sample-id-col) whose values match the column names of the count matrix, plus the column passed to --comp-col holding the condition/grouping variable. For example:

sample_id,conc
20190221.A-1_noaF,0
20190221.A-1_4nMaF,4
20190221.A-2_noaF,0
20190221.A-2_4nMaF,4

Here --comp-col conc defines the contrasts and --baseline 0 sets the reference level, so each non-zero concentration is compared against 0.

Outputs

The analyze step produces:

Differential abundance results table

Key columns in the results table:

log2FoldChange: Effect size (positive = enriched, negative = depleted)
padj: Adjusted p-value (FDR-corrected)
baseMean: Average normalized count across samples