Tutorial
========

This tutorial demonstrates how to process SATAY (Saturation Transposon Analysis in Yeast) 
data using the test dataset included with satay-tools.


Setting Up Your Workspace
-------------------------

For detailed installation instructions, see :doc:`../installation`.

Before running the tutorial, make sure your activate `satay-tools` environment, `cd` into satay_tools, and create an output directory for your results:

.. code-block:: bash

   conda activate satay-tools
   cd satay_tools
   output_dir=~/test_out # Change this to desired location
   mkdir -p $output_dir


Pipeline Overview
-----------------

The SATAY analysis pipeline consists of four main steps:

1. **align**: Map reads to reference genome using STAR aligner
2. **map**: Identify transposon insertion sites from aligned reads
3. **merge**: Combine insertion counts across samples into count matrices
4. **analyze**: Perform statistical analysis to identify fitness-altering mutations

Step 1: Align Reads
-------------------

Align SATAY sequencing reads to the reference genome:

.. code-block:: bash

   satay align \
     -f tests/test_data/medium_dataset/ \
     -o $output_dir \
     -g ref/GCF_000146045.2_R64_genomic.fna.gz

.. note::

   This step requires **Linux** — the STAR aligner does not work on macOS
   (see :doc:`../installation`). The remaining steps work on either platform.

Parameters
^^^^^^^^^^

* ``-f, --fastq-dir``: Directory containing FASTQ files (can be gzipped). Reads are processed as single-end.
* ``-o, --output-dir``: Output directory for BAM files
* ``-g, --genome-fasta``: Reference genome in FASTA format. The *S. cerevisiae* R64 genome (NCBI assembly ``GCF_000146045.2``) is bundled under ``ref/``; supply your own FASTA here to use a different genome.

Optional parameters:

* ``-t, --threads``: Number of threads for alignment (default: 4)
* ``-r, --limit-bam-sort-ram``: Max RAM in bytes for STAR BAM sorting (default: 2000000000). Increase if STAR reports a BAM-sorting RAM error.

.. note::

   Only single-end reads are supported. On the first run, if a STAR genome
   index is not already present next to the FASTA, satay-tools builds one
   automatically (tuned for the small yeast genome, with a 16 GB index-build
   RAM cap). The index is written alongside the reference and reused on
   subsequent runs.

Outputs
^^^^^^^

The align step produces:

* ``{sample}.bam``: Merged and sorted BAM files for each sample


Step 2: Map Insertion Sites
-----------------------------

Identify transposon insertion sites from aligned BAM files:

.. code-block:: bash

   satay map \
     -b $output_dir \
     -o $output_dir \
     -s 20190221.A-2_noaF \
     -a ref/GCF_000146045.2.genes.gff.gz

Parameters
^^^^^^^^^^

* ``-b, --bam-dir``: Directory containing BAM files from align step
* ``-o, --output-dir``: Output directory for insertion site files
* ``-s, --sample-name``: Sample identifier to process. Must be a part of the BAM file name
* ``-a, --gff``: Annotation file(s) in GFF or BED format. May be given multiple times to count over several interval sets.

Outputs
^^^^^^^

The map step produces:


* ``{sample}_*.cnts``: Per-gene insertion and read counts
* ``{sample}.bed``: Filtered alignments in BED format
* ``{sample}.bed.insertions.sorted.merged.filtered``: High-confidence insertion sites
* ``process_samples.log``: Processing log file

Step 3: Merge Counts
--------------------

Combine insertion site data from multiple samples into one count matrix:

.. code-block:: bash

   satay merge \
     -d $output_dir \
     -a ref/GCF_000146045.2.genes.gff.gz \
     -n test1

Parameters
^^^^^^^^^^

* ``-d, --counts-dir``: Directory containing count files from map step
* ``-a, --gff``: Name of the annotation file that was used for counting
* ``-n, --name``: Name prefix for output files

Optional parameters:

* ``--format``: Format of the annotation file, ``gff`` or ``bed`` (default: ``gff``)

Outputs
^^^^^^^

The merge step produces two count matrices, where ``{name}`` is the value passed
to ``-n`` (``test1`` in this example) and ``{date}`` is the current date:

* ``{date}_{name}_transposon_counts.csv``: Number of unique transposon insertions per gene per sample
* ``{date}_{name}_read_counts.csv``: Total read depth per gene per sample

These matrices have genes as rows and samples as columns, suitable for downstream
statistical analysis. Pass one of them (typically the transposon counts) as the
``--counts-file`` for the ``analyze`` step.

Step 4: Statistical Analysis
-----------------------------

Perform differential abundance analysis to identify genes with altered fitness between conditions:

.. code-block:: bash

   satay analyze \
     -f tests/test_data/test_merged_counts.txt \
     -s tests/test_data/test-metadata.csv \
     -o $output_dir \
     -c conc \
     -b "0"

.. note::

   This step normally takes a count matrix produced by ``merge`` (one of the
   ``*_transposon_counts.csv`` / ``*_read_counts.csv`` files). The tutorial
   instead uses the bundled ``tests/test_data/test_merged_counts.txt`` and
   ``test-metadata.csv`` because the tutorial's own samples are all the same
   baseline condition, which cannot support a differential comparison. The
   bundled files provide multiple conditions so the analysis runs end to end.

Parameters
^^^^^^^^^^

* ``-f, --counts-file``: Merged count matrix from merge step
* ``-s, --sample_data``: Sample metadata file (CSV format)
* ``-o, --output-dir``: Output directory for analysis results
* ``-c, --comp-col``: Column name in metadata that defines the contrasts (experimental condition)
* ``-b, --baseline``: Baseline/reference value in ``--comp-col``; all other values are compared to it

Optional parameters:

* ``-a, --gff``: GFF file used to annotate results with gene names (optional)
* ``-l, --filter``: Drop genes/intervals with fewer than this many counts across samples (default: 100)
* ``--alpha``: FDR cutoff for DESeq2 (default: 0.05)
* ``--sample-id-col``: Column in the metadata holding sample IDs; must match the count matrix columns (default: ``sample_id``)
* ``--ids``: GFF fields to keep when annotating (default: ``locus_tag gene``)
* ``-t, --threads``: Number of CPUs for DESeq2 inference (default: min(8, available cores))

Sample Metadata Format
^^^^^^^^^^^^^^^^^^^^^^^

The metadata file is a CSV with one row per sample. It must contain a sample-ID
column (named ``sample_id`` by default, configurable with ``--sample-id-col``)
whose values match the column names of the count matrix, plus the column passed
to ``--comp-col`` holding the condition/grouping variable. For example:

.. code-block:: text

   sample_id,conc
   20190221.A-1_noaF,0
   20190221.A-1_4nMaF,4
   20190221.A-2_noaF,0
   20190221.A-2_4nMaF,4

Here ``--comp-col conc`` defines the contrasts and ``--baseline 0`` sets the
reference level, so each non-zero concentration is compared against ``0``.

Outputs
^^^^^^^

The analyze step produces:

* Differential abundance results table

Key columns in the results table:

* ``log2FoldChange``: Effect size (positive = enriched, negative = depleted)
* ``padj``: Adjusted p-value (FDR-corrected)
* ``baseMean``: Average normalized count across samples