Getting Started Tutorial
========================

This brief tutorial walks through a basic pipeline for *k*-Seq experiment data analysis using ``k-seq`` package. The
pipeline includes following steps:


1. `Data preprocessing`_
2. `Sequence quantification`_
3. `Kinetic model fitting`_


.. Some `additional analysis`_ tools are also include to help design your experiments and perform additional
.. characterization of the data.

.. seealso::
  :doc:`installation` for setting up the environment and installing `k-seq` package.
  :doc:`k-seq package <k-seq_package>` for the complete API documentation of the package.


Data preprocessing
*******************

Sequencing data (FASTQ files) are first processed in the following steps:

1. trim/join paired-end reads and deduplicate joined reads as sequence counts
2. compose multiple count files as a count table
3. filter sequences or samples for downstream analysis

Join paired-end reads and count unique sequences using ``EasyDIVER``
----------------------------------------------------------------------
We integrated `EasyDIVER <https://github.com/ichen-lab-ucsb/EasyDIVER>`_ pipeline in the script ``fastq-to-counts.sh``
where paired-end reads are joined using `PANDASeq <https://github.com/neufeld/pandaseq>`_, reads from sample samples
are pooled, and joined reads are further deduplicated into counts (number of reads) for each unique sequence.

``fastq-to-counts.sh`` requires all FASTQ files are organized in a folder with file names formatted as:
``[sample-name]_L[lane number]_R[1/2]_001.fastq.gz``. ``sample-name`` is the string identifier for an experimental
sample and ``lane number`` indicate the lane of the flow-cell. ``R[1/2]`` indicate the direction of paired-end reads
(``R1`` for forward reads and ``R2`` for reverse reads) that each sample-lane expects to have matched files in both
directions.

Here is an example input file folder contains paired-end reads from two samples ``R4A-0A_S7`` and ``R4B-0A_S21``
with 4 lanes::

  R4A-0A_S7_L001_R1_001.fastq.gz  R4B-0A_S21_L001_R1_001.fastq.gz
  R4A-0A_S7_L001_R2_001.fastq.gz  R4B-0A_S21_L001_R2_001.fastq.gz
  R4A-0A_S7_L002_R1_001.fastq.gz  R4B-0A_S21_L002_R1_001.fastq.gz
  R4A-0A_S7_L002_R2_001.fastq.gz  R4B-0A_S21_L002_R2_001.fastq.gz
  R4A-0A_S7_L003_R1_001.fastq.gz  R4B-0A_S21_L003_R1_001.fastq.gz
  R4A-0A_S7_L003_R2_001.fastq.gz  R4B-0A_S21_L003_R2_001.fastq.gz
  R4A-0A_S7_L004_R1_001.fastq.gz  R4B-0A_S21_L004_R1_001.fastq.gz
  R4A-0A_S7_L004_R2_001.fastq.gz  R4B-0A_S21_L004_R2_001.fastq.gz

To run ``EasyDIVER`` pipeline:

.. code-block:: shell
  :linenos:

  > fastq-to-counts.sh \
        -i /path/to/input/folder \
        -o /path/to/output/folder \
        -p CTACGAATTC \              # forward adapter sequence for trimming
        -q CTGCAGTGAA \              # reverse adapter sequence for trimming
        -c \                         # use completely matching in joining and discard reads with error
        -a \                         # join the paired-end reads first before trimming the adapters for reads with heavily overlapped regions
        -T 12                        # number of threads to run in parallel for pandaSeq
  > ls /path/to/output/folder
  counts  fastas  fastqs  histos  log.txt

The output folder contains following files:

- ``counts``: deduplicated count files named as ``[sample-name]_counts.txt``
- ``fastqs``: joined FASTQ files named as ``[sample-name].joined.fastq``
- ``fastas``: joined FASTA files named as ``[sample-name].joined.fasta``
- ``histos``: files with the joined reads length histogram statistics, named as ``[sample-files]_counts_histo.txt``

We will use the count files for each sample under the ``counts`` folder, which has following format::

  number of unique sequences =     981824
  total number of molecules =    10096876

  GGGGGGGGATTCATGACTATT                                                                 1
  GGGGGGGAGTAGGACTGCAAA                                                                 1
  GGGGGGGAAGACTCCGGAACG                                                                 1
  GGGGGGGAACGCATTTCACGG                                                                 1
  GGGGGGGACGTTCACCGGCAA                                                                 1
  ...


Load count files
-----------------
We next use ``k-seq`` data class :class:`k_seq.data.SeqData <k_seq.data.seq_data.SeqData>` to load and pool count files
into a Python object with merged count table for further analysis.

In this example, we use :meth:`SeqData.from_count_files <k_seq.data.seq_data.SeqData.from_count_files>` to load all
count files provided in the ``k-seq`` repository
(`k-seq/data/counts <https://github.com/ichen-lab-ucsb/k-seq/tree/master/data/counts>`_ folder).
We filter files in the folder to load only count files whose file names contain ``test-d`` and two
DNA quantification data files (``total-rna.csv`` and ``spike-in-rna.csv``) should not be included.
The independent variable (x, initial BYO concentration) is provided in the file ``sample-substrate-concentration.csv``,
which can be passed to the dataset using ``x_values`` argument.
See :meth:`SeqData.from_count_files <k_seq.data.seq_data.SeqData.from_count_files>` for the complete usage.

.. tip::

  Files are usually named with relevant information to the sample, such as such as experimental condition,
  replicate number, sample name, etc. The loading function
  :meth:`SeqData.from_count_files <k_seq.data.seq_data.SeqData.from_count_files>` can automatically extract
  these metadata from file names using argument ``name_template``. The examples count files are named as
  ``test-d-{exp_cond: A, B, C, ...}{rep_id: 1, 2, 3}-_S{sample_id}_counts.txt`` and the samples are named
  as ``{experimental condition}{rep}``. Use ``name_template`` to locate the metadata (using ``{}``) and sample
  name (using ``[]``) in file name with corresponding data type (``int``, ``float``, or ``str`` by default).
  See below for example code.


Example code:


.. code-block:: python
  :linenos:

  import pandas as pd
  from k_seq.data import SeqData

  sub_c = pd.read_csv('/path/to/sample-substrate-concentration.csv', index_col=1, squeeze=True)

  dataset = SeqData.from_count_files(
      count_files='path/to/counts',
      pattern_filter='test-d-',
      name_template='test-d-[{cond}{exp_rep, int}]_S{sample, int}_counts.txt',
      sort_by='sample',
      x_values=sub_c,
      x_unit='M',    # substrate concentration unit: M
      input_sample_name=['R0'],
      note='Example k-seq data'
  )


``dataset`` is a :class:`k_seq.data.SeqData <k_seq.data.seq_data.SeqData>` object and contains
a count table (:class:`k_seq.data.SeqTable <k_seq.data.seq_data.SeqTable>`, a subclass of ``pd.DataFrame``)
at ``dataset.table.original``.


Filter sequences and samples
-----------------------------

Depending on the experimental design and quality control criteria, sequences or samples sometimes need to be filtered.
``k-seq`` provides several filters for this purpose. For example, sequences contain ambiguous nucleotides ('N') or
with undesired length might need to be removed, and the spike-in sequence should be excluded from fitting. We filter
these sequences and saved a filtered table under ``dataset.table``:

.. code-block:: python
  :linenos:

  from k_seq.data.filters import NoAmbiguityFilter, SeqLengthFilter, SpikeInFilter

  dataset.table.filtered = SeqLengthFilter(min_len=21, max_len=21)(
      NoAmbiguityFilter()(
          dataset.table.original
      )
  )

  dataset.table.filtered = SpikeInFilter(
      center_seq='AAAAACAAAAACAAAAACAAA',
      radius=2, dist_type='edit',       # we consider all sequences within 2 edit distance to center_seq as spike-in
      target=dataset.table.filtered     # target table needed to calculate sequence edit distance to center_seq
  )(dataset.table.filtered)


Sequence quantification
***********************

Once a count table with sequences of interests are prepared, we can further quantify the dependent variable in the
kinetic model (e.g., fraction reacted) for fitting.


Quantify the sequence
----------------------

Depending on the design of the *k*-Seq experiment, different quantification methods could be used. The ``k-seq``
package implemented two common quantification approaches: 1) by total sequence amount and 2) by amount of
external standards (i.e., spike-in sequence).

Total sequence (DNA/RNA/protein) amount in each sample can be measured using proper experimental technics
(e.g., qPCR for RNA amount). The quantity of each sequence in a sample can be calculated as the ratio of sequence's
counts over the total number of reads in the sample, multiplied by the total sequence amount.
In our examples, the total RNA amount for each sample is saved in ``total-rna.csv``, and we can pass it to
``dataset`` using method :meth:`SeqData.add_sample_total <k_seq.data.seq_data.SeqData.add_sample_total>`. Please note
that in order to calculate the fraction of reads for each sequence in a sample, we need to specify a ``full_table``,
which is, in our case, the filtered table to exclude added spike-in sequence (and a small number of sequences discarded).

.. code-block:: python
    :linenos:

    total_rna = pd.read_csv('path/to/total-rna.csv', index_col=0, squeeze=True)

    # add sample total info
    dataset.add_sample_total(
        total_amounts=total_rna,
        full_table=dataset.table.filtered,
        unit='ng'
    )
    # quantify fitlered seqeunces
    dataset.table.filtered_ng = dataset.sample_total(target=dataset.table.filtered)


Alternatively, we can quantify RNA amount with spike-in sequence using method
:meth:`SeqData.add_spike_in <k_seq.data.seq_data.SeqData.add_spike_in>`. The amount of a sequence is calculated as the
ratio of the sequence's counts over the counts of spike-in sequences, multiplied by the spike-in amount. In this case,
we choose a ``base_table`` as the original table with spike-read reads.

.. code-block:: python
    :linenos:

    spike_in_amount = pd.read_csv('path/to/spike-in-rna.csv', index_col=0, squeeze=True)

    dataset.add_spike_in(
        spike_in_seq='AAAAACAAAAACAAAAACAAA',
        spike_in_amount=spike_in_amount,
        base_table=dataset.table.original,
        dist_type='edit',
        radius=2
    )

    dataset.table.filtered_ng_spikein = dataset.spike_in(target=dataset.table.filtered)

The value in the table ``filtered_ng`` and ``filtered_ng_spikin`` are sequence mass with unit 'ng'.


Calculate the dependent variable
---------------------------------

The final step in the preprocessing is to calculate the dependent variable (y) in the kinetic model, in our case,
the fraction of sequences reacted. Given sequences have been quantify (as ng), the fraction reacted is the amount of
sequence in the reacted sample divided by the amount of sequence in the input sample. While this can be done by
regular ``pd.DataFrame`` arithmetics, we also include
:class:`ReactedFractionNormalizer <k_seq.data.transform.ReactedFractionNormalizer>` with additional functions (e.g.,
taking the median of multiple input replicates, see function documentation for details).

.. code-block:: python
    :linenos:

    from k_seq.data.transform import ReactedFractionNormalizer

    dataset.table.reacted_fraction = ReactedFractionNormalizer(input_samples='R0')(dataset.table.filtered_ng)


The table ``reacted_fraction`` contains the fractions reacted in samples for each sequences to fit
kinetic models.


Kinetic model fitting
********************************


Least-squares fitting with bootstrapping
---------------------------------

For each sequence, we perform least-squares fitting to the kinetic model using
`curve_fit <https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html>`_
function from ``scipy`` package.
Several bootstrapping schemas are implemented to robustly quantify the estimation uncertainty. All these functions
are enclosed in the class :class:`BatchFitter <k_seq.estimate.least_squares_batch.BatchFitter>`.

Fitting thousands to hundreds of thousands sequence from *k*-Seq experimental could be computationally heavy, especially
with bootstrapping uncertainty quantify. Several optimizations are included in ``BatchFitter`` for large datasets:

1. Count reads are discrete numbers and lower abundant sequences are likely to contain same number of reads in all
   samples. Such sequences containing same data from *k*-Seq are deduplicated through hashing and only one fitting
   are performed for them.
2. During the fitting of a large dataset, results are streamed to disk for each sequence. If the fitting were interrupted,
   it can restart by scanning the output folder and continue fitting only unfitted sequences.
3. ``BatchFitter`` perform fittings in parallel through multi-threading. Large jobs can be deployed to a High Performance
   Computing (HPC) node with multi-cores for faster fitting.


In this example, we fit sequences in the table ``reacted_fraction`` to the pseudo-first-order kinetics. The kinetic model
can be defined using python function following the format of ``curve_fit``. The function has the independent variable
as the first argument and model parameters as the remaining arguments. Any model can be represented in such format can
be passed to ``BatchFitter``. The fitter can perform single point estimation using all available data points and/or fit
bootstrapped data points for multiple times to estimate uncertainty.
In addition to model parameters *k* and *A*, we also intended to calculate the
product *kA* from each fitting results in point estimation and bootstrapping. A dictionary of functions (here ``kA``)
can be passed to argument ``metric`` to calculate and save additional metrics after each fitting.
These functions should take a single argument containing a list of fitted model parameters (ordered according to
model function arguments). Please see below example and
:class:`BatchFitter <k_seq.estimate.least_squares_batch.BatchFitter>`
for more details on fitting configuration.


.. code-block:: python
    :linenos:

    from k_seq.estimate.least_squares_batch import BatchFitter

    def pseudo_first_order(c, k, A):
        alpha = 0.479
        t = 90
        y = A * (1 - np.exp(- alpha * t * k * c))
        return y

    def kA(param):
        k, A = param[0], param[1]
        return k * A

    fitter = BatchFitter(
        y_dataframe=dataset.table.reacted_fraction,
        x_data=dataset.x_values,
        model=pseudo_first_order,
        x_label='Substrate (M)',         # name for x values
        y_label='Fraction reacted',      # name for y values
        bounds=[(0, 0), (np.inf, 1)],    # value bound for fitting, format: [(k_min, A_min), (k_max, A_max)]
        metrics={'kA': kA},              # additional metrics to calculate after each fitting
        bootstrap_num=10,                # bootstrap for 10 times (10 is for test purpose, recommend > 100)
        bs_record_num=5,                 # save only 5 bootstrap fitting results to save disk space
        bs_method='data',                # bootstrap by resampling (x, y) data points
        large_dataset=True,              # if True, stream results to save_to folder during fitting
        save_to=FIG_OUTPUT_DIR/'fitting',
        rnd_seed=23
    )

    fitter.fit(
        parallel_cores=12,               # run fitting in parallel
        point_estimate=True,             # run point estimation using all data points
        bootstrap=True                   # run bootstrapping with configuration specified above
    )


Fitting results
-----------------

All fitting results are saved to the directory specified by the ``save_to`` arguments, a typical output folder contains
following content:

- ``summary.csv``: a table contain summarized output from fitting. In out example, it includes columns of sequence,
  results from point estimation (*k*, *A*, and *kA* value), results from bootstrapping (mean, 2.5%, 50% or median, 97.5% of
  estimated *k*, *A*, and *kA* values, with prefix 'bs\_' )


- ``seqs/`` folder contains fitting details:

    - ``seq_to_hash.json``: JSON file contains sequence to hash mapping


    - ``[hash_value].json``: list of JSON files contain the fitting details for hashed sequences, including data,
      fitting configuration, point estimation results, bootstrap (uncertainty) results with selected records from
      bootstrapped samples.

Results are also accessible in Python: ``fitter.summary()`` returns the fitting summary and fitting details are available
under ``fitter.results``.


..
.. Additional analysis
.. *********************
..

.. note::

	This tutorial is under working progress and is frequently updated