Getting Started Tutorial¶

This brief tutorial walks through a basic pipeline for k-Seq experiment data analysis using k-seq package. The pipeline includes following steps:

Data preprocessing
Sequence quantification
Kinetic model fitting

Data preprocessing¶

Sequencing data (FASTQ files) are first processed in the following steps:

trim/join paired-end reads and deduplicate joined reads as sequence counts
compose multiple count files as a count table
filter sequences or samples for downstream analysis

Join paired-end reads and count unique sequences using `EasyDIVER`¶

We integrated EasyDIVER pipeline in the script fastq-to-counts.sh where paired-end reads are joined using PANDASeq, reads from sample samples are pooled, and joined reads are further deduplicated into counts (number of reads) for each unique sequence.

fastq-to-counts.sh requires all FASTQ files are organized in a folder with file names formatted as: [sample-name]_L[lane number]_R[1/2]_001.fastq.gz. sample-name is the string identifier for an experimental sample and lane number indicate the lane of the flow-cell. R[1/2] indicate the direction of paired-end reads (R1 for forward reads and R2 for reverse reads) that each sample-lane expects to have matched files in both directions.

Here is an example input file folder contains paired-end reads from two samples R4A-0A_S7 and R4B-0A_S21 with 4 lanes:

R4A-0A_S7_L001_R1_001.fastq.gz  R4B-0A_S21_L001_R1_001.fastq.gz
R4A-0A_S7_L001_R2_001.fastq.gz  R4B-0A_S21_L001_R2_001.fastq.gz
R4A-0A_S7_L002_R1_001.fastq.gz  R4B-0A_S21_L002_R1_001.fastq.gz
R4A-0A_S7_L002_R2_001.fastq.gz  R4B-0A_S21_L002_R2_001.fastq.gz
R4A-0A_S7_L003_R1_001.fastq.gz  R4B-0A_S21_L003_R1_001.fastq.gz
R4A-0A_S7_L003_R2_001.fastq.gz  R4B-0A_S21_L003_R2_001.fastq.gz
R4A-0A_S7_L004_R1_001.fastq.gz  R4B-0A_S21_L004_R1_001.fastq.gz
R4A-0A_S7_L004_R2_001.fastq.gz  R4B-0A_S21_L004_R2_001.fastq.gz

To run EasyDIVER pipeline:

> fastq-to-counts.sh \
      -i /path/to/input/folder \
      -o /path/to/output/folder \
      -p CTACGAATTC \              # forward adapter sequence for trimming
      -q CTGCAGTGAA \              # reverse adapter sequence for trimming
      -c \                         # use completely matching in joining and discard reads with error
      -a \                         # join the paired-end reads first before trimming the adapters for reads with heavily overlapped regions
      -T 12                        # number of threads to run in parallel for pandaSeq
> ls /path/to/output/folder
counts  fastas  fastqs  histos  log.txt

The output folder contains following files:

counts: deduplicated count files named as [sample-name]_counts.txt
fastqs: joined FASTQ files named as [sample-name].joined.fastq
fastas: joined FASTA files named as [sample-name].joined.fasta
histos: files with the joined reads length histogram statistics, named as [sample-files]_counts_histo.txt

We will use the count files for each sample under the counts folder, which has following format:

number of unique sequences =     981824
total number of molecules =    10096876

GGGGGGGGATTCATGACTATT                                                                 1
GGGGGGGAGTAGGACTGCAAA                                                                 1
GGGGGGGAAGACTCCGGAACG                                                                 1
GGGGGGGAACGCATTTCACGG                                                                 1
GGGGGGGACGTTCACCGGCAA                                                                 1
...

Load count files¶

We next use k-seq data class k_seq.data.SeqData to load and pool count files into a Python object with merged count table for further analysis.

In this example, we use SeqData.from_count_files to load all count files provided in the k-seq repository (k-seq/data/counts folder). We filter files in the folder to load only count files whose file names contain test-d and two DNA quantification data files (total-rna.csv and spike-in-rna.csv) should not be included. The independent variable (x, initial BYO concentration) is provided in the file sample-substrate-concentration.csv, which can be passed to the dataset using x_values argument. See SeqData.from_count_files for the complete usage.

Tip

Files are usually named with relevant information to the sample, such as such as experimental condition, replicate number, sample name, etc. The loading function SeqData.from_count_files can automatically extract these metadata from file names using argument name_template. The examples count files are named as test-d-{exp_cond: A, B, C, ...}{rep_id: 1, 2, 3}-_S{sample_id}_counts.txt and the samples are named as {experimental condition}{rep}. Use name_template to locate the metadata (using {}) and sample name (using []) in file name with corresponding data type (int, float, or str by default). See below for example code.

Example code:

import pandas as pd
from k_seq.data import SeqData

sub_c = pd.read_csv('/path/to/sample-substrate-concentration.csv', index_col=1, squeeze=True)

dataset = SeqData.from_count_files(
    count_files='path/to/counts',
    pattern_filter='test-d-',
    name_template='test-d-[{cond}{exp_rep, int}]_S{sample, int}_counts.txt',
    sort_by='sample',
    x_values=sub_c,
    x_unit='M',    # substrate concentration unit: M
    input_sample_name=['R0'],
    note='Example k-seq data'
)

dataset is a k_seq.data.SeqData object and contains a count table (k_seq.data.SeqTable, a subclass of pd.DataFrame) at dataset.table.original.

Filter sequences and samples¶

Depending on the experimental design and quality control criteria, sequences or samples sometimes need to be filtered. k-seq provides several filters for this purpose. For example, sequences contain ambiguous nucleotides (‘N’) or with undesired length might need to be removed, and the spike-in sequence should be excluded from fitting. We filter these sequences and saved a filtered table under dataset.table:

from k_seq.data.filters import NoAmbiguityFilter, SeqLengthFilter, SpikeInFilter

dataset.table.filtered = SeqLengthFilter(min_len=21, max_len=21)(
    NoAmbiguityFilter()(
        dataset.table.original
    )
)

dataset.table.filtered = SpikeInFilter(
    center_seq='AAAAACAAAAACAAAAACAAA',
    radius=2, dist_type='edit',       # we consider all sequences within 2 edit distance to center_seq as spike-in
    target=dataset.table.filtered     # target table needed to calculate sequence edit distance to center_seq
)(dataset.table.filtered)

Sequence quantification¶

Once a count table with sequences of interests are prepared, we can further quantify the dependent variable in the kinetic model (e.g., fraction reacted) for fitting.

Quantify the sequence¶

Depending on the design of the k-Seq experiment, different quantification methods could be used. The k-seq package implemented two common quantification approaches: 1) by total sequence amount and 2) by amount of external standards (i.e., spike-in sequence).

Total sequence (DNA/RNA/protein) amount in each sample can be measured using proper experimental technics (e.g., qPCR for RNA amount). The quantity of each sequence in a sample can be calculated as the ratio of sequence’s counts over the total number of reads in the sample, multiplied by the total sequence amount. In our examples, the total RNA amount for each sample is saved in total-rna.csv, and we can pass it to dataset using method SeqData.add_sample_total. Please note that in order to calculate the fraction of reads for each sequence in a sample, we need to specify a full_table, which is, in our case, the filtered table to exclude added spike-in sequence (and a small number of sequences discarded).

total_rna = pd.read_csv('path/to/total-rna.csv', index_col=0, squeeze=True)

# add sample total info
dataset.add_sample_total(
    total_amounts=total_rna,
    full_table=dataset.table.filtered,
    unit='ng'
)
# quantify fitlered seqeunces
dataset.table.filtered_ng = dataset.sample_total(target=dataset.table.filtered)

Alternatively, we can quantify RNA amount with spike-in sequence using method SeqData.add_spike_in. The amount of a sequence is calculated as the ratio of the sequence’s counts over the counts of spike-in sequences, multiplied by the spike-in amount. In this case, we choose a base_table as the original table with spike-read reads.

spike_in_amount = pd.read_csv('path/to/spike-in-rna.csv', index_col=0, squeeze=True)

dataset.add_spike_in(
    spike_in_seq='AAAAACAAAAACAAAAACAAA',
    spike_in_amount=spike_in_amount,
    base_table=dataset.table.original,
    dist_type='edit',
    radius=2
)

dataset.table.filtered_ng_spikein = dataset.spike_in(target=dataset.table.filtered)

The value in the table filtered_ng and filtered_ng_spikin are sequence mass with unit ‘ng’.

Calculate the dependent variable¶

The final step in the preprocessing is to calculate the dependent variable (y) in the kinetic model, in our case, the fraction of sequences reacted. Given sequences have been quantify (as ng), the fraction reacted is the amount of sequence in the reacted sample divided by the amount of sequence in the input sample. While this can be done by regular pd.DataFrame arithmetics, we also include ReactedFractionNormalizer with additional functions (e.g., taking the median of multiple input replicates, see function documentation for details).

from k_seq.data.transform import ReactedFractionNormalizer

dataset.table.reacted_fraction = ReactedFractionNormalizer(input_samples='R0')(dataset.table.filtered_ng)

The table reacted_fraction contains the fractions reacted in samples for each sequences to fit kinetic models.

Kinetic model fitting¶

Least-squares fitting with bootstrapping¶

For each sequence, we perform least-squares fitting to the kinetic model using curve_fit function from scipy package. Several bootstrapping schemas are implemented to robustly quantify the estimation uncertainty. All these functions are enclosed in the class BatchFitter.

Fitting thousands to hundreds of thousands sequence from k-Seq experimental could be computationally heavy, especially with bootstrapping uncertainty quantify. Several optimizations are included in BatchFitter for large datasets:

Count reads are discrete numbers and lower abundant sequences are likely to contain same number of reads in all samples. Such sequences containing same data from k-Seq are deduplicated through hashing and only one fitting are performed for them.
During the fitting of a large dataset, results are streamed to disk for each sequence. If the fitting were interrupted, it can restart by scanning the output folder and continue fitting only unfitted sequences.
BatchFitter perform fittings in parallel through multi-threading. Large jobs can be deployed to a High Performance Computing (HPC) node with multi-cores for faster fitting.

In this example, we fit sequences in the table reacted_fraction to the pseudo-first-order kinetics. The kinetic model can be defined using python function following the format of curve_fit. The function has the independent variable as the first argument and model parameters as the remaining arguments. Any model can be represented in such format can be passed to BatchFitter. The fitter can perform single point estimation using all available data points and/or fit bootstrapped data points for multiple times to estimate uncertainty. In addition to model parameters k and A, we also intended to calculate the product kA from each fitting results in point estimation and bootstrapping. A dictionary of functions (here kA) can be passed to argument metric to calculate and save additional metrics after each fitting. These functions should take a single argument containing a list of fitted model parameters (ordered according to model function arguments). Please see below example and BatchFitter for more details on fitting configuration.

from k_seq.estimate.least_squares_batch import BatchFitter

def pseudo_first_order(c, k, A):
    alpha = 0.479
    t = 90
    y = A * (1 - np.exp(- alpha * t * k * c))
    return y

def kA(param):
    k, A = param[0], param[1]
    return k * A

fitter = BatchFitter(
    y_dataframe=dataset.table.reacted_fraction,
    x_data=dataset.x_values,
    model=pseudo_first_order,
    x_label='Substrate (M)',         # name for x values
    y_label='Fraction reacted',      # name for y values
    bounds=[(0, 0), (np.inf, 1)],    # value bound for fitting, format: [(k_min, A_min), (k_max, A_max)]
    metrics={'kA': kA},              # additional metrics to calculate after each fitting
    bootstrap_num=10,                # bootstrap for 10 times (10 is for test purpose, recommend > 100)
    bs_record_num=5,                 # save only 5 bootstrap fitting results to save disk space
    bs_method='data',                # bootstrap by resampling (x, y) data points
    large_dataset=True,              # if True, stream results to save_to folder during fitting
    save_to=FIG_OUTPUT_DIR/'fitting',
    rnd_seed=23
)

fitter.fit(
    parallel_cores=12,               # run fitting in parallel
    point_estimate=True,             # run point estimation using all data points
    bootstrap=True                   # run bootstrapping with configuration specified above
)

Fitting results¶

All fitting results are saved to the directory specified by the save_to arguments, a typical output folder contains following content:

summary.csv: a table contain summarized output from fitting. In out example, it includes columns of sequence, results from point estimation (k, A, and kA value), results from bootstrapping (mean, 2.5%, 50% or median, 97.5% of estimated k, A, and kA values, with prefix ‘bs_’ )
seqs/ folder contains fitting details:
- seq_to_hash.json: JSON file contains sequence to hash mapping
- [hash_value].json: list of JSON files contain the fitting details for hashed sequences, including data, fitting configuration, point estimation results, bootstrap (uncertainty) results with selected records from bootstrapped samples.

Results are also accessible in Python: fitter.summary() returns the fitting summary and fitting details are available under fitter.results.

Note

This tutorial is under working progress and is frequently updated