Getting Started Tutorial¶
This brief tutorial walks through a basic pipeline for k-Seq experiment data analysis using k-seq
package. The
pipeline includes following steps:
See also
Installation for setting up the environment and installing k-seq package. k-seq package for the complete API documentation of the package.
Data preprocessing¶
Sequencing data (FASTQ files) are first processed in the following steps:
trim/join paired-end reads and deduplicate joined reads as sequence counts
compose multiple count files as a count table
filter sequences or samples for downstream analysis
Join paired-end reads and count unique sequences using EasyDIVER
¶
We integrated EasyDIVER pipeline in the script fastq-to-counts.sh
where paired-end reads are joined using PANDASeq, reads from sample samples
are pooled, and joined reads are further deduplicated into counts (number of reads) for each unique sequence.
fastq-to-counts.sh
requires all FASTQ files are organized in a folder with file names formatted as:
[sample-name]_L[lane number]_R[1/2]_001.fastq.gz
. sample-name
is the string identifier for an experimental
sample and lane number
indicate the lane of the flow-cell. R[1/2]
indicate the direction of paired-end reads
(R1
for forward reads and R2
for reverse reads) that each sample-lane expects to have matched files in both
directions.
Here is an example input file folder contains paired-end reads from two samples R4A-0A_S7
and R4B-0A_S21
with 4 lanes:
R4A-0A_S7_L001_R1_001.fastq.gz R4B-0A_S21_L001_R1_001.fastq.gz
R4A-0A_S7_L001_R2_001.fastq.gz R4B-0A_S21_L001_R2_001.fastq.gz
R4A-0A_S7_L002_R1_001.fastq.gz R4B-0A_S21_L002_R1_001.fastq.gz
R4A-0A_S7_L002_R2_001.fastq.gz R4B-0A_S21_L002_R2_001.fastq.gz
R4A-0A_S7_L003_R1_001.fastq.gz R4B-0A_S21_L003_R1_001.fastq.gz
R4A-0A_S7_L003_R2_001.fastq.gz R4B-0A_S21_L003_R2_001.fastq.gz
R4A-0A_S7_L004_R1_001.fastq.gz R4B-0A_S21_L004_R1_001.fastq.gz
R4A-0A_S7_L004_R2_001.fastq.gz R4B-0A_S21_L004_R2_001.fastq.gz
To run EasyDIVER
pipeline:
1 2 3 4 5 6 7 8 9 10 | > fastq-to-counts.sh \
-i /path/to/input/folder \
-o /path/to/output/folder \
-p CTACGAATTC \ # forward adapter sequence for trimming
-q CTGCAGTGAA \ # reverse adapter sequence for trimming
-c \ # use completely matching in joining and discard reads with error
-a \ # join the paired-end reads first before trimming the adapters for reads with heavily overlapped regions
-T 12 # number of threads to run in parallel for pandaSeq
> ls /path/to/output/folder
counts fastas fastqs histos log.txt
|
The output folder contains following files:
counts
: deduplicated count files named as[sample-name]_counts.txt
fastqs
: joined FASTQ files named as[sample-name].joined.fastq
fastas
: joined FASTA files named as[sample-name].joined.fasta
histos
: files with the joined reads length histogram statistics, named as[sample-files]_counts_histo.txt
We will use the count files for each sample under the counts
folder, which has following format:
number of unique sequences = 981824
total number of molecules = 10096876
GGGGGGGGATTCATGACTATT 1
GGGGGGGAGTAGGACTGCAAA 1
GGGGGGGAAGACTCCGGAACG 1
GGGGGGGAACGCATTTCACGG 1
GGGGGGGACGTTCACCGGCAA 1
...
Load count files¶
We next use k-seq
data class k_seq.data.SeqData
to load and pool count files
into a Python object with merged count table for further analysis.
In this example, we use SeqData.from_count_files
to load all
count files provided in the k-seq
repository
(k-seq/data/counts folder).
We filter files in the folder to load only count files whose file names contain test-d
and two
DNA quantification data files (total-rna.csv
and spike-in-rna.csv
) should not be included.
The independent variable (x, initial BYO concentration) is provided in the file sample-substrate-concentration.csv
,
which can be passed to the dataset using x_values
argument.
See SeqData.from_count_files
for the complete usage.
Tip
Files are usually named with relevant information to the sample, such as such as experimental condition,
replicate number, sample name, etc. The loading function
SeqData.from_count_files
can automatically extract
these metadata from file names using argument name_template
. The examples count files are named as
test-d-{exp_cond: A, B, C, ...}{rep_id: 1, 2, 3}-_S{sample_id}_counts.txt
and the samples are named
as {experimental condition}{rep}
. Use name_template
to locate the metadata (using {}
) and sample
name (using []
) in file name with corresponding data type (int
, float
, or str
by default).
See below for example code.
Example code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import pandas as pd
from k_seq.data import SeqData
sub_c = pd.read_csv('/path/to/sample-substrate-concentration.csv', index_col=1, squeeze=True)
dataset = SeqData.from_count_files(
count_files='path/to/counts',
pattern_filter='test-d-',
name_template='test-d-[{cond}{exp_rep, int}]_S{sample, int}_counts.txt',
sort_by='sample',
x_values=sub_c,
x_unit='M', # substrate concentration unit: M
input_sample_name=['R0'],
note='Example k-seq data'
)
|
dataset
is a k_seq.data.SeqData
object and contains
a count table (k_seq.data.SeqTable
, a subclass of pd.DataFrame
)
at dataset.table.original
.
Filter sequences and samples¶
Depending on the experimental design and quality control criteria, sequences or samples sometimes need to be filtered.
k-seq
provides several filters for this purpose. For example, sequences contain ambiguous nucleotides (‘N’) or
with undesired length might need to be removed, and the spike-in sequence should be excluded from fitting. We filter
these sequences and saved a filtered table under dataset.table
:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from k_seq.data.filters import NoAmbiguityFilter, SeqLengthFilter, SpikeInFilter
dataset.table.filtered = SeqLengthFilter(min_len=21, max_len=21)(
NoAmbiguityFilter()(
dataset.table.original
)
)
dataset.table.filtered = SpikeInFilter(
center_seq='AAAAACAAAAACAAAAACAAA',
radius=2, dist_type='edit', # we consider all sequences within 2 edit distance to center_seq as spike-in
target=dataset.table.filtered # target table needed to calculate sequence edit distance to center_seq
)(dataset.table.filtered)
|
Sequence quantification¶
Once a count table with sequences of interests are prepared, we can further quantify the dependent variable in the kinetic model (e.g., fraction reacted) for fitting.
Quantify the sequence¶
Depending on the design of the k-Seq experiment, different quantification methods could be used. The k-seq
package implemented two common quantification approaches: 1) by total sequence amount and 2) by amount of
external standards (i.e., spike-in sequence).
Total sequence (DNA/RNA/protein) amount in each sample can be measured using proper experimental technics
(e.g., qPCR for RNA amount). The quantity of each sequence in a sample can be calculated as the ratio of sequence’s
counts over the total number of reads in the sample, multiplied by the total sequence amount.
In our examples, the total RNA amount for each sample is saved in total-rna.csv
, and we can pass it to
dataset
using method SeqData.add_sample_total
. Please note
that in order to calculate the fraction of reads for each sequence in a sample, we need to specify a full_table
,
which is, in our case, the filtered table to exclude added spike-in sequence (and a small number of sequences discarded).
1 2 3 4 5 6 7 8 9 10 | total_rna = pd.read_csv('path/to/total-rna.csv', index_col=0, squeeze=True)
# add sample total info
dataset.add_sample_total(
total_amounts=total_rna,
full_table=dataset.table.filtered,
unit='ng'
)
# quantify fitlered seqeunces
dataset.table.filtered_ng = dataset.sample_total(target=dataset.table.filtered)
|
Alternatively, we can quantify RNA amount with spike-in sequence using method
SeqData.add_spike_in
. The amount of a sequence is calculated as the
ratio of the sequence’s counts over the counts of spike-in sequences, multiplied by the spike-in amount. In this case,
we choose a base_table
as the original table with spike-read reads.
1 2 3 4 5 6 7 8 9 10 11 | spike_in_amount = pd.read_csv('path/to/spike-in-rna.csv', index_col=0, squeeze=True)
dataset.add_spike_in(
spike_in_seq='AAAAACAAAAACAAAAACAAA',
spike_in_amount=spike_in_amount,
base_table=dataset.table.original,
dist_type='edit',
radius=2
)
dataset.table.filtered_ng_spikein = dataset.spike_in(target=dataset.table.filtered)
|
The value in the table filtered_ng
and filtered_ng_spikin
are sequence mass with unit ‘ng’.
Calculate the dependent variable¶
The final step in the preprocessing is to calculate the dependent variable (y) in the kinetic model, in our case,
the fraction of sequences reacted. Given sequences have been quantify (as ng), the fraction reacted is the amount of
sequence in the reacted sample divided by the amount of sequence in the input sample. While this can be done by
regular pd.DataFrame
arithmetics, we also include
ReactedFractionNormalizer
with additional functions (e.g.,
taking the median of multiple input replicates, see function documentation for details).
1 2 3 | from k_seq.data.transform import ReactedFractionNormalizer
dataset.table.reacted_fraction = ReactedFractionNormalizer(input_samples='R0')(dataset.table.filtered_ng)
|
The table reacted_fraction
contains the fractions reacted in samples for each sequences to fit
kinetic models.
Kinetic model fitting¶
Least-squares fitting with bootstrapping¶
For each sequence, we perform least-squares fitting to the kinetic model using
curve_fit
function from scipy
package.
Several bootstrapping schemas are implemented to robustly quantify the estimation uncertainty. All these functions
are enclosed in the class BatchFitter
.
Fitting thousands to hundreds of thousands sequence from k-Seq experimental could be computationally heavy, especially
with bootstrapping uncertainty quantify. Several optimizations are included in BatchFitter
for large datasets:
Count reads are discrete numbers and lower abundant sequences are likely to contain same number of reads in all samples. Such sequences containing same data from k-Seq are deduplicated through hashing and only one fitting are performed for them.
During the fitting of a large dataset, results are streamed to disk for each sequence. If the fitting were interrupted, it can restart by scanning the output folder and continue fitting only unfitted sequences.
BatchFitter
perform fittings in parallel through multi-threading. Large jobs can be deployed to a High Performance Computing (HPC) node with multi-cores for faster fitting.
In this example, we fit sequences in the table reacted_fraction
to the pseudo-first-order kinetics. The kinetic model
can be defined using python function following the format of curve_fit
. The function has the independent variable
as the first argument and model parameters as the remaining arguments. Any model can be represented in such format can
be passed to BatchFitter
. The fitter can perform single point estimation using all available data points and/or fit
bootstrapped data points for multiple times to estimate uncertainty.
In addition to model parameters k and A, we also intended to calculate the
product kA from each fitting results in point estimation and bootstrapping. A dictionary of functions (here kA
)
can be passed to argument metric
to calculate and save additional metrics after each fitting.
These functions should take a single argument containing a list of fitted model parameters (ordered according to
model function arguments). Please see below example and
BatchFitter
for more details on fitting configuration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | from k_seq.estimate.least_squares_batch import BatchFitter
def pseudo_first_order(c, k, A):
alpha = 0.479
t = 90
y = A * (1 - np.exp(- alpha * t * k * c))
return y
def kA(param):
k, A = param[0], param[1]
return k * A
fitter = BatchFitter(
y_dataframe=dataset.table.reacted_fraction,
x_data=dataset.x_values,
model=pseudo_first_order,
x_label='Substrate (M)', # name for x values
y_label='Fraction reacted', # name for y values
bounds=[(0, 0), (np.inf, 1)], # value bound for fitting, format: [(k_min, A_min), (k_max, A_max)]
metrics={'kA': kA}, # additional metrics to calculate after each fitting
bootstrap_num=10, # bootstrap for 10 times (10 is for test purpose, recommend > 100)
bs_record_num=5, # save only 5 bootstrap fitting results to save disk space
bs_method='data', # bootstrap by resampling (x, y) data points
large_dataset=True, # if True, stream results to save_to folder during fitting
save_to=FIG_OUTPUT_DIR/'fitting',
rnd_seed=23
)
fitter.fit(
parallel_cores=12, # run fitting in parallel
point_estimate=True, # run point estimation using all data points
bootstrap=True # run bootstrapping with configuration specified above
)
|
Fitting results¶
All fitting results are saved to the directory specified by the save_to
arguments, a typical output folder contains
following content:
summary.csv
: a table contain summarized output from fitting. In out example, it includes columns of sequence, results from point estimation (k, A, and kA value), results from bootstrapping (mean, 2.5%, 50% or median, 97.5% of estimated k, A, and kA values, with prefix ‘bs_’ )seqs/
folder contains fitting details:seq_to_hash.json
: JSON file contains sequence to hash mapping[hash_value].json
: list of JSON files contain the fitting details for hashed sequences, including data, fitting configuration, point estimation results, bootstrap (uncertainty) results with selected records from bootstrapped samples.
Results are also accessible in Python: fitter.summary()
returns the fitting summary and fitting details are available
under fitter.results
.
Note
This tutorial is under working progress and is frequently updated