k_seq.data
: data preprocessing¶
Modules for data handling, including:
preprocessing
: core module in data pre-processing from count file to SequenceSeq for estimatorio
: module contains utility function for read, write and convert different file formatsanalysis
: module contains functions for extra analysis for sequencing samples or reads to sample investigation and sample pipeline quality controlsimu
: module contains codes to generate simulated data used in analysis in paper (TODO: add paper citation)
-
k_seq.data.
axis_mapper
(axis)¶ Map the name of data table axis to axis number axis 0: seq, sequence, sequences axis 1: sample, observation, obs
k_seq.data.preprocess¶
This module contains following preprocessing function for sequencing data:
1. Preprocessing DNA sequencing results from fastq.gz files using EasyDIVER pipeline (https://github.com/ichen-lab-ucsb/EasyDIVER), including joining paired-end reads, trimming primers, pooling same sample in different lanes, and deduplicate to generate count files of unique sequences.
Parse count files and load as a
SeqTable
object
-
k_seq.data.preprocess.
fastq_to_count
(fastq_root, output_path='pipeline-output', threads=12, forward_primer=None, reverse_primer=None, pandas_abs_match=True, join_first=True)¶ Call fastq_to_count.sh subprocess for raw reads joining, pooling, and de-duplicating using the EasyDIVER pipeline (https://github.com/ichen-lab-ucsb/EasyDIVER)
- Parameters
fastq_root (path) – path to the root folder contains paired-end raw FASTQ.gz files
output_path (path) – path to save output. By default it saves to ‘pipeline-output’ folder
threads (int) – number of threads to run pandaSeq for joining. By default it uses all CPUs
forward_primer (str) – specify the forward primer for pandaSeq joining and trimming
reverse_primer (str) – specify the reverse primer for pandaSeq joining and trimming
pandas_abs_match (bool) – if enforce absolute matching in pandaSeq joining. Default True.
join_first (bool) – if join before trimming in pandaSeq. Suitable for heavily overlapped paired-end reads. Default True.
Example
TODO: add an example
-
k_seq.data.preprocess.
load_Seqtable_from_count_files
(count_files, file_list=None, pattern_filter=None, black_list=None, name_template=None, sort_by=None, x_values=None, x_unit=None, input_sample_name=None, sample_metadata=None, note=None, dry_run=False)¶ Create a
SeqData
instance from count files.Multiple sources of count files are supported, indicate in count_files
- Parameters
count_files (str, or list of dict) – root directory to search for count files. To load multiple sources, provide a list of dict as keyword arguments in the format of [{‘count_files’: path/to/count/files (required), ‘name_template’: string for name template, ‘pattern_filter’: string for pattern filter, ‘file_list’: list of file to load, ‘black_list’: list of files to exclude}]
file_list (list of str) – optional, only includes the count files with names in the file_list
pattern_filter (str) – optional, filter file names based on this pattern, wildcards
*/?
are allowedblack_list (list of str) – optional, file names included in black_list will be excluded
name_template (str) – naming convention to extract metadata. Use
[...]
to include the region of sample_name, use{domain_name[, int/float]}
to indicate region of domain to extract as metadata, includingint
orfloat
will convert the domain value to int/float in applicable, otherwise, stringsort_by (str) – sort the order of samples based on given domain
dry_run (bool) – only return the parsed count file names and metadata without actual reading in data
x_values (list-like, or dict) – optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key
x_unit (str) – optional. Unit for controlled variable. e.g. ‘uM’
input_sample_name (list of str) – optional. Indicate input samples (unreacted)
sample_metadata (dict of objects) – optional. Extra sample metadata
note (str) – Note for dataset/seq_table
Example
Example on metadata extraction from pattern: >>> metadata = extract_metadata(
sample_name = “R4B-1250A_S16_counts.txt” template = “R4[{exp_rep}-{concentration, float}{seq_rep}_S{id, int}]_counts.txt”
)
>>> metadata { 'name': 'B-1250A_S16', 'exp_rep': 'B', 'concentration': 1250.0, 'seq_rep': 'A', 'id': 16 }
- Notice: two back-to-back domain can only be parsed if one of them is numeric and one of them is alphabetic, and missing
value will raise error
Valid: matching ‘-A1-‘ to ‘-{{sample}}{{replicate, int}}-‘ gives {{ ‘sample’: ‘A’, ‘replicate’: 1}} Not valid: matching ‘-A-‘ to ‘-{{sample}}{{replicate, int}}-‘ will cause error
matching ‘-AA-‘ to ‘-{{sample}}{{replicate}}-‘ will cause error
-
k_seq.data.preprocess.
read_count_file
(file_path, as_dict=False, number_only=False)¶ Read a single count file generated from EasyDIVER pipeline with format:
Count file format:
number of unique sequences = 2825 total number of molecules = 29348173 AAAAAAAACACCACACA 2636463 AATATTACATCATCTATC 86763 ...
- Parameters
file_path (str) – full directory to the count file
as_dict (bool) – return a dictionary instead of a pd.DataFrame
number_only (bool) – only return number of unique seqs and total counts if True
- Returns
number of unique sequences in the count file total_counts (int): number of total reads in the count file sequence_counts (pd.DataFrame): with
sequence
as index andcounts
as the first column- Return type
unique_seqs (int)
k_seq.data.seq_data¶
Submodule of SeqData, a rich functions class of seq_table for sequencing manipulation This module contains methods
for data pre-processing from count files to CountFile
for estimator
For absolute quantification, it accepts absolute amount (e.g. measured by qPCR) or reacted fraction
.. todo:: - write output function for each class as JSON file and readable foler
-
class
k_seq.data.seq_data.
SeqData
(data, data_unit=None, sample_list=None, seq_list=None, data_note=None, use_sparse=True, seq_metadata=None, sample_metadata=None, grouper=None, x_values=None, x_unit=None, note=None, dataset_metadata=None)¶ Bases:
object
Data instance to store k-seq result
-
seq_table
¶ accessor for tables stored. Including
original
created during initialization. tables stored should bepd.DataFrame
orSeqTable
- Type
- x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key
x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ seqs samples metadata (AttrScope): accessor for metadata for the dataset, includes
sample (AttrScope): collection of metadata for samples if applicable seq (AttrScope): collection of metadata for seqs if applicable created_time (timestamp): datetime of the instance is created note (str): note for the dataset other dataset metadata objects could be added
analysis (AttrScope): accessor to some pre-built analyses
- Plugins:
grouper (GrouperCollection): collection of
Grouper
to slice subtables spike_in (SpikeInNormalizer): optional. Accessor to the normalizer using spike-in sample_total (TotalAmountNormalizer): optional. Accessor to the normalizer using total sample amount of seqs analysis (SeqDataAnalyzer): built-in analysis tools to analyze SeqData object
-
__init__
(data, data_unit=None, sample_list=None, seq_list=None, data_note=None, use_sparse=True, seq_metadata=None, sample_metadata=None, grouper=None, x_values=None, x_unit=None, note=None, dataset_metadata=None)¶ Initialize a SeqData object
Args: data (pd.DataFrame or np.ndarray): 2-D data with indices as sequences and columns as samples. If data is pd.DataFrame, values in index and column will be used as sequences and samples; if data is a 2-D np.ndarray, sample_list and seq_to_fit are needed with same length and order as data
data_unit (str): The unit of seq_table values, e.g. counts, ng, M. Default counts. sample_list (list-like): list of samples in the sample, should match the columns in the seq_table data seq_to_fit: data_note (str): Note for data seq_table use_sparse (bool): If store the seq_table value as sparse matrix seq_metadata (dict of objects): optional. Extra seq metadata grouper: x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ note (str): Note for dataset/seq_table dataset_metadata (dict of objects): optional. Extra dataset metadata
- grouper (dict of list, or dict of dict of list): optional. dict of list (Type 1) of dict of list (Type 2) to
create grouper plugin
-
add_grouper
(**kwargs)¶ Add an accessor of GrouperCollection of SeqData if not yet. Add Groupers to the accessor Initialize a Grouper instance with keyword arguments with a dictionary of:
- group (list or dict): list creates a Type 0 Grouper (single group) and dict creates a Type 1 Grouper
(multiple groups)
target (pd.DataFrame): optional, target table axis (0 or 1): axis to apply the grouper
-
add_sample_total
(total_amounts, full_table, unit=None)¶ - Add TotalAmountNormalizer to quantify sequences with their total amount in each sample
as sample_total
- Parameters
total_amounts (dict or pd.Series) – Total DNA amount for samples measured in experiment
full_table (pd.DataFrame) – seq_table where the total amount were measured and normalize to
unit (str) – Unit of amount measured
-
add_spike_in
(base_table, spike_in_seq, spike_in_amount, radius=2, unit=None, dist_type='edit')¶ Add SpikeInNormalizer to quantify seq amount using spike-in sequence as accessor
spike_in
to the instance- Parameters
base_table (pd.DataFrame) – base_table includes spike-in sequences to calculate normalization factor
spike_in_seq (str) – center sequence for spike-in
spike_in_amount (list-like, dict, or pd.Series) – added spike_in amount, dict and pd.Series should have key of samples in base_table list-like should have same length as number of samples (cols) in base_table
radius (int) – Radius of spike-in peak, seqs less or equal to radius away from center are spike-in seqs
unit (str) – unit of spike-in amount
dist_type (str) – ‘edit’ or ‘hamming’ as distance measure. Default ‘edit’ to include insertion / deletion
-
static
from_count_files
(**kwargs)¶ Create a
SeqData
instance from a folder of count files- Args:
count_files (str): root directory to search for count files file_list (list of str): optional, only includes the count files with names in the file_list pattern_filter (str): optional, filter file names based on this pattern, wildcards
*/?
are allowed black_list (list of str): optional, file names included in black_list will be excluded name_template (str): naming convention to extract metadata. Use[...]
to include the region of sample_name,use
{domain_name[, int/float]}
to indicate region of domain to extract as metadata, includingint
orfloat
will convert the domain value to int/float in applicable, otherwise, stringsort_by (str): sort the order of samples based on given domain dry_run (bool): only return the parsed count file names and metadata without actual reading in data
x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ input_sample_name (list of str): optional. Indicate input samples (unreacted) sample_metadata (dict of objects): optional. Extra sample metadata note (str): Note for dataset/seq_table
Example
Example on metadata extraction from pattern: >>> metadata = extract_metadata(
sample_name = “R4B-1250A_S16_counts.txt” template = “R4[{exp_rep}-{concentration, float}{seq_rep}_S{id, int}]_counts.txt”
)
>>> metadata { 'name': 'B-1250A_S16', 'exp_rep': 'B', 'concentration': 1250.0, 'seq_rep': 'A', 'id': 16 }
- Notice: two back-to-back domain can only be parsed if one of them is numeric and one of them is alphabetic, and missing
value will raise error
Valid: matching ‘-A1-‘ to ‘-{{sample}}{{replicate, int}}-‘ gives {{ ‘sample’: ‘A’, ‘replicate’: 1}} Not valid: matching ‘-A-‘ to ‘-{{sample}}{{replicate, int}}-‘ will cause error
matching ‘-AA-‘ to ‘-{{sample}}{{replicate}}-‘ will cause error
-
from_json
()¶ TODO: add json
-
static
from_pickle
(path)¶
-
property
samples
¶
-
property
seqs
¶
-
to_json
()¶ More generalized JSON file TODO: add to_json and from_json
-
to_pickle
(path)¶
-
update_analysis
()¶ Update accessor to SeqDataAnalyzer
-
-
class
k_seq.data.seq_data.
SeqTable
(*args, **kwargs)¶ Bases:
pandas.core.frame.DataFrame
Extended
pd.DataFrame
with added property and functions for SeqTable- Additional Attributes:
unit (str): unit of entries in this seq_table note (str): note for this seq_table samples (pd.Series): samples in the seq_table seqs (pd.Series): sequences in the seq_table
- Additional Methods:
about: print a summary of the seq_table analysis: accessor to SeqTableAnalyzer
-
__init__
(*args, **kwargs)¶ Initialize SeqTable instance Additional kwargs: unit:
note (str): Note for dataset/seq_table use_sparse (bool): If store the seq_table value as sparse matrix
-
about
()¶ Quick view of SeqTable
-
property
density
¶
-
describe
(percentiles=None, include=None, exclude=None)¶ return major stable statistics
-
filter_axis
(filter, axis=0, remove_empty=False, inplace=False)¶ Filter seq_table along with one axis
- Parameters
filter (callable) – a callable to apply on row/columns and returns a bool value
axis (0 or 1) – the axis to filter. 0: row, seqs; 1: column, sample
remove_empty (bool) – If remove the empty column/rows after filtering
inplace (bool) – if change the seq_table inplace. If False, return a new seq_table
-
property
samples
¶
-
property
seqs
¶
-
update_analysis
()¶ update the accessor to SeqTableAnalyzer
-
k_seq.data.seq_data.
slice_table
(table, axis, keys, remove_empty=False)¶ Slice pd.DataFrame seq_table with a list of key values or filter functions returning True/False along given axis. Optional to remove all zero entries :param seq_table: seq_table to slice :type seq_table: pd.DataFrame :param keys: list of keys to preserve. If is callable, apply to row/column of seq_table and returns
bool of preserve (True) or discard (False)
- Parameters
axis (0 or 1) – which axis to filter
remove_empty (bool) – If remove the empty column/rows after filtering
k_seq.data.seq_data_analyzer¶
-
class
k_seq.data.seq_data_analyzer.
SeqDataAnalyzer
(seq_data)¶
-
class
k_seq.data.seq_data_analyzer.
SeqTableAnalyzer
(seq_table)¶
-
k_seq.data.seq_data_analyzer.
cross_table_compare
(base_table, compare_table, samples=None, ax=None, figsize=None, color_map=None, save_fig_to=None)¶
-
k_seq.data.seq_data_analyzer.
rep_variance_scatter
(seq_table, grouper, xaxis=None, subsample=None, xlog=True, ylog=True, xlim=None, ylim=None, group_title_pos=None, xlabel=None, ylabel=None, label_map=None, figsize=None, save_fig_to=None)¶ A scatter plot for measured value variance in replicates for each sequence
-
k_seq.data.seq_data_analyzer.
sample_entropy_scatterplot
(seq_table, black_list=None, normalize=False, base=2, color_map=None, figsize=None, scatter_kwargs=None, save_fig_to=None)¶
-
k_seq.data.seq_data_analyzer.
sample_info
(seq_data)¶ Summarize sample info for a SeqData, with info of total amount and spike-in :returns: A pd.DataFrame show the summary for samples
-
k_seq.data.seq_data_analyzer.
sample_overview
(seq_table, axis=1)¶ Summarize sequences for a given seq_table, with info of unique seqs, total amount
- Returns
A pd.DataFrame show the summary for sequences
-
k_seq.data.seq_data_analyzer.
sample_overview_plots
(seq_table, plot_unique_seq=True, plot_total_counts=True, plot_spike_in_frac=True, color_map=None, black_list=None, figsize=None, label_mapper=None, save_fig_to=None)¶ Overview plot(s) of unique seqs, total counts and spike-in fractions in the samples
- Parameters
seq_table (SeqData) – sample set to survey
plot_unique_seq (bool) – plot bar plot for unique sequences if True
plot_total_counts (bool) – plot bar plot for total counts if True
plot_spike_in_frac (bool) – plot scatter plot for spike in fraction if True
color_map (dict) – {sample_name: color} for all plots
black_list (list of str) – list of sample name to exlude from the plots
sep_plot (bool) – plot separate plots for unique sequences, total counts and spike_in fractions if True
label_mapper (dict or callable) – alternative labels for samples
fig_save_to (str) – save figure to the directory if not None
-
k_seq.data.seq_data_analyzer.
sample_rel_abun_hist
(seq_table, black_list=None, bins=None, x_log=True, y_log=False, ncol=None, nrow=None, figsize=None, hist_kwargs=None, save_fig_to=None)¶ todo: add pool counts composition curve for straight forward visualization
-
k_seq.data.seq_data_analyzer.
sample_spike_in_ratio_scatterplot
(seq_table, black_list=None, ax=None, save_fig_to=None, figsize=None, label_mapper=None, scatter_kwargs=None)¶ Scatter plot of spike in ratio in the pool
-
k_seq.data.seq_data_analyzer.
sample_total_reads_barplot
(seq_table, black_list=None, logy=False, ax=None, save_fig_to=None, figsize=None, x_label=None, y_label='Total reads', fontsize=14, label_mapper=None, barplot_kwargs=None)¶ Barplot of total counts (sum over sequences) in each sample
-
k_seq.data.seq_data_analyzer.
sample_unique_seqs_barplot
(seq_table, black_list=None, logy=False, ax=None, save_fig_to=None, figsize=None, x_label=None, y_label='Unique sequences', fontsize=14, label_mapper=None, barplot_kwargs=None)¶ Barplot of unique seqs in each sample
-
k_seq.data.seq_data_analyzer.
seq_length_dist
(seq_table, axis=0, ax=None, figsize=6, 3, bins=20, logx=False, logy=False, hist_kwargs=None, save_fig_to=None)¶ histogram of length distribution of unique sequences
-
k_seq.data.seq_data_analyzer.
seq_mean_value_detected_samples_scatterplot
(seq_table, figsize=5, ylabel='counts', ylog=True, subsample=None, color='#1F77B4', marker_size=5, scatter_kwargs=None)¶ Joint plot of of mean value (e.g. count) for a sequence across samples and number of sample it is detected With one scatter plot (x: number of samples detected, y: mean value)
and two histogram showing the distribution in each dimension
-
k_seq.data.seq_data_analyzer.
seq_overview
(seq_table, axis=0)¶ Summarize sample in seq_table, with info of seq length, sample detected, mean, sd :returns: A pd.DataFrame show the summary for sequences
-
k_seq.data.seq_data_analyzer.
seq_variance
(seq_table, grouper)¶ Get the spread (standard deviation) of sequence abundance across replicates, provided by grouper
- Returns
if single group, returns a pd.DataFrame with columns (‘mean’, ‘sd’) if multiple groups, returns two pd.DataFrame (mean, sd) with columns of each group
k_seq.data.filters module¶
k_seq.data.grouper module¶
Groupers slice seq_table into groups. E.g. input samples, reacted samples, different concentrations TODO: simplify code
-
class
k_seq.data.grouper.
Grouper
(group, target=None, axis=1)¶ Bases:
object
Grouper of samples/sequences
- Two types of grouper accepted:
Type 0 (list): initialize with group as list-like. This defines a single set of samples/sequences Type 1 (dict): initialize with group as dict. This defines a collection of groups of samples/sequences
-
target
¶ accessor for seq_table to group
- Type
pd.DataFrame
-
axis
¶ axis to apply grouping (0 for index, 1 for columns)
- Type
0 or 1
-
group
¶ dictionary with structure {group_name: group_members}
- Type
list or dict
-
type
¶ type of the grouper
- Type
0 or 1
-
__init__
(group, target=None, axis=1)¶ Initialize a Grouper instance :param group: list creates a Type 0 Grouper (single group) and dict creates a Type 1 Grouper
(multiple groups)
- Parameters
target (pd.DataFrame) – optional, target seq_table
axis (0 or 1) – axis to apply the grouper
-
get_table
(group=None, target=None, axis=None, remove_zero=False)¶ Return a sub-seq_table from target given group
-
split
(target=None, remove_zero=False)¶
-
class
k_seq.data.grouper.
GrouperCollection
(**kwargs)¶ Bases:
k_seq.utility.func_tools.AttrScope
A collection of groupers
-
add
(**kwargs)¶ add a grouper
-
-
k_seq.data.grouper.
get_group
(table, group, axis=1, remove_empty=False)¶ Slice or split the table based on the group :param table: target table to group :type table: pd.DataFrame or SeqTable :param group: single group or a set of groups :type group: list-like or dict of list
k_seq.data.variant_pool module¶
Function for variant pool design
-
k_seq.data.variant_pool.
combination
(n, k)¶
-
k_seq.data.variant_pool.
d_mutant_fraction
(d, mutation_rate, length=21, letter_book_size=4)¶ Relative abundance for a single d-order mutants
-
k_seq.data.variant_pool.
neighbor_effect_error
(xi, d, eta=0.09, L=21)¶ Fraction of reads from neighboring sequences due to sequencing error
-
k_seq.data.variant_pool.
neighbor_effect_observation
(xi, d, eta=0.09, L=21)¶ Get the ratio of observed abundance for a d-th order mutant, considering the neighbor effect under given sequencing error rate (xi)
-
k_seq.data.variant_pool.
num_of_seq
(d, length=21, letter_book_size=4)¶ Expected number of
k_seq.data.simu module¶
Module contains code to simulate data
-
class
k_seq.data.simu.
DistGenerators
¶ Bases:
object
A collection of random value generators from commonly used distributions. uniq_seq_num number of independent draw of distribution are returned
- Available distributions:
lognormal uniform compo_lognormal
-
static
compo_lognormal
(size, loc=None, scale=None, c95=None, seed=None)¶ Sample a pool composition from a log-normal distribution indicate with loc and scale, or c95
Example
scale = 0 means an evenly distributed pool with all components have relative abundance 1/uniq_seq_num
- Parameters
size (int) – uniq_seq_num of the pool
loc (float) – center of log-normal distribution
scale (float) – log variance of the distribution
c95 ([float, float]) – 95% percentile of log-normal distribution
seed – random seed
-
static
lognormal
(size=None, loc=None, scale=None, c95=None, seed=None)¶ Sample from a log-normal distribution indicate with loc and scale, or c95
- Parameters
size (int) – number of values to draw
loc (float) – center of log-normal distribution, default 0
scale (float) – log variance of the distribution, default 0
c95 ([float, float]) – 95% percentile of log-normal distribution
seed – random seed
- Returns
a draw from distribution with given uniq_seq_num
-
static
uniform
(low=None, high=None, size=None, seed=None)¶ Sample from a uniform distribution
-
class
k_seq.data.simu.
PoolParamGenerator
¶ Bases:
object
Functions to generate parameters for a set of sequence in a sequence pool
-
- sample_from_iid_dist
-
- sample_from_dataframe
- Returns
pd.DataFrame with columns of parameters and rows of simulated parameter for each sequence
-
classmethod
sample_from_dataframe
(df, uniq_seq_num, replace=True, weights=None, seed=None)¶ Simulate parameter by resampling rows of a given dataframe :param df: dataframe contains parameters as columns to sample from,
needs to have p0 as one column for heterogenous pool
uniq_seq_num (int): Number of unique sequences from simulation
-
static
sample_from_iid_dist
(uniq_seq_num, seed=None, **param_generators)¶ Simulate the seq parameters from individual draws of distributions Parameter:
p0: initial fraction for each sequence for uneven pool depending on the model. e.g. first-order model needs to include k and A
- Accepted parameter input:
list-like: if uniq_seq_num does not match as expected uniq_seq_num, resample with replacement
generator: given generator returns a random parameter
callable: if taking uniq_seq_num as an argument, needs to return a uniq_seq_num vector of sampled parameter or a generator to generate a uniq_seq_num vector; if not taking uniq_seq_num as an argument, needs to return single sample
Args: p0 (list-like, generator, or callable): reserved argument for initial pool composition (fraction)
uniq_seq_num (int): Number of unique sequences from simulation seed (int): random seed for repeatability param_generators (kwargs of list-like, generator, or callable): parameter generator depending on the model
- Returns
a n_row = uniq_seq_num pd.DataFrame contains generated parameters
-
-
class
k_seq.data.simu.
SimulationResults
(dataset_dir, result_dir)¶ Bases:
object
Class to load simulation result
-
__init__
(dataset_dir, result_dir)¶ Survey estimation results - load fitting results from result_dir/fit_summary.csv - load truth and input count infor from dataset_dir/truth.csv and input_counts
Optional to include: - input_counts: counts of sequences in the input pool - mean_counts: mean counts in all samples (input and reacted)
- Returns
table of estimated k, A, kA truth: table of true k, A, p0, ka, and input_counts seq_list: list of indices of sequences that were able to estimate
- Return type
results
-
get_est_results
(param, pred_type='point_est')¶ Return the estimation (pred) and truth of given parameter
-
get_fold_range
(param)¶ Return the ratio of 97.5-percentile to 2.5-percentile
-
get_uncertainty_accuracy
(param, pred_type='bs_ci95')¶ Return the accuracy of uncertainty estimation if uncertainty range includes the truth
-
-
k_seq.data.simu.
get_pct_gaussian_error
(rate)¶ Return a function to apply Gaussian error proportional to the value
-
k_seq.data.simu.
simulate_counts
(uniq_seq_num, x_values, total_reads, p0_generator=None, kinetic_model=None, count_model=None, total_amount_error=None, param_sample_from_df=None, weights=None, replace=True, reps=1, seed=None, note=None, save_to=None, **param_generators)¶ Simulate sequencing count dataset given kinetic and count model
- Procedure:
parameter for each unique sequences were sampled from param_sample_from_df and kwargs
(param_generators). It is an even pool if p0 is not provided. No repeated parameters.
simulate the reacted amount / fraction of sequences with each controlled variable in x_values
Simulated counts with given total total_reads were simulated for input pool and reacted pools.
- Parameters
uniq_seq_num (int) – Number of unique sequences from simulation
x_values (list-like) – list of controlled variables in each experiment setup,negative value means it is an initial pool
total_reads –
p0 (list-like, generator, or callable) – reserved argument for initial pool composition (fraction)
kinetic_model (callable) – model the amount of sequences in reaction given input variables. Default BYOModel.amount_first_order
count_model (callable) – model the sequencing counts w.r.t. total reads and pool composition.Default MultiNomial model.
param_sample_from_df (pd.DataFrame) – optional to sample sequences from given table
weights (list or str) – weights/col of weight for sampling from table
total_amount_error (float or callable) – float as the standard deviation for a fixed Gaussian error, or any error function on the DNA amount. Use 0 for no introduced error.
conv_reps –
seed (int) – random seed for repeatability
save_to (str) – optional, path to save the simulation results with x, Y, truth csv file and a pickled SeqData object
param_generator –
- Returns
c, n value for samples Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data
- Return type
x (pd.DataFrame)
-
k_seq.data.simu.
simulate_on_byo_doped_condition_from_exp_results
(dataset, fitting_res, uniq_seq_num=None, x_values=None, total_reads=None, sequencing_depth=40, n_input=1, table_name='original', total_dna_error_rate=0.1, seed=23, plot_dist=False, save_to=None)¶ - Simulate k-seq count dataset based on the experimental condition of BYO-doped pool, that
t: reaction time (90 min) alpha: degradation ratio of BYO (0.479) x_values: controlled BYO concentration points: 1 input pool with triple sequencing depth,
- 5 BYO concentration with triplicates:
- [-1 (input pool),
2e-6, 2e-6, 2e-6, 10e-6, 10e-6, 10e-6, 50e-6, 50e-6, 50e-6, 250e-6, 250e-6, 250e-6, 1260e-6, 1260e-6, 1260e-6]
- Parameter for each sequences were sampled from previous point estimate results:
point_est_csv: load point estimates results to extract estimated k and A
seqtable_path: path to load input sample SeqData object to get p0 information
Returns: x (pd.DataFrame): controlled variable (c, n) for samples
Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences, including p0, k, A, kA truth (pd.DataFrame): true values of parameters (e.g. p0, k, A) for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data
-
k_seq.data.simu.
simulate_w_byo_doped_condition_from_param_dist
(uniq_seq_num, depth, p0_loc, p0_scale, k_95, total_dna_error_rate=0.1, seed=23, save_to=None, plot_dist=True)¶ - Deprecated. Simulate k-seq count dataset similar to the experimental condition of BYO-doped pool, that
t: reaction time (90 min) alpha: degradation ratio of BYO (0.479) x_values: controlled BYO concentration points: 1 input pool with triple sequencing depth,
- 5 BYO concentration with triplicates:
- [-1 (input pool),
2e-6, 2e-6, 2e-6, 10e-6, 10e-6, 10e-6, 50e-6, 50e-6, 50e-6, 250e-6, 250e-6, 250e-6, 1260e-6, 1260e-6, 1260e-6]
- Parameter for each sequences were sampled from given distribution defined from arguments
p0: log normal from exp(N(p0_loc, p0_scale))
k: log normal from k_95 95-percentile for k
A: uniform from [0, 1]
Other args: uniq_seq_num (int): Number of unique sequences from simulation
depth (int or float): sequence depth defined on mean reads per sequence total_amount_error (float or callable): float as the standard deviation for a fixed Gaussian error, or any error function on the DNA amount. Use 0 for no introduced error. save_to (str): optional, path to save the simulation results with x, Y, truth csv file and a pickled SeqData object plot_dist (bool): if pairwise figures of distribution for simulated parameters (p0, A, k, kA)
Returns: x (pd.DataFrame): controlled variable (c, n) for samples
Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences, including p0, k, A, kA truth (pd.DataFrame): true values of parameters (e.g. p0, k, A) for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data