k_seq.data: data preprocessing

Modules for data handling, including:

  • preprocessing: core module in data pre-processing from count file to SequenceSeq for estimator

  • io: module contains utility function for read, write and convert different file formats

  • analysis: module contains functions for extra analysis for sequencing samples or reads to sample investigation and sample pipeline quality control

  • simu: module contains codes to generate simulated data used in analysis in paper (TODO: add paper citation)

k_seq.data.axis_mapper(axis)

Map the name of data table axis to axis number axis 0: seq, sequence, sequences axis 1: sample, observation, obs

k_seq.data.preprocess

This module contains following preprocessing function for sequencing data:

1. Preprocessing DNA sequencing results from fastq.gz files using EasyDIVER pipeline (https://github.com/ichen-lab-ucsb/EasyDIVER), including joining paired-end reads, trimming primers, pooling same sample in different lanes, and deduplicate to generate count files of unique sequences.

  1. Parse count files and load as a SeqTable object

k_seq.data.preprocess.fastq_to_count(fastq_root, output_path='pipeline-output', threads=12, forward_primer=None, reverse_primer=None, pandas_abs_match=True, join_first=True)

Call fastq_to_count.sh subprocess for raw reads joining, pooling, and de-duplicating using the EasyDIVER pipeline (https://github.com/ichen-lab-ucsb/EasyDIVER)

Parameters
  • fastq_root (path) – path to the root folder contains paired-end raw FASTQ.gz files

  • output_path (path) – path to save output. By default it saves to ‘pipeline-output’ folder

  • threads (int) – number of threads to run pandaSeq for joining. By default it uses all CPUs

  • forward_primer (str) – specify the forward primer for pandaSeq joining and trimming

  • reverse_primer (str) – specify the reverse primer for pandaSeq joining and trimming

  • pandas_abs_match (bool) – if enforce absolute matching in pandaSeq joining. Default True.

  • join_first (bool) – if join before trimming in pandaSeq. Suitable for heavily overlapped paired-end reads. Default True.

Example

TODO: add an example

k_seq.data.preprocess.load_Seqtable_from_count_files(count_files, file_list=None, pattern_filter=None, black_list=None, name_template=None, sort_by=None, x_values=None, x_unit=None, input_sample_name=None, sample_metadata=None, note=None, dry_run=False)

Create a SeqData instance from count files.

Multiple sources of count files are supported, indicate in count_files

Parameters
  • count_files (str, or list of dict) – root directory to search for count files. To load multiple sources, provide a list of dict as keyword arguments in the format of [{‘count_files’: path/to/count/files (required), ‘name_template’: string for name template, ‘pattern_filter’: string for pattern filter, ‘file_list’: list of file to load, ‘black_list’: list of files to exclude}]

  • file_list (list of str) – optional, only includes the count files with names in the file_list

  • pattern_filter (str) – optional, filter file names based on this pattern, wildcards */? are allowed

  • black_list (list of str) – optional, file names included in black_list will be excluded

  • name_template (str) – naming convention to extract metadata. Use [...] to include the region of sample_name, use {domain_name[, int/float]} to indicate region of domain to extract as metadata, including int or float will convert the domain value to int/float in applicable, otherwise, string

  • sort_by (str) – sort the order of samples based on given domain

  • dry_run (bool) – only return the parsed count file names and metadata without actual reading in data

  • x_values (list-like, or dict) – optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key

  • x_unit (str) – optional. Unit for controlled variable. e.g. ‘uM’

  • input_sample_name (list of str) – optional. Indicate input samples (unreacted)

  • sample_metadata (dict of objects) – optional. Extra sample metadata

  • note (str) – Note for dataset/seq_table

Example

Example on metadata extraction from pattern: >>> metadata = extract_metadata(

sample_name = “R4B-1250A_S16_counts.txt” template = “R4[{exp_rep}-{concentration, float}{seq_rep}_S{id, int}]_counts.txt”

)

>>> metadata
{
    'name': 'B-1250A_S16',
    'exp_rep': 'B',
    'concentration': 1250.0,
    'seq_rep': 'A',
    'id': 16
}
Notice: two back-to-back domain can only be parsed if one of them is numeric and one of them is alphabetic, and missing

value will raise error

Valid: matching ‘-A1-‘ to ‘-{{sample}}{{replicate, int}}-‘ gives {{ ‘sample’: ‘A’, ‘replicate’: 1}} Not valid: matching ‘-A-‘ to ‘-{{sample}}{{replicate, int}}-‘ will cause error

matching ‘-AA-‘ to ‘-{{sample}}{{replicate}}-‘ will cause error

k_seq.data.preprocess.read_count_file(file_path, as_dict=False, number_only=False)

Read a single count file generated from EasyDIVER pipeline with format:

Count file format:

number of unique sequences = 2825
total number of molecules = 29348173

AAAAAAAACACCACACA               2636463
AATATTACATCATCTATC              86763
...
Parameters
  • file_path (str) – full directory to the count file

  • as_dict (bool) – return a dictionary instead of a pd.DataFrame

  • number_only (bool) – only return number of unique seqs and total counts if True

Returns

number of unique sequences in the count file total_counts (int): number of total reads in the count file sequence_counts (pd.DataFrame): with sequence as index and counts as the first column

Return type

unique_seqs (int)

k_seq.data.seq_data

Submodule of SeqData, a rich functions class of seq_table for sequencing manipulation This module contains methods for data pre-processing from count files to CountFile for estimator For absolute quantification, it accepts absolute amount (e.g. measured by qPCR) or reacted fraction .. todo:: - write output function for each class as JSON file and readable foler

class k_seq.data.seq_data.SeqData(data, data_unit=None, sample_list=None, seq_list=None, data_note=None, use_sparse=True, seq_metadata=None, sample_metadata=None, grouper=None, x_values=None, x_unit=None, note=None, dataset_metadata=None)

Bases: object

Data instance to store k-seq result

seq_table

accessor for tables stored. Including original created during initialization. tables stored should be pd.DataFrame or SeqTable

Type

AttrScope

x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key

x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ seqs samples metadata (AttrScope): accessor for metadata for the dataset, includes

sample (AttrScope): collection of metadata for samples if applicable seq (AttrScope): collection of metadata for seqs if applicable created_time (timestamp): datetime of the instance is created note (str): note for the dataset other dataset metadata objects could be added

analysis (AttrScope): accessor to some pre-built analyses

Plugins:

grouper (GrouperCollection): collection of Grouper to slice subtables spike_in (SpikeInNormalizer): optional. Accessor to the normalizer using spike-in sample_total (TotalAmountNormalizer): optional. Accessor to the normalizer using total sample amount of seqs analysis (SeqDataAnalyzer): built-in analysis tools to analyze SeqData object

__init__(data, data_unit=None, sample_list=None, seq_list=None, data_note=None, use_sparse=True, seq_metadata=None, sample_metadata=None, grouper=None, x_values=None, x_unit=None, note=None, dataset_metadata=None)

Initialize a SeqData object

Args: data (pd.DataFrame or np.ndarray): 2-D data with indices as sequences and columns as samples. If data is pd.DataFrame, values in index and column will be used as sequences and samples; if data is a 2-D np.ndarray, sample_list and seq_to_fit are needed with same length and order as data

data_unit (str): The unit of seq_table values, e.g. counts, ng, M. Default counts. sample_list (list-like): list of samples in the sample, should match the columns in the seq_table data seq_to_fit: data_note (str): Note for data seq_table use_sparse (bool): If store the seq_table value as sparse matrix seq_metadata (dict of objects): optional. Extra seq metadata grouper: x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ note (str): Note for dataset/seq_table dataset_metadata (dict of objects): optional. Extra dataset metadata

grouper (dict of list, or dict of dict of list): optional. dict of list (Type 1) of dict of list (Type 2) to

create grouper plugin

add_grouper(**kwargs)

Add an accessor of GrouperCollection of SeqData if not yet. Add Groupers to the accessor Initialize a Grouper instance with keyword arguments with a dictionary of:

group (list or dict): list creates a Type 0 Grouper (single group) and dict creates a Type 1 Grouper

(multiple groups)

target (pd.DataFrame): optional, target table axis (0 or 1): axis to apply the grouper

add_sample_total(total_amounts, full_table, unit=None)
Add TotalAmountNormalizer to quantify sequences with their total amount in each sample

as sample_total

Parameters
  • total_amounts (dict or pd.Series) – Total DNA amount for samples measured in experiment

  • full_table (pd.DataFrame) – seq_table where the total amount were measured and normalize to

  • unit (str) – Unit of amount measured

add_spike_in(base_table, spike_in_seq, spike_in_amount, radius=2, unit=None, dist_type='edit')

Add SpikeInNormalizer to quantify seq amount using spike-in sequence as accessor spike_in to the instance

Parameters
  • base_table (pd.DataFrame) – base_table includes spike-in sequences to calculate normalization factor

  • spike_in_seq (str) – center sequence for spike-in

  • spike_in_amount (list-like, dict, or pd.Series) – added spike_in amount, dict and pd.Series should have key of samples in base_table list-like should have same length as number of samples (cols) in base_table

  • radius (int) – Radius of spike-in peak, seqs less or equal to radius away from center are spike-in seqs

  • unit (str) – unit of spike-in amount

  • dist_type (str) – ‘edit’ or ‘hamming’ as distance measure. Default ‘edit’ to include insertion / deletion

static from_count_files(**kwargs)

Create a SeqData instance from a folder of count files

Args:

count_files (str): root directory to search for count files file_list (list of str): optional, only includes the count files with names in the file_list pattern_filter (str): optional, filter file names based on this pattern, wildcards */? are allowed black_list (list of str): optional, file names included in black_list will be excluded name_template (str): naming convention to extract metadata. Use [...] to include the region of sample_name,

use {domain_name[, int/float]} to indicate region of domain to extract as metadata, including int or float will convert the domain value to int/float in applicable, otherwise, string

sort_by (str): sort the order of samples based on given domain dry_run (bool): only return the parsed count file names and metadata without actual reading in data

x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ input_sample_name (list of str): optional. Indicate input samples (unreacted) sample_metadata (dict of objects): optional. Extra sample metadata note (str): Note for dataset/seq_table

Example

Example on metadata extraction from pattern: >>> metadata = extract_metadata(

sample_name = “R4B-1250A_S16_counts.txt” template = “R4[{exp_rep}-{concentration, float}{seq_rep}_S{id, int}]_counts.txt”

)

>>> metadata
{
    'name': 'B-1250A_S16',
    'exp_rep': 'B',
    'concentration': 1250.0,
    'seq_rep': 'A',
    'id': 16
}
Notice: two back-to-back domain can only be parsed if one of them is numeric and one of them is alphabetic, and missing

value will raise error

Valid: matching ‘-A1-‘ to ‘-{{sample}}{{replicate, int}}-‘ gives {{ ‘sample’: ‘A’, ‘replicate’: 1}} Not valid: matching ‘-A-‘ to ‘-{{sample}}{{replicate, int}}-‘ will cause error

matching ‘-AA-‘ to ‘-{{sample}}{{replicate}}-‘ will cause error

from_json()

TODO: add json

static from_pickle(path)
property samples
property seqs
to_json()

More generalized JSON file TODO: add to_json and from_json

to_pickle(path)
update_analysis()

Update accessor to SeqDataAnalyzer

class k_seq.data.seq_data.SeqTable(*args, **kwargs)

Bases: pandas.core.frame.DataFrame

Extended pd.DataFrame with added property and functions for SeqTable

Additional Attributes:

unit (str): unit of entries in this seq_table note (str): note for this seq_table samples (pd.Series): samples in the seq_table seqs (pd.Series): sequences in the seq_table

Additional Methods:

about: print a summary of the seq_table analysis: accessor to SeqTableAnalyzer

__init__(*args, **kwargs)

Initialize SeqTable instance Additional kwargs: unit:

note (str): Note for dataset/seq_table use_sparse (bool): If store the seq_table value as sparse matrix

about()

Quick view of SeqTable

property density
describe(percentiles=None, include=None, exclude=None)

return major stable statistics

filter_axis(filter, axis=0, remove_empty=False, inplace=False)

Filter seq_table along with one axis

Parameters
  • filter (callable) – a callable to apply on row/columns and returns a bool value

  • axis (0 or 1) – the axis to filter. 0: row, seqs; 1: column, sample

  • remove_empty (bool) – If remove the empty column/rows after filtering

  • inplace (bool) – if change the seq_table inplace. If False, return a new seq_table

property samples
property seqs
update_analysis()

update the accessor to SeqTableAnalyzer

k_seq.data.seq_data.slice_table(table, axis, keys, remove_empty=False)

Slice pd.DataFrame seq_table with a list of key values or filter functions returning True/False along given axis. Optional to remove all zero entries :param seq_table: seq_table to slice :type seq_table: pd.DataFrame :param keys: list of keys to preserve. If is callable, apply to row/column of seq_table and returns

bool of preserve (True) or discard (False)

Parameters
  • axis (0 or 1) – which axis to filter

  • remove_empty (bool) – If remove the empty column/rows after filtering

k_seq.data.seq_data_analyzer

class k_seq.data.seq_data_analyzer.SeqDataAnalyzer(seq_data)

Bases: k_seq.utility.func_tools.FuncToMethod

class k_seq.data.seq_data_analyzer.SeqTableAnalyzer(seq_table)

Bases: k_seq.utility.func_tools.FuncToMethod

k_seq.data.seq_data_analyzer.cross_table_compare(base_table, compare_table, samples=None, ax=None, figsize=None, color_map=None, save_fig_to=None)
k_seq.data.seq_data_analyzer.rep_variance_scatter(seq_table, grouper, xaxis=None, subsample=None, xlog=True, ylog=True, xlim=None, ylim=None, group_title_pos=None, xlabel=None, ylabel=None, label_map=None, figsize=None, save_fig_to=None)

A scatter plot for measured value variance in replicates for each sequence

k_seq.data.seq_data_analyzer.sample_entropy_scatterplot(seq_table, black_list=None, normalize=False, base=2, color_map=None, figsize=None, scatter_kwargs=None, save_fig_to=None)
k_seq.data.seq_data_analyzer.sample_info(seq_data)

Summarize sample info for a SeqData, with info of total amount and spike-in :returns: A pd.DataFrame show the summary for samples

k_seq.data.seq_data_analyzer.sample_overview(seq_table, axis=1)

Summarize sequences for a given seq_table, with info of unique seqs, total amount

Returns

A pd.DataFrame show the summary for sequences

k_seq.data.seq_data_analyzer.sample_overview_plots(seq_table, plot_unique_seq=True, plot_total_counts=True, plot_spike_in_frac=True, color_map=None, black_list=None, figsize=None, label_mapper=None, save_fig_to=None)

Overview plot(s) of unique seqs, total counts and spike-in fractions in the samples

Parameters
  • seq_table (SeqData) – sample set to survey

  • plot_unique_seq (bool) – plot bar plot for unique sequences if True

  • plot_total_counts (bool) – plot bar plot for total counts if True

  • plot_spike_in_frac (bool) – plot scatter plot for spike in fraction if True

  • color_map (dict) – {sample_name: color} for all plots

  • black_list (list of str) – list of sample name to exlude from the plots

  • sep_plot (bool) – plot separate plots for unique sequences, total counts and spike_in fractions if True

  • label_mapper (dict or callable) – alternative labels for samples

  • fig_save_to (str) – save figure to the directory if not None

k_seq.data.seq_data_analyzer.sample_rel_abun_hist(seq_table, black_list=None, bins=None, x_log=True, y_log=False, ncol=None, nrow=None, figsize=None, hist_kwargs=None, save_fig_to=None)

todo: add pool counts composition curve for straight forward visualization

k_seq.data.seq_data_analyzer.sample_spike_in_ratio_scatterplot(seq_table, black_list=None, ax=None, save_fig_to=None, figsize=None, label_mapper=None, scatter_kwargs=None)

Scatter plot of spike in ratio in the pool

k_seq.data.seq_data_analyzer.sample_total_reads_barplot(seq_table, black_list=None, logy=False, ax=None, save_fig_to=None, figsize=None, x_label=None, y_label='Total reads', fontsize=14, label_mapper=None, barplot_kwargs=None)

Barplot of total counts (sum over sequences) in each sample

k_seq.data.seq_data_analyzer.sample_unique_seqs_barplot(seq_table, black_list=None, logy=False, ax=None, save_fig_to=None, figsize=None, x_label=None, y_label='Unique sequences', fontsize=14, label_mapper=None, barplot_kwargs=None)

Barplot of unique seqs in each sample

k_seq.data.seq_data_analyzer.seq_length_dist(seq_table, axis=0, ax=None, figsize=6, 3, bins=20, logx=False, logy=False, hist_kwargs=None, save_fig_to=None)

histogram of length distribution of unique sequences

k_seq.data.seq_data_analyzer.seq_mean_value_detected_samples_scatterplot(seq_table, figsize=5, ylabel='counts', ylog=True, subsample=None, color='#1F77B4', marker_size=5, scatter_kwargs=None)

Joint plot of of mean value (e.g. count) for a sequence across samples and number of sample it is detected With one scatter plot (x: number of samples detected, y: mean value)

and two histogram showing the distribution in each dimension

k_seq.data.seq_data_analyzer.seq_overview(seq_table, axis=0)

Summarize sample in seq_table, with info of seq length, sample detected, mean, sd :returns: A pd.DataFrame show the summary for sequences

k_seq.data.seq_data_analyzer.seq_variance(seq_table, grouper)

Get the spread (standard deviation) of sequence abundance across replicates, provided by grouper

Returns

if single group, returns a pd.DataFrame with columns (‘mean’, ‘sd’) if multiple groups, returns two pd.DataFrame (mean, sd) with columns of each group

k_seq.data.filters module

k_seq.data.grouper module

Groupers slice seq_table into groups. E.g. input samples, reacted samples, different concentrations TODO: simplify code

class k_seq.data.grouper.Grouper(group, target=None, axis=1)

Bases: object

Grouper of samples/sequences

Two types of grouper accepted:

Type 0 (list): initialize with group as list-like. This defines a single set of samples/sequences Type 1 (dict): initialize with group as dict. This defines a collection of groups of samples/sequences

target

accessor for seq_table to group

Type

pd.DataFrame

axis

axis to apply grouping (0 for index, 1 for columns)

Type

0 or 1

group

dictionary with structure {group_name: group_members}

Type

list or dict

type

type of the grouper

Type

0 or 1

__init__(group, target=None, axis=1)

Initialize a Grouper instance :param group: list creates a Type 0 Grouper (single group) and dict creates a Type 1 Grouper

(multiple groups)

Parameters
  • target (pd.DataFrame) – optional, target seq_table

  • axis (0 or 1) – axis to apply the grouper

get_table(group=None, target=None, axis=None, remove_zero=False)

Return a sub-seq_table from target given group

split(target=None, remove_zero=False)
class k_seq.data.grouper.GrouperCollection(**kwargs)

Bases: k_seq.utility.func_tools.AttrScope

A collection of groupers

add(**kwargs)

add a grouper

k_seq.data.grouper.get_group(table, group, axis=1, remove_empty=False)

Slice or split the table based on the group :param table: target table to group :type table: pd.DataFrame or SeqTable :param group: single group or a set of groups :type group: list-like or dict of list

k_seq.data.variant_pool module

Function for variant pool design

k_seq.data.variant_pool.combination(n, k)
k_seq.data.variant_pool.d_mutant_fraction(d, mutation_rate, length=21, letter_book_size=4)

Relative abundance for a single d-order mutants

k_seq.data.variant_pool.neighbor_effect_error(xi, d, eta=0.09, L=21)

Fraction of reads from neighboring sequences due to sequencing error

k_seq.data.variant_pool.neighbor_effect_observation(xi, d, eta=0.09, L=21)

Get the ratio of observed abundance for a d-th order mutant, considering the neighbor effect under given sequencing error rate (xi)

k_seq.data.variant_pool.num_of_seq(d, length=21, letter_book_size=4)

Expected number of

k_seq.data.simu module

Module contains code to simulate data

class k_seq.data.simu.DistGenerators

Bases: object

A collection of random value generators from commonly used distributions. uniq_seq_num number of independent draw of distribution are returned

Available distributions:

lognormal uniform compo_lognormal

static compo_lognormal(size, loc=None, scale=None, c95=None, seed=None)

Sample a pool composition from a log-normal distribution indicate with loc and scale, or c95

Example

scale = 0 means an evenly distributed pool with all components have relative abundance 1/uniq_seq_num

Parameters
  • size (int) – uniq_seq_num of the pool

  • loc (float) – center of log-normal distribution

  • scale (float) – log variance of the distribution

  • c95 ([float, float]) – 95% percentile of log-normal distribution

  • seed – random seed

static lognormal(size=None, loc=None, scale=None, c95=None, seed=None)

Sample from a log-normal distribution indicate with loc and scale, or c95

Parameters
  • size (int) – number of values to draw

  • loc (float) – center of log-normal distribution, default 0

  • scale (float) – log variance of the distribution, default 0

  • c95 ([float, float]) – 95% percentile of log-normal distribution

  • seed – random seed

Returns

a draw from distribution with given uniq_seq_num

static uniform(low=None, high=None, size=None, seed=None)

Sample from a uniform distribution

class k_seq.data.simu.PoolParamGenerator

Bases: object

Functions to generate parameters for a set of sequence in a sequence pool

- sample_from_iid_dist
- sample_from_dataframe
Returns

pd.DataFrame with columns of parameters and rows of simulated parameter for each sequence

classmethod sample_from_dataframe(df, uniq_seq_num, replace=True, weights=None, seed=None)

Simulate parameter by resampling rows of a given dataframe :param df: dataframe contains parameters as columns to sample from,

needs to have p0 as one column for heterogenous pool

uniq_seq_num (int): Number of unique sequences from simulation

static sample_from_iid_dist(uniq_seq_num, seed=None, **param_generators)

Simulate the seq parameters from individual draws of distributions Parameter:

p0: initial fraction for each sequence for uneven pool depending on the model. e.g. first-order model needs to include k and A

Accepted parameter input:
  • list-like: if uniq_seq_num does not match as expected uniq_seq_num, resample with replacement

  • generator: given generator returns a random parameter

  • callable: if taking uniq_seq_num as an argument, needs to return a uniq_seq_num vector of sampled parameter or a generator to generate a uniq_seq_num vector; if not taking uniq_seq_num as an argument, needs to return single sample

Args: p0 (list-like, generator, or callable): reserved argument for initial pool composition (fraction)

uniq_seq_num (int): Number of unique sequences from simulation seed (int): random seed for repeatability param_generators (kwargs of list-like, generator, or callable): parameter generator depending on the model

Returns

a n_row = uniq_seq_num pd.DataFrame contains generated parameters

class k_seq.data.simu.SimulationResults(dataset_dir, result_dir)

Bases: object

Class to load simulation result

__init__(dataset_dir, result_dir)

Survey estimation results - load fitting results from result_dir/fit_summary.csv - load truth and input count infor from dataset_dir/truth.csv and input_counts

Optional to include: - input_counts: counts of sequences in the input pool - mean_counts: mean counts in all samples (input and reacted)

Returns

table of estimated k, A, kA truth: table of true k, A, p0, ka, and input_counts seq_list: list of indices of sequences that were able to estimate

Return type

results

get_est_results(param, pred_type='point_est')

Return the estimation (pred) and truth of given parameter

get_fold_range(param)

Return the ratio of 97.5-percentile to 2.5-percentile

get_uncertainty_accuracy(param, pred_type='bs_ci95')

Return the accuracy of uncertainty estimation if uncertainty range includes the truth

k_seq.data.simu.get_pct_gaussian_error(rate)

Return a function to apply Gaussian error proportional to the value

k_seq.data.simu.simulate_counts(uniq_seq_num, x_values, total_reads, p0_generator=None, kinetic_model=None, count_model=None, total_amount_error=None, param_sample_from_df=None, weights=None, replace=True, reps=1, seed=None, note=None, save_to=None, **param_generators)

Simulate sequencing count dataset given kinetic and count model

Procedure:
  1. parameter for each unique sequences were sampled from param_sample_from_df and kwargs

(param_generators). It is an even pool if p0 is not provided. No repeated parameters.

  1. simulate the reacted amount / fraction of sequences with each controlled variable in x_values

  2. Simulated counts with given total total_reads were simulated for input pool and reacted pools.

Parameters
  • uniq_seq_num (int) – Number of unique sequences from simulation

  • x_values (list-like) – list of controlled variables in each experiment setup,negative value means it is an initial pool

  • total_reads

  • p0 (list-like, generator, or callable) – reserved argument for initial pool composition (fraction)

  • kinetic_model (callable) – model the amount of sequences in reaction given input variables. Default BYOModel.amount_first_order

  • count_model (callable) – model the sequencing counts w.r.t. total reads and pool composition.Default MultiNomial model.

  • param_sample_from_df (pd.DataFrame) – optional to sample sequences from given table

  • weights (list or str) – weights/col of weight for sampling from table

  • total_amount_error (float or callable) – float as the standard deviation for a fixed Gaussian error, or any error function on the DNA amount. Use 0 for no introduced error.

  • conv_reps

  • seed (int) – random seed for repeatability

  • save_to (str) – optional, path to save the simulation results with x, Y, truth csv file and a pickled SeqData object

  • param_generator

Returns

c, n value for samples Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data

Return type

x (pd.DataFrame)

k_seq.data.simu.simulate_on_byo_doped_condition_from_exp_results(dataset, fitting_res, uniq_seq_num=None, x_values=None, total_reads=None, sequencing_depth=40, n_input=1, table_name='original', total_dna_error_rate=0.1, seed=23, plot_dist=False, save_to=None)
Simulate k-seq count dataset based on the experimental condition of BYO-doped pool, that

t: reaction time (90 min) alpha: degradation ratio of BYO (0.479) x_values: controlled BYO concentration points: 1 input pool with triple sequencing depth,

5 BYO concentration with triplicates:
[-1 (input pool),

2e-6, 2e-6, 2e-6, 10e-6, 10e-6, 10e-6, 50e-6, 50e-6, 50e-6, 250e-6, 250e-6, 250e-6, 1260e-6, 1260e-6, 1260e-6]

Parameter for each sequences were sampled from previous point estimate results:
  • point_est_csv: load point estimates results to extract estimated k and A

  • seqtable_path: path to load input sample SeqData object to get p0 information

Returns: x (pd.DataFrame): controlled variable (c, n) for samples

Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences, including p0, k, A, kA truth (pd.DataFrame): true values of parameters (e.g. p0, k, A) for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data

k_seq.data.simu.simulate_w_byo_doped_condition_from_param_dist(uniq_seq_num, depth, p0_loc, p0_scale, k_95, total_dna_error_rate=0.1, seed=23, save_to=None, plot_dist=True)
Deprecated. Simulate k-seq count dataset similar to the experimental condition of BYO-doped pool, that

t: reaction time (90 min) alpha: degradation ratio of BYO (0.479) x_values: controlled BYO concentration points: 1 input pool with triple sequencing depth,

5 BYO concentration with triplicates:
[-1 (input pool),

2e-6, 2e-6, 2e-6, 10e-6, 10e-6, 10e-6, 50e-6, 50e-6, 50e-6, 250e-6, 250e-6, 250e-6, 1260e-6, 1260e-6, 1260e-6]

Parameter for each sequences were sampled from given distribution defined from arguments
  • p0: log normal from exp(N(p0_loc, p0_scale))

  • k: log normal from k_95 95-percentile for k

  • A: uniform from [0, 1]

Other args: uniq_seq_num (int): Number of unique sequences from simulation

depth (int or float): sequence depth defined on mean reads per sequence total_amount_error (float or callable): float as the standard deviation for a fixed Gaussian error, or any error function on the DNA amount. Use 0 for no introduced error. save_to (str): optional, path to save the simulation results with x, Y, truth csv file and a pickled SeqData object plot_dist (bool): if pairwise figures of distribution for simulated parameters (p0, A, k, kA)

Returns: x (pd.DataFrame): controlled variable (c, n) for samples

Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences, including p0, k, A, kA truth (pd.DataFrame): true values of parameters (e.g. p0, k, A) for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data

k_seq.data.visualzation module