k_seq.data: data preprocessing

Modules for data handling, including:

  • preprocessing: core module in data pre-processing from count file to SequenceSeq for estimator

  • io: module contains utility function for read, write and convert different file formats

  • analysis: module contains functions for extra analysis for sequencing samples or reads to sample investigation and sample pipeline quality control

  • simu: module contains codes to generate simulated data used in analysis in paper (TODO: add paper citation)


Map the name of data table axis to axis number axis 0: seq, sequence, sequences axis 1: sample, observation, obs


This module contains following preprocessing function for sequencing data:

1. Preprocessing DNA sequencing results from fastq.gz files using EasyDIVER pipeline (https://github.com/ichen-lab-ucsb/EasyDIVER), including joining paired-end reads, trimming primers, pooling same sample in different lanes, and deduplicate to generate count files of unique sequences.

  1. Parse count files and load as a SeqTable object

k_seq.data.preprocess.fastq_to_count(fastq_root, output_path='pipeline-output', threads=12, forward_primer=None, reverse_primer=None, pandas_abs_match=True, join_first=True)

Call fastq_to_count.sh subprocess for raw reads joining, pooling, and de-duplicating using the EasyDIVER pipeline (https://github.com/ichen-lab-ucsb/EasyDIVER)

  • fastq_root (path) – path to the root folder contains paired-end raw FASTQ.gz files

  • output_path (path) – path to save output. By default it saves to ‘pipeline-output’ folder

  • threads (int) – number of threads to run pandaSeq for joining. By default it uses all CPUs

  • forward_primer (str) – specify the forward primer for pandaSeq joining and trimming

  • reverse_primer (str) – specify the reverse primer for pandaSeq joining and trimming

  • pandas_abs_match (bool) – if enforce absolute matching in pandaSeq joining. Default True.

  • join_first (bool) – if join before trimming in pandaSeq. Suitable for heavily overlapped paired-end reads. Default True.


TODO: add an example

k_seq.data.preprocess.load_Seqtable_from_count_files(count_files, file_list=None, pattern_filter=None, black_list=None, name_template=None, sort_by=None, x_values=None, x_unit=None, input_sample_name=None, sample_metadata=None, note=None, dry_run=False)

Create a SeqData instance from count files.

Multiple sources of count files are supported, indicate in count_files

  • count_files (str, or list of dict) – root directory to search for count files. To load multiple sources, provide a list of dict as keyword arguments in the format of [{‘count_files’: path/to/count/files (required), ‘name_template’: string for name template, ‘pattern_filter’: string for pattern filter, ‘file_list’: list of file to load, ‘black_list’: list of files to exclude}]

  • file_list (list of str) – optional, only includes the count files with names in the file_list

  • pattern_filter (str) – optional, filter file names based on this pattern, wildcards */? are allowed

  • black_list (list of str) – optional, file names included in black_list will be excluded

  • name_template (str) – naming convention to extract metadata. Use [...] to include the region of sample_name, use {domain_name[, int/float]} to indicate region of domain to extract as metadata, including int or float will convert the domain value to int/float in applicable, otherwise, string

  • sort_by (str) – sort the order of samples based on given domain

  • dry_run (bool) – only return the parsed count file names and metadata without actual reading in data

  • x_values (list-like, or dict) – optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key

  • x_unit (str) – optional. Unit for controlled variable. e.g. ‘uM’

  • input_sample_name (list of str) – optional. Indicate input samples (unreacted)

  • sample_metadata (dict of objects) – optional. Extra sample metadata

  • note (str) – Note for dataset/seq_table


Example on metadata extraction from pattern: >>> metadata = extract_metadata(

sample_name = “R4B-1250A_S16_counts.txt” template = “R4[{exp_rep}-{concentration, float}{seq_rep}_S{id, int}]_counts.txt”


>>> metadata
    'name': 'B-1250A_S16',
    'exp_rep': 'B',
    'concentration': 1250.0,
    'seq_rep': 'A',
    'id': 16
Notice: two back-to-back domain can only be parsed if one of them is numeric and one of them is alphabetic, and missing

value will raise error

Valid: matching ‘-A1-‘ to ‘-{{sample}}{{replicate, int}}-‘ gives {{ ‘sample’: ‘A’, ‘replicate’: 1}} Not valid: matching ‘-A-‘ to ‘-{{sample}}{{replicate, int}}-‘ will cause error

matching ‘-AA-‘ to ‘-{{sample}}{{replicate}}-‘ will cause error

k_seq.data.preprocess.read_count_file(file_path, as_dict=False, number_only=False)

Read a single count file generated from EasyDIVER pipeline with format:

Count file format:

number of unique sequences = 2825
total number of molecules = 29348173

AAAAAAAACACCACACA               2636463
AATATTACATCATCTATC              86763
  • file_path (str) – full directory to the count file

  • as_dict (bool) – return a dictionary instead of a pd.DataFrame

  • number_only (bool) – only return number of unique seqs and total counts if True


number of unique sequences in the count file total_counts (int): number of total reads in the count file sequence_counts (pd.DataFrame): with sequence as index and counts as the first column

Return type

unique_seqs (int)


Submodule of SeqData, a rich functions class of seq_table for sequencing manipulation This module contains methods for data pre-processing from count files to CountFile for estimator For absolute quantification, it accepts absolute amount (e.g. measured by qPCR) or reacted fraction .. todo:: - write output function for each class as JSON file and readable foler

class k_seq.data.seq_data.SeqData(data, data_unit=None, sample_list=None, seq_list=None, data_note=None, use_sparse=True, seq_metadata=None, sample_metadata=None, grouper=None, x_values=None, x_unit=None, note=None, dataset_metadata=None)

Bases: object

Data instance to store k-seq result


accessor for tables stored. Including original created during initialization. tables stored should be pd.DataFrame or SeqTable



x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key

x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ seqs samples metadata (AttrScope): accessor for metadata for the dataset, includes

sample (AttrScope): collection of metadata for samples if applicable seq (AttrScope): collection of metadata for seqs if applicable created_time (timestamp): datetime of the instance is created note (str): note for the dataset other dataset metadata objects could be added

analysis (AttrScope): accessor to some pre-built analyses


grouper (GrouperCollection): collection of Grouper to slice subtables spike_in (SpikeInNormalizer): optional. Accessor to the normalizer using spike-in sample_total (TotalAmountNormalizer): optional. Accessor to the normalizer using total sample amount of seqs analysis (SeqDataAnalyzer): built-in analysis tools to analyze SeqData object

__init__(data, data_unit=None, sample_list=None, seq_list=None, data_note=None, use_sparse=True, seq_metadata=None, sample_metadata=None, grouper=None, x_values=None, x_unit=None, note=None, dataset_metadata=None)

Initialize a SeqData object

Args: data (pd.DataFrame or np.ndarray): 2-D data with indices as sequences and columns as samples. If data is pd.DataFrame, values in index and column will be used as sequences and samples; if data is a 2-D np.ndarray, sample_list and seq_to_fit are needed with same length and order as data

data_unit (str): The unit of seq_table values, e.g. counts, ng, M. Default counts. sample_list (list-like): list of samples in the sample, should match the columns in the seq_table data seq_to_fit: data_note (str): Note for data seq_table use_sparse (bool): If store the seq_table value as sparse matrix seq_metadata (dict of objects): optional. Extra seq metadata grouper: x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ note (str): Note for dataset/seq_table dataset_metadata (dict of objects): optional. Extra dataset metadata

grouper (dict of list, or dict of dict of list): optional. dict of list (Type 1) of dict of list (Type 2) to

create grouper plugin


Add an accessor of GrouperCollection of SeqData if not yet. Add Groupers to the accessor Initialize a Grouper instance with keyword arguments with a dictionary of:

group (list or dict): list creates a Type 0 Grouper (single group) and dict creates a Type 1 Grouper

(multiple groups)

target (pd.DataFrame): optional, target table axis (0 or 1): axis to apply the grouper

add_sample_total(total_amounts, full_table, unit=None)
Add TotalAmountNormalizer to quantify sequences with their total amount in each sample

as sample_total

  • total_amounts (dict or pd.Series) – Total DNA amount for samples measured in experiment

  • full_table (pd.DataFrame) – seq_table where the total amount were measured and normalize to

  • unit (str) – Unit of amount measured

add_spike_in(base_table, spike_in_seq, spike_in_amount, radius=2, unit=None, dist_type='edit')

Add SpikeInNormalizer to quantify seq amount using spike-in sequence as accessor spike_in to the instance

  • base_table (pd.DataFrame) – base_table includes spike-in sequences to calculate normalization factor

  • spike_in_seq (str) – center sequence for spike-in

  • spike_in_amount (list-like, dict, or pd.Series) – added spike_in amount, dict and pd.Series should have key of samples in base_table list-like should have same length as number of samples (cols) in base_table

  • radius (int) – Radius of spike-in peak, seqs less or equal to radius away from center are spike-in seqs

  • unit (str) – unit of spike-in amount

  • dist_type (str) – ‘edit’ or ‘hamming’ as distance measure. Default ‘edit’ to include insertion / deletion

static from_count_files(**kwargs)

Create a SeqData instance from a folder of count files


count_files (str): root directory to search for count files file_list (list of str): optional, only includes the count files with names in the file_list pattern_filter (str): optional, filter file names based on this pattern, wildcards */? are allowed black_list (list of str): optional, file names included in black_list will be excluded name_template (str): naming convention to extract metadata. Use [...] to include the region of sample_name,

use {domain_name[, int/float]} to indicate region of domain to extract as metadata, including int or float will convert the domain value to int/float in applicable, otherwise, string

sort_by (str): sort the order of samples based on given domain dry_run (bool): only return the parsed count file names and metadata without actual reading in data

x_values (list-like, or dict): optional. value for controlled variables. If list-like, should have same length and order as samples; if dict, should have sample names as key x_unit (str): optional. Unit for controlled variable. e.g. ‘uM’ input_sample_name (list of str): optional. Indicate input samples (unreacted) sample_metadata (dict of objects): optional. Extra sample metadata note (str): Note for dataset/seq_table


Example on metadata extraction from pattern: >>> metadata = extract_metadata(

sample_name = “R4B-1250A_S16_counts.txt” template = “R4[{exp_rep}-{concentration, float}{seq_rep}_S{id, int}]_counts.txt”


>>> metadata
    'name': 'B-1250A_S16',
    'exp_rep': 'B',
    'concentration': 1250.0,
    'seq_rep': 'A',
    'id': 16
Notice: two back-to-back domain can only be parsed if one of them is numeric and one of them is alphabetic, and missing

value will raise error

Valid: matching ‘-A1-‘ to ‘-{{sample}}{{replicate, int}}-‘ gives {{ ‘sample’: ‘A’, ‘replicate’: 1}} Not valid: matching ‘-A-‘ to ‘-{{sample}}{{replicate, int}}-‘ will cause error

matching ‘-AA-‘ to ‘-{{sample}}{{replicate}}-‘ will cause error


TODO: add json

static from_pickle(path)
property samples
property seqs

More generalized JSON file TODO: add to_json and from_json


Update accessor to SeqDataAnalyzer

class k_seq.data.seq_data.SeqTable(*args, **kwargs)

Bases: pandas.core.frame.DataFrame

Extended pd.DataFrame with added property and functions for SeqTable

Additional Attributes:

unit (str): unit of entries in this seq_table note (str): note for this seq_table samples (pd.Series): samples in the seq_table seqs (pd.Series): sequences in the seq_table

Additional Methods:

about: print a summary of the seq_table analysis: accessor to SeqTableAnalyzer

__init__(*args, **kwargs)

Initialize SeqTable instance Additional kwargs: unit:

note (str): Note for dataset/seq_table use_sparse (bool): If store the seq_table value as sparse matrix


Quick view of SeqTable

property density
describe(percentiles=None, include=None, exclude=None)

return major stable statistics

filter_axis(filter, axis=0, remove_empty=False, inplace=False)

Filter seq_table along with one axis

  • filter (callable) – a callable to apply on row/columns and returns a bool value

  • axis (0 or 1) – the axis to filter. 0: row, seqs; 1: column, sample

  • remove_empty (bool) – If remove the empty column/rows after filtering

  • inplace (bool) – if change the seq_table inplace. If False, return a new seq_table

property samples
property seqs

update the accessor to SeqTableAnalyzer

k_seq.data.seq_data.slice_table(table, axis, keys, remove_empty=False)

Slice pd.DataFrame seq_table with a list of key values or filter functions returning True/False along given axis. Optional to remove all zero entries :param seq_table: seq_table to slice :type seq_table: pd.DataFrame :param keys: list of keys to preserve. If is callable, apply to row/column of seq_table and returns

bool of preserve (True) or discard (False)

  • axis (0 or 1) – which axis to filter

  • remove_empty (bool) – If remove the empty column/rows after filtering


class k_seq.data.seq_data_analyzer.SeqDataAnalyzer(seq_data)

Bases: k_seq.utility.func_tools.FuncToMethod

class k_seq.data.seq_data_analyzer.SeqTableAnalyzer(seq_table)

Bases: k_seq.utility.func_tools.FuncToMethod

k_seq.data.seq_data_analyzer.cross_table_compare(base_table, compare_table, samples=None, ax=None, figsize=None, color_map=None, save_fig_to=None)
k_seq.data.seq_data_analyzer.rep_variance_scatter(seq_table, grouper, xaxis=None, subsample=None, xlog=True, ylog=True, xlim=None, ylim=None, group_title_pos=None, xlabel=None, ylabel=None, label_map=None, figsize=None, save_fig_to=None)

A scatter plot for measured value variance in replicates for each sequence

k_seq.data.seq_data_analyzer.sample_entropy_scatterplot(seq_table, black_list=None, normalize=False, base=2, color_map=None, figsize=None, scatter_kwargs=None, save_fig_to=None)

Summarize sample info for a SeqData, with info of total amount and spike-in :returns: A pd.DataFrame show the summary for samples

k_seq.data.seq_data_analyzer.sample_overview(seq_table, axis=1)

Summarize sequences for a given seq_table, with info of unique seqs, total amount


A pd.DataFrame show the summary for sequences

k_seq.data.seq_data_analyzer.sample_overview_plots(seq_table, plot_unique_seq=True, plot_total_counts=True, plot_spike_in_frac=True, color_map=None, black_list=None, figsize=None, label_mapper=None, save_fig_to=None)

Overview plot(s) of unique seqs, total counts and spike-in fractions in the samples

  • seq_table (SeqData) – sample set to survey

  • plot_unique_seq (bool) – plot bar plot for unique sequences if True

  • plot_total_counts (bool) – plot bar plot for total counts if True

  • plot_spike_in_frac (bool) – plot scatter plot for spike in fraction if True

  • color_map (dict) – {sample_name: color} for all plots

  • black_list (list of str) – list of sample name to exlude from the plots

  • sep_plot (bool) – plot separate plots for unique sequences, total counts and spike_in fractions if True

  • label_mapper (dict or callable) – alternative labels for samples

  • fig_save_to (str) – save figure to the directory if not None

k_seq.data.seq_data_analyzer.sample_rel_abun_hist(seq_table, black_list=None, bins=None, x_log=True, y_log=False, ncol=None, nrow=None, figsize=None, hist_kwargs=None, save_fig_to=None)

todo: add pool counts composition curve for straight forward visualization

k_seq.data.seq_data_analyzer.sample_spike_in_ratio_scatterplot(seq_table, black_list=None, ax=None, save_fig_to=None, figsize=None, label_mapper=None, scatter_kwargs=None)

Scatter plot of spike in ratio in the pool

k_seq.data.seq_data_analyzer.sample_total_reads_barplot(seq_table, black_list=None, logy=False, ax=None, save_fig_to=None, figsize=None, x_label=None, y_label='Total reads', fontsize=14, label_mapper=None, barplot_kwargs=None)

Barplot of total counts (sum over sequences) in each sample

k_seq.data.seq_data_analyzer.sample_unique_seqs_barplot(seq_table, black_list=None, logy=False, ax=None, save_fig_to=None, figsize=None, x_label=None, y_label='Unique sequences', fontsize=14, label_mapper=None, barplot_kwargs=None)

Barplot of unique seqs in each sample

k_seq.data.seq_data_analyzer.seq_length_dist(seq_table, axis=0, ax=None, figsize=6, 3, bins=20, logx=False, logy=False, hist_kwargs=None, save_fig_to=None)

histogram of length distribution of unique sequences

k_seq.data.seq_data_analyzer.seq_mean_value_detected_samples_scatterplot(seq_table, figsize=5, ylabel='counts', ylog=True, subsample=None, color='#1F77B4', marker_size=5, scatter_kwargs=None)

Joint plot of of mean value (e.g. count) for a sequence across samples and number of sample it is detected With one scatter plot (x: number of samples detected, y: mean value)

and two histogram showing the distribution in each dimension

k_seq.data.seq_data_analyzer.seq_overview(seq_table, axis=0)

Summarize sample in seq_table, with info of seq length, sample detected, mean, sd :returns: A pd.DataFrame show the summary for sequences

k_seq.data.seq_data_analyzer.seq_variance(seq_table, grouper)

Get the spread (standard deviation) of sequence abundance across replicates, provided by grouper


if single group, returns a pd.DataFrame with columns (‘mean’, ‘sd’) if multiple groups, returns two pd.DataFrame (mean, sd) with columns of each group

k_seq.data.filters module

k_seq.data.grouper module

Groupers slice seq_table into groups. E.g. input samples, reacted samples, different concentrations TODO: simplify code

class k_seq.data.grouper.Grouper(group, target=None, axis=1)

Bases: object

Grouper of samples/sequences

Two types of grouper accepted:

Type 0 (list): initialize with group as list-like. This defines a single set of samples/sequences Type 1 (dict): initialize with group as dict. This defines a collection of groups of samples/sequences


accessor for seq_table to group




axis to apply grouping (0 for index, 1 for columns)


0 or 1


dictionary with structure {group_name: group_members}


list or dict


type of the grouper


0 or 1

__init__(group, target=None, axis=1)

Initialize a Grouper instance :param group: list creates a Type 0 Grouper (single group) and dict creates a Type 1 Grouper

(multiple groups)

  • target (pd.DataFrame) – optional, target seq_table

  • axis (0 or 1) – axis to apply the grouper

get_table(group=None, target=None, axis=None, remove_zero=False)

Return a sub-seq_table from target given group

split(target=None, remove_zero=False)
class k_seq.data.grouper.GrouperCollection(**kwargs)

Bases: k_seq.utility.func_tools.AttrScope

A collection of groupers


add a grouper

k_seq.data.grouper.get_group(table, group, axis=1, remove_empty=False)

Slice or split the table based on the group :param table: target table to group :type table: pd.DataFrame or SeqTable :param group: single group or a set of groups :type group: list-like or dict of list

k_seq.data.variant_pool module

Function for variant pool design

k_seq.data.variant_pool.combination(n, k)
k_seq.data.variant_pool.d_mutant_fraction(d, mutation_rate, length=21, letter_book_size=4)

Relative abundance for a single d-order mutants

k_seq.data.variant_pool.neighbor_effect_error(xi, d, eta=0.09, L=21)

Fraction of reads from neighboring sequences due to sequencing error

k_seq.data.variant_pool.neighbor_effect_observation(xi, d, eta=0.09, L=21)

Get the ratio of observed abundance for a d-th order mutant, considering the neighbor effect under given sequencing error rate (xi)

k_seq.data.variant_pool.num_of_seq(d, length=21, letter_book_size=4)

Expected number of

k_seq.data.simu module

Module contains code to simulate data

class k_seq.data.simu.DistGenerators

Bases: object

A collection of random value generators from commonly used distributions. uniq_seq_num number of independent draw of distribution are returned

Available distributions:

lognormal uniform compo_lognormal

static compo_lognormal(size, loc=None, scale=None, c95=None, seed=None)

Sample a pool composition from a log-normal distribution indicate with loc and scale, or c95


scale = 0 means an evenly distributed pool with all components have relative abundance 1/uniq_seq_num

  • size (int) – uniq_seq_num of the pool

  • loc (float) – center of log-normal distribution

  • scale (float) – log variance of the distribution

  • c95 ([float, float]) – 95% percentile of log-normal distribution

  • seed – random seed

static lognormal(size=None, loc=None, scale=None, c95=None, seed=None)

Sample from a log-normal distribution indicate with loc and scale, or c95

  • size (int) – number of values to draw

  • loc (float) – center of log-normal distribution, default 0

  • scale (float) – log variance of the distribution, default 0

  • c95 ([float, float]) – 95% percentile of log-normal distribution

  • seed – random seed


a draw from distribution with given uniq_seq_num

static uniform(low=None, high=None, size=None, seed=None)

Sample from a uniform distribution

class k_seq.data.simu.PoolParamGenerator

Bases: object

Functions to generate parameters for a set of sequence in a sequence pool

- sample_from_iid_dist
- sample_from_dataframe

pd.DataFrame with columns of parameters and rows of simulated parameter for each sequence

classmethod sample_from_dataframe(df, uniq_seq_num, replace=True, weights=None, seed=None)

Simulate parameter by resampling rows of a given dataframe :param df: dataframe contains parameters as columns to sample from,

needs to have p0 as one column for heterogenous pool

uniq_seq_num (int): Number of unique sequences from simulation

static sample_from_iid_dist(uniq_seq_num, seed=None, **param_generators)

Simulate the seq parameters from individual draws of distributions Parameter:

p0: initial fraction for each sequence for uneven pool depending on the model. e.g. first-order model needs to include k and A

Accepted parameter input:
  • list-like: if uniq_seq_num does not match as expected uniq_seq_num, resample with replacement

  • generator: given generator returns a random parameter

  • callable: if taking uniq_seq_num as an argument, needs to return a uniq_seq_num vector of sampled parameter or a generator to generate a uniq_seq_num vector; if not taking uniq_seq_num as an argument, needs to return single sample

Args: p0 (list-like, generator, or callable): reserved argument for initial pool composition (fraction)

uniq_seq_num (int): Number of unique sequences from simulation seed (int): random seed for repeatability param_generators (kwargs of list-like, generator, or callable): parameter generator depending on the model


a n_row = uniq_seq_num pd.DataFrame contains generated parameters

class k_seq.data.simu.SimulationResults(dataset_dir, result_dir)

Bases: object

Class to load simulation result

__init__(dataset_dir, result_dir)

Survey estimation results - load fitting results from result_dir/fit_summary.csv - load truth and input count infor from dataset_dir/truth.csv and input_counts

Optional to include: - input_counts: counts of sequences in the input pool - mean_counts: mean counts in all samples (input and reacted)


table of estimated k, A, kA truth: table of true k, A, p0, ka, and input_counts seq_list: list of indices of sequences that were able to estimate

Return type


get_est_results(param, pred_type='point_est')

Return the estimation (pred) and truth of given parameter


Return the ratio of 97.5-percentile to 2.5-percentile

get_uncertainty_accuracy(param, pred_type='bs_ci95')

Return the accuracy of uncertainty estimation if uncertainty range includes the truth


Return a function to apply Gaussian error proportional to the value

k_seq.data.simu.simulate_counts(uniq_seq_num, x_values, total_reads, p0_generator=None, kinetic_model=None, count_model=None, total_amount_error=None, param_sample_from_df=None, weights=None, replace=True, reps=1, seed=None, note=None, save_to=None, **param_generators)

Simulate sequencing count dataset given kinetic and count model

  1. parameter for each unique sequences were sampled from param_sample_from_df and kwargs

(param_generators). It is an even pool if p0 is not provided. No repeated parameters.

  1. simulate the reacted amount / fraction of sequences with each controlled variable in x_values

  2. Simulated counts with given total total_reads were simulated for input pool and reacted pools.

  • uniq_seq_num (int) – Number of unique sequences from simulation

  • x_values (list-like) – list of controlled variables in each experiment setup,negative value means it is an initial pool

  • total_reads

  • p0 (list-like, generator, or callable) – reserved argument for initial pool composition (fraction)

  • kinetic_model (callable) – model the amount of sequences in reaction given input variables. Default BYOModel.amount_first_order

  • count_model (callable) – model the sequencing counts w.r.t. total reads and pool composition.Default MultiNomial model.

  • param_sample_from_df (pd.DataFrame) – optional to sample sequences from given table

  • weights (list or str) – weights/col of weight for sampling from table

  • total_amount_error (float or callable) – float as the standard deviation for a fixed Gaussian error, or any error function on the DNA amount. Use 0 for no introduced error.

  • conv_reps

  • seed (int) – random seed for repeatability

  • save_to (str) – optional, path to save the simulation results with x, Y, truth csv file and a pickled SeqData object

  • param_generator


c, n value for samples Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data

Return type

x (pd.DataFrame)

k_seq.data.simu.simulate_on_byo_doped_condition_from_exp_results(dataset, fitting_res, uniq_seq_num=None, x_values=None, total_reads=None, sequencing_depth=40, n_input=1, table_name='original', total_dna_error_rate=0.1, seed=23, plot_dist=False, save_to=None)
Simulate k-seq count dataset based on the experimental condition of BYO-doped pool, that

t: reaction time (90 min) alpha: degradation ratio of BYO (0.479) x_values: controlled BYO concentration points: 1 input pool with triple sequencing depth,

5 BYO concentration with triplicates:
[-1 (input pool),

2e-6, 2e-6, 2e-6, 10e-6, 10e-6, 10e-6, 50e-6, 50e-6, 50e-6, 250e-6, 250e-6, 250e-6, 1260e-6, 1260e-6, 1260e-6]

Parameter for each sequences were sampled from previous point estimate results:
  • point_est_csv: load point estimates results to extract estimated k and A

  • seqtable_path: path to load input sample SeqData object to get p0 information

Returns: x (pd.DataFrame): controlled variable (c, n) for samples

Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences, including p0, k, A, kA truth (pd.DataFrame): true values of parameters (e.g. p0, k, A) for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data

k_seq.data.simu.simulate_w_byo_doped_condition_from_param_dist(uniq_seq_num, depth, p0_loc, p0_scale, k_95, total_dna_error_rate=0.1, seed=23, save_to=None, plot_dist=True)
Deprecated. Simulate k-seq count dataset similar to the experimental condition of BYO-doped pool, that

t: reaction time (90 min) alpha: degradation ratio of BYO (0.479) x_values: controlled BYO concentration points: 1 input pool with triple sequencing depth,

5 BYO concentration with triplicates:
[-1 (input pool),

2e-6, 2e-6, 2e-6, 10e-6, 10e-6, 10e-6, 50e-6, 50e-6, 50e-6, 250e-6, 250e-6, 250e-6, 1260e-6, 1260e-6, 1260e-6]

Parameter for each sequences were sampled from given distribution defined from arguments
  • p0: log normal from exp(N(p0_loc, p0_scale))

  • k: log normal from k_95 95-percentile for k

  • A: uniform from [0, 1]

Other args: uniq_seq_num (int): Number of unique sequences from simulation

depth (int or float): sequence depth defined on mean reads per sequence total_amount_error (float or callable): float as the standard deviation for a fixed Gaussian error, or any error function on the DNA amount. Use 0 for no introduced error. save_to (str): optional, path to save the simulation results with x, Y, truth csv file and a pickled SeqData object plot_dist (bool): if pairwise figures of distribution for simulated parameters (p0, A, k, kA)

Returns: x (pd.DataFrame): controlled variable (c, n) for samples

Y (pd.DataFrame): simulated sequence counts for given samples param_table (pd.DataFrame): seq_table list the parameters for simulated sequences, including p0, k, A, kA truth (pd.DataFrame): true values of parameters (e.g. p0, k, A) for simulated sequences seq_table (data.SeqData): a SeqData object to stores all the data

k_seq.data.visualzation module