`k_seq.estimate`: least squares fitting¶

This module contains the submodule needed for estimator

k_seq.estimate.least_squares¶

Least-squares fitting to a kinetic model,

Several functions are included:

point estimation using scipy.optimize.curve_fit
option to exclude zero in fitting
option to initialize values
weighted fitting depends on the customized weights
confidence interval estimation using bootstrap

class k_seq.estimate.least_squares.FitResults(estimator, model=None)¶

Bases: object

A class to store, process, and visualize fitting results for single estimator, contains almost all information needed to further analysis

estimator¶

proxy to the estimator

Type: Estimator

model¶

model used in fitting

Type: callable

data¶

a scope stores the fitting data x_data (pd.Series): y_data (pd.Series): sigma (pd.Series): x_label (str): name for x data y_label (str): name for y data

Type: AttrScope

point_estimation¶

a scope stores point estimation results, includes params (pd.Series): stores the parameter estimation, with extra metrics calculation pcov (pd.DataFrame): covariance matrix for estimated parameter

Type: AttrScope

uncertainty¶

a scope stores uncertainty estimation results, includes summary (pd.Series): summary of each parameter or metric from records bs_records (pd.DataFrame): records for stored bootstrapping results rep_results (pd.DataFrame): results from fitting of replicates

Type: AttrScope

convergence¶

a scope stores convergence test results, includes summary (pd.DataFrame): summary for each parameter or metric from records records (pd.DataFrame): records for repeated fitting results

Type: AttrScope

__init__(estimator, model=None)¶

Parameters: estimator (SingleFitter) – estimator used to generate this fitting result

classmethod from_json(json_path, tarfile=None, gzip=True, estimator=None, data=None)¶

load fitting results from json records, option to load from tar.gz files Note: no estimator info if estimator is None

Parameters

json_path (str) – path to json file, or file name under tarball file if tar_file_name is true tarfile (str): if not None, the json file is in a tarfile (.tar/.tar.gz)
gzip (bool) – if True, the tarfile is compressed with gzip (.tar.gz); if False, the tarfile is not compressed (.tar)
estimator (SingleFitter) – optional. Recover the estimator instance.
data (dict) – supplied dictionary to add data info

plot_fitting_curves(model=None, plot_on='bootstrap', subsample=20, x_lim=- 3e-05, 0.002, y_lim=None, x_label=None, y_label=None, legend=False, legend_loc='upper left', fontsize=12, params=None, ax=None, **kwargs)¶: plot fitting results for Aminoacylation ribozyme fitting curves obj should be a FitResults instance

plot_loss_heatmap(model=None, plot_on='bootstrap', subsample=20, param1_range=0.01, 10000.0, param2_range=0.001, 1, legend=False, legend_loc='upper left', colorbar=True, resolution=101, fontsize=12, param_name=None, add_lines=None, line_label=True, ax=None, **kwargs)¶: Plot the 2-D heatmap of loss function

to_json(path=None)¶

Convert results into a json string/file contains {

point_estimation: { params: jsonfy(pd.Series)
pcov: jsonfy(pd.DataFrame) }

uncertainty: { summary: jsonfy(pd.Series)
bs_records: jsonfy(pd.DataFrame) rep_results: jsonfy(pd.Series) }

convergence: { summary: jsonfy(pd.DataFrame)
records: jsonfy(pd.DataFrame) }

data: {x_data: jsonfy(pd.Series)
y_data: jsonfy(pd.Series), sigma: jsonfy(pd.Series), x_label: str, y_label: str}

}

to_series()¶: Convert point_estimation, uncertainty (if possible), and convergence (if possible) to a series include flattened info: e.g. columns will include [param1, param2, bs_param1_mean, bs_param1_std, bs_param1_2.5%, …, param1_range]

class k_seq.estimate.least_squares.SingleFitter(x_data, y_data, model, name=None, x_label=None, y_label=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, save_to=None, overwrite=False, verbose=1)¶

Bases: k_seq.estimate._estimator.Estimator

Use scipy.optimize.curve_fit to fit a model for a sequence time/concentration series It can conduct point estimation, uncertainty estimation for bootstrap, empirical CI estimation

x_data¶

list of x values in fitting

Type: list

y_data¶

y values in fitting

Type: list, pd.Series

model¶

model to fit

Type: callable

parameter¶

silent¶

name¶

Optional. Estimator’s name

Type: str

bootstrap¶

proxy to the bootstrap object

Type: Bootstrap

results¶

proxy to the FitResult object

Type: FitResult

config¶

name space for fitting, contains

Type: AttrScope

opt_method¶

Optimization methods in scipy.optimize. Default ‘trf’

Type: str

exclude_zero¶

If exclude zero/missing data in fitting. Default False.

Type: bool

init_guess¶

Initial guess estimate parameters, random value from 0 to 1 will be use if None

Type: list of float or generator

rnd_seed¶

random seed used in fitting for reproducibility

Type: int

sigma¶

Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

Type: list, pd.Series, or pd.DataFrame

bounds¶

Optional, [[lower bounds], [higher bounds]] for each parameter

Type: 2 by m list

metric¶

curve_fit_kwargs¶

other keyword parameters to pass to scipy.optimize.curve_fit

Type: dict

bootstrap_config¶

name space for bootstrap, contains

Type: AttrScope

bootstrap_num¶

Number of bootstrap to perform, 0 means no bootstrap

Type: int

bs_record_num¶

Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

Type: int

bs_method¶

Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

Type: str

__init__(x_data, y_data, model, name=None, x_label=None, y_label=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, save_to=None, overwrite=False, verbose=1)¶

Initialize a SingleFitter instance

Parameters

x_data (list) – list of x values in fitting
y_data (list, pd.Series) – y values in fitting
model (callable) – model to fit
name (str) – Optional. Estimator’s name
x_label (str) – name of x data
y_label (str) – name of y data
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’
exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
rnd_seed (int) – random seed used in fitting for reproducibility
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
replicates (list of list) – List of list of sample names for each replicates
bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
grouper (dict or Grouper) – Indicate the grouping of samples
record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
save_to (str) –
overwrite (bool) –
verbose (0, 1, 2) –
save_to – optional. If not None, save results to the given path when fitting finishes
overwrite – if True, overwrite the save_to file if already exists; if False, read results and skip estimation. Default False
verbose – set different verbose level. 0: WARNING, 1: INFO, 2: DEBUG

convergence_test(**kwargs)¶

Empirically estimate convergence by repeated fittings, :param conv_reps: number of repeated fitting from perturbed initial points for convergence test :type conv_reps: int :param conv_init_range: a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw :type conv_init_range: list of 2-tuple :param conv_stats: a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series :type conv_stats: dict of callable

Returns: pd.Series records: pd.DataFrame of full records
Return type: summary

fit(point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, **kwargs)¶: Run fitting, configuration are from the object :param point_estimate: if do point estimation, default True :type point_estimate: bool :param replicates: if use replicate for uncertainty estimation, default False :type replicates: bool :param bootstrap: if do bootstrap, default False :type bootstrap: bool :param convergence_test: if do convergence test, default False :type convergence_test: bool

classmethod from_json(file_path, model)¶

create a estimator from saved json file

Parameters

file_path (str) – path to saved json file
model (callable) – as callable is not json-able, need to reassign

Notes

bs_stats, conv_stats currently can not be recovered

point_estimate(**kwargs)¶

Fitting using scipy.optimize.curve_fit, Arguments will be inferred from instance’s attributes if not provided

Parameters

model (callable) – model to fit
x_data (list) – list of x values in fitting
y_data (list, pd.Series) – y values in fitting
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit

Returns: A dictionary contains least-squares fitting results

params: pd.Series of estimated parameter
pcov: pd.Dataframe of covariance matrix
metrics: None or pd.Series of calculated metrics

run_bootstrap(bs_record_num=None, **kwargs)¶

Use bootstrap to estimate uncertainty :param bs_method: Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) :type bs_method: str :param bootstrap_num: Number of bootstrap to perform, 0 means no bootstrap :type bootstrap_num: int :param grouper: Indicate the grouping of samples :type grouper: dict or Grouper :param bs_record_num: Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption :type bs_record_num: int :param bs_stats: a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series :type bs_stats: dict of callable :param record_full: if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. :type record_full: bool

Returns: pd.Series results: pd.DataFrame subsampled if 0 <= bs_record_num <= bootstrap_num
Return type: summary

run_replicates(replicates=None)¶

Use replicates to estimate uncertainty :param replicates: List of list of sample names for each replicates :type replicates: list of list

Returns: a pd.Dataframe of results from each replicates

summary()¶: Return a pd.series as fitting summary with flatten info

to_dict()¶

Save estimator configuration as a dictionary

Returns: Dict of configurations for the estimator

to_json(save_to_file=None)¶: Save the estimator configuration as a json file, except for model, bs_stats, conv_stats as these are not json-able

k_seq.estimate.least_squares_batch¶

Least-squares fitting for a batch of sequences

class k_seq.estimate.least_squares_batch.BatchFitResults(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶

Bases: object

Parse, store, and visualize BatchFitter results Only save results (separate from each estimator), corresponding estimator should be found by sequence We used two data storage strategies:

smaller dataset that was saved as results.pkl: the pickled file is passed, and the results will be
loaded to self.summary, self.bs_record, self.conv_record

larger dataset that was saved in results/ folder: self. summary will be loaded, self.bs_record and
self.conv_record will be linked

estimator¶: proxy to the BatchFitter

model¶

model function

Type: callable

data¶

contains data information: x_data (list-like): x_data for fitting y_dataframe (pd.DataFrame): a table of y_data for sequences simga (pd.DataFrame): a table representing each sigma for sequence in fitting

Type: AttrScope

large_data¶

if True, it will not load all bootstrap or convergence results

Type: bool

summary¶

summarized results with each sequence as index

Type: pd.DataFrame

bs_record()¶: get bootstrap results {seq: SingleFitter.results.uncertainty.records}

conv_record()¶: {seq: `SingleFitter.results.convergence.records}

summary_to_csv()¶: export summary dataframe as csv file

to_json()¶: storage strategy for large files: save results as a folder of json files

to_pickle()¶: storage strategy for small files: save results as pickled dictionary

from_pickle()¶: load from a picked dictionary

from_json()¶: load from a folder of json files

load_result()¶: overall method to infer either load BatchFitResults from pickled or a folder

Analysis:

__init__(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶: Init a BatchFitResults instance :param estimator: corresponding estimator :type estimator: BatchFitter

bs_record(seqs=None)¶: Retrieve bootstrap records

conv_record(seqs=None)¶: Retrieve convergence records

classmethod from_json(path_to_folder, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶: Load results from folder of results with json format

classmethod from_pickle(path_to_pickle, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶: Create a BatchFitResults instance with results loaded from pickle Notice:

this could take a very long time if the pickled file is large, suggest to use to_json for large dataset

static generate_summary(result_folder_path, n_core=1, save_to=None)¶

Generate a summary csv file from given result folder. This could be used if summary was not successfully generated during fitting

Result folder should have a structure of:

seqs - [seq name or hash].json - [if hash] seq_to_hash.json

Parameters

result_folder_path (str) – path to the root of results folder
n_core (int) – number of threads to process in parallel. Default 1
save_to (str) – save CSV file to local path

Returns

pd.DataFrame of summary

get_FitResult(seq=None)¶: Get FitResults from a JSON file

classmethod load_result(result_path, estimator=None)¶

summary_to_csv(path)¶: Save summary table as csv file

to_json(output_dir)¶

Save results in json format, with the structure of

|output_dir/: |- summary.csv |- seqs

|- seq1.json |- seq2.json

…

Notes

Bootstrap and convergence records should already be streamed as separate JSON files under /seqs/

Parameters: output_dir (str) – path of folder to save results

to_pickle(output_dir, bs_record=True, conv_record=True)¶

Save fitting results as a single pickled dict, suitable for small dataset. For large dataset to_json is preferred

Parameters

output_dir (str) – path to saved results, should have suffix of .pkl
bs_record (bool) – if output bs_record, default True
conv_record (bool) – if output conv_record, default True

class k_seq.estimate.least_squares_batch.BatchFitter(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶

Bases: k_seq.estimate._estimator.Estimator

Least-squares fitting for batch of sequences

y_dataframe¶

Table of y values for sequences (rows) to fit kinetic models

Type: pd.DataFrame

model¶

model to fit

Type: callable

x_data¶

list of x values in fitting

Type: list

seq_to_fit¶

pick top n sequences in the table for fitting or only fit selected sequences

Type: int or list of seq

sigma¶

Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

Type: list, pd.Series, or pd.DataFrame

note¶

note about this fitting job

Type: str

results¶

accessor to fitting results

Type: BatchFitResult

fit_params¶

collection of arguments pass to each single seq fitting, includes: x_data (list): list of x values in fitting model (callable): model to fit bounds (2 by m list ): Optional, [[lower bounds], [higher bounds]] for each parameter init_guess (list of float or generator): Initial guess estimate parameters, random value from 0 to 1 will be use if None opt_method (str): Optimization methods in scipy.optimize. Default ‘trf’ exclude_zero (bool): If exclude zero/missing data in fitting. Default False. metrics (dict of callable): Optional. Extra metric/parameters to calculate for each estimation rnd_seed (int): random seed used in fitting for reproducibility curve_fit_kwargs (dict): other keyword parameters to pass to scipy.optimize.curve_fit replicates (list of list): List of list of sample names for each replicates bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. conv_reps (int): number of repeated fitting from perturbed initial points for convergence test conv_init_range (list of 2-tuple): a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw conv_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series overwrite:

Type: AttrScope

__init__(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶

Initialize a BatchFitter

Parameters

y_dataframe (pd.DataFrame) –
y_dataframe – Table of y values for sequences (rows) to fit kinetic models
x_data (list) – list of x values in fitting
model (callable) – model to fit
seq_to_fit (int or list of seq) – pick top n sequences in the table for fitting or only fit selected sequences
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’
exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
rnd_seed (int) – random seed used in fitting for reproducibility
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
grouper (dict or Grouper) – Indicate the grouping of samples
record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
note (str) – optional notes for the estimator
(bool (large_dataset) – if trigger strategy to work on large dataset (e.g. > 1000 seqs). If True, deduplicate of sequences with same reacted fractions in each concentration will be performed and results will be streamed to hard drive
save_to (str) – if not None, load results from the path

fit(parallel_cores=1, point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, overwrite=False)¶

Run the estimation :param parallel_cores: number of parallel cores to use. Default 1 :type parallel_cores: int :param point_estimate: if perform point estimation, default True :type point_estimate: bool :param bootstrap: if perform bootstrap uncertainty estimation, default False :type bootstrap: bool :param replicates: if perform replicates for uncertainty estimation, default False :type replicates: bool :param convergence_test: if perform convergence test, default False :type convergence_test: bool :param stream_to: Directly stream fitting results to disk if output path is given

will create a folder with name of seq/hash with pickled dict of fitting results

Parameters: overwrite (bool) – if overwrite existing results when stream to disk. Default False.

classmethod load_model(model_path, y_dataframe=None, sigma=None, result_path=None)¶

Create a model from pickled config file

Parameters

model_path (str) – path to picked model configuration file or the saved folder
y_dataframe (pd.DataFrame or str) – y_data table for fitting
sigma (pd.DataFrame or str) – optional sigma table for fitting
result_path (str) – path to fitting results

Returns

a BatchFitter instance

save_model(output_dir, results=True, bs_record=True, conv_record=True, tables=True)¶

Save model to a given directory model_config will be saved as a pickled dictionary to recover the model

except for y_dataframe and sigma which are too large

Parameters

output_dir (str) – path to save the model, create if the path does not exist
results (bool) – if save estimation results to results as well, to be load by BatchFitResults, Default True
bs_record (bool) – if save bootstrap records, default True
conv_record (bool) – if save convergence records, default True
tables (bool) – if save tables (y_dataframe, sigma) in the folder. Default True

save_results(result_path, bs_record=True, conv_record=True)¶: Save results to disk as JSON or pickle JSON is preferred for speed, readability, compatibility, and security

property save_to¶

summary(save_to=None)¶

k_seq.estimate.least_squares_batch¶

Least-squares fitting for a batch of sequences

class k_seq.estimate.least_squares_batch.BatchFitResults(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶

Bases: object

Parse, store, and visualize BatchFitter results Only save results (separate from each estimator), corresponding estimator should be found by sequence We used two data storage strategies:

smaller dataset that was saved as results.pkl: the pickled file is passed, and the results will be
loaded to self.summary, self.bs_record, self.conv_record

larger dataset that was saved in results/ folder: self. summary will be loaded, self.bs_record and
self.conv_record will be linked

estimator¶: proxy to the BatchFitter

model¶

model function

Type: callable

data¶

contains data information: x_data (list-like): x_data for fitting y_dataframe (pd.DataFrame): a table of y_data for sequences simga (pd.DataFrame): a table representing each sigma for sequence in fitting

Type: AttrScope

large_data¶

if True, it will not load all bootstrap or convergence results

Type: bool

summary¶

summarized results with each sequence as index

Type: pd.DataFrame

bs_record()¶: get bootstrap results {seq: SingleFitter.results.uncertainty.records}

conv_record()¶: {seq: `SingleFitter.results.convergence.records}

summary_to_csv()¶: export summary dataframe as csv file

to_json()¶: storage strategy for large files: save results as a folder of json files

to_pickle()¶: storage strategy for small files: save results as pickled dictionary

from_pickle()¶: load from a picked dictionary

from_json()¶: load from a folder of json files

load_result()¶: overall method to infer either load BatchFitResults from pickled or a folder

Analysis:

__init__(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶: Init a BatchFitResults instance :param estimator: corresponding estimator :type estimator: BatchFitter

bs_record(seqs=None)¶: Retrieve bootstrap records

conv_record(seqs=None)¶: Retrieve convergence records

classmethod from_json(path_to_folder, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶: Load results from folder of results with json format

classmethod from_pickle(path_to_pickle, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶: Create a BatchFitResults instance with results loaded from pickle Notice:

this could take a very long time if the pickled file is large, suggest to use to_json for large dataset

static generate_summary(result_folder_path, n_core=1, save_to=None)¶

Generate a summary csv file from given result folder. This could be used if summary was not successfully generated during fitting

Result folder should have a structure of:

seqs - [seq name or hash].json - [if hash] seq_to_hash.json

Parameters

result_folder_path (str) – path to the root of results folder
n_core (int) – number of threads to process in parallel. Default 1
save_to (str) – save CSV file to local path

Returns

pd.DataFrame of summary

get_FitResult(seq=None)¶: Get FitResults from a JSON file

classmethod load_result(result_path, estimator=None)¶

summary_to_csv(path)¶: Save summary table as csv file

to_json(output_dir)¶

Save results in json format, with the structure of

|output_dir/: |- summary.csv |- seqs

|- seq1.json |- seq2.json

…

Notes

Bootstrap and convergence records should already be streamed as separate JSON files under /seqs/

Parameters: output_dir (str) – path of folder to save results

to_pickle(output_dir, bs_record=True, conv_record=True)¶

Save fitting results as a single pickled dict, suitable for small dataset. For large dataset to_json is preferred

Parameters

output_dir (str) – path to saved results, should have suffix of .pkl
bs_record (bool) – if output bs_record, default True
conv_record (bool) – if output conv_record, default True

class k_seq.estimate.least_squares_batch.BatchFitter(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶

Bases: k_seq.estimate._estimator.Estimator

Least-squares fitting for batch of sequences

y_dataframe¶

Table of y values for sequences (rows) to fit kinetic models

Type: pd.DataFrame

model¶

model to fit

Type: callable

x_data¶

list of x values in fitting

Type: list

seq_to_fit¶

pick top n sequences in the table for fitting or only fit selected sequences

Type: int or list of seq

sigma¶

Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

Type: list, pd.Series, or pd.DataFrame

note¶

note about this fitting job

Type: str

results¶

accessor to fitting results

Type: BatchFitResult

fit_params¶

collection of arguments pass to each single seq fitting, includes: x_data (list): list of x values in fitting model (callable): model to fit bounds (2 by m list ): Optional, [[lower bounds], [higher bounds]] for each parameter init_guess (list of float or generator): Initial guess estimate parameters, random value from 0 to 1 will be use if None opt_method (str): Optimization methods in scipy.optimize. Default ‘trf’ exclude_zero (bool): If exclude zero/missing data in fitting. Default False. metrics (dict of callable): Optional. Extra metric/parameters to calculate for each estimation rnd_seed (int): random seed used in fitting for reproducibility curve_fit_kwargs (dict): other keyword parameters to pass to scipy.optimize.curve_fit replicates (list of list): List of list of sample names for each replicates bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. conv_reps (int): number of repeated fitting from perturbed initial points for convergence test conv_init_range (list of 2-tuple): a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw conv_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series overwrite:

Type: AttrScope

__init__(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶

Initialize a BatchFitter

Parameters

y_dataframe (pd.DataFrame) –
y_dataframe – Table of y values for sequences (rows) to fit kinetic models
x_data (list) – list of x values in fitting
model (callable) – model to fit
seq_to_fit (int or list of seq) – pick top n sequences in the table for fitting or only fit selected sequences
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’
exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
rnd_seed (int) – random seed used in fitting for reproducibility
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
grouper (dict or Grouper) – Indicate the grouping of samples
record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
note (str) – optional notes for the estimator
(bool (large_dataset) – if trigger strategy to work on large dataset (e.g. > 1000 seqs). If True, deduplicate of sequences with same reacted fractions in each concentration will be performed and results will be streamed to hard drive
save_to (str) – if not None, load results from the path

fit(parallel_cores=1, point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, overwrite=False)¶

Run the estimation :param parallel_cores: number of parallel cores to use. Default 1 :type parallel_cores: int :param point_estimate: if perform point estimation, default True :type point_estimate: bool :param bootstrap: if perform bootstrap uncertainty estimation, default False :type bootstrap: bool :param replicates: if perform replicates for uncertainty estimation, default False :type replicates: bool :param convergence_test: if perform convergence test, default False :type convergence_test: bool :param stream_to: Directly stream fitting results to disk if output path is given

will create a folder with name of seq/hash with pickled dict of fitting results

Parameters: overwrite (bool) – if overwrite existing results when stream to disk. Default False.

classmethod load_model(model_path, y_dataframe=None, sigma=None, result_path=None)¶

Create a model from pickled config file

Parameters

model_path (str) – path to picked model configuration file or the saved folder
y_dataframe (pd.DataFrame or str) – y_data table for fitting
sigma (pd.DataFrame or str) – optional sigma table for fitting
result_path (str) – path to fitting results

Returns

a BatchFitter instance

save_model(output_dir, results=True, bs_record=True, conv_record=True, tables=True)¶

Save model to a given directory model_config will be saved as a pickled dictionary to recover the model

except for y_dataframe and sigma which are too large

Parameters

output_dir (str) – path to save the model, create if the path does not exist
results (bool) – if save estimation results to results as well, to be load by BatchFitResults, Default True
bs_record (bool) – if save bootstrap records, default True
conv_record (bool) – if save convergence records, default True
tables (bool) – if save tables (y_dataframe, sigma) in the folder. Default True

save_results(result_path, bs_record=True, conv_record=True)¶: Save results to disk as JSON or pickle JSON is preferred for speed, readability, compatibility, and security

property save_to¶

summary(save_to=None)¶

k_seq.estimate.bootstrap¶

Uncertainty estimation through bootstrap

class k_seq.estimate.bootstrap.Bootstrap(estimator, bootstrap_num, bs_record_num, bs_method, grouper=None, bs_stats=None, record_full=False)¶

Bases: object

Perform bootstrap for fitting uncertainty estimation

Three types of bootstrap supported:

rel_res: resample the percent residue, based on the assumption that variance are proportional to the mean
(from data property)
data: directly resample data points
stratified: resample within groups, grouper is required

estimator¶

accessor to the associated estimator

Type: EstimatorBase type

bs_method¶

Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

Type: str

bootstrap_num¶

Number of bootstrap to perform, 0 means no bootstrap

Type: int

bs_record_num¶

Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

Type: int

bs_stats¶

a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

Type: dict of callable

grouper¶

Indicate the grouping of samples

Type: dict or Grouper

record_full¶

if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

Type: bool

__init__(estimator, bootstrap_num, bs_record_num, bs_method, grouper=None, bs_stats=None, record_full=False)¶

Parameters: estimator (Estimator) – the estimator generates the results

bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap: bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

property bs_method¶

run()¶

Perform bootstrap with arguments indicated in instance attributes

Returns: summarized results for each parameter and metrics from bootstrap records (pd.DataFrame): records of bootstrapped results, each row is a bootstrapped result
Return type: summary (pd.Series)

k_seq.estimate.replicates¶

Uncertainty estimation using replicates

class k_seq.estimate.replicates.Replicates(estimator, replicates)¶

Bases: object

property n_replicates¶

run()¶: Perform fitting for replicates

k_seq.estimate.convergence¶

Module to access the convergence of fitting, e.g. model identifiability

class k_seq.estimate.convergence.ConvergenceTester(estimator, conv_reps=10, conv_init_range=None, conv_stats=None)¶

Bases: object

Apply repeated fitting on a Estimator with perturbed initial value to test empirical convergence Store the convergence test results as these are separate tests from estimation

conv_reps¶

number of repeated fitting from perturbed initial points for convergence test

Type: int

estimator¶

estimator for fitting

Type: Estimator

conv_init_range¶

a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw

Type: list of 2-tuple

conv_stats¶

a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

Type: dict of callable

run()¶: run converge test and return a summary and full records

__init__(estimator, conv_reps=10, conv_init_range=None, conv_stats=None)¶

Apply convergence test to given estimator

Parameters

estimator (Estimator) – estimator for fitting
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

run()¶

Run convergence test, report a summary and full records

Returns: A pd.Series contains the mean, sd, range for each reported parameter, and conv_stats result records: A pd.Dataframe contains the full records
Return type: summary

k_seq.estimate.model_ident¶

Modules for model identifiability analysis

class k_seq.estimate.model_ident.ParamMap(model, sample_n, x_data, save_to, param1_name, param1_range, param2_name, param2_range, param1_log=False, param2_log=False, model_kwargs=None, bootstrap_num=100, bs_record_num=50, bs_method='data', bs_stats=None, grouper=None, record_full=False, conv_reps=20, conv_stats=None, conv_init_range=None, fitting_kwargs=None, seed=23)¶

Bases: object

Generate a 2d convergence map for randomly sampled data points from given parameter range it simulates sample_n sequence samples with random params selected from the range of (param1_range, param2_range), on the optional log scale

fit(**kwargs)¶: Batch fit simulated result

get_metric_values(metric, finite_only=False)¶: Returns a pd.Series contains the metric value

classmethod load_result(result_path, model=None)¶

plot_map(metric=None, metric_label=None, scatter=False, gridsize=50, figsize=5, 5, ax=None, cax_pos=0.91, 0.58, 0.03, 0.3, **plot_kwargs)¶

simulate_samples(grid=True, const_err=None, rel_err=None, y_enforce_positive=True)¶: Simulate a set of samples (param1 and param2)

k_seq.estimate.model_ident.gamma(df)¶: Get metric gamma = log_10{sigma_k mu_A} / {sigma kA}

k_seq.estimate.model_ident.kendall_log(records)¶

k_seq.estimate.model_ident.pearson(records)¶

k_seq.estimate.model_ident.pearson_log(records)¶

k_seq.estimate.model_ident.remove_nan(df)¶

k_seq.estimate.model_ident.spearman(records)¶

k_seq.estimate.model_ident.spearman_log(records)¶

k_seq.estimate: least squares fitting¶