k_seq.estimate: least squares fitting

This module contains the submodule needed for estimator

k_seq.estimate.least_squares

Least-squares fitting to a kinetic model,

Several functions are included:
  • point estimation using scipy.optimize.curve_fit

  • option to exclude zero in fitting

  • option to initialize values

  • weighted fitting depends on the customized weights

  • confidence interval estimation using bootstrap

class k_seq.estimate.least_squares.FitResults(estimator, model=None)

Bases: object

A class to store, process, and visualize fitting results for single estimator, contains almost all information needed to further analysis

estimator

proxy to the estimator

Type

Estimator

model

model used in fitting

Type

callable

data

a scope stores the fitting data x_data (pd.Series): y_data (pd.Series): sigma (pd.Series): x_label (str): name for x data y_label (str): name for y data

Type

AttrScope

point_estimation

a scope stores point estimation results, includes params (pd.Series): stores the parameter estimation, with extra metrics calculation pcov (pd.DataFrame): covariance matrix for estimated parameter

Type

AttrScope

uncertainty

a scope stores uncertainty estimation results, includes summary (pd.Series): summary of each parameter or metric from records bs_records (pd.DataFrame): records for stored bootstrapping results rep_results (pd.DataFrame): results from fitting of replicates

Type

AttrScope

convergence

a scope stores convergence test results, includes summary (pd.DataFrame): summary for each parameter or metric from records records (pd.DataFrame): records for repeated fitting results

Type

AttrScope

__init__(estimator, model=None)
Parameters

estimator (SingleFitter) – estimator used to generate this fitting result

classmethod from_json(json_path, tarfile=None, gzip=True, estimator=None, data=None)

load fitting results from json records, option to load from tar.gz files Note: no estimator info if estimator is None

Parameters
  • json_path (str) – path to json file, or file name under tarball file if tar_file_name is true tarfile (str): if not None, the json file is in a tarfile (.tar/.tar.gz)

  • gzip (bool) – if True, the tarfile is compressed with gzip (.tar.gz); if False, the tarfile is not compressed (.tar)

  • estimator (SingleFitter) – optional. Recover the estimator instance.

  • data (dict) – supplied dictionary to add data info

plot_fitting_curves(model=None, plot_on='bootstrap', subsample=20, x_lim=- 3e-05, 0.002, y_lim=None, x_label=None, y_label=None, legend=False, legend_loc='upper left', fontsize=12, params=None, ax=None, **kwargs)

plot fitting results for Aminoacylation ribozyme fitting curves obj should be a FitResults instance

plot_loss_heatmap(model=None, plot_on='bootstrap', subsample=20, param1_range=0.01, 10000.0, param2_range=0.001, 1, legend=False, legend_loc='upper left', colorbar=True, resolution=101, fontsize=12, param_name=None, add_lines=None, line_label=True, ax=None, **kwargs)

Plot the 2-D heatmap of loss function

to_json(path=None)

Convert results into a json string/file contains {

point_estimation: { params: jsonfy(pd.Series)

pcov: jsonfy(pd.DataFrame) }

uncertainty: { summary: jsonfy(pd.Series)

bs_records: jsonfy(pd.DataFrame) rep_results: jsonfy(pd.Series) }

convergence: { summary: jsonfy(pd.DataFrame)

records: jsonfy(pd.DataFrame) }

data: {x_data: jsonfy(pd.Series)

y_data: jsonfy(pd.Series), sigma: jsonfy(pd.Series), x_label: str, y_label: str}

}

to_series()

Convert point_estimation, uncertainty (if possible), and convergence (if possible) to a series include flattened info: e.g. columns will include [param1, param2, bs_param1_mean, bs_param1_std, bs_param1_2.5%, …, param1_range]

class k_seq.estimate.least_squares.SingleFitter(x_data, y_data, model, name=None, x_label=None, y_label=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, save_to=None, overwrite=False, verbose=1)

Bases: k_seq.estimate._estimator.Estimator

Use scipy.optimize.curve_fit to fit a model for a sequence time/concentration series It can conduct point estimation, uncertainty estimation for bootstrap, empirical CI estimation

x_data

list of x values in fitting

Type

list

y_data

y values in fitting

Type

list, pd.Series

model

model to fit

Type

callable

parameter
silent
name

Optional. Estimator’s name

Type

str

bootstrap

proxy to the bootstrap object

Type

Bootstrap

results

proxy to the FitResult object

Type

FitResult

config

name space for fitting, contains

Type

AttrScope

opt_method

Optimization methods in scipy.optimize. Default ‘trf’

Type

str

exclude_zero

If exclude zero/missing data in fitting. Default False.

Type

bool

init_guess

Initial guess estimate parameters, random value from 0 to 1 will be use if None

Type

list of float or generator

rnd_seed

random seed used in fitting for reproducibility

Type

int

sigma

Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

Type

list, pd.Series, or pd.DataFrame

bounds

Optional, [[lower bounds], [higher bounds]] for each parameter

Type

2 by m list

metric
curve_fit_kwargs

other keyword parameters to pass to scipy.optimize.curve_fit

Type

dict

bootstrap_config

name space for bootstrap, contains

Type

AttrScope

bootstrap_num

Number of bootstrap to perform, 0 means no bootstrap

Type

int

bs_record_num

Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

Type

int

bs_method

Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

Type

str

__init__(x_data, y_data, model, name=None, x_label=None, y_label=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, save_to=None, overwrite=False, verbose=1)

Initialize a SingleFitter instance

Parameters
  • x_data (list) – list of x values in fitting

  • y_data (list, pd.Series) – y values in fitting

  • model (callable) – model to fit

  • name (str) – Optional. Estimator’s name

  • x_label (str) – name of x data

  • y_label (str) – name of y data

  • sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

  • bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter

  • init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None

  • opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’

  • exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.

  • metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation

  • rnd_seed (int) – random seed used in fitting for reproducibility

  • curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit

  • replicates (list of list) – List of list of sample names for each replicates

  • bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap

  • bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

  • bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

  • bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

  • grouper (dict or Grouper) – Indicate the grouping of samples

  • record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

  • conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test

  • conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw

  • conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

  • save_to (str) –

  • overwrite (bool) –

  • verbose (0, 1, 2) –

  • save_to – optional. If not None, save results to the given path when fitting finishes

  • overwrite – if True, overwrite the save_to file if already exists; if False, read results and skip estimation. Default False

  • verbose – set different verbose level. 0: WARNING, 1: INFO, 2: DEBUG

convergence_test(**kwargs)

Empirically estimate convergence by repeated fittings, :param conv_reps: number of repeated fitting from perturbed initial points for convergence test :type conv_reps: int :param conv_init_range: a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw :type conv_init_range: list of 2-tuple :param conv_stats: a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series :type conv_stats: dict of callable

Returns

pd.Series records: pd.DataFrame of full records

Return type

summary

fit(point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, **kwargs)

Run fitting, configuration are from the object :param point_estimate: if do point estimation, default True :type point_estimate: bool :param replicates: if use replicate for uncertainty estimation, default False :type replicates: bool :param bootstrap: if do bootstrap, default False :type bootstrap: bool :param convergence_test: if do convergence test, default False :type convergence_test: bool

classmethod from_json(file_path, model)

create a estimator from saved json file

Parameters
  • file_path (str) – path to saved json file

  • model (callable) – as callable is not json-able, need to reassign

Notes

bs_stats, conv_stats currently can not be recovered

point_estimate(**kwargs)

Fitting using scipy.optimize.curve_fit, Arguments will be inferred from instance’s attributes if not provided

Parameters
  • model (callable) – model to fit

  • x_data (list) – list of x values in fitting

  • y_data (list, pd.Series) – y values in fitting

  • sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

  • bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter

  • metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation

  • init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None

  • curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit

Returns: A dictionary contains least-squares fitting results
  • params: pd.Series of estimated parameter

  • pcov: pd.Dataframe of covariance matrix

  • metrics: None or pd.Series of calculated metrics

run_bootstrap(bs_record_num=None, **kwargs)

Use bootstrap to estimate uncertainty :param bs_method: Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) :type bs_method: str :param bootstrap_num: Number of bootstrap to perform, 0 means no bootstrap :type bootstrap_num: int :param grouper: Indicate the grouping of samples :type grouper: dict or Grouper :param bs_record_num: Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption :type bs_record_num: int :param bs_stats: a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series :type bs_stats: dict of callable :param record_full: if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. :type record_full: bool

Returns

pd.Series results: pd.DataFrame subsampled if 0 <= bs_record_num <= bootstrap_num

Return type

summary

run_replicates(replicates=None)

Use replicates to estimate uncertainty :param replicates: List of list of sample names for each replicates :type replicates: list of list

Returns

a pd.Dataframe of results from each replicates

summary()

Return a pd.series as fitting summary with flatten info

to_dict()

Save estimator configuration as a dictionary

Returns

Dict of configurations for the estimator

to_json(save_to_file=None)

Save the estimator configuration as a json file, except for model, bs_stats, conv_stats as these are not json-able

k_seq.estimate.least_squares_batch

Least-squares fitting for a batch of sequences

class k_seq.estimate.least_squares_batch.BatchFitResults(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Bases: object

Parse, store, and visualize BatchFitter results Only save results (separate from each estimator), corresponding estimator should be found by sequence We used two data storage strategies:

  1. smaller dataset that was saved as results.pkl: the pickled file is passed, and the results will be

    loaded to self.summary, self.bs_record, self.conv_record

  2. larger dataset that was saved in results/ folder: self. summary will be loaded, self.bs_record and

    self.conv_record will be linked

estimator

proxy to the BatchFitter

model

model function

Type

callable

data

contains data information: x_data (list-like): x_data for fitting y_dataframe (pd.DataFrame): a table of y_data for sequences simga (pd.DataFrame): a table representing each sigma for sequence in fitting

Type

AttrScope

large_data

if True, it will not load all bootstrap or convergence results

Type

bool

summary

summarized results with each sequence as index

Type

pd.DataFrame

bs_record()

get bootstrap results {seq: SingleFitter.results.uncertainty.records}

conv_record()

{seq: `SingleFitter.results.convergence.records}

summary_to_csv()

export summary dataframe as csv file

to_json()

storage strategy for large files: save results as a folder of json files

to_pickle()

storage strategy for small files: save results as pickled dictionary

from_pickle()

load from a picked dictionary

from_json()

load from a folder of json files

load_result()

overall method to infer either load BatchFitResults from pickled or a folder

Analysis:

__init__(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Init a BatchFitResults instance :param estimator: corresponding estimator :type estimator: BatchFitter

bs_record(seqs=None)

Retrieve bootstrap records

conv_record(seqs=None)

Retrieve convergence records

classmethod from_json(path_to_folder, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Load results from folder of results with json format

classmethod from_pickle(path_to_pickle, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Create a BatchFitResults instance with results loaded from pickle Notice:

this could take a very long time if the pickled file is large, suggest to use to_json for large dataset

static generate_summary(result_folder_path, n_core=1, save_to=None)

Generate a summary csv file from given result folder. This could be used if summary was not successfully generated during fitting

Result folder should have a structure of:
  • seqs - [seq name or hash].json - [if hash] seq_to_hash.json

Parameters
  • result_folder_path (str) – path to the root of results folder

  • n_core (int) – number of threads to process in parallel. Default 1

  • save_to (str) – save CSV file to local path

Returns

pd.DataFrame of summary

get_FitResult(seq=None)

Get FitResults from a JSON file

classmethod load_result(result_path, estimator=None)
summary_to_csv(path)

Save summary table as csv file

to_json(output_dir)
Save results in json format, with the structure of
|output_dir/

|- summary.csv |- seqs

|- seq1.json |- seq2.json

Notes

Bootstrap and convergence records should already be streamed as separate JSON files under /seqs/

Parameters

output_dir (str) – path of folder to save results

to_pickle(output_dir, bs_record=True, conv_record=True)

Save fitting results as a single pickled dict, suitable for small dataset. For large dataset to_json is preferred

Parameters
  • output_dir (str) – path to saved results, should have suffix of .pkl

  • bs_record (bool) – if output bs_record, default True

  • conv_record (bool) – if output conv_record, default True

class k_seq.estimate.least_squares_batch.BatchFitter(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)

Bases: k_seq.estimate._estimator.Estimator

Least-squares fitting for batch of sequences

y_dataframe

Table of y values for sequences (rows) to fit kinetic models

Type

pd.DataFrame

model

model to fit

Type

callable

x_data

list of x values in fitting

Type

list

seq_to_fit

pick top n sequences in the table for fitting or only fit selected sequences

Type

int or list of seq

sigma

Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

Type

list, pd.Series, or pd.DataFrame

note

note about this fitting job

Type

str

results

accessor to fitting results

Type

BatchFitResult

fit_params

collection of arguments pass to each single seq fitting, includes: x_data (list): list of x values in fitting model (callable): model to fit bounds (2 by m list ): Optional, [[lower bounds], [higher bounds]] for each parameter init_guess (list of float or generator): Initial guess estimate parameters, random value from 0 to 1 will be use if None opt_method (str): Optimization methods in scipy.optimize. Default ‘trf’ exclude_zero (bool): If exclude zero/missing data in fitting. Default False. metrics (dict of callable): Optional. Extra metric/parameters to calculate for each estimation rnd_seed (int): random seed used in fitting for reproducibility curve_fit_kwargs (dict): other keyword parameters to pass to scipy.optimize.curve_fit replicates (list of list): List of list of sample names for each replicates bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. conv_reps (int): number of repeated fitting from perturbed initial points for convergence test conv_init_range (list of 2-tuple): a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw conv_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series overwrite:

Type

AttrScope

__init__(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)

Initialize a BatchFitter

Parameters
  • y_dataframe (pd.DataFrame) –

  • y_dataframe – Table of y values for sequences (rows) to fit kinetic models

  • x_data (list) – list of x values in fitting

  • model (callable) – model to fit

  • seq_to_fit (int or list of seq) – pick top n sequences in the table for fitting or only fit selected sequences

  • sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

  • bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter

  • init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None

  • opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’

  • exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.

  • metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation

  • rnd_seed (int) – random seed used in fitting for reproducibility

  • curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit

  • bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap

  • bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

  • bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

  • bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

  • grouper (dict or Grouper) – Indicate the grouping of samples

  • record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

  • conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test

  • conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw

  • conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

  • note (str) – optional notes for the estimator

  • (bool (large_dataset) – if trigger strategy to work on large dataset (e.g. > 1000 seqs). If True, deduplicate of sequences with same reacted fractions in each concentration will be performed and results will be streamed to hard drive

  • save_to (str) – if not None, load results from the path

fit(parallel_cores=1, point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, overwrite=False)

Run the estimation :param parallel_cores: number of parallel cores to use. Default 1 :type parallel_cores: int :param point_estimate: if perform point estimation, default True :type point_estimate: bool :param bootstrap: if perform bootstrap uncertainty estimation, default False :type bootstrap: bool :param replicates: if perform replicates for uncertainty estimation, default False :type replicates: bool :param convergence_test: if perform convergence test, default False :type convergence_test: bool :param stream_to: Directly stream fitting results to disk if output path is given

will create a folder with name of seq/hash with pickled dict of fitting results

Parameters

overwrite (bool) – if overwrite existing results when stream to disk. Default False.

classmethod load_model(model_path, y_dataframe=None, sigma=None, result_path=None)

Create a model from pickled config file

Parameters
  • model_path (str) – path to picked model configuration file or the saved folder

  • y_dataframe (pd.DataFrame or str) – y_data table for fitting

  • sigma (pd.DataFrame or str) – optional sigma table for fitting

  • result_path (str) – path to fitting results

Returns

a BatchFitter instance

save_model(output_dir, results=True, bs_record=True, conv_record=True, tables=True)

Save model to a given directory model_config will be saved as a pickled dictionary to recover the model

  • except for y_dataframe and sigma which are too large

Parameters
  • output_dir (str) – path to save the model, create if the path does not exist

  • results (bool) – if save estimation results to results as well, to be load by BatchFitResults, Default True

  • bs_record (bool) – if save bootstrap records, default True

  • conv_record (bool) – if save convergence records, default True

  • tables (bool) – if save tables (y_dataframe, sigma) in the folder. Default True

save_results(result_path, bs_record=True, conv_record=True)

Save results to disk as JSON or pickle JSON is preferred for speed, readability, compatibility, and security

property save_to
summary(save_to=None)

k_seq.estimate.least_squares_batch

Least-squares fitting for a batch of sequences

class k_seq.estimate.least_squares_batch.BatchFitResults(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Bases: object

Parse, store, and visualize BatchFitter results Only save results (separate from each estimator), corresponding estimator should be found by sequence We used two data storage strategies:

  1. smaller dataset that was saved as results.pkl: the pickled file is passed, and the results will be

    loaded to self.summary, self.bs_record, self.conv_record

  2. larger dataset that was saved in results/ folder: self. summary will be loaded, self.bs_record and

    self.conv_record will be linked

estimator

proxy to the BatchFitter

model

model function

Type

callable

data

contains data information: x_data (list-like): x_data for fitting y_dataframe (pd.DataFrame): a table of y_data for sequences simga (pd.DataFrame): a table representing each sigma for sequence in fitting

Type

AttrScope

large_data

if True, it will not load all bootstrap or convergence results

Type

bool

summary

summarized results with each sequence as index

Type

pd.DataFrame

bs_record()

get bootstrap results {seq: SingleFitter.results.uncertainty.records}

conv_record()

{seq: `SingleFitter.results.convergence.records}

summary_to_csv()

export summary dataframe as csv file

to_json()

storage strategy for large files: save results as a folder of json files

to_pickle()

storage strategy for small files: save results as pickled dictionary

from_pickle()

load from a picked dictionary

from_json()

load from a folder of json files

load_result()

overall method to infer either load BatchFitResults from pickled or a folder

Analysis:

__init__(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Init a BatchFitResults instance :param estimator: corresponding estimator :type estimator: BatchFitter

bs_record(seqs=None)

Retrieve bootstrap records

conv_record(seqs=None)

Retrieve convergence records

classmethod from_json(path_to_folder, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Load results from folder of results with json format

classmethod from_pickle(path_to_pickle, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)

Create a BatchFitResults instance with results loaded from pickle Notice:

this could take a very long time if the pickled file is large, suggest to use to_json for large dataset

static generate_summary(result_folder_path, n_core=1, save_to=None)

Generate a summary csv file from given result folder. This could be used if summary was not successfully generated during fitting

Result folder should have a structure of:
  • seqs - [seq name or hash].json - [if hash] seq_to_hash.json

Parameters
  • result_folder_path (str) – path to the root of results folder

  • n_core (int) – number of threads to process in parallel. Default 1

  • save_to (str) – save CSV file to local path

Returns

pd.DataFrame of summary

get_FitResult(seq=None)

Get FitResults from a JSON file

classmethod load_result(result_path, estimator=None)
summary_to_csv(path)

Save summary table as csv file

to_json(output_dir)
Save results in json format, with the structure of
|output_dir/

|- summary.csv |- seqs

|- seq1.json |- seq2.json

Notes

Bootstrap and convergence records should already be streamed as separate JSON files under /seqs/

Parameters

output_dir (str) – path of folder to save results

to_pickle(output_dir, bs_record=True, conv_record=True)

Save fitting results as a single pickled dict, suitable for small dataset. For large dataset to_json is preferred

Parameters
  • output_dir (str) – path to saved results, should have suffix of .pkl

  • bs_record (bool) – if output bs_record, default True

  • conv_record (bool) – if output conv_record, default True

class k_seq.estimate.least_squares_batch.BatchFitter(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)

Bases: k_seq.estimate._estimator.Estimator

Least-squares fitting for batch of sequences

y_dataframe

Table of y values for sequences (rows) to fit kinetic models

Type

pd.DataFrame

model

model to fit

Type

callable

x_data

list of x values in fitting

Type

list

seq_to_fit

pick top n sequences in the table for fitting or only fit selected sequences

Type

int or list of seq

sigma

Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

Type

list, pd.Series, or pd.DataFrame

note

note about this fitting job

Type

str

results

accessor to fitting results

Type

BatchFitResult

fit_params

collection of arguments pass to each single seq fitting, includes: x_data (list): list of x values in fitting model (callable): model to fit bounds (2 by m list ): Optional, [[lower bounds], [higher bounds]] for each parameter init_guess (list of float or generator): Initial guess estimate parameters, random value from 0 to 1 will be use if None opt_method (str): Optimization methods in scipy.optimize. Default ‘trf’ exclude_zero (bool): If exclude zero/missing data in fitting. Default False. metrics (dict of callable): Optional. Extra metric/parameters to calculate for each estimation rnd_seed (int): random seed used in fitting for reproducibility curve_fit_kwargs (dict): other keyword parameters to pass to scipy.optimize.curve_fit replicates (list of list): List of list of sample names for each replicates bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. conv_reps (int): number of repeated fitting from perturbed initial points for convergence test conv_init_range (list of 2-tuple): a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw conv_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series overwrite:

Type

AttrScope

__init__(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)

Initialize a BatchFitter

Parameters
  • y_dataframe (pd.DataFrame) –

  • y_dataframe – Table of y values for sequences (rows) to fit kinetic models

  • x_data (list) – list of x values in fitting

  • model (callable) – model to fit

  • seq_to_fit (int or list of seq) – pick top n sequences in the table for fitting or only fit selected sequences

  • sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting

  • bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter

  • init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None

  • opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’

  • exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.

  • metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation

  • rnd_seed (int) – random seed used in fitting for reproducibility

  • curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit

  • bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap

  • bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

  • bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

  • bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

  • grouper (dict or Grouper) – Indicate the grouping of samples

  • record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

  • conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test

  • conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw

  • conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

  • note (str) – optional notes for the estimator

  • (bool (large_dataset) – if trigger strategy to work on large dataset (e.g. > 1000 seqs). If True, deduplicate of sequences with same reacted fractions in each concentration will be performed and results will be streamed to hard drive

  • save_to (str) – if not None, load results from the path

fit(parallel_cores=1, point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, overwrite=False)

Run the estimation :param parallel_cores: number of parallel cores to use. Default 1 :type parallel_cores: int :param point_estimate: if perform point estimation, default True :type point_estimate: bool :param bootstrap: if perform bootstrap uncertainty estimation, default False :type bootstrap: bool :param replicates: if perform replicates for uncertainty estimation, default False :type replicates: bool :param convergence_test: if perform convergence test, default False :type convergence_test: bool :param stream_to: Directly stream fitting results to disk if output path is given

will create a folder with name of seq/hash with pickled dict of fitting results

Parameters

overwrite (bool) – if overwrite existing results when stream to disk. Default False.

classmethod load_model(model_path, y_dataframe=None, sigma=None, result_path=None)

Create a model from pickled config file

Parameters
  • model_path (str) – path to picked model configuration file or the saved folder

  • y_dataframe (pd.DataFrame or str) – y_data table for fitting

  • sigma (pd.DataFrame or str) – optional sigma table for fitting

  • result_path (str) – path to fitting results

Returns

a BatchFitter instance

save_model(output_dir, results=True, bs_record=True, conv_record=True, tables=True)

Save model to a given directory model_config will be saved as a pickled dictionary to recover the model

  • except for y_dataframe and sigma which are too large

Parameters
  • output_dir (str) – path to save the model, create if the path does not exist

  • results (bool) – if save estimation results to results as well, to be load by BatchFitResults, Default True

  • bs_record (bool) – if save bootstrap records, default True

  • conv_record (bool) – if save convergence records, default True

  • tables (bool) – if save tables (y_dataframe, sigma) in the folder. Default True

save_results(result_path, bs_record=True, conv_record=True)

Save results to disk as JSON or pickle JSON is preferred for speed, readability, compatibility, and security

property save_to
summary(save_to=None)

k_seq.estimate.bootstrap

Uncertainty estimation through bootstrap

class k_seq.estimate.bootstrap.Bootstrap(estimator, bootstrap_num, bs_record_num, bs_method, grouper=None, bs_stats=None, record_full=False)

Bases: object

Perform bootstrap for fitting uncertainty estimation

Three types of bootstrap supported:
  • rel_res: resample the percent residue, based on the assumption that variance are proportional to the mean

    (from data property)

  • data: directly resample data points

  • stratified: resample within groups, grouper is required

estimator

accessor to the associated estimator

Type

EstimatorBase type

bs_method

Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)

Type

str

bootstrap_num

Number of bootstrap to perform, 0 means no bootstrap

Type

int

bs_record_num

Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption

Type

int

bs_stats

a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

Type

dict of callable

grouper

Indicate the grouping of samples

Type

dict or Grouper

record_full

if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

Type

bool

__init__(estimator, bootstrap_num, bs_record_num, bs_method, grouper=None, bs_stats=None, record_full=False)
Parameters

estimator (Estimator) – the estimator generates the results

bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap

bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.

property bs_method
run()

Perform bootstrap with arguments indicated in instance attributes

Returns

summarized results for each parameter and metrics from bootstrap records (pd.DataFrame): records of bootstrapped results, each row is a bootstrapped result

Return type

summary (pd.Series)

k_seq.estimate.replicates

Uncertainty estimation using replicates

class k_seq.estimate.replicates.Replicates(estimator, replicates)

Bases: object

property n_replicates
run()

Perform fitting for replicates

k_seq.estimate.convergence

Module to access the convergence of fitting, e.g. model identifiability

class k_seq.estimate.convergence.ConvergenceTester(estimator, conv_reps=10, conv_init_range=None, conv_stats=None)

Bases: object

Apply repeated fitting on a Estimator with perturbed initial value to test empirical convergence Store the convergence test results as these are separate tests from estimation

conv_reps

number of repeated fitting from perturbed initial points for convergence test

Type

int

estimator

estimator for fitting

Type

Estimator

conv_init_range

a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw

Type

list of 2-tuple

conv_stats

a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

Type

dict of callable

run()

run converge test and return a summary and full records

__init__(estimator, conv_reps=10, conv_init_range=None, conv_stats=None)

Apply convergence test to given estimator

Parameters
  • estimator (Estimator) – estimator for fitting

  • conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test

  • conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw

  • conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series

run()

Run convergence test, report a summary and full records

Returns

A pd.Series contains the mean, sd, range for each reported parameter, and conv_stats result records: A pd.Dataframe contains the full records

Return type

summary

k_seq.estimate.model_ident

Modules for model identifiability analysis

class k_seq.estimate.model_ident.ParamMap(model, sample_n, x_data, save_to, param1_name, param1_range, param2_name, param2_range, param1_log=False, param2_log=False, model_kwargs=None, bootstrap_num=100, bs_record_num=50, bs_method='data', bs_stats=None, grouper=None, record_full=False, conv_reps=20, conv_stats=None, conv_init_range=None, fitting_kwargs=None, seed=23)

Bases: object

Generate a 2d convergence map for randomly sampled data points from given parameter range it simulates sample_n sequence samples with random params selected from the range of (param1_range, param2_range), on the optional log scale

fit(**kwargs)

Batch fit simulated result

get_metric_values(metric, finite_only=False)

Returns a pd.Series contains the metric value

classmethod load_result(result_path, model=None)
plot_map(metric=None, metric_label=None, scatter=False, gridsize=50, figsize=5, 5, ax=None, cax_pos=0.91, 0.58, 0.03, 0.3, **plot_kwargs)
simulate_samples(grid=True, const_err=None, rel_err=None, y_enforce_positive=True)

Simulate a set of samples (param1 and param2)

k_seq.estimate.model_ident.gamma(df)

Get metric gamma = log_10{sigma_k mu_A} / {sigma kA}

k_seq.estimate.model_ident.kendall_log(records)
k_seq.estimate.model_ident.pearson(records)
k_seq.estimate.model_ident.pearson_log(records)
k_seq.estimate.model_ident.remove_nan(df)
k_seq.estimate.model_ident.spearman(records)
k_seq.estimate.model_ident.spearman_log(records)

k_seq.estimate.visualizer