k_seq.estimate
: least squares fitting¶
This module contains the submodule needed for estimator
k_seq.estimate.least_squares¶
Least-squares fitting to a kinetic model,
- Several functions are included:
point estimation using scipy.optimize.curve_fit
option to exclude zero in fitting
option to initialize values
weighted fitting depends on the customized weights
confidence interval estimation using bootstrap
-
class
k_seq.estimate.least_squares.
FitResults
(estimator, model=None)¶ Bases:
object
A class to store, process, and visualize fitting results for single estimator, contains almost all information needed to further analysis
-
estimator
¶ proxy to the estimator
- Type
Estimator
-
model
¶ model used in fitting
- Type
callable
-
data
¶ a scope stores the fitting data x_data (pd.Series): y_data (pd.Series): sigma (pd.Series): x_label (str): name for x data y_label (str): name for y data
- Type
-
point_estimation
¶ a scope stores point estimation results, includes params (pd.Series): stores the parameter estimation, with extra metrics calculation pcov (pd.DataFrame): covariance matrix for estimated parameter
- Type
-
uncertainty
¶ a scope stores uncertainty estimation results, includes summary (pd.Series): summary of each parameter or metric from records bs_records (pd.DataFrame): records for stored bootstrapping results rep_results (pd.DataFrame): results from fitting of replicates
- Type
-
convergence
¶ a scope stores convergence test results, includes summary (pd.DataFrame): summary for each parameter or metric from records records (pd.DataFrame): records for repeated fitting results
- Type
-
__init__
(estimator, model=None)¶ - Parameters
estimator (SingleFitter) – estimator used to generate this fitting result
-
classmethod
from_json
(json_path, tarfile=None, gzip=True, estimator=None, data=None)¶ load fitting results from json records, option to load from tar.gz files Note: no estimator info if estimator is None
- Parameters
json_path (str) – path to json file, or file name under tarball file if tar_file_name is true tarfile (str): if not None, the json file is in a tarfile (.tar/.tar.gz)
gzip (bool) – if True, the tarfile is compressed with gzip (.tar.gz); if False, the tarfile is not compressed (.tar)
estimator (SingleFitter) – optional. Recover the estimator instance.
data (dict) – supplied dictionary to add data info
-
plot_fitting_curves
(model=None, plot_on='bootstrap', subsample=20, x_lim=- 3e-05, 0.002, y_lim=None, x_label=None, y_label=None, legend=False, legend_loc='upper left', fontsize=12, params=None, ax=None, **kwargs)¶ plot fitting results for Aminoacylation ribozyme fitting curves obj should be a FitResults instance
-
plot_loss_heatmap
(model=None, plot_on='bootstrap', subsample=20, param1_range=0.01, 10000.0, param2_range=0.001, 1, legend=False, legend_loc='upper left', colorbar=True, resolution=101, fontsize=12, param_name=None, add_lines=None, line_label=True, ax=None, **kwargs)¶ Plot the 2-D heatmap of loss function
-
to_json
(path=None)¶ Convert results into a json string/file contains {
- point_estimation: { params: jsonfy(pd.Series)
pcov: jsonfy(pd.DataFrame) }
- uncertainty: { summary: jsonfy(pd.Series)
bs_records: jsonfy(pd.DataFrame) rep_results: jsonfy(pd.Series) }
- convergence: { summary: jsonfy(pd.DataFrame)
records: jsonfy(pd.DataFrame) }
- data: {x_data: jsonfy(pd.Series)
y_data: jsonfy(pd.Series), sigma: jsonfy(pd.Series), x_label: str, y_label: str}
}
-
to_series
()¶ Convert point_estimation, uncertainty (if possible), and convergence (if possible) to a series include flattened info: e.g. columns will include [param1, param2, bs_param1_mean, bs_param1_std, bs_param1_2.5%, …, param1_range]
-
-
class
k_seq.estimate.least_squares.
SingleFitter
(x_data, y_data, model, name=None, x_label=None, y_label=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, save_to=None, overwrite=False, verbose=1)¶ Bases:
k_seq.estimate._estimator.Estimator
Use scipy.optimize.curve_fit to fit a model for a sequence time/concentration series It can conduct point estimation, uncertainty estimation for bootstrap, empirical CI estimation
-
x_data
¶ list of x values in fitting
- Type
list
-
y_data
¶ y values in fitting
- Type
list, pd.Series
-
model
¶ model to fit
- Type
callable
-
parameter
¶
-
silent
¶
-
name
¶ Optional. Estimator’s name
- Type
str
-
results
¶ proxy to the FitResult object
- Type
FitResult
-
opt_method
¶ Optimization methods in scipy.optimize. Default ‘trf’
- Type
str
-
exclude_zero
¶ If exclude zero/missing data in fitting. Default False.
- Type
bool
-
init_guess
¶ Initial guess estimate parameters, random value from 0 to 1 will be use if None
- Type
list of float or generator
-
rnd_seed
¶ random seed used in fitting for reproducibility
- Type
int
-
sigma
¶ Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
- Type
list, pd.Series, or pd.DataFrame
-
bounds
¶ Optional, [[lower bounds], [higher bounds]] for each parameter
- Type
2 by m list
-
metric
¶
-
curve_fit_kwargs
¶ other keyword parameters to pass to scipy.optimize.curve_fit
- Type
dict
-
bootstrap_num
¶ Number of bootstrap to perform, 0 means no bootstrap
- Type
int
-
bs_record_num
¶ Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
- Type
int
-
bs_method
¶ Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
- Type
str
-
__init__
(x_data, y_data, model, name=None, x_label=None, y_label=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, save_to=None, overwrite=False, verbose=1)¶ Initialize a SingleFitter instance
- Parameters
x_data (list) – list of x values in fitting
y_data (list, pd.Series) – y values in fitting
model (callable) – model to fit
name (str) – Optional. Estimator’s name
x_label (str) – name of x data
y_label (str) – name of y data
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’
exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
rnd_seed (int) – random seed used in fitting for reproducibility
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
replicates (list of list) – List of list of sample names for each replicates
bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
grouper (dict or Grouper) – Indicate the grouping of samples
record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
save_to (str) –
overwrite (bool) –
verbose (0, 1, 2) –
save_to – optional. If not None, save results to the given path when fitting finishes
overwrite – if True, overwrite the save_to file if already exists; if False, read results and skip estimation. Default False
verbose – set different verbose level. 0: WARNING, 1: INFO, 2: DEBUG
-
convergence_test
(**kwargs)¶ Empirically estimate convergence by repeated fittings, :param conv_reps: number of repeated fitting from perturbed initial points for convergence test :type conv_reps: int :param conv_init_range: a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw :type conv_init_range: list of 2-tuple :param conv_stats: a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series :type conv_stats: dict of callable
- Returns
pd.Series records: pd.DataFrame of full records
- Return type
summary
-
fit
(point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, **kwargs)¶ Run fitting, configuration are from the object :param point_estimate: if do point estimation, default True :type point_estimate: bool :param replicates: if use replicate for uncertainty estimation, default False :type replicates: bool :param bootstrap: if do bootstrap, default False :type bootstrap: bool :param convergence_test: if do convergence test, default False :type convergence_test: bool
-
classmethod
from_json
(file_path, model)¶ create a estimator from saved json file
- Parameters
file_path (str) – path to saved json file
model (callable) – as callable is not json-able, need to reassign
Notes
bs_stats, conv_stats currently can not be recovered
-
point_estimate
(**kwargs)¶ Fitting using scipy.optimize.curve_fit, Arguments will be inferred from instance’s attributes if not provided
- Parameters
model (callable) – model to fit
x_data (list) – list of x values in fitting
y_data (list, pd.Series) – y values in fitting
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
- Returns: A dictionary contains least-squares fitting results
params: pd.Series of estimated parameter
pcov: pd.Dataframe of covariance matrix
metrics: None or pd.Series of calculated metrics
-
run_bootstrap
(bs_record_num=None, **kwargs)¶ Use bootstrap to estimate uncertainty :param bs_method: Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) :type bs_method: str :param bootstrap_num: Number of bootstrap to perform, 0 means no bootstrap :type bootstrap_num: int :param grouper: Indicate the grouping of samples :type grouper: dict or Grouper :param bs_record_num: Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption :type bs_record_num: int :param bs_stats: a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series :type bs_stats: dict of callable :param record_full: if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. :type record_full: bool
- Returns
pd.Series results: pd.DataFrame subsampled if 0 <= bs_record_num <= bootstrap_num
- Return type
summary
-
run_replicates
(replicates=None)¶ Use replicates to estimate uncertainty :param replicates: List of list of sample names for each replicates :type replicates: list of list
- Returns
a pd.Dataframe of results from each replicates
-
summary
()¶ Return a pd.series as fitting summary with flatten info
-
to_dict
()¶ Save estimator configuration as a dictionary
- Returns
Dict of configurations for the estimator
-
to_json
(save_to_file=None)¶ Save the estimator configuration as a json file, except for model, bs_stats, conv_stats as these are not json-able
-
k_seq.estimate.least_squares_batch¶
Least-squares fitting for a batch of sequences
-
class
k_seq.estimate.least_squares_batch.
BatchFitResults
(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Bases:
object
Parse, store, and visualize BatchFitter results Only save results (separate from each estimator), corresponding estimator should be found by sequence We used two data storage strategies:
- smaller dataset that was saved as
results.pkl
: the pickled file is passed, and the results will be loaded to self.summary, self.bs_record, self.conv_record
- smaller dataset that was saved as
- larger dataset that was saved in
results/
folder: self. summary will be loaded, self.bs_record and self.conv_record will be linked
- larger dataset that was saved in
-
estimator
¶ proxy to the BatchFitter
-
model
¶ model function
- Type
callable
-
data
¶ contains data information: x_data (list-like): x_data for fitting y_dataframe (pd.DataFrame): a table of y_data for sequences simga (pd.DataFrame): a table representing each sigma for sequence in fitting
- Type
-
large_data
¶ if True, it will not load all bootstrap or convergence results
- Type
bool
-
summary
¶ summarized results with each sequence as index
- Type
pd.DataFrame
-
bs_record
()¶ get bootstrap results {seq: SingleFitter.results.uncertainty.records}
-
summary_to_csv
()¶ export summary dataframe as csv file
-
to_json
()¶ storage strategy for large files: save results as a folder of json files
-
to_pickle
()¶ storage strategy for small files: save results as pickled dictionary
-
from_pickle
()¶ load from a picked dictionary
-
from_json
()¶ load from a folder of json files
-
load_result
()¶ overall method to infer either load BatchFitResults from pickled or a folder
Analysis:
-
__init__
(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Init a BatchFitResults instance :param estimator: corresponding estimator :type estimator: BatchFitter
-
bs_record
(seqs=None)¶ Retrieve bootstrap records
-
conv_record
(seqs=None)¶ Retrieve convergence records
-
classmethod
from_json
(path_to_folder, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Load results from folder of results with json format
-
classmethod
from_pickle
(path_to_pickle, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Create a BatchFitResults instance with results loaded from pickle Notice:
this could take a very long time if the pickled file is large, suggest to use to_json for large dataset
-
static
generate_summary
(result_folder_path, n_core=1, save_to=None)¶ Generate a summary csv file from given result folder. This could be used if summary was not successfully generated during fitting
- Result folder should have a structure of:
seqs - [seq name or hash].json - [if hash] seq_to_hash.json
- Parameters
result_folder_path (str) – path to the root of results folder
n_core (int) – number of threads to process in parallel. Default 1
save_to (str) – save CSV file to local path
- Returns
pd.DataFrame of summary
-
get_FitResult
(seq=None)¶ Get FitResults from a JSON file
-
classmethod
load_result
(result_path, estimator=None)¶
-
summary_to_csv
(path)¶ Save summary table as csv file
-
to_json
(output_dir)¶ -
Notes
Bootstrap and convergence records should already be streamed as separate JSON files under /seqs/
- Parameters
output_dir (str) – path of folder to save results
-
to_pickle
(output_dir, bs_record=True, conv_record=True)¶ Save fitting results as a single pickled dict, suitable for small dataset. For large dataset to_json is preferred
- Parameters
output_dir (str) – path to saved results, should have suffix of
.pkl
bs_record (bool) – if output bs_record, default True
conv_record (bool) – if output conv_record, default True
-
class
k_seq.estimate.least_squares_batch.
BatchFitter
(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶ Bases:
k_seq.estimate._estimator.Estimator
Least-squares fitting for batch of sequences
-
y_dataframe
¶ Table of y values for sequences (rows) to fit kinetic models
- Type
pd.DataFrame
-
model
¶ model to fit
- Type
callable
-
x_data
¶ list of x values in fitting
- Type
list
-
seq_to_fit
¶ pick top n sequences in the table for fitting or only fit selected sequences
- Type
int or list of seq
-
sigma
¶ Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
- Type
list, pd.Series, or pd.DataFrame
-
note
¶ note about this fitting job
- Type
str
-
results
¶ accessor to fitting results
- Type
BatchFitResult
-
fit_params
¶ collection of arguments pass to each single seq fitting, includes: x_data (list): list of x values in fitting model (callable): model to fit bounds (2 by m list ): Optional, [[lower bounds], [higher bounds]] for each parameter init_guess (list of float or generator): Initial guess estimate parameters, random value from 0 to 1 will be use if None opt_method (str): Optimization methods in scipy.optimize. Default ‘trf’ exclude_zero (bool): If exclude zero/missing data in fitting. Default False. metrics (dict of callable): Optional. Extra metric/parameters to calculate for each estimation rnd_seed (int): random seed used in fitting for reproducibility curve_fit_kwargs (dict): other keyword parameters to pass to scipy.optimize.curve_fit replicates (list of list): List of list of sample names for each replicates bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. conv_reps (int): number of repeated fitting from perturbed initial points for convergence test conv_init_range (list of 2-tuple): a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw conv_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series overwrite:
- Type
-
__init__
(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶ Initialize a BatchFitter
- Parameters
y_dataframe (pd.DataFrame) –
y_dataframe – Table of y values for sequences (rows) to fit kinetic models
x_data (list) – list of x values in fitting
model (callable) – model to fit
seq_to_fit (int or list of seq) – pick top n sequences in the table for fitting or only fit selected sequences
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’
exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
rnd_seed (int) – random seed used in fitting for reproducibility
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
grouper (dict or Grouper) – Indicate the grouping of samples
record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
note (str) – optional notes for the estimator
(bool (large_dataset) – if trigger strategy to work on large dataset (e.g. > 1000 seqs). If True, deduplicate of sequences with same reacted fractions in each concentration will be performed and results will be streamed to hard drive
save_to (str) – if not None, load results from the path
-
fit
(parallel_cores=1, point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, overwrite=False)¶ Run the estimation :param parallel_cores: number of parallel cores to use. Default 1 :type parallel_cores: int :param point_estimate: if perform point estimation, default True :type point_estimate: bool :param bootstrap: if perform bootstrap uncertainty estimation, default False :type bootstrap: bool :param replicates: if perform replicates for uncertainty estimation, default False :type replicates: bool :param convergence_test: if perform convergence test, default False :type convergence_test: bool :param stream_to: Directly stream fitting results to disk if output path is given
will create a folder with name of seq/hash with pickled dict of fitting results
- Parameters
overwrite (bool) – if overwrite existing results when stream to disk. Default False.
-
classmethod
load_model
(model_path, y_dataframe=None, sigma=None, result_path=None)¶ Create a model from pickled config file
- Parameters
model_path (str) – path to picked model configuration file or the saved folder
y_dataframe (pd.DataFrame or str) – y_data table for fitting
sigma (pd.DataFrame or str) – optional sigma table for fitting
result_path (str) – path to fitting results
- Returns
a BatchFitter instance
-
save_model
(output_dir, results=True, bs_record=True, conv_record=True, tables=True)¶ Save model to a given directory model_config will be saved as a pickled dictionary to recover the model
except for y_dataframe and sigma which are too large
- Parameters
output_dir (str) – path to save the model, create if the path does not exist
results (bool) – if save estimation results to results as well, to be load by BatchFitResults, Default True
bs_record (bool) – if save bootstrap records, default True
conv_record (bool) – if save convergence records, default True
tables (bool) – if save tables (y_dataframe, sigma) in the folder. Default True
-
save_results
(result_path, bs_record=True, conv_record=True)¶ Save results to disk as JSON or pickle JSON is preferred for speed, readability, compatibility, and security
-
property
save_to
¶
-
summary
(save_to=None)¶
-
k_seq.estimate.least_squares_batch¶
Least-squares fitting for a batch of sequences
-
class
k_seq.estimate.least_squares_batch.
BatchFitResults
(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Bases:
object
Parse, store, and visualize BatchFitter results Only save results (separate from each estimator), corresponding estimator should be found by sequence We used two data storage strategies:
- smaller dataset that was saved as
results.pkl
: the pickled file is passed, and the results will be loaded to self.summary, self.bs_record, self.conv_record
- smaller dataset that was saved as
- larger dataset that was saved in
results/
folder: self. summary will be loaded, self.bs_record and self.conv_record will be linked
- larger dataset that was saved in
-
estimator
¶ proxy to the BatchFitter
-
model
¶ model function
- Type
callable
-
data
¶ contains data information: x_data (list-like): x_data for fitting y_dataframe (pd.DataFrame): a table of y_data for sequences simga (pd.DataFrame): a table representing each sigma for sequence in fitting
- Type
-
large_data
¶ if True, it will not load all bootstrap or convergence results
- Type
bool
-
summary
¶ summarized results with each sequence as index
- Type
pd.DataFrame
-
bs_record
()¶ get bootstrap results {seq: SingleFitter.results.uncertainty.records}
-
summary_to_csv
()¶ export summary dataframe as csv file
-
to_json
()¶ storage strategy for large files: save results as a folder of json files
-
to_pickle
()¶ storage strategy for small files: save results as pickled dictionary
-
from_pickle
()¶ load from a picked dictionary
-
from_json
()¶ load from a folder of json files
-
load_result
()¶ overall method to infer either load BatchFitResults from pickled or a folder
Analysis:
-
__init__
(estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Init a BatchFitResults instance :param estimator: corresponding estimator :type estimator: BatchFitter
-
bs_record
(seqs=None)¶ Retrieve bootstrap records
-
conv_record
(seqs=None)¶ Retrieve convergence records
-
classmethod
from_json
(path_to_folder, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Load results from folder of results with json format
-
classmethod
from_pickle
(path_to_pickle, estimator=None, model=None, x_data=None, y_dataframe=None, sigma=None)¶ Create a BatchFitResults instance with results loaded from pickle Notice:
this could take a very long time if the pickled file is large, suggest to use to_json for large dataset
-
static
generate_summary
(result_folder_path, n_core=1, save_to=None)¶ Generate a summary csv file from given result folder. This could be used if summary was not successfully generated during fitting
- Result folder should have a structure of:
seqs - [seq name or hash].json - [if hash] seq_to_hash.json
- Parameters
result_folder_path (str) – path to the root of results folder
n_core (int) – number of threads to process in parallel. Default 1
save_to (str) – save CSV file to local path
- Returns
pd.DataFrame of summary
-
get_FitResult
(seq=None)¶ Get FitResults from a JSON file
-
classmethod
load_result
(result_path, estimator=None)¶
-
summary_to_csv
(path)¶ Save summary table as csv file
-
to_json
(output_dir)¶ -
Notes
Bootstrap and convergence records should already be streamed as separate JSON files under /seqs/
- Parameters
output_dir (str) – path of folder to save results
-
to_pickle
(output_dir, bs_record=True, conv_record=True)¶ Save fitting results as a single pickled dict, suitable for small dataset. For large dataset to_json is preferred
- Parameters
output_dir (str) – path to saved results, should have suffix of
.pkl
bs_record (bool) – if output bs_record, default True
conv_record (bool) – if output conv_record, default True
-
class
k_seq.estimate.least_squares_batch.
BatchFitter
(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶ Bases:
k_seq.estimate._estimator.Estimator
Least-squares fitting for batch of sequences
-
y_dataframe
¶ Table of y values for sequences (rows) to fit kinetic models
- Type
pd.DataFrame
-
model
¶ model to fit
- Type
callable
-
x_data
¶ list of x values in fitting
- Type
list
-
seq_to_fit
¶ pick top n sequences in the table for fitting or only fit selected sequences
- Type
int or list of seq
-
sigma
¶ Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
- Type
list, pd.Series, or pd.DataFrame
-
note
¶ note about this fitting job
- Type
str
-
results
¶ accessor to fitting results
- Type
BatchFitResult
-
fit_params
¶ collection of arguments pass to each single seq fitting, includes: x_data (list): list of x values in fitting model (callable): model to fit bounds (2 by m list ): Optional, [[lower bounds], [higher bounds]] for each parameter init_guess (list of float or generator): Initial guess estimate parameters, random value from 0 to 1 will be use if None opt_method (str): Optimization methods in scipy.optimize. Default ‘trf’ exclude_zero (bool): If exclude zero/missing data in fitting. Default False. metrics (dict of callable): Optional. Extra metric/parameters to calculate for each estimation rnd_seed (int): random seed used in fitting for reproducibility curve_fit_kwargs (dict): other keyword parameters to pass to scipy.optimize.curve_fit replicates (list of list): List of list of sample names for each replicates bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False. conv_reps (int): number of repeated fitting from perturbed initial points for convergence test conv_init_range (list of 2-tuple): a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw conv_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series overwrite:
- Type
-
__init__
(y_dataframe, x_data, model, x_label=None, y_label=None, seq_to_fit=None, sigma=None, bounds=None, init_guess=None, opt_method='trf', exclude_zero=False, metrics=None, rnd_seed=None, curve_fit_kwargs=None, replicates=None, bootstrap_num=0, bs_record_num=0, bs_method='pct_res', bs_stats=None, grouper=None, record_full=False, conv_reps=0, conv_init_range=None, conv_stats=None, note=None, large_dataset=None, verbose=1, save_to=None)¶ Initialize a BatchFitter
- Parameters
y_dataframe (pd.DataFrame) –
y_dataframe – Table of y values for sequences (rows) to fit kinetic models
x_data (list) – list of x values in fitting
model (callable) – model to fit
seq_to_fit (int or list of seq) – pick top n sequences in the table for fitting or only fit selected sequences
sigma (list, pd.Series, or pd.DataFrame) – Optional, same size as y_data/y_dataframe.Sigma (variance) for data points for weighted fitting
bounds (2 by m list) – Optional, [[lower bounds], [higher bounds]] for each parameter
init_guess (list of float or generator) – Initial guess estimate parameters, random value from 0 to 1 will be use if None
opt_method (str) – Optimization methods in scipy.optimize. Default ‘trf’
exclude_zero (bool) – If exclude zero/missing data in fitting. Default False.
metrics (dict of callable) – Optional. Extra metric/parameters to calculate for each estimation
rnd_seed (int) – random seed used in fitting for reproducibility
curve_fit_kwargs (dict) – other keyword parameters to pass to scipy.optimize.curve_fit
bootstrap_num (int) – Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int) – Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
bs_method (str) – Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
bs_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
grouper (dict or Grouper) – Indicate the grouping of samples
record_full (bool) – if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
note (str) – optional notes for the estimator
(bool (large_dataset) – if trigger strategy to work on large dataset (e.g. > 1000 seqs). If True, deduplicate of sequences with same reacted fractions in each concentration will be performed and results will be streamed to hard drive
save_to (str) – if not None, load results from the path
-
fit
(parallel_cores=1, point_estimate=True, replicates=False, bootstrap=False, convergence_test=False, overwrite=False)¶ Run the estimation :param parallel_cores: number of parallel cores to use. Default 1 :type parallel_cores: int :param point_estimate: if perform point estimation, default True :type point_estimate: bool :param bootstrap: if perform bootstrap uncertainty estimation, default False :type bootstrap: bool :param replicates: if perform replicates for uncertainty estimation, default False :type replicates: bool :param convergence_test: if perform convergence test, default False :type convergence_test: bool :param stream_to: Directly stream fitting results to disk if output path is given
will create a folder with name of seq/hash with pickled dict of fitting results
- Parameters
overwrite (bool) – if overwrite existing results when stream to disk. Default False.
-
classmethod
load_model
(model_path, y_dataframe=None, sigma=None, result_path=None)¶ Create a model from pickled config file
- Parameters
model_path (str) – path to picked model configuration file or the saved folder
y_dataframe (pd.DataFrame or str) – y_data table for fitting
sigma (pd.DataFrame or str) – optional sigma table for fitting
result_path (str) – path to fitting results
- Returns
a BatchFitter instance
-
save_model
(output_dir, results=True, bs_record=True, conv_record=True, tables=True)¶ Save model to a given directory model_config will be saved as a pickled dictionary to recover the model
except for y_dataframe and sigma which are too large
- Parameters
output_dir (str) – path to save the model, create if the path does not exist
results (bool) – if save estimation results to results as well, to be load by BatchFitResults, Default True
bs_record (bool) – if save bootstrap records, default True
conv_record (bool) – if save convergence records, default True
tables (bool) – if save tables (y_dataframe, sigma) in the folder. Default True
-
save_results
(result_path, bs_record=True, conv_record=True)¶ Save results to disk as JSON or pickle JSON is preferred for speed, readability, compatibility, and security
-
property
save_to
¶
-
summary
(save_to=None)¶
-
k_seq.estimate.bootstrap¶
Uncertainty estimation through bootstrap
-
class
k_seq.estimate.bootstrap.
Bootstrap
(estimator, bootstrap_num, bs_record_num, bs_method, grouper=None, bs_stats=None, record_full=False)¶ Bases:
object
Perform bootstrap for fitting uncertainty estimation
- Three types of bootstrap supported:
- rel_res: resample the percent residue, based on the assumption that variance are proportional to the mean
(from data property)
data: directly resample data points
stratified: resample within groups, grouper is required
-
estimator
¶ accessor to the associated estimator
- Type
EstimatorBase type
-
bs_method
¶ Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates)
- Type
str
-
bootstrap_num
¶ Number of bootstrap to perform, 0 means no bootstrap
- Type
int
-
bs_record_num
¶ Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption
- Type
int
-
bs_stats
¶ a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
- Type
dict of callable
-
record_full
¶ if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
- Type
bool
-
__init__
(estimator, bootstrap_num, bs_record_num, bs_method, grouper=None, bs_stats=None, record_full=False)¶ - Parameters
estimator (Estimator) – the estimator generates the results
- bootstrap_num (int): Number of bootstrap to perform, 0 means no bootstrap
bs_record_num (int): Number of bootstrap results to store. Negative number means store all results.Not recommended due to memory consumption bs_method (str): Bootstrap method, choose from ‘pct_res’ (resample percent residue),’data’ (resample data), or ‘stratified’ (resample within replicates) bs_stats (dict of callable): a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series grouper (dict or Grouper): Indicate the grouping of samples record_full (bool): if record the x_value and y_value for each bootstrapped sample; if False, only parameters and metrics are recorded. Default False.
-
property
bs_method
¶
-
run
()¶ Perform bootstrap with arguments indicated in instance attributes
- Returns
summarized results for each parameter and metrics from bootstrap records (pd.DataFrame): records of bootstrapped results, each row is a bootstrapped result
- Return type
summary (pd.Series)
k_seq.estimate.replicates¶
Uncertainty estimation using replicates
k_seq.estimate.convergence¶
Module to access the convergence of fitting, e.g. model identifiability
-
class
k_seq.estimate.convergence.
ConvergenceTester
(estimator, conv_reps=10, conv_init_range=None, conv_stats=None)¶ Bases:
object
Apply repeated fitting on a Estimator with perturbed initial value to test empirical convergence Store the convergence test results as these are separate tests from estimation
-
conv_reps
¶ number of repeated fitting from perturbed initial points for convergence test
- Type
int
-
estimator
¶ estimator for fitting
- Type
Estimator
-
conv_init_range
¶ a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
- Type
list of 2-tuple
-
conv_stats
¶ a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
- Type
dict of callable
-
run
()¶ run converge test and return a summary and full records
-
__init__
(estimator, conv_reps=10, conv_init_range=None, conv_stats=None)¶ Apply convergence test to given estimator
- Parameters
estimator (Estimator) – estimator for fitting
conv_reps (int) – number of repeated fitting from perturbed initial points for convergence test
conv_init_range (list of 2-tuple) – a list of two tuple range (min, max) with same length as model parameters. If None, all parameters are initialized from (0, 1) with random uniform draw
conv_stats (dict of callable) – a dict of stats functions to input the full record table (pd.DataFrame with parameters and metrics as columns) and return a single value, dict, or pd.Series
-
run
()¶ Run convergence test, report a summary and full records
- Returns
A pd.Series contains the mean, sd, range for each reported parameter, and conv_stats result records: A pd.Dataframe contains the full records
- Return type
summary
-
k_seq.estimate.model_ident¶
Modules for model identifiability analysis
-
class
k_seq.estimate.model_ident.
ParamMap
(model, sample_n, x_data, save_to, param1_name, param1_range, param2_name, param2_range, param1_log=False, param2_log=False, model_kwargs=None, bootstrap_num=100, bs_record_num=50, bs_method='data', bs_stats=None, grouper=None, record_full=False, conv_reps=20, conv_stats=None, conv_init_range=None, fitting_kwargs=None, seed=23)¶ Bases:
object
Generate a 2d convergence map for randomly sampled data points from given parameter range it simulates sample_n sequence samples with random params selected from the range of (param1_range, param2_range), on the optional log scale
-
fit
(**kwargs)¶ Batch fit simulated result
-
get_metric_values
(metric, finite_only=False)¶ Returns a pd.Series contains the metric value
-
classmethod
load_result
(result_path, model=None)¶
-
plot_map
(metric=None, metric_label=None, scatter=False, gridsize=50, figsize=5, 5, ax=None, cax_pos=0.91, 0.58, 0.03, 0.3, **plot_kwargs)¶
-
simulate_samples
(grid=True, const_err=None, rel_err=None, y_enforce_positive=True)¶ Simulate a set of samples (param1 and param2)
-
-
k_seq.estimate.model_ident.
gamma
(df)¶ Get metric gamma = log_10{sigma_k mu_A} / {sigma kA}
-
k_seq.estimate.model_ident.
kendall_log
(records)¶
-
k_seq.estimate.model_ident.
pearson
(records)¶
-
k_seq.estimate.model_ident.
pearson_log
(records)¶
-
k_seq.estimate.model_ident.
remove_nan
(df)¶
-
k_seq.estimate.model_ident.
spearman
(records)¶
-
k_seq.estimate.model_ident.
spearman_log
(records)¶