CAST package

CAST module

CAST.CAST_MARK(coords_raw_t, exp_dict_t, output_path_t, task_name_t=None, gpu_t=None, args=None, epoch_t=None, if_plot=True, graph_strategy='convex')

Runs CAST Mark, which captures common spatial features across samples using a self-supervised graph neural network.

Saves the Delaunay triangulation for each sample, the training log, and the final embeddings and model files to the output path.

If if_plot is true, prints the Delaunay triangulation for each sample and the training log.

Parameters:
  • coords_raw_t (dict[str, array-like]) – A dictionary with the sample names as keys and the spatial coordinates as values (castable to a np.array).

  • exp_dict_t (dict[str, array-like]) – A dictionary with the sample names as keys and the gene expression data as values (castable to a torch.Tensor).

  • output_path_t (str) – The output path.

  • task_name_t (str, optional (default: 'task1')) – The task name, used to save the log file.

  • gpu_t (int, optional) – The GPU ID. If omitted, the GPU ID is set to 0 if a GPU is available.

  • args (Args, optional) – The parameters for the model. If omitted, the default parameters are used (see models.model_GCNII.Args for more information).

  • epoch_t (int, optional) – The number of epochs for training. If omitted, the number of epochs in args is used (default 400 if args is not provided).

  • if_plot (bool, optional (default: True)) – Whether or not to plot the results.

  • graph_strategy ('convex' | 'delaunay', optional (default: 'convex')) –

    The strategy used to construct the delaunay graph for each sample.

    Convex will use Veronoi polygons clipped to the convex hull of the points and their rook spatial weights matrix (with libpysal).

    Delaunay will use the Delaunay triangulation (with sciipy).

Returns:

embed_dict – A dictionary with the sample names as keys and their corresponding graph embeddings as values.

Return type:

dict[str, torch.Tensor]

CAST.CAST_STACK(coords_raw, embed_dict, output_path, graph_list, params_dist=None, tmp1_f1_idx=None, mid_visual=False, sub_node_idxs=None, rescale=False, corr_q_r=None, if_embed_sub=False, early_stop_thres=None)

Runs CAST Stack, which aligns the spatial coordinates of the samples in the graph_list based on their graph embeddings using gradient-descent-based rigid registration and free-form deformation (FFD).

Saves the final coordinates, intermediate results, and the registration parameters to the output path.

Prints the intermediate results if mid_visual is True.

Parameters:
  • coords_raw (dict[str, np.array]) – A dictionary with sample names as keys and their spatial coordinates as values.

  • embed_dict (dict[str, torch.Tensor]) – A dictionary with sample names as keys and their graph embeddings as values (such as the output from CAST Mark).

  • output_path (str) – The output folder path.

  • graph_list (list[str]) – A two-member list of the query sample name and the reference sample name. The query sample is aligned to the reference sample.

  • params_dist (reg_params, optional) –

    The registration parameters. If omitted, the following default parameters are used:

    • dataname: the query sample name

    • gpu: 0

    • iterations: 500

    • dist_penalty1: 0

    • bleeding: 500

    • d_list: [3,2,1,1/2,1/3]

    • attention_params: [None,3,1,0]

    • dist_penalty2: [0]

    • alpha_basis_bs: [500]

    • meshsize: [8]

    • iterations_bs: [400]

    • attention_params_bs: [[tmp1_f1_idx,3,1,0]]

    • mesh_weight: [None]

  • tmp1_f1_idx (array-like, optional) –

    The attention region for the FFD (a True/False index of all the cells of the query sample).

    If tmp1_f1_idx or params_dist is omitted, no attention region is used.

  • mid_visual (bool, optional (default: False)) – Whether to plot the intermediate results.

  • sub_node_idxs (dict[str, np.array], optional) –

    A dictionary where keys are sample names and values are bitmasks indicating whether each cell should be used for alignment.

    If omitted, all coordinates are used for alignment.

  • rescale (bool, optional (default: False)) – Whether to rescale the coordinates (by 22340 / the sample max).

  • corr_q_r (np.array, optional) – The correlation matrix between the query and reference graph embeddings. If omitted, the correlation matrix is calculated.

  • if_embed_sub (bool, optional (default: False)) – Whether to use a subset of the embeddings (defined by sub_nodes_idxs) to calculate the correlation matrix and plot the results.

  • early_stop_thres (float, optional) – The early stopping threshold for detecting a plateau in affine gradient descent. If omitted, no early stopping is done.

Returns:

coords_final – A dictionary with the sample names as keys and the final transformed coordinates as values.

Return type:

dict[str, np.array]

CAST.CAST_PROJECT(sdata_inte, source_sample, target_sample, coords_source, coords_target, scaled_layer='log2_norm1e4_scaled', raw_layer='raw', batch_key='protocol', use_highly_variable_t=True, ifplot=True, n_components=50, umap_n_neighbors=50, umap_n_pcs=30, min_dist=0.01, spread_t=5, k2=1, source_sample_ctype_col='level_2', output_path='', umap_feature='X_umap', pc_feature='X_pca_harmony', integration_strategy='Harmony', ave_dist_fold=3, save_result=True, ifcombat=True, alignment_shift_adjustment=50, color_dict=None, adjust_shift=False, metric_t='cosine', working_memory_t=1000)

Runs CAST Project, an unsupervised, label-free method to project single cells from the query samples onto a reference sample toward spatially resolved single-cell multi-omics.

Saves the projected dataset and distribution information to the output path.

Parameters:
  • sdata_inte (anndata) – The integrated dataset.

  • source_sample (str) – The source sample name.

  • target_sample (str) – The target sample name.

  • coords_source (array-like) – The coordinates of the source sample.

  • coords_target (array-like) – The coordinates of the target sample.

  • scaled_layer (str, optional (default: 'log2_norm1e4_scaled')) – The name of the layer in sdata_inte.layers to use to find the scaled data.

  • raw_layer (str, optional (default: 'raw')) – The name of the layer in sdata_inte.layers to use to find the raw data.

  • batch_key (str, optional (default: 'protocol')) – The column name of the samples in sdata_inte.obs.

  • use_highly_variable_t (bool, optional (default: True)) – Whether to use highly variable genes.

  • ifplot (bool, optional (default: True)) – Whether to plot the result.

  • n_components (int, optional (default: 50)) – The n_components parameter in sc.pp.pca for Harmony integration (ignored if integration_strategy is not ‘Harmony’).

  • umap_n_neighbors (int, optional (default: 50)) – The n_neighbors parameter in sc.pp.neighbors for Harmony integration (ignored if integration_strategy is not ‘Harmony’).

  • umap_n_pcs (int, optional (default: 30)) – The n_pcs parameter in sc.pp.neighbors for Harmony integration (ignored if integration_strategy is not ‘Harmony’).

  • min_dist (float, optional (default: 0.01)) – The min_dist parameter in sc.tl.umap for Harmony integration (ignored if integration_strategy is not ‘Harmony’).

  • spread_t (int, optional (default: 5)) – The spread parameter in sc.tl.umap for Harmony integration (ignored if integration_strategy is not ‘Harmony’).

  • k2 (int, optional (default: 1)) – The number of nearest neighbors to consider.

  • source_sample_ctype_col (str, optional (default: 'level_2')) – The column name of the cell type annotation in the source sample (in sdata_inte.obs).

  • output_path (str, optional (default: '')) – The output path.

  • umap_feature (str, optional (default: 'X_umap')) – The column name in sdata_inte.obsm to use for the UMAP data for visualization and saving.

  • pc_feature (str, optional (default: 'X_pca_harmony')) – The column name in sdata_inte.obsm to use for the principle component data of the samples.

  • integration_strategy ('Harmony' | None, optional (default: 'Harmony')) – Whether to run Harmony integration.

  • ave_dist_fold (int, optional (default: 3)) – A multiplicative factor on the average distance for the physical distance threshold.

  • save_result (bool, optional (default: True)) – Whether to save the results.

  • ifcombat (bool, optional (default: True)) – Whether to use combat when doing the Harmony integration (ignored if integration_strategy is not ‘Harmony’).

  • alignment_shift_adjustment (int, optional (default: 50)) – An additive factor on the average distance for the physical distance threshold.

  • color_dict (dict, optional) – The color dictionary for cell type annotation in visualizations.

  • adjust_shift (bool, optional (default: False)) – Whether to shift the coordinates of the source cells by the median shift between the target and source cells for each cell type (ignored if source_sample_ctype_col is not given).

  • metric_t (str, optional (default: 'cosine')) – The metric for pairwise distance calculation.

  • working_memory_t (int, optional (default: 1000)) – The sought maximum memory for the chunked pairwise distance calculations.

Returns:

  • sdata_ref (anndata) – The integrated anndata object with the raw and normalized projected data as layers.

  • list[np.ndarray, np.ndarray, np.ndarray, np.ndarray] – The indicies of the k-nearest neighbors, their corresponding weights, cosine distances and physical distances for each cell.

Subpackages

CAST.CAST_Mark module

CAST_Mark.train_seq(graphs, args, dump_epoch_list, out_prefix, model)

Trains a model_GCNII.CCA_SSG model (using the CCA-SSG approach) for CAST Mark.

Parameters:
  • graphs (List[Tuple(str, dgl.Graph, torch.Tensor)]) – List of 3-member tuples, where each tuple represents one tissue sample. The tuple elements are the sample name, a DGL graph object, and a feature matrix.

  • args (model_GCNII.Args) – The Args object contains training parameters

  • dump_epoch_list (List[int]) – A list of epoch iterations you hope training snapshots to be dumped for debugging

  • out_prefix (str) – File name prefix for the snapshot files

  • model (model_GCNII.CCA_SSG) – The untrained GNN model.

Returns:

  • Dict[str, torch.Tensor] – The graph embeddings for each sample.

  • List[float] – The loss value per epoch.

  • model_GCNII.CCA_SSG – The trained GNN model.

CAST_Mark.delaunay_dgl(sample_name, df, output_path, if_plot=True, strategy_t='convex')

Constructs a delaunay graph from a given dataframe.

Parameters:
  • sample_name (str) – The name of the sample.

  • df (array-like (castable to np.array)) – An array containing the coordinates of the points.

  • output_path (str) – The path to save the plot (if if_plot is True).

  • if_plot (bool, optional (default: True)) – Whether to display and save the graph.

  • strategy_t ('convex' | 'delaunay', optional (default: 'convex')) –

    The strategy to construct the delaunay graph.

    Convex will use Veronoi polygons clipped to the convex hull of the points and their rook spatial weights matrix (with libpysal).

    Delaunay will use the Delaunay triangulation (with sciipy).

Returns:

The delaunay graph in the DGL format.

Return type:

dgl.DGLGraph

CAST.CAST_Stack module

class CAST_Stack.reg_params(theta_r1: float = 0, d_list: list[float] = <factory>, mirror_t: list[float] | None = None, translation_params: list[float] | None = None, theta_r2: float = 0, alpha_basis: list[float] = <factory>, iterations: int = 500, dist_penalty1: float = 0, attention_params: list[float] = <factory>, ifrigid: bool = False, mesh_trans_list: list[float] = <factory>, attention_region: list[float] = <factory>, attention_params_bs: list[float] = <factory>, mesh_weight: list[float] = <factory>, iterations_bs: list[float] = <factory>, alpha_basis_bs: list[float] = <factory>, meshsize: list[float] = <factory>, img_size_bs: list[float] = <factory>, dist_penalty2: list[float] = <factory>, PaddingRate_bs: float = 0, dataname: str = '', bleeding: float = 500, diff_step: float = 5, min_qr2: float = 0, mean_q: float = 0, mean_r: float = 0, gpu: int = 0)

Bases: object

Parameters for the registration process.

  • Prelocation parameters: theta_r1, d_list, mirror_t, translation_params

  • Affine parameters: theta_r2, alpha_basis, iterations, dist_penalty1, attention_params, ifrigid

  • FFD parameters: mesh_trans_list, attention_region, attention_params_bs, mesh_weight, iterations_bs, alpha_basis_bs, meshsize, img_size_bs, dist_penalty2, PaddingRate_bs

  • Common parameters: dataname, bleeding, diff_step, min_qr2, mean_q, mean_r, gpu, device

theta_r1: float = 0

Initial rotation angle for pre-location.

d_list: list[float]

The scale values to evalulate during pre-location. For example, if the list contains 2, the function evaluates whether a two-fold increase of the coordinates reduces loss.

mirror_t: list[float] = None

The mirror transformation for the pre-location. The elements of d_list will be multiplied by the elements of mirror_t to evaluate the cost function.

translation_params: list[float] = None

A description of the evenly-spaced grid used for translations considered during pre-location. The first two elements are multiplicative factors for the grid boundaries (the first element for x and the second element for y), and the third element is the step size for the grid. If omitted, no translation is done in pre-location.

theta_r2: float = 0

Initial rotation angle for affine transformation.

alpha_basis: list[float]

The coefficients for updating the affine transformation during gradient descent. The first two elements are the coefficients for the scaling, the third element is the coefficient for the rotation, and the last two elements are the coefficients for the translation.

iterations: int = 500

The number of iterations for the affine transformation gradient descent.

We compared the results of aligning S4 to S1 using 100 and 500 iterations. The results showed that in the 100-step task, the DG region of the query sample exhibited a small shift to the one in the reference sample and the five parameters did not converge, in contrast to the 500-step task. Thus, 500 steps were necessary in this case.

dist_penalty1: float = 0

Distance penalty in affine transformation. When the distance of the query cell to the nearest neighbor in the reference sample is greater than a distance threshold (by default, average cell distance), CAST Stack will multiply the initial cost function of those cells by dist_penalty1.

If omitted, no distance penalty will be added.

attention_params: list[float]

The attention mechanism to increase the penalty of some cells. If omitted, ‘dist_penalty’ = 0 or the first element is None, no attention mechanism will be added.

  • The first element describes the attention region — an np.ndarry of True/False values for each cell in the query sample or None.

  • The second element is the double penalty - The cost function of cells with attention will be multiplied by this value.

  • The third element is the additional penalty for the attention cells - The cost function of cells with attention will be multiplied by this value.

  • The last element is the additional penalty for the cells with distance penalty and attention (penalty_inc_both). The initial cost function value of these cells will be multiplied by (penalty_inc_both/dist_penalty + 1).

ifrigid: bool = False

If True, the affine transformation must have a uniform scaling factor. If False, the affine transformation will be performed without constraints.

mesh_trans_list: list[float]

A list of the transformed mesh grids for each round of FFD, used to apply the FFD transformation to the query sample.

attention_region: list[float]

The region to apply the attention mechanism. The True/False index of all the cells of the query sample or None.

attention_params_bs: list[float]

The attention mechanism to increase the penalty of the cells for calculating the loss during BSpline gradient descent. Refer to attention_params.

mesh_weight: list[float]

The weight matrix for the mesh grid (used as a multiplicative factor for the gradient). This is an np.ndarry of the same shape as the mesh grid, else it is set to 1.

iterations_bs: list[float]

Number of iterations of the FFD. We compared the results of aligning S4 to S1 using 50 and 400 iterations. The results showed that the CA1 region could be better aligned in the 400-step task than the 50-step task.

alpha_basis_bs: list[float]

The learning rate for the FFD transformation.

meshsize: list[float]

The mesh size for the FFD (as the number of meshgrid cells in each dimension). A smaller value of meshsize generally gives coarse-grained adjustment, while a higher value of meshsize could adjust more details. We observed that the S4 to S1 alignment task with a meshsize of 4 exhibits poor alignment performance compared to the tasks with a meshsize of 8 or a meshsize of 10.

img_size_bs: list[float]

The size of the image to transform for each round of the FFD transformation.

dist_penalty2: list[float]

Distance penalty parameter in FFD. Refer to dist_penalty1.

PaddingRate_bs: float = 0

The padding between the image size and the sample coordinates.

dataname: str = ''

Name of the dataset

bleeding: float = 500

When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample will be considered when calculating the cost function.

diff_step: float = 5

The distance used for approximating the gradient. The gradient is approximated by the difference between the cost function values of the query sample with the small change of ±`diff_step` in the query sample coordinates.

min_qr2: float = 0

The minimum value of the query sample - this is subtracted from the query sample before FFD transformation.

mean_q: float = 0

The mean value of the query sample coordinates.

mean_r: float = 0

The mean value of the reference sample coordinates.

gpu: int = 0

The GPU device number. If -1, the CPU will be used.

device: str

The device for the computation. This is set automatically based on the GPU number.

CAST_Stack.get_range(sp_coords)

Gets the x and y range of a set of coordinates.

Parameters:

sp_coords (array-like) – The coordinates of the sample.

Returns:

  • xrng (float) – The range of the x coordinates.

  • yrng (float) – The range of the y coordinates.

CAST_Stack.prelocate(coords_q, coords_r, cov_anchor_it, bleeding, output_path, d_list=[1, 2, 3], prefix='test', ifplot=True, index_list=None, translation_params=None, mirror_t=None)

Performs the pre-location step of the registration process - finds the best cost function value for different fixed affine transformation parameters to reduce computation time and avoid local minima in full affine gradient descent.

Parameters:
  • coords_q (torch.Tensor) – The query sample coordinates.

  • coords_r (torch.Tensor) – The reference sample coordinates.

  • cov_anchor_it (2D array-like) – The covariance matrix of the query and reference sample.

  • bleeding (int) – When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample is considered when calculating the cost function.

  • output_path (str) – The path to save the visualizations if ifplot is True.

  • d_list (list[float], optional (default: [1,2,3])) – The scale values to evalulate during pre-location. For example, if the list contains 2, the function evaluates whether a two-fold increase of the coordinates reduces loss.

  • prefix (str, optional (default: 'test')) – The prefix of the saved plots if ifplot is True.

  • ifplot (bool, optional (default: True)) – If True, the visualization of the pre-location will be saved at output_path.

  • index_list (list[np.array[bool]], optional) – A mask indicating which cells to consider in the query and reference samples when calculating loss. If omitted, all cells will be used.

  • translation_params (list[float], optional) –

    A description of the evenly-spaced grid used for translations considered during pre-location. The first two elements are multiplicative factors for the grid boundaries (the first element for x and the second element for y), and the third element is the step size for the grid.

    If omitted, no translation is done in pre-location.

  • mirror_t (list[float], optional (default: [1,-1])) – The mirror transformations for the pre-location. The elements of d_list will be multiplied by the elements of mirror_t.

Returns:

theta – The optimal affine parameters found during pre-location. The format of this is that the first two elements are the coefficients for scaling in x and y, the third element is the rotation in degrees, and the last two elements are the coefficients for translation in x and y.

Return type:

torch.Tensor

CAST_Stack.Affine_GD(coords_query_it_raw, coords_ref_it, cov_anchor_it, output_path, bleeding=500, dist_penalty=0, diff_step=50, alpha_basis=array([[0.], [0.], [0.2], [2.], [2.]]), iterations=50, prefix='test', attention_params=[None, 3, 1, 0], scale_t=1, coords_log=False, index_list=None, mid_visual=False, early_stop_thres=1, ifrigid=False)

Given the query and reference samples, calculates the optimal affine transformation using gradient descent.

Parameters:
  • coords_query_it_raw (torch.Tensor) – The query sample coordinates.

  • coords_ref_it (torch.Tensor) – The reference sample coordinates.

  • cov_anchor_it (2D array-like) – The covariance matrix of the query and reference sample.

  • bleeding (int, optional (default: 500)) – When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample will be considered when calculating the cost function.

  • dist_penalty (float, optional (default: 0)) –

    Distance penalty in affine transformation. When the distance of the query cell to the nearest neighbor in the reference sample is greater than a distance threshold (by default, average cell distance), CAST Stack will multiply the initial cost function of those cells by dist_penalty1.

    If omitted, no distance penalty will be added.

  • diff_step (float, optional (default: 50)) – The distance used for approximating the gradient. The gradient is approximated by the difference between the cost function values of the query sample with the small change of ±`diff_step` in the query sample coordinates.

  • alpha_basis (np.array, optional (default: np.reshape(np.array([0,0,1/5,2,2]),[5,1]))) – The initial coefficients for calculating the learning rate for each parameter in the affine transformation. These constants are then multiplied by a decreasing function for each iteration. Following the format of theta, the first two elements are the coefficients for the scaling, the third element is the coefficient for the rotation, and the last two elements are the coefficients for the translation.

  • iterations (int, optional (default: 50)) – The number of iterations for the affine transformation. We compared the results of aligning S4 to S1 using 100 and 500 iterations. The results showed that in the 100-step task, the DG region of the query sample exhibited a small shift to the one in the reference sample and the five parameters did not converge, in contrast to the 500-step task. Thus, 500 steps were necessary in this case.

  • prefix (str, optional (default: 'test')) – The prefix of the file names for the saved plots if mid_visual is True.

  • attention_params (list[float], optional) –

    The attention mechanism to increase the penalty of some cells. If omitted, ‘dist_penalty’ = 0 or the first element is None, no attention mechanism will be added.

    • The first element describes the attention region — an np.ndarry of True/False values for each cell in the query sample or None.

    • The second element is the double penalty - The cost function of cells with attention will be multiplied by this value.

    • The third element is the additional penalty for the attention cells - The cost function of cells with attention will be multiplied by this value.

    • The last element is the additional penalty for the cells with distance penalty and attention (penalty_inc_both). The initial cost function value of these cells will be multiplied by (penalty_inc_both/dist_penalty + 1).

  • scale_t (float, optional (default: 1)) – The scaling factor for the visualization (if mid_visual is true).

  • coords_log (bool, optional (default: False)) – If True, the coordinates of the query sample will be saved for each iteration.

  • index_list (list[np.ndarray[bool]], optional) – A mask indicating which cells to consider in the query and reference samples when calculating loss. If omitted, all cells will be used.

  • mid_visual (bool, optional (default: False)) – If True, the function plots and saves intermediate results (the initial positions and the results after every 20 iterations of gradient descent).

  • early_stop_thres (float, optional (default: 1)) – The threshold for early stopping. If the cost function does not change more than the threshold for five consecutive iterations, the registration process will be stopped.

  • ifrigid (bool, optional (default: False)) – If True, the affine transformation must have a uniform scaling factor. If False, the affine transformation will be performed without constraints.

Returns:

  • The cost function value for each iteration.

  • The gradient of the cost function for each iteration.

  • The parameters for the affine transformation for each iteration. The format of this is that the first two elements are the coefficients for scaling in x and y, the third element is the rotation in degrees, and the last two elements are the coefficients for translation in x and y.

  • The coordinates of the query sample for each iteration (if coords_log is True).

Return type:

list[list[float], list[torch.Tensor], list[torch.Tensor], list[np.array]]

CAST_Stack.BSpline_GD(coords_q, coords_r, cov_anchor_it, iterations, output_path, bleeding, dist_penalty=0, alpha_basis=1000, diff_step=50, mesh_size=5, prefix='test', mesh_weight=None, attention_params=[None, 3, 1, 0], scale_t=1, coords_log=False, index_list=None, mid_visual=False, max_xy=None)

Given the query and reference samples, calculates the optimal FFD transformation using gradient descent.

Parameters:
  • coords_q (torch.Tensor) – The query sample coordinates.

  • coords_r (torch.Tensor) – The reference sample coordinates.

  • cov_anchor_it (2D array-like) – The covariance matrix of the query and reference sample.

  • iterations (int) – The number of iterations for gradient descent. We compared the results of aligning S4 to S1 using 50 and 400 iterations. The results showed that the CA1 region could be better aligned in the 400-step task than the 50-step task.

  • output_path (str) – The path to save the visualization if mid_visual is True.

  • bleeding (int) – When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample will be considered when calculating the cost function.

  • dist_penalty (float, optional) –

    Distance penalty in affine transformation. When the distance of the query cell to the nearest neighbor in the reference sample is greater than a distance threshold (by default, average cell distance), CAST Stack will multiply the initial cost function of those cells by dist_penalty1.

    If omitted, no distance penalty will be added.

  • alpha_basis (float, optional (default: 1000)) – The learning rate for the FFD transformation.

  • diff_step (float, optional (default: 50)) – The distance used for approximating the gradient. The gradient is approximated by the difference between the cost function values of the query sample with the small change of ±`diff_step` in the query sample coordinates.

  • mesh_size (int, optional (default: 5)) – The mesh size for the FFD (as the number of meshgrid cells in each dimension). A smaller mesh_size generally gives coarse-grained adjustment, while a larger mesh_size could adjust more details. We observed that the S4 to S1 alignment task with a mesh_size of 4 exhibits poor alignment performance compared to the tasks a mesh_size of 8 or a mesh_size of 10.

  • prefix (str, optional (default: 'test')) – The prefix of the saved plots if mid_visual is True.

  • mesh_weight (np.array, optional (default: 1)) – The weight matrix for the mesh grid (used as a multiplicative factor for the gradient). This is an np.ndarry of the same shape as the mesh grid, else it is set to 1.

  • attention_params (list[float], optional) –

    The attention mechanism to increase the penalty of some cells. If omitted, ‘dist_penalty’ = 0 or the first element is None, no attention mechanism will be added.

    • The first element describes the attention region — an np.ndarry of True/False values for each cell in the query sample or None.

    • The second element is the double penalty - The cost function of cells with attention will be multiplied by this value.

    • The third element is the additional penalty for the attention cells - The cost function of cells with attention will be multiplied by this value.

    • The last element is the additional penalty for the cells with distance penalty and attention (penalty_inc_both). The initial cost function value of these cells will be multiplied by (penalty_inc_both/dist_penalty + 1).

  • scale_t (float, optional (default: 1)) – The scaling factor for the visualization (if mid_visual is true).

  • coords_log (bool, optional (default: False)) – If True, the coordinates of the query sample will be saved for each iteration.

  • index_list (list[list[bool]]) – A mask indicating which cells to consider in the query and reference samples when calculating loss. If omitted, all cells will be used.

  • mid_visual (bool, optional (default: False)) – If True, the function plots and saves intermediate results (the initial positions and the results after every 20 iterations of gradient descent).

  • max_xy (torch.Tensor, optional) – The maximum x and y coordinates for the mesh. If omitted, the maximum x and y coordinates of the query sample are used.

Returns:

  • The coordinates of the query sample after the FFD transformation

  • The mesh grid for each iteration,

  • The gradient for each iteration.

  • The cost function value per iteration

  • The coordinates of the query sample for each iteration (if coords_log is True).

Return type:

list[np.array]

CAST_Stack.J_cal(coords_q, coords_r, cov_mat, bleeding=10, dist_penalty=0, attention_params=[None, 3, 1, 0])

Calculates the cost function based on the covariance matrix and the distance between the query and reference samples (applying bleeding, distance penalty, and the attention mechanism).

Parameters:
  • coords_q (torch.Tensor) – The query sample coordinates.

  • coords_r (torch.Tensor) – The reference sample coordinates.

  • cov_mat (2D array-like) – The covariance matrix of the query and reference sample.

  • bleeding (int, optional (default: 10)) – When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample will be considered when calculating the cost function.

  • dist_penalty (float, optional) –

    Distance penalty in affine transformation. When the distance of the query cell to the nearest neighbor in the reference sample is greater than a distance threshold (by default, average cell distance), CAST Stack will multiply the initial cost function of those cells by dist_penalty1.

    If omitted, no distance penalty will be added.

  • attention_params (list[float], optional (default: [None, 3,1,0])) –

    The attention mechanism to increase the penalty of some cells. If omitted, ‘dist_penalty’ = 0 or the first element is None, no attention mechanism will be added.

    • The first element describes the attention region — an np.ndarry of True/False values for each cell in the query sample or None.

    • The second element is the double penalty - The cost function of cells with attention will be multiplied by this value.

    • The third element is the additional penalty for the attention cells - The cost function of cells with attention will be multiplied by this value.

    • The last element is the additional penalty for the cells with distance penalty and attention (penalty_inc_both). The initial cost function value of these cells will be multiplied by (penalty_inc_both/dist_penalty + 1).

Returns:

The cost function array - the scaled covariance scores between each point in coords_q and its closest point in coords_r.

Return type:

np.array

CAST_Stack.alpha_init(alpha_basis, it, dev)

Multiplies the alpha_basis values by 5/(it/40 + 1)^0.6 (a decreasing function of it) to get the alpha value for the current iteration.

Parameters:
  • alpha_basis (torch.Tensor) – The initial coefficients of the learning rate for each element of theta in Affine transformation.

  • it (int) – The current iteration.

  • dev (str) – The device for the computation.

Returns:

The learning rates for the current iteration.

Return type:

torch.Tensor

CAST_Stack.dJ_dt_cal(coords_q, coords_r, diff_step, dev, cov_anchor_it, bleeding, dist_penalty, attention_params)

Caclulates the gradient of the loss with respect to x and y for each cell in the query sample.

Parameters:
  • coords_q (torch.Tensor) – The query sample coordinates.

  • coords_r (torch.Tensor) – The reference sample coordinates.

  • diff_step (float) – The distance used for approximating the gradient. The gradient is approximated by the difference between the cost function values of the query sample with the small change of ±`diff_step` in the query sample coordinates.

  • dev (str) – The device for the computation.

  • cov_anchor_it (2D array-like) – The covariance matrix of the query and reference sample.

  • bleeding (int) – When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample will be considered when calculating the cost function.

  • dist_penalty (float) –

    Distance penalty in affine transformation. When the distance of the query cell to the nearest neighbor in the reference sample is greater than a distance threshold (by default, average cell distance), CAST Stack will multiply the initial cost function of those cells by dist_penalty1.

    If omitted, no distance penalty will be added.

  • attention_params (list[float]) –

    The attention mechanism to increase the penalty of some cells. If omitted, ‘dist_penalty’ = 0 or the first element is None, no attention mechanism will be added.

    • The first element describes the attention region — an np.ndarry of True/False values for each cell in the query sample or None.

    • The second element is the double penalty - The cost function of cells with attention will be multiplied by this value.

    • The third element is the additional penalty for the attention cells - The cost function of cells with attention will be multiplied by this value.

    • The last element is the additional penalty for the cells with distance penalty and attention (penalty_inc_both). The initial cost function value of these cells will be multiplied by (penalty_inc_both/dist_penalty + 1).

Returns:

The stacked gradient of the loss with respect to x and y for each cell in the query sample.

Return type:

torch.Tensor

CAST_Stack.dJ_dtheta_cal(xi, yi, dJ_dxy_mat, theta, dev, ifrigid=False)

Calculates the gradient of the loss with respect to the affine transformation parameters.

Parameters:
  • xi (torch.Tensor) – The x coordinates of the query sample.

  • yi (torch.Tensor) – The y coordinates of the query sample.

  • dJ_dxy_mat (torch.Tensor) – The gradient of the loss with respect to x and y for each cell in the query sample.

  • theta (torch.Tensor) – The parameters of the affine transformation. The format of this is that the first two elements are the coefficients for scaling in x and y, the third element is the rotation in degrees, and the last two elements are the coefficients for translation in x and y.

  • dev (str) – The device for the computation.

  • ifrigid (bool, optional (default: False)) – If True, the affine transformation must have a uniform scaling factor. If False, the affine transformation will be performed without constraints.

Returns:

The gradient of the loss with respect to the affine transformation parameters.

  • dxy_da: The gradient of the loss with respect to the scaling factor in x. {x * cos(rad_phi), x * sin(rad_phi)} if ifrigid is False, else {x * cos(rad_phi) - y * sin(rad_phi), y * cos(rad_phi) + x * sin(rad_phi)}.

  • dxy_dd: The gradient of the loss with respect to the scaling factor in y. {-y * sin(rad_phi), y * cos(rad_phi)} if ifrigid is False, else the same value as dxy_da: {x * cos(rad_phi) - y * sin(rad_phi), y * cos(rad_phi) + x * sin(rad_phi)}.

  • dxy_dphi: The gradient of the loss with respect to the rotation. {-d * y * cos(rad_phi) - a * x * sin(rad_phi), a * x * cos(rad_phi) - d * y * sin(rad_phi)}

  • dxy_dt1: The gradient of the loss with respect to the translation in x. {1, 0}

  • dxy_dt2: The gradient of the loss with respect to the translation in y. {0, 1}

Return type:

torch.Tensor

CAST_Stack.theta_renew(theta, dJ_dtheta, alpha, ifrigid=False)

Updates theta based on the gradient dJ_dtheta and the learning rate alpha.

Parameters:
  • theta (torch.Tensor) – The current parameters of the affine transformation. The format of this is that the first two elements are the coefficients for scaling in x and y, the third element is the rotation in degrees, and the last two elements are the coefficients for translation in x and y.

  • dJ_dtheta (torch.Tensor) – The gradient of the cost function with respect to the affine transformation parameters.

  • alpha (float) – The learning rate.

  • ifrigid (bool, optional (default: False)) – If True, the affine transformation must have a uniform scaling factor. If False, the affine transformation will be performed without constraints.

Returns:

The updated parameters of the affine transformation in the same format as theta.

Return type:

torch.Tensor

CAST_Stack.affine_trans_t(theta, coords_t)

Applies the affine transformation (defined by parameters theta) to the coordinates.

Parameters:
  • theta (torch.Tensor) – The parameters of the affine transformation. The format of this is that the first two elements are the coefficients for scaling in x and y, the third element is the rotation in degrees, and the last two elements are the coefficients for translation in x and y.

  • coords_t (torch.Tensor) – The coordinates to be transformed.

Returns:

The transformed coordinates.

Return type:

torch.Tensor

CAST_Stack.torch_Bspline(uv, kl)

Calculates the B-spline basis functions for the given uv and kl.

Parameters:
  • uv (torch.Tensor) – The uv coordinates - the input for the B-spline basis functions (the coordinates of each cell within its containing mesh grid tile).

  • kl (torch.Tensor) – The kl values - the indices for the B-spline basis functions.

Returns:

The result of the B-spline basis functions.

Return type:

torch.Tensor

CAST_Stack.BSpline_GD_preparation(max_xy, mesh_size, dev, mesh_weight)

Initializes the mesh grid, mesh weight, B-spline basis functions, and the gradient of the FFD transformation.

Parameters:
  • max_xy (torch.Tensor) – The maximum x and y coordinates for the mesh.

  • mesh_size (int) – The mesh size for the FFD (as the number of meshgrid cells in each dimension). A smaller mesh_size generally gives coarse-grained adjustment, while a larger mesh_size could adjust more details. We observed that the S4 to S1 alignment task with a mesh_size of 4 exhibits poor alignment performance compared to the tasks a mesh_size of 8 or a mesh_size of 10.

  • dev (str) – The device for the computation.

  • mesh_weight (np.ndarray, optional (default: 1)) – The weight matrix for the mesh grid (used as a multiplicative factor for the gradient). This is an np.ndarry of the same shape as the mesh grid, else it is set to 1.

Returns:

  • np.array – The mesh grid.

  • torch.Tensor | 1 – The mesh weights.

  • torch.Tensor – The indices for the B-spline basis functions.

  • torch.Tensor – The initial gradient of the FFD transformation (zeros).

  • torch.Tensor – The mesh grid size.

CAST_Stack.BSpline_GD_uv_ij_calculate(coords_query_it, delta, dev)

Gets the uv and ij coordinates for each query cell.

Parameters:
  • coords_query_it (torch.Tensor) – The coordinates of the query sample.

  • delta (torch.Tensor) – The mesh grid size (as a 2x1 tensor whose values represent physical lengths in x and y).

  • dev (str) – The device for the computation.

Returns:

  • torch.Tensor – The uv coordinates (bottom-left coordinates of the containing mesh grid tile) for each query cell.

  • torch.Tensor – The ij coordinates (the position of the cell within the containing mesh grid tile) for each query cell.

CAST_Stack.B_matrix(uv_t, kls_t)

Calculates the result of the B-spline basis functions for the given uv and kl.

Parameters:
  • uv_t (torch.Tensor) – The uv coordinates - the input for the B-spline basis functions (the coordinates of each cell within its containing mesh grid tile).

  • kls_t (torch.Tensor) – The kl values - the indices for the B-spline basis functions (as a 2 x 16 torch.Tensor).

Returns:

The concatenated result of the B-spline basis functions indicated by kls_t (as a 16 * N[idx] torch.Tensor).

Return type:

torch.Tensor

CAST_Stack.get_dxy_ffd(ij, result_B_t, mesh, dJ_dxy_mat, mesh_weight, alpha_basis)

Calculates the gradient of the FFD transformation with respect to x and y for each point in the mesh grid.

Parameters:
  • ij (torch.Tensor) – The ij coordinates - the position of all cells within their containing mesh grid tile.

  • result_B_t (torch.Tensor) – The concatenated result of the B-spline basis functions (generated by B_matrix).

  • mesh (2D array-like) – The mesh grid (used only to determine the size of the gradient tensor).

  • dJ_dxy_mat (torch.Tensor) – The gradient of the loss with respect to x and y for each cell in the query sample.

  • mesh_weight (torch.Tensor | float) – The weight matrix for the mesh grid (used as a multiplicative factor for the gradient).

  • alpha_basis (float) – The learning rate for the FFD transformation.

Returns:

The gradient of the FFD transformation with respect to x and y for each point in the mesh grid.

Return type:

torch.Tensor

CAST_Stack.BSpline_renew_coords(uv_t, kls_t, ij_t, mesh_trans)

Updates the coordinates of the query sample based on the FFD transformation.

Parameters:
  • uv_t (torch.Tensor) – The uv coordinates - the input for the B-spline basis functions (the coordinates of each query cell within its containing mesh grid tile).

  • kls_t (torch.Tensor) – The kl values - the indices for the B-spline basis functions.

  • ij_t (torch.Tensor) – The ij coordinates - the position of each query cell within their containing mesh grid.

  • mesh_trans (torch.Tensor) – The transformed mesh grid.

Returns:

The updated coordinates of the query sample.

Return type:

torch.Tensor

CAST_Stack.reg_total_t(coords_q, coords_r, params_dist)

Applies the affine and FFD transformation described in params_dist to the query sample.

Parameters:
  • coords_q (array-like) – The query sample coordiantes.

  • coords_r (array-like) – The reference sample coordinates (used to recenter the transformed query sample to the reference sample mean).

  • params_dist (reg_params) – The parameters for the affine and FFD transformations.

Returns:

  • torch.Tensor – The coordinates of the query sample after the affine and FFD transformation.

  • torch.Tensor – The coordinates of the query sample after the affine and FFD transformation, recentered to the mean of the reference sample.

CAST_Stack.FFD_Bspline_apply_t(coords_q, params_dist, round_t=0)

Applies one round of the FFD transformations described in params_dist to the query sample.

Parameters:
  • coords_q (torch.Tensor) – The query sample coordinates.

  • params_dist (reg_params) – The parameters for the FFD transformation - the mesh_trans_list attribute should describe the FFD to apply.

  • round_t (int, optional (default: 0)) – The round of the FFD transformation to apply.

Returns:

The query sample after the FFD transformation.

Return type:

torch.Tensor

CAST_Stack.rescale_coords(coords_raw, graph_list, rescale=False)

Rescales the coordinates to a max of 22340.

Parameters:
  • coords_raw (dict[str, np.array]) – Dictionary mapping the sample names to the raw coordinates.

  • graph_list (list[str]) – The list of sample names.

  • rescale (bool, optional (default: False)) – If True, rescale the coordinates. If False, no rescaling is applied.

Returns:

  • dict[str, np.arrays] – Dictionary mapping the sample names to the rescaled coordinates.

  • float – The rescale factor for the second sample in graph_list (or 1 if rescale = False).

CAST_Stack.mesh_plot(mesh_t, coords_q_t, mesh_trans_t=None)

Plots the mesh and the query sample.

Parameters:
  • mesh_t (np.array) – The mesh (plotted in blue).

  • coords_q_t (np.array) – The coordinates of the query (plotted in blue).

  • mesh_trans_t (np.array, optional) – The transformed mesh (plotted in orange). If omitted, no transformed mesh is plotted.

CAST_Stack.plot_mid(coords_q, coords_r, output_path='', filename=None, title_t=['ref', 'query'], s_t=8, scale_bar_t=None)

Plots the coordinates of two samples in the same plot.

Parameters:
  • coords_q (np.array) – The query sample coordinates.

  • coords_r (np.array) – The reference sample coordinates.

  • output_path (str, optional) – The path to save the plot. The plot will only be saved if filename is provided.

  • filename (str, optional) – The name of the file to save the plot (without .pdf). If omitted, the plot will not be saved.

  • title_t (list[str], optional (default: ['ref','query'])) – The labels for the samples in the plot.

  • s_t (int, optional (default: 8)) – The size of the points in the plot.

  • scale_bar_t (list[float], optional) – The length (in data units) and label of the scale bar for the plot. If omitted, no scale bar is added.

CAST_Stack.corr_heat(coords_q, coords_r, corr, output_path, title_t=['Corr in ref', 'Anchor in query'], filename=None, scale_bar_t=None)

Plots 20 random points in the query sample and their correlated points in the reference sample.

Parameters:
  • coords_q (np.array) – The query sample coordinates.

  • coords_r (np.array) – The reference sample coordinates.

  • corr (np.array) – The correlation matrix between the query and reference sample.

  • output_path (str) – The path to save the plot. The plot will only be saved if filename is provided.

  • title_t (list[str], optional (default: ['Corr in ref','Anchor in query'])) – The labels for the two panes of each plot.

  • filename (str, optional) – The name of the file to save the plot. If omitted, the plot will not be saved.

  • scale_bar_t (list[float], optional) – The length (in data units) and label of the scale bar for the plot. If omitted, no scale bar is added.

CAST_Stack.prelocate_loss_plot(J_t, output_path, prefix='test')

Plots and saves the loss during prelocation, where the x-axis is the iteration number and the y-axis is the loss value.

Parameters:
  • J_t (list[float]) – The loss values for each iteration.

  • output_path (str) – The path to save the plot.

  • prefix (str, optional (default: 'test')) – The prefix for the file name. The name of the file will be prefix _prelocate_loss.pdf.

CAST_Stack.register_result(coords_q, coords_r, cov_anchor_t, bleeding, embed_stack, output_path, k=8, prefix='test', scale_t=1, index_list=None)

Plots and saves three figures for the registration results.

1: The query and reference coordinates on the same axes.

2: The query and reference coordinates colored by the cell type.

3: The query coordinates colored by similarity score (the cost function value) to the reference sample.

Parameters:
  • coords_q (np.array) – The query sample coordinates.

  • coords_r (np.array) – The reference sample coordinates.

  • cov_anchor_t (np.array) – The covariance matrix of the query and reference sample.

  • bleeding (int) – When the reference sample is larger than the query sample, for efficient computation, only the region of the reference sample within bleeding distance of the query sample will be considered when calculating the cost function.

  • embed_stack (2D array-like) – The embedding of the query and reference samples.

  • output_path (str) – The path to save the plots.

  • k (int, optional (default: 8)) – The number of clusters for the K-means clustering in plot 2.

  • prefix (str, optional (default: 'test')) – The prefix for the file names.

  • scale_t (float, optional (default: 1)) – The scale factor for the coordinates.

  • index_list (list[np.array], optional) – A mask indicating which cells to plot in the query and reference samples. If omitted, all cells will be used.

CAST_Stack.affine_reg_params(it_theta, similarity_score, iterations, output_path, prefix='test')

Plots the affine transformation parameters during the registration process.

Parameters:
  • it_theta (list[array-like]) – The affine transformation parameters for each iteration. The format of the parameters is that the first two elements are the coefficients for scaling in x and y, the third element is the rotation in degrees, and the last two elements are the coefficients for translation in x and y.

  • similarity_score (list[float]) – The similarity score for each iteration.

  • iterations (int) – The number of iterations (used only in the filename).

  • output_path (str) – The path to save the plot.

  • prefix (str (default: 'test')) – The prefix for the file name. The name of the file will be prefix _params_Affine_GD_ iterations its.pdf.

CAST_Stack.CAST_STACK_rough(coords_raw_list, ifsquare=True, if_max_xy=True, percentile=None)

Roughly scales the coordinates using the range of the x and y coordinates. If ifsquare is True, the coordinates will be scaled by a uniform factor in the x and y directions (the maximum value of the range of x and y). Otherwise, the x and y coordinates will be scaled by their respective maximum values.

Parameters:
  • coords_raw_list (list[np.array]) – List of numpy arrays, where each array is the coordinates of a layer.

  • ifsquare (bool, optional (default: True)) – If True, the coordinates will be scaled by a uniform factor in the x and y directions (the maximum value of the range of x and y). Otherwise, the x and y coordinates will be scaled by their respective maximum values.

  • if_max_xy (bool, option (default: True)) – If False, the coorinates will be dividied by the maximum value of the range of x and y.

  • percentile (list[float] | float | None, optional (default: None)) – If not None, the min and max for caluclating the range will be calculated based on the percentile of the coordinates for each slice (ignoring coordinates outside of the percentile values).

Returns:

List of numpy arrays, where each array is the scaled coordinates of a layer

Return type:

list[np.array]

CAST_Stack.coords_minus_mean(coord_t)

Subtracts the mean from a set of coordinates.

Parameters:

coord_t (np.array) – The coordinates.

Returns:

The coordinates after subtracting the mean.

Return type:

np.array

CAST_Stack.coords_minus_min(coord_t)

Subtracts the min from a set of coordinates.

Parameters:

coord_t (np.array) – The coordinates.

Returns:

The coordinates after subtracting the min.

Return type:

np.array

CAST_Stack.max_minus_value(corr)

Inverts a correlation matrix by substracting it from its maximum value.

Parameters:

corr (np.array) – The correlation matrix.

Returns:

The inverted correlation matrix.

Return type:

np.array

CAST_Stack.coords_minus_min_t(coord_t)

Subtracts the column-wise minimum value from a set of coordinates.

Parameters:

coord_t (torch.Tensor) – The coordinates.

Returns:

The coordinates after subtracting the column-wise minimum value.

Return type:

torch.Tensor

CAST_Stack.max_minus_value_t(corr)

Inverts a correlation matrix by substracting it from its maximum value.

Parameters:

corr (torch.Tensor) – The correlation matrix.

Returns:

The inverted correlation matrix.

Return type:

torch.Tensor

CAST_Stack.corr_dist(query_np, ref_np, nan_as='min')

Calculates the correlation distance between two samples.

Parameters:
  • query_np (np.ndarray) – The query sample.

  • ref_np (np.ndarray) – The reference sample.

  • nan_as (str (default: 'min')) – If ‘min’, replace NaN values with the minimum value in the distance matrix.

Returns:

The correlation distance matrix between the query and reference

Return type:

np.array

CAST_Stack.region_detect(embed_dict_t, coords0, k=20)

Detects and plots the KMeans clustering results of the embeddings in the query sample.

Parameters:
  • embed_dict_t (2D array-like) – The embedding of the query sample.

  • coords0 (array-like) – The coordinates of the query sample.

  • k (int (default: 20)) – The number of regions to detect.

Returns:

The KMeans clustering labels for the query sample.

Return type:

np.array

CAST.CAST_Projection module

CAST_Projection.space_project(sdata_inte, idx_source, idx_target, raw_layer, source_sample, target_sample, coords_source, coords_target, output_path, source_sample_ctype_col, target_cell_pc_feature=None, source_cell_pc_feature=None, k2=1, ifplot=True, umap_feature='X_umap', ave_dist_fold=2, batch_t='', alignment_shift_adjustment=50, color_dict=None, adjust_shift=False, metric_t='cosine', working_memory_t=1000)

Projects the source cells to the target cells based on the k-nearest neighbors in the PCA space and phsyical distance.

Parameters:
  • sdata_inte (anndata) – The integrated anndata object.

  • idx_source (np.ndarray) – The indices of the source cells.

  • idx_target (np.ndarray) – The indices of the target cells.

  • raw_layer (str) – The layer name of the raw data.

  • source_sample (str) – The name of the source sample.

  • target_sample (str) – The name of the target sample.

  • coords_source (array-like) – The coordinates of the source cells.

  • coords_target (array-like) – The coordinates of the target cells.

  • output_path (str) – The path to save the output files.

  • source_sample_ctype_col (str | None) – The column name of the cell type annotation in the source sample. If None, the projection will be performed as a single sample without cell type annotation.

  • target_cell_pc_feature (array-like, optional) – The principal components of the target cells.

  • source_cell_pc_feature (array-like, optional) – The principal components of the source cells.

  • k2 (int, optional (default: 1)) – The number of nearest neighbors to consider.

  • ifplot (bool, optional (default: True)) – Whether to generate evaluation plots.

  • umap_feature (str, optional (default: 'X_umap')) – The column name in sdata_inte.obsm to use for the UMAP for visualization and saving.

  • ave_dist_fold (int, optional (default: 2)) – A multiplicative factor on the average distance to use for the physical distance threshold.

  • batch_t (str, optional (default: '')) – The batch name used in naming the output files.

  • alignment_shift_adjustment (int, optional (default: 50)) – An additive factor on the average distance for the physical distance threshold.

  • color_dict (dict, optional) – The color dictionary for visualizing the cell type annotations.

  • adjust_shift (bool, optional (default: False)) – Whether to shift the coordinates of the source cells by the median shift between the target and source cells for each cell type (ignored if source_sample_ctype_col is not given).

  • metric_t (str | callable, optional (default: 'cosine')) – The metric to use for the pairwise distance calculations. See sklearn.metrics.pairwise_distances_chunked for more information.

  • working_memory_t (int, optional (default: 1000)) – The sought maximum memory for the chunked pairwise distance calculations.

Returns:

  • anndata – The integrated anndata object with the raw and normalized projected data as layers.

  • list[np.ndarray, np.ndarray, np.ndarray, np.ndarray] – The indicies of the k-nearest neighbors, their corresponding weights, cosine distances and physical distances for each cell.

CAST_Projection.average_dist(coords, quantile_t=0.99, working_memory_t=1000, strategy_t='convex')

Finds the average distance between the cells after filtering for distances in the top quantile_t.

Parameters:
  • coords (array-like) – The coordinates of the cells.

  • quantile_t (float, optional (default: 0.99)) – The quantile to filter the delaunay graph edges. This is not applied if the number of cells is less than 5.

  • working_memory_t (int, optional (default: 1000)) – The sought maximum memory for the chunked pairwise distance calculations.

  • strategy_t ('convex' | 'delaunay', optional (default: 'convex')) –

    The strategy to use for generating the delaunay graph.

    Convex will use Veronoi polygons clipped to the convex hull of the points and their rook spatial weights matrix (with libpysal).

    Delaunay will use the Delaunay triangulation (with sciipy).

Returns:

The average distance, the quantile_t of the edge distances, the edge distances, and the delaunay graph. On a dataset of less than 5 cells, the average distance is calculated directly and the other return values are the empty string.

Return type:

float, float, np.ndarray, np.ndarray

CAST_Projection.group_shift(feat_target, feat_source, coords_target_t, coords_source_t, working_memory_t=1000, pencentile_t=0.8, metric_t='cosine')

Calculates the median shift between the target and source cells.

Parameters:
  • feat_target (array-like) – The target features, used to calculate the pairwise distance between target and source cells.

  • feat_source (array-like) – The source features, used to calculate the pairwise distance between target and source cells.

  • coords_target_t (np.ndarray) – The coordinates of the target cells.

  • coords_source_t (np.ndarray) – The coordinates of the source cells.

  • working_memory_t (int, optional (default: 1000)) – The sought maximum memory for the chunked pairwise distance calculations.

  • pencentile_t (float, optional (default: 0.8)) – The pencentile of the pairwise distances to use as anchor points.

  • metric_t (str, optional (default: 'cosine')) – The metric to use for the pairwise distance calculations. See sklearn.metrics.pairwise_distances_chunked for more information.

Returns:

The median shift between the target and source cells.

Return type:

np.ndarray

CAST_Projection.physical_dist_priority_project(feat_target, feat_source, coords_target, coords_source, source_feat=None, k2=1, k_extend=20, pdist_thres=200, working_memory_t=1000, metric_t='cosine')

Gets the indicies, weights, and distances of the k-nearest neighbors for each cell in the target space.

Parameters:
  • feat_target (array-like) – The target features, used to calculate the pairwise distance between target and source cells.

  • feat_source (array-like) – The source features, used to calculate the pairwise distance between target and source cells.

  • coords_target (array-like) – The coordinates of the target cells.

  • coords_source (array-like) – The coordinates of the source cells.

  • source_feat (scipy.sparse matrix, optional) – If provided, also returns the weighted average of these source features.

  • k2 (int, optional (default: 1)) – For points without k2 neighbors within the physical distance threshold, extend the search to k_extend neighbors.

  • k_extend (int, optional (default: 20)) – For points without k2 neighbors within the physical distance threshold, extend the search to k_extend neighbors.

  • pdist_thres (int, optional (default: 200)) – The physical distance threshold for the nearest neighbors search.

  • working_memory_t (int, optional (default: 1000)) – The sought maximum memory for the chunked pairwise distance calculations.

  • metric_t (str, optional (default: 'cosine')) – The metric to use for the pairwise distance calculations. See sklearn.metrics.pairwise_distances_chunked for more information.

Returns:

The indicies of the k-nearest neighbors, their corresponding weights, cosine distances and physical distances for each cell. If source_feat is provided, also return the weighted average of the source features.

Return type:

np.ndarray, np.ndarray, np.ndarray, np.ndarray (, np.ndarray)

CAST_Projection.sparse_mask(idw_t, ind: ~numpy.ndarray, n_cols: int, dtype=<class 'numpy.float64'>)

Creates a CSR matrix from the given non-zero values and their corresponding indices.

Parameters:
  • idw_t (np.ndarray) – The non-zero values to set in the matrix.

  • ind (np.ndarray) – The indices of the non-zero values with shape (num data points, indices), in the format of the output of the numpy.argpartition function

  • n_cols (int) – The number of columns in the output matrix.

  • dtype (type, optional (default: np.float64)) – The data type of the output matrix.

Returns:

The CSR matrix with the given non-zero values and indices.

Return type:

scipy.sparse.csr_matrix

CAST_Projection.cosine_IDW(cosine_dist_t, k2=5, eps=1e-06, need_filter=True, ifavg=False)

Compute the weights for the k-nearest neighbors of a target cell using the inverse distance weighting method or the average weight.

Parameters:
  • cosine_dist_t (np.ndarray) – The cosine distance between the target cell and its neighbors.

  • k2 (int, optional (default: 5)) – The number of nearest neighbors to consider.

  • eps (float, optional (default: 1e-6)) – A small constant to prevent dividing by zero.

  • need_filter (bool, optional (default: True)) – Whether to filter for only the k-nearest neighbors. If True, only consider the k2 nearest neighbors.

  • ifavg (bool, optional (default: False)) – Whether to use the average weight for all the neighbors. If False, use the IDW method.

Returns:

  • np.ndarray – The indicies of the k2-nearest neighbors (in cosine_dist_t) if need_filter is True, otherwise 0.

  • np.ndarray – The cell weights as a 1D array. If ifavg is True, a uniform 1/k2 weight for k2 neighbors. Otherwise the IDW weights for the k2 neighbors if need_filter is True or for all cells if need_filter is False.

  • np.ndarray – The cosine distances for the k2-nearest neighbors if need_filter is True, otherwise cosine_dist_t.

CAST_Projection.IDW(df_value, eps=1e-06)

Calculates the normalized, reciprocal weights for an array.

Parameters:
  • df_value (np.ndarray) – The array to take the weights from.

  • eps (float, optional (default: 1e-6)) – A small constant to prevent dividing by zero.

Returns:

The normalized, reciprocal weights for each element.

Return type:

np.ndarray

CAST_Projection.evaluation_project(physical_dist, project_ind, coords_target, coords_source, y_true_t, y_pred_t, y_source, output_path, source_sample_ctype_col, umap_target=None, umap_source=None, source_sample=None, target_sample=None, cdists=None, batch_t='', exclude_group='Other', color_dict=None, umap_examples=False)

Generates and saves evaluation plots for the projection results, such as the physical distance and cosine distance histograms, the confusion matrix, a 3D link plot, and UMAP examples plot (see cdist_check).

Parameters:
  • physical_dist (array-like) – The physical distances between the target cells and their k-nearest neighbors.

  • project_ind (array-like) – The indicies of the k-nearest neighbors for each cell in the target space.

  • coords_target (array-like) – The coordinates of the target cells.

  • coords_source (array-like) – The coordinates of the source cells.

  • y_true_t (np.ndarray) – The true cell type labels of the target cells.

  • y_pred_t (np.ndarray) – The predicted cell type labels of the target cells based on the projection results.

  • y_source (array-like) – The cell type labels of the source cells.

  • output_path (str) – The path to save the output files.

  • source_sample_ctype_col (str) – The column name of the cell type annotation in the source sample. If omitted, no confusion matrix will be plotted and visualizations won’t include cell type information.

  • umap_target (array-like, optional) – The UMAP coordinates of the target cells.

  • umap_source (array-like, optional) – The UMAP coordinates of the source cells.

  • source_sample (str, optional) – The name of the source sample, used as a label on the UMAP examples plot (if umap_examples is True).

  • target_sample (str, optional) – The name of the target sample, used as a label on the UMAP examples plot (if umap_examples is True).

  • cdists (array-like, optional) – The cosine distances between the target cells and their k-nearest neighbors.

  • batch_t (str, optional (default: '')) – The batch name for naming the output files.

  • exclude_group (str | None, optional (default: 'Other')) – The group to exclude from the confusion matrix. If None, exclude no groups.

  • color_dict (dict, optional) – The color dictionary for visualizing the cell type annotations. (only applied if source_sample_ctype_col is given).

  • umap_examples (bool, optional (default: False)) – Whether to generate the UMAP examples plot.

CAST_Projection.cdist_hist(data_t, range_t=None, step=None)

Generates a histogram of the given data.

Parameters:
  • data_t (array-like) – The data to plot.

  • range_t (tuple, optional) – The range of the x-axis (inclusive on both sides). If omitted, the entire data range is included.

  • step (float, optional) – The step size of the x-axis. If omitted, the step size is automatically determined by matplotlib.hist.

CAST_Projection.confusion_mat_plot(y_true_t, y_pred_t, filter_thres=None, withlabel=True, fig_x=60, fig_y=20)

Generates a confusion matrix plot.

Parameters:
  • y_true_t (np.ndarray) – The true cell type labels.

  • y_pred_t (np.ndarray) – The predicted cell type labels from the projection results.

  • filter_thres (int, optional) – A threshold to filter out cell types with low cell counts.

  • withlabel (bool, optional (default: True)) – Whether to include value labels on the diagonal of the confusion matrix.

  • fig_x (int, optional (default: 60)) – The width of the figure.

  • fig_y (int, optional (default: 20)) – The height of the figure.

CAST_Projection.cdist_check(cdist_t, cdist_idx, umap_coords0, umap_coords1, labels_t=['query', 'ref'], random_seed_t=2, figsize_t=[40, 32], output_path_t=None)

Generates UMAP examples plots — for a random 20 points (generating 20 subplots), highlights the target point and its nearest neighbor in the reference sample.

Parameters:
  • cdist_t (array-like) – The cosine distances between the target cells and their k-nearest neighbors (used to title the subplots with their distance values).

  • cdist_idx (array-like) – The indicies of the k-nearest neighbors for each cell in the target space.

  • umap_coords0 (array-like) – The UMAP coordinates of the query cells.

  • umap_coords1 (array-like) – The UMAP coordinates of the reference cells.

  • labels_t (list[str], optional (default: ['query','ref'])) – The labels for the query and reference samples.

  • random_seed_t (int, optional (default: 2)) – The random seed used to sample the 20 random points.

  • figsize_t (list[float], optional (default: [40,32])) – The size of the figure.

  • output_path_t (str, optional) – The path to save the final plot. If omitted, the plot will not be saved.

Generates a 3D link plot for the projection results - the target cells are displayed in a plane above the source cells and sample_n links between corresponding target and source cells are drawn.

Parameters:
  • assign_mat (array-like) – The indicies of the k-nearest neighbors for each cell in the target space.

  • coords_target (array-like) – The coordinates of the target cells.

  • coords_source (array-like) – The coordinates of the source cells.

  • k (int) – The number of nearest neighbors to consider. As of the current implementation, this must be 1.

  • figsize_t (list, optional (default: [15,20])) – The size of the figure.

  • sample_n (int, optional (default: 1000)) – The number of links to sample and display.

  • link_color_mask (np.ndarray, optional) – A boolean mask to color the links based on their corresponding values. If omitted, all links will be colored with color_true.

  • color_target (str, optional (default: "#9295CA")) – The color of the target cells (displayed in a plane above the source cells).

  • color_source (str, optional (default: '#E66665')) – The color of the source cells (displayed in a plane below the target cells).

  • color_true (str, optional (default: "#999999")) – The color of the links that are True in the link_color_mask. If link_color_mask is omitted, this color will be used for all links.

  • color_false (str, optional (default: "#999999")) – The color of the links that are False in the link_color_mask. If link_color_mask is omitted, this color will not be used.

  • remove_background (bool, optional (default: True)) – Whether to remove the axes of the plot.

CAST.utils module

utils.coords2adjacentmat(coords, output_mode='adjacent', strategy_t='convex')

Given a spatial data matrix, generates the delaunay graph in the specified format.

Parameters:
  • coords (np.ndarray) – The spatial data matrix with the coordinate position for each cell.

  • output_mode ('adjacent' | 'raw' | 'adjacent_sparse', optional (default: 'adjacent')) –

    The output format of the delaunay graph.

    If ‘adjacent’, the function will return the adjacent matrix.

    If ‘raw’, the function will return the raw delaunay graph.

    If ‘adjacent_sparse’, the function will return the adjacent matrix in sparse format.

  • strategy_t ('convex' | 'delaunay', optional (default: 'convex')) –

    The strategy to use for generating the delaunay graph.

    Convex will use Veronoi polygons clipped to the convex hull of the points and their rook spatial weights matrix (with libpysal).

    Delaunay will use the Delaunay triangulation (with sciipy).

Returns:

delaunay_graph – The delaunay graph in the specified format.

Return type:

ndarray | nx.Graph | scipy.sparse.csr_matrix

utils.hv_cutoff(max_col, threshold_cell_num=2000)

Returns the highest integer threshold such that the at least threshold_cell_num cells have a max_col value greater than the threshold.

Parameters:
  • max_col (np.ndarray) – The data to threshold.

  • threshold_cell_num (int, optional (default: 2000)) – The threshold value.

Returns:

thres_t – The highest integer threshold such that the at least threshold_cell_num cells have a max_col value greater than the threshold.

Return type:

int

utils.detect_highly_variable_genes(sdata, batch_key='batch', n_top_genes=4000, count_layer='count')

Finds the genes that have high expression in at least one cell in all batches.

Parameters:
  • sdata (AnnData) – Annotated data matrix.

  • batch_key (str, optional (default: 'batch')) – The column name of the samples in sdata.obs.

  • n_top_genes (int, optional (default: 4000)) – The number of genes to keep - the result uses the smallest integer threshold to get at least this many genes.

  • count_layer (str, optional (default: 'count')) – The layer in sdata.layers to use for the count data. If count_layer is ‘.X’, the function will use sdata.X for the count data.

Returns:

highly_variable_genes – A boolean array of which genes are highly variable across all batches.

Return type:

np.ndarray

utils.extract_coords_exp(sdata, batch_key='batch', cols='spatial', count_layer='count', data_format='norm1e4', ifcombat=False, if_inte=False)

Extracts the spatial data and gene expression data for each sample.

Parameters:
  • sdata (AnnData) – Annotated data matrix.

  • batch_key (str, optional (default: 'batch')) – The column name of the samples in sdata.obs.

  • cols ('spatial' | list, optional (default: 'spatial')) – The column name of the coordinates. If cols is ‘spatial’, the function will use sdata.obsm[‘spatial’] for the coordinates. Otherwise, the function will use sdata.obs[cols] for the coordinates.

  • count_layer (str, optional (default: 'count')) – The layer in sdata.layers to use for the count data. If count_layer is ‘.X’, the function will use sdata.X for the count data.

  • data_format (str, optional (default: 'norm1e4')) – The name of the layer in sdata.layers to use for the data.

  • ifcombat (bool, optional (default: False)) – Whether or not to use ComBat to correct for batch effects.

  • if_inte (bool, optional (default: False)) – Whether or not to perform Harmony integration.

Returns:

  • coords_raw (dict[str,np.ndarray]) – The spatial data matrix with the coordinate position of each cell, indexed by sample name.

  • exps (dict[str,np.ndarray]) – The gene expression data for each coordinate in coords_raw, indexed by sample name.

utils.Harmony_integration(sdata_inte, scaled_layer, use_highly_variable_t, batch_key, umap_n_neighbors, umap_n_pcs, min_dist, spread_t, source_sample_ctype_col, output_path, n_components=50, ifplot=True, ifcombat=False)

Performs Harmony integration on the data, runs PCA and constructs a neighborhood graph and UMAP.

Parameters:
  • sdata_inte (AnnData) – Annotated data matrix.

  • scaled_layer (str) – The name of the layer in sdata.layers to use for the scaled data.

  • use_highly_variable_t (bool) – Whether or not to use highly variable genes for PCA. If true, the algorithm will use highly variable genes only, stored in sdata_inte.var[‘highly_variable’].

  • batch_key (str) – The column name of the samples in sdata.obs.

  • umap_n_neighbors (int) – The number of neighbors to use for the UMAP.

  • umap_n_pcs (int) – The number of PCs to use for the UMAP.

  • min_dist (float) – The minimum distance to use for the UMAP.

  • spread_t (int) – The spread to use for the UMAP.

  • source_sample_ctype_col (str) – The key in sdata.obs to split batches on.

  • output_path (str) – The path to save the output.

  • n_components (int, optional (default: 50)) – The number of components to use for the Harmony integration.

  • ifplot (bool, optional (default: True)) – Whether or not to plot the UMAP results.

  • ifcombat (bool, optional (default: False)) – Whether or not to initially run ComBat to correct for batch effects.

Returns:

sdata_inte – The annotated data matrix, now supplemented with the neighborhood graph and UMAP information.

Return type:

AnnData

utils.random_sample(coords_t, nodenum, seed_t=2)

Returns nodenum random indices from coords_t.

Parameters:
  • coords_t (ndarray | torch.Tensor) – The random indices will be sampled from the length of the first axis.

  • nodenum (int) – The number of random indicies to return.

  • seed_t (int | None, optional (default: 2)) – The seed for the random package.

Returns:

sub_node_idx – 1D nodenum-length sorted array of random indicies.

Return type:

np.ndarray

utils.sub_node_sum(coords_t, exp_t, nodenum=1000, vis=True, seed_t=2)

Randomly selects nodenum nodes from coords_t and returns the subnode expression matrix for the nearest neighbor of each chosen point.

Parameters:
  • coords_t (np.ndarray) – The spatial data matrix with the coordinate of each cell.

  • exp_t (np.ndarray) – The gene expression data for each coordinate in coords_t.

  • nodenum (int, optional (default: 1000)) – The number of nodes to return. If this is larger than the total number of nodes, the original data will be returned.

  • vis (bool, optional (default: True)) – Whether or not to visualize the subnodes and their neighbors before returning them.

  • seed_t (int, optional (default: 2)) – The seed for the random package (for reproducibility).

Returns:

  • exp_t_sub (ndarray) – The subnode expression matrix of the nearest neighbor of each chosen point.

  • sub_node_idx (ndarray) – The nodenum-length sorted array of indicies chosen from coords_t.

utils.nearest_neighbors_idx(coord1, coord2, mode_t='knn')

Finds each coord2 point’s nearest neighbor in coord1.

Parameters:
  • coord1 (array-like) – The reference array of coordinates.

  • coord2 (array-like) – The query array of coordinates.

  • mode_t (str, optional (default: 'knn')) – Whether or not to use a KNN classifier to find the nearest neighbor. If not ‘knn’, the function will use pairwise distances instead.

Returns:

close_idx – The index of each point in coord2’s nearest neighbor in coord1.

Return type:

ndarray

utils.non_zero_center_scale(sdata_t_X)

Scales the data by dividing each column by its standard deviation without centering (i.e. without subtracting the mean).

Parameters:

sdata_t_X (ndarray) – The data matrix to scale.

Returns:

sdata_t_X – The scaled data matrix.

Return type:

ndarray

utils.sub_data_extract(sample_list, coords_raw, exps, nodenum_t=20000, if_non_zero_center_scale=True)

For each sample in sample_list, extracts nodenum_t random nodes and returns their coordinates and expression matrix.

Parameters:
  • sample_list (list[str]) – The list of samples to extract sub-data from.

  • coords_raw (dict[str,np.ndarray]) – The spatial data matrix with the cell coordinate positions, indexed by sample.

  • exps (dict[str,np.ndarray]) – The gene expression data for each coordinate in coords_raw.

  • nodenum_t (int, optional (default: 20000)) – The number of nodes to return.

  • if_non_zero_center_scale (bool, optional (default: True)) – Whether or not to scale the expression matrix by dividing each column by its standard deviation without centering.

Returns:

  • coords_sub (dict[str,np.ndarray]) – The sub-node coordinates for each sample in sample_list.

  • exp_sub (dict[str,np.ndarray]) – The sub-node expression matrix for each sample in sample_list.

  • sub_node_idxs (dict[str,np.ndarray]) – The sub-node indices for each sample in sample_list.

utils.preprocess_fast(sdata1, mode='customized', target_sum=10000.0, base=2, zero_center=True, regressout=False)

Converts the data to a CSR matrix and preprocesses it in multiple ways: total counts, log transform, scale, and regress out (if regressout is True).

Parameters:
  • sdata1 (AnnData) – Annotated data matrix.

  • mode ('customized' | 'default' , optional (default: 'customized')) –

    The mode of preprocessing. If ‘default’, the function will use the default preprocessing parameters, regardless of if other parameters are passed in.

    If ‘customized’, the user can specify the target sum, base, zero center, and regressout parameters.

  • target_sum (float, optional (default: 1e4)) – The target sum for the normalization (if ‘mode’ is ‘customized’).

  • base (int, optional (default: 2)) – The base for the log transformation (if ‘mode’ is ‘customized’).

  • zero_center (bool, optional (default: True)) – Whether or not to zero center the data.

  • regressout (bool, optional (default: False)) – Whether or not to use scanpy.pp.regress_out on the total counts.

Returns:

sdata1 – The data matrix with the preprocessing layers added in the ‘layers’ attribute.

Return type:

AnnData

utils.cell_select(coords_t, s=0.5, c=None, output_path_t=None)

Displays an interactive ipywidget to select cells by drawing a polygon on the plot.

Click the “Finish Polygon” button to finish drawing the polygon.

Click the “Clear Polygon” button to clear the polygon.

Parameters:
  • coords_t (ndarray) – The spatial data matrix with the coordinate position for each cell.

  • s (float, optional (default: 0.5)) – The size of the scatter plot points.

  • c (str, optional (default: None')) – The color of the scatter plot points.

  • output_path_t (str, optional (default: None)) – The path to save the output plot.

utils.get_neighborhood_rad(coords_centroids, coords_candidate, radius_px, dist=None)
utils.delta_cell_cal(coords_tgt, coords_ref, ctype_tgt, ctype_ref, radius_px)

Calculates the delta cell counts between target cells and reference cells based on their coordinates and cell types.

Parameters:
  • coords_tgt (np.array) – The coordinates of niche centroids (target cells).

  • coords_ref (np.array) – The coordinates of reference cells.

  • ctype_tgt (np.array) – The cell types of niche centroids.

  • ctype_ref (np.array) – The cell types of reference cells.

  • radius_px (float) – The radius of the neighborhood.

Returns:

  • df_delta_cell_tgt (pandas.DataFrame) – The raw cell type counts of target cells given niche centroids.

  • df_delta_cell_ref (pandas.DataFrame) – The raw cell type counts of reference cells given niche centroids.

  • df_delta_cell (pandas.DataFrame) – The delta cell counts (delta_cell_tgt - delta_cell_ref).

Examples

  • coords_tgt = coords_final[‘injured’]

  • coords_ref = coords_final[‘normal’]

  • ctype_tgt = sdata.obs[‘Annotation’][right_idx]

  • ctype_ref = sdata.obs[‘Annotation’][left_idx]

  • radius_px = 1000

  • df_delta_cell_tgt, df_delta_cell_ref, df_delta_cell = delta_cell(coords_tgt, coords_ref, ctype_tgt, ctype_ref, radius_px)

utils.delta_exp_cal(coords_tgt, coords_ref, exp_tgt, exp_ref, radius_px, valid_tgt_idx=None, valid_ref_idx=None)

Calculates the delta gene expression between target cells and reference cells based on their coordinates and gene expression.

Parameters:
  • coords_tgt (np.array) – The coordinates of niche centroids (target cells).

  • coords_ref (np.array) – The coordinates of reference cells.

  • exp_tgt (np.array) – The gene expression of target cells.

  • exp_ref (np.array) – The gene expression of reference cells.

  • radius_px (float) – The radius of the neighborhood.

Returns:

  • df_delta_exp_tgt (np.array)

  • delta_exp_ref (np.array)

  • delta_exp (np.array)

utils.delta_exp_sigplot(p_values, avg_differences, abs_10logp_cutoff=None, abs_avg_diff_cutoff=None, sig=True)
utils.delta_exp_statistics(delta_exp_tgt, delta_exp_ref)

CAST.visualize module

visualize.kmeans_plot_multiple(embed_dict_t, graph_list, coords, taskname_t, output_path_t, k=20, dot_size=10, scale_bar_t=None, minibatch=True, plot_strategy='sep', axis_off=False)

Plots and returns the KMeans clustering results for multiple samples.

Parameters:
  • embed_dict_t (dict[str,torch.Tensor]) – A dictionary mapping the sample names to their embeddings (such as as an output from CAST Mark).

  • graph_list (list[str]) – A list of the sample names.

  • coords (dict[str,array-like]) – A dictionary mapping the sample names to their coordinates.

  • taskname_t (str) – The name of the task for the output file name. The file will be named “{taskname}_trained_k{k}.pdf” where k is the number of clusters.

  • output_path_t (str) – The path to save the plot.

  • k (int (default = 20)) – The number of clusters for KMeans.

  • dot_size (int (default = 10)) – The size of the points in the plot.

  • scale_bar_t (list[float,str], optional) – The length (in data units) and label of the scale bar for the plot. If omitted, no scale bar is added.

  • minibatch (bool (default = True)) – Whether to use MiniBatchKMeans for clustering.

  • plot_strategy (str (default = 'sep')) – The strategy to plot the clustering results. If ‘sep’, each sample is plotted in a separate subplot. Else, all samples are plotted in the same subplot.

  • axis_off (bool (default = False)) – Whether to turn off the axis in the plot(s).

Returns:

cell_label – The labels of the cells in the KMeans clustering.

Return type:

np.array

visualize.add_scale_bar(length_t, label_t)

Adds a scale bar to the current matplotlib plot

Parameters:
  • length_t (float) – The length for the scale bar in data coordinates

  • label_t (str) – The label for the scale bar

visualize.plot_mid_v2(coords_q, coords_r=None, output_path='', filename=None, title_t=['ref', 'query'], s_t=8, scale_bar_t=None)

Plots the coordinates of one or two samples on the same axes.

Parameters:
  • coords_q (np.array) – The coordinates of the query sample.

  • coords_r (np.array, optional) – The coordinates of the reference sample. If omitted, only the query sample is plotted.

  • output_path (str) – The path to save the plot (if filename is included).

  • filename (str, optional) – The name of the output file. If omitted, the plot is not saved.

  • title_t (list[str] (default = ['ref','query'])) – The labels for the query (and reference) samples.

  • s_t (int (default = 8)) – The size of the points in the plot.

  • scale_bar_t (list[float, str], optional) – The length (in data units) and label of the plot’s scale bar. If omitted, no scale bar is added.

visualize.plot_mid(coords_q, coords_r, output_path='', filename=None, title_t=['ref', 'query'], s_t=8, scale_bar_t=None, axis_off=False)

Plots the coordinates of two samples on the same axes.

Parameters:
  • coords_q (np.array) – The coordinates of the query sample.

  • coords_r (np.array) – The coordinates of the reference sample.

  • output_path (str) – The path to save the plot (if filename is included).

  • filename (str) – The name of the output file. If omitted, the plot is not saved.

  • title_t (list[str] (default = ['ref','query'])) – The labels for the query and reference samples.

  • s_t (int (default = 8)) – The size of the points in the plot.

  • scale_bar_t (list[float, str], optional) – The length (in data units) and label of the plot’s scale bar. If omitted, no scale bar is added.

  • axis_off (bool (default = False)) – Whether to turn off the axis in the plot.

Plots links between the query cells and their k nearest neighbors in the reference sample.

Parameters:
  • all_cosine_knn_inds_t (np.array) – The indices of each query cell’s k nearest neighbors in the reference sample.

  • coords_q (np.array) – The coordinates of the query cells.

  • coords_r (np.array) – The coordinates of the reference cells.

  • k (int) – The number of nearest neighbors to visualize.

  • figsize_t (list[float], optional (default = [15,20])) – The size of the plot.

  • scale_bar_t (list[float,str], optional) – The length (in data units) and label of the plot’s scale bar. If omitted, no scale bar is added.

visualize.dsplot(coords0, coords_plaque_t, s_cell=10, s_plaque=40, col_cell='#999999', col_plaque='red', cmap_t='vlag', alpha=1, vmax_t=None, title=None, scale_bar_200=None, output_path_t=None, coords0_mask=None)