Usage
=====

Import the package
-----------------

.. code-block:: python

    from Harmonics import *


Initialize the model
-------------------

.. code-block:: python

    model = Harmonics_Model(
        adata_list,
        slice_name_list,
        cond_list=cond_list,
        cond_name_list=cond_name_list,
        concat_label='slice_name',
        proportion_label=None,
        refine_k=0,
        seed=1234, 
        parallel=True,
        verbose=True,
    )

**Parameters**

- **adata_list**: list of `anndata` objects. Whole dataset for condition-agnostic analysis, control group for case-control study, or reference data for label transfer.  

- **slice_name_list**: list of strings, names of slices corresponding to `adata_list`.  

- **cond_list**: Default: `None`. list of `anndata` objects or `None`. `None` for condition-agnostic analysis, case group for case-control study, or query data for label transfer.

- **cond_name_list**: Default: `None`. list of strings or `None`, names of slices corresponding to `cond_list`.  

- **concat_label**: Default: `'slice_name'`. string, key in `.obs` to store slice names.  

- **proportion_label**: Default: `None`. string or `None`, key in `.obsm` storing cell type proportions for low-resolution data.  

- **refine_k**: Default: `0`. int, number of cell types with highest proportion used for niche refinement; set `0` to skip refinement.  

- **seed**: Default: `1234`. int, random seed.  

- **parallel**: Default: `True`. bool, whether to run computations in parallel.  

- **verbose**: Default: `True`. bool, whether to print progress messages.  


Construct cell representations
------------------------------

.. code-block:: python

    model.preprocess(
        ct_key='celltype',
        spatial_key='spatial',
        method='joint',
        n_step=3, 
        n_neighbors=20,
        radius='auto',
        cut_percentage=99,
        kernel=None,
    )

**Parameters**

- **ct_key**: Default: `'celltype'`. string, key in `.obs` storing the cell type information.  

- **spatial_key**: Default: `'spatial'`. string, key in `.obsm` storing the spatial coordinates.  

- **method**: Default: `'joint'`. string or `None`, method for graph construction. Options:  

  - ``'joint'``: n-step hop Delaunay triangulation with graph completion to at least `n_neighbors` per cell.  

  - ``'delaunay'``: n-step hop Delaunay triangulation.  

  - ``'knn'``: connect `n_neighbors` neighbors per cell.  

  - ``'radius'``: connect all cells within a specified radius. If `radius='auto'`, the radius is set to the median distance to the `n_neighbors`-th nearest neighbor across cells.  

  - ``None``: directly use cell type composition (for low-resolution data).  

- **n_step**: Default: `3`. int, number of steps for n-step Delaunay triangulation.  

- **n_neighbors**: Default: `20`. int, minimum number of neighbors per cell when method is `'joint'` or `'knn'`.  

- **radius**: Default: `'auto'`. float or `'auto'`, radius used when `method='radius'`. If set to `'auto'`, the radius is automatically determined as the median distance to the `n_neighbors`-th nearest neighbor.  

- **cut_percentage**: Default: `99`. int, percentage of shortest edges to keep in the Delaunay triangulation adjacency graph. 

- **kernel**: Default: `None`. string or `None`, optional weighting scheme used when computing microenvironment cell type distributions. Options:  

  - ``None``: use unweighted averaging over neighboring cells.  

  - ``'gaussian'``: use an adaptive Gaussian kernel to assign distance-dependent weights to neighbors before averaging.  


Over-clustering initialization (whole dataset / control group / reference)
---------------------------------------------------------------------------

.. code-block:: python

    model.initialize_clusters(
        dim_reduction=True,
        explained_var=None,
        n_components=None,
        n_components_max=100,
        standardize=True,
        method='kmeans',
        Qmax=20,
    )

**Parameters**

- **dim_reduction**: Default: `True`. bool, whether to perform dimensionality reduction (PCA) before clustering.  

- **explained_var**: Default: `None`. float or `None`, target cumulative explained variance for dimensionality reduction.  

- **n_components**: Default: `None`. int or `None`, number of components to retain after dimensionality reduction. If `None`, retain no more than `n_components_max`.  

- **n_components_max**: Default: `100`. int, maximum number of components allowed during reduction.  

- **standardize**: Default: `True`. bool, whether to z-score normalize each feature before dimensionality reduction.  

- **method**: Default: `'kmeans'`. string, clustering method for initialization. Options: `'kmeans'`, `'gmm'`.  

- **Qmax**: Default: `20`. int, number of clusters for initialization.  


Perform HDM to find solution
----------------------------

.. code-block:: python

    model.hier_dist_match(
        assign_metric='jsd',
        weighted_merge=True,
        max_iters=100,
        tol=1e-4,
        Qmin=2,
    )

**Parameters**

- **assign_metric**: Default: `'jsd'`. string, metric used to evaluate distribution similarity between niches.  

- **weighted_merge**: Default: `True`. bool, whether to use weighted JSD (WJSD) during merging phase.  

- **max_iters**: Default: `100`. int, maximum number of iterations for convergence.  

- **tol**: Default: `1e-4`. float, tolerance for convergence.  

- **Qmin**: Default: `2`. int, minimum number of niches to consider.  


Select the solution
-------------------

.. code-block:: python

    adata_list, adata_concat = model.select_solution(
        n_niche=None,
        niche_key='niche_label',
        auto=True,
        metric='jsd_v2',
        threshold=0.1,
        return_adata=True,
        plot=True,
        save=False,
        fig_size=(10, 6),
        save_dir=None,
        file_name='score_vs_nichecount_basic.pdf',
    )

**Parameters**

- **n_niche**: Default: `None`. int or `None`, number of niches to select. If `None`, solution is selected automatically using `metric`.  

- **niche_key**: Default: `'niche_label'`. string, key in `.obs` to store niche assignment results.  

- **auto**: Default: `True`. bool, whether to automatically determine the solution if `n_niche=None`.  

- **metric**: Default: `'jsd_v2'`. string, metric used for solution selection. Supported options are:  

  - ``'jsd'``: minimum pairwise Jensen-Shannon divergence between niches.  

  - ``'wjsd'``: weighted minimum pairwise Jensen-Shannon divergence between niches.  

  - ``'jsd_v2'``: bootstrap-based minimum Jensen-Shannon divergence with confidence intervals.   

- **threshold**: Default: `0.1`. float, threshold for selecting solution based on `metric`.  

- **return_adata**: Default: `True`. bool, whether to return an `anndata` object with niche assignments.  

- **plot**: Default: `True`. bool, whether to plot the minJSD curve.  

- **save**: Default: `False`. bool, whether to save the minJSD plot.  

- **fig_size**: Default: `(10, 6)`. tuple, figure size for plotting.  

- **save_dir**: Default: `None`. string or `None`, directory to save the plot.  

- **file_name**: Default: `'score_vs_nichecount_basic.pdf'`. string, name of the saved plot file.  


Over-clustering initialization (case group)
-------------------------------------------

.. code-block:: python

    model.initialize_clusters_cond(
        assign_metric='jsd',
        threshold=0.1,
        min_cell_per_niche=100,
        dim_reduction=True,
        explained_var=None,
        n_components=None,
        n_components_max=100,
        standardize=True,
        method='kmeans',
        Rmax=10,
    )

**Parameters**

- **assign_metric**: Default: `'jsd'`. string, metric used for evaluating distribution similarity when assigning cells to BCNs.  

- **threshold**: Default: `0.1`. float, minimum divergence threshold below which cells are assigned to BCNs.  

- **min_cell_per_niche**: Default: `100`. int, minimum average number of cells per new niche.  

- **dim_reduction**: Default: `True`. bool, whether to perform dimensionality reduction (PCA) before clustering.  

- **explained_var**: Default: `None`. float or `None`, target cumulative explained variance for dimensionality reduction.  

- **n_components**: Default: `None`. int or `None`, number of components to retain after dimensionality reduction. If `None`, retain no more than `n_components_max`.  

- **n_components_max**: Default: `100`. int, maximum number of components allowed during reduction.  

- **standardize**: Default: `True`. bool, whether to z-score normalize each feature before dimensionality reduction.  

- **method**: Default: `'kmeans'`. string, clustering method for initialization.  

- **Rmax**: Default: `10`. int, number of clusters for initialization.  


Perform HDM to find solution (case group)
-----------------------------------------

.. code-block:: python

    model.hier_dist_match_cond(
        assign_metric='jsd',
        weighted_merge=True,
        max_iters=100,
        tol=1e-4,
    )

**Parameters**

- **assign_metric**: Default: `'jsd'`. string, metric used to evaluate distribution similarity between niches.  

- **weighted_merge**: Default: `True`. bool, whether to use weighted JSD (WJSD) during merging phase.  

- **max_iters**: Default: `100`. int, maximum number of iterations for convergence.  

- **tol**: Default: `1e-4`. float, tolerance for convergence.  


Select the solution (case group)
--------------------------------

.. code-block:: python

    cond_list, cond_concat = model.select_solution_cond(
        n_csn=None,
        niche_key='niche_label',
        csn_key='csn_label',
        auto=True,
        metric='jsd_v2',
        threshold=0.1,
        return_adata=True,
        plot=True,
        save=False,
        fig_size=(10, 6),
        save_dir=None,
        file_name='score_vs_nichecount_cond.pdf',
    )

**Parameters**

- **n_csn**: Default: `None`. int or `None`, number of CSNs to select. If `None`, solution is selected automatically using `metric`.  

- **niche_key**: Default: `'niche_label'`. string, key in `.obs` to store niche assignment results.  

- **csn_key**: Default: `'csn_label'`. string, key in `.obs` to store CSN assignment results.  

- **auto**: Default: `True`. bool, whether to automatically determine the solution if `n_csn=None`.  

- **metric**: Default: `'jsd_v2'`. string, metric used for solution selection. Supported options are:  

  - ``'jsd'``: minimum pairwise Jensen-Shannon divergence between niches.  

  - ``'wjsd'``: weighted minimum pairwise Jensen-Shannon divergence between niches.  

  - ``'jsd_v2'``: bootstrap-based minimum Jensen-Shannon divergence with confidence intervals.   

- **threshold**: Default: `0.1`. float, threshold for selecting solution based on `metric`.  

- **return_adata**: Default: `True`. bool, whether to return an `anndata` object with CSN assignments.  

- **plot**: Default: `True`. bool, whether to plot the minJSD curve.  

- **save**: Default: `False`. bool, whether to save the minJSD plot.  

- **fig_size**: Default: `(10, 6)`. tuple, figure size for plotting.  

- **save_dir**: Default: `None`. string or `None`, directory to save the plot.  

- **file_name**: Default: `'score_vs_nichecount_cond.pdf'`. string, name of the saved plot file.  


Label transfer
--------------

.. code-block:: python

    trans_list, trans_concat = model.label_transfer(
        assign_metric='jsd',
        niche_key='niche_label',
        return_adata=True,
    )

**Parameters**

- **assign_metric**: Default: `'jsd'`. string, metric used to evaluate distribution similarity when assigning cells to niches.  

- **niche_key**: Default: `'niche_label'`. string, key in `.obs` to store niche assignment results.  

- **return_adata**: Default: `True`. bool, whether to return an `anndata` object with niche assignments.  


Cell type enrichment test
-------------------------

.. code-block:: python

    ct_results = ct_enrichment_test(
        niche_dist,
        cell_count_niche,
        idx2ct_dict,
        niche_summary,
        method='fisher',
        alpha=0.05,
        fdr_method='fdr_by',
        log2fc_threshold=1,
        prop_threshold=0.01,
        verbose=True,
        eps=1e-10,
    )

**Parameters**

- **niche_dist**: array-like or sparse matrix, shape `(n_niche, n_celltype)`, representing the proportion of each cell type in each niche.  

- **cell_count_niche**: array-like, shape `(n_niche,)`, number of cells in each niche.  

- **idx2ct_dict**: dict, mapping from cell type indices to cell type names.  

- **niche_summary**: list of strings, names or labels for each niche.  

- **method**: Default: `'fisher'`. string, statistical test method. Options:  

  - ``'fisher'``: two-sided Fisher's exact test.  

  - ``'fisher_greater'``: one-sided Fisher's exact test (greater).  

  - ``'chi2'``: chi-square test.  

- **alpha**: Default: `0.05`. float, significance level for multiple testing correction.  

- **fdr_method**: Default: `'fdr_by'`. string, method for false discovery rate correction.  

- **log2fc_threshold**: Default: `1`. float, minimum log2 fold-change required for enrichment.  

- **prop_threshold**: Default: `0.01`. float, minimum proportion of cell type in niche required for enrichment.  

- **verbose**: Default: `True`. bool, whether to print progress messages.  

- **eps**: Default: `1e-10`. float, small value to avoid division by zero.  

**Returns**

- **ct_results**: `pandas.DataFrame` containing enrichment results with columns:  

  - ``niche_idx``: index of the niche  

  - ``niche``: name of the niche  

  - ``celltype_idx``: index of the cell type  

  - ``celltype``: name of the cell type  

  - ``oddsratio`` or ``chi2_stat``: test statistic  

  - ``p-value``: raw p-value  

  - ``q-value``: FDR-corrected p-value  

  - ``log2fc``: log2 fold-change  

  - ``prop``: proportion of cell type in niche  

  - ``enrichment``: bool, whether cell type is significantly enriched.


Cell-cell interaction enrichment test
-------------------------------------------

.. code-block:: python

    cci_results, test_norm_list, bg_norm_list, test_edge_count_list, bg_edge_count_list = cci_enrichment_test(
        adata_list,
        niche_key,
        ct_key,
        niche_summary=None,
        spatial_key='spatial',
        cut_percentage=99,
        method='fisher',
        alpha=0.05,
        fdr_method='fdr_by',
        log2fc_threshold=1,
        prop_threshold=0.01,
        verbose=True,
        eps=1e-10,
    )

**Parameters**

- **adata_list**: `anndata` object or list of `anndata` objects, input datasets to test CCI enrichment.  

- **niche_key**: string, key in `.obs` representing niche labels.  

- **ct_key**: string, key in `.obs` representing cell type labels.  

- **niche_summary**: Default: `None`. list of niche names to test; if `None`, all unique niches in `adata_list` are used.  

- **spatial_key**: Default: `'spatial'`. string, key in `.obsm` representing spatial coordinates.  

- **cut_percentage**: Default: `99`. float, percentage of shortest edges to retain in Delaunay adjacency graph.  

- **method**: Default: `'fisher'`. string, statistical test method. Options:  

  - ``'fisher'``: two-sided Fisher's exact test.  

  - ``'fisher_greater'``: one-sided Fisher's exact test (greater).  

- **alpha**: Default: `0.05`. float, significance level for multiple testing correction.  

- **fdr_method**: Default: `'fdr_by'`. string, method for false discovery rate correction.  

- **log2fc_threshold**: Default: `1`. float, minimum log2 fold-change required for enrichment.  

- **prop_threshold**: Default: `0.01`. float, minimum proportion of cell type pairs in niche required for enrichment.  

- **verbose**: Default: `True`. bool, whether to print progress messages.  

- **eps**: Default: `1e-10`. float, small value to avoid division by zero.  

**Returns**

- **cci_results**: `pandas.DataFrame` containing CCI enrichment results with columns:  

  - ``niche_idx``: index of the niche  

  - ``niche``: name of the niche  

  - ``ct1_idx``, ``ct2_idx``: indices of interacting cell types  

  - ``ct1``, ``ct2``: names of interacting cell types  

  - ``test_edge_count``, ``bg_edge_count``: number of observed and background edges  

  - ``test_edge_prop``, ``bg_edge_prop``: proportion of observed and background edges  

  - ``oddsratio``: odds ratio from statistical test  

  - ``p-value``: raw p-value  

  - ``q-value``: FDR-corrected p-value  

  - ``log2fc``: log2 fold-change  

  - ``enrichment``: bool, whether interaction is significantly enriched  

- **test_norm_list**: list of normalized test adjacency matrices for each niche.  

- **bg_norm_list**: list of normalized background adjacency matrices for each niche.  

- **test_edge_count_list**: list of total test edges per niche.  

- **bg_edge_count_list**: list of total background edges per niche.  


Niche-niche colocalization enrichment test
---------------------------------

.. code-block:: python

    df_results, edge_prop_mtx, n1_count = nnc_enrichment_test(
        adata_list,
        niche_key,
        niche_summary=None,
        spatial_key='spatial',
        cut_percentage=99,
        method='fisher',
        alpha=0.05,
        fdr_method='fdr_by',
        log2fc_threshold=1,
        prop_threshold=0.01,
        verbose=True,
        eps=1e-10,
    )

**Parameters**

- **adata_list**: `anndata` object or list of `anndata` objects, input datasets for NNC enrichment testing.  

- **niche_key**: string, key in `.obs` representing niche labels.  

- **niche_summary**: Default: `None`. list of niche names to test; if `None`, all unique niches in `adata_list` are used.  

- **spatial_key**: Default: `'spatial'`. string, key in `.obsm` representing spatial coordinates.  

- **cut_percentage**: Default: `99`. float, percentage of shortest edges to retain in Delaunay adjacency graph.  

- **method**: Default: `'fisher'`. string, statistical test method. Options:  

  - ``'fisher'``: two-sided Fisher's exact test.  

  - ``'fisher_greater'``: one-sided Fisher's exact test (greater).  

  - ``'chi2'``: chi-square test with continuity correction.  

- **alpha**: Default: `0.05`. float, significance level for multiple testing correction.  

- **fdr_method**: Default: `'fdr_by'`. string, method for false discovery rate correction.  

- **log2fc_threshold**: Default: `1`. float, minimum log2 fold-change required for enrichment.  

- **prop_threshold**: Default: `0.01`. float, minimum proportion of edges between niches required for enrichment.  

- **verbose**: Default: `True`. bool, whether to print progress messages.  

- **eps**: Default: `1e-10`. float, small value to avoid division by zero.  

**Returns**

- **df_results**: `pandas.DataFrame` containing NNC enrichment results with columns:  

  - ``niche1_idx``, ``niche2_idx``: indices of interacting niches. Niche 1 is source niche and niche 2 is target niche.  

  - ``niche1``, ``niche2``: names of interacting niches. Niche 1 is source niche and niche 2 is target niche.  

  - ``edge_count``: number of edges observed between niche pairs  

  - ``edge_prop``: proportion of edges between niche pairs for the source niche

  - ``oddsratio`` or ``chi2_stat``: statistic from the test  

  - ``p-value``: raw p-value  

  - ``q-value``: FDR-corrected p-value  

  - ``log2fc``: log2 fold-change  

  - ``enrichment``: bool, whether the interaction is significantly enriched  

- **edge_prop_mtx**: numpy array, normalized edge proportions between all niche pairs.  

- **n1_count**: numpy array, total outgoing edges for each niche.  


Niche-niche colocalization matrix
--------------------------------

.. code-block:: python

    edge_prop_mtx, n1_count = cal_nnc_mtx(
        adata,
        niche_key,
        niche_summary=None,
        adj_mtx_key=None,
        spatial_key=None,
        cut_percentage=99,
        reserve_nonexist=False,
    )

**Parameters**

- **adata**: `anndata` object, input dataset used to compute the niche-niche colocalization matrix.  

- **niche_key**: string, key in `.obs` representing niche labels.  

- **niche_summary**: Default: `None`. list of niche names or labels. If `None`, all unique niche labels in `adata.obs[niche_key]` are used.  

- **adj_mtx_key**: Default: `None`. string or `None`, key in `.obsp` storing a precomputed adjacency matrix. If provided, this adjacency matrix is used directly.  

- **spatial_key**: Default: `None`. string or `None`, key in `.obsm` storing spatial coordinates. If `adj_mtx_key` is `None`, a Delaunay adjacency graph is constructed from these coordinates.  

- **cut_percentage**: Default: `99`. float, percentage threshold for retaining Delaunay edges when constructing the adjacency graph from spatial coordinates. Only used when `spatial_key` is provided and `adj_mtx_key` is `None`.  

- **reserve_nonexist**: Default: `False`. bool, whether to retain niches with zero outgoing niche-niche edges by setting their total edge count to 1 before normalization, thereby avoiding division by zero.  

**Returns**

- **edge_prop_mtx**: `numpy.ndarray`, shape `(n_niche, n_niche)`, row-normalized niche-niche colocalization matrix. Each entry represents the proportion of outgoing edges from one niche to another niche, excluding self-niche edges.  

- **n1_count**: `numpy.ndarray`, shape `(n_niche,)`, total number of outgoing edges from each niche to other niches before normalization.  


Niche-niche colocalization patterns differential test between groups
---------------------------------------------------

.. code-block:: python

    df_nnc_between_groups = nnc_between_groups_test(
        g1_list,
        g2_list,
        niche_labels,
        min_valid=3,
        alpha=0.05,
        alternative="two-sided",
        fdr_method="fdr_by",
    )

**Parameters**

- **g1_list**: array-like, shape `(n_sample1, n_niche, n_niche)`, collection of niche-niche colocalization matrices from group 1.  

- **g2_list**: array-like, shape `(n_sample2, n_niche, n_niche)`, collection of niche-niche colocalization matrices from group 2.  

- **niche_labels**: list of strings, niche names or labels corresponding to the rows and columns of the colocalization matrices.  

- **min_valid**: Default: `3`. int, minimum number of valid samples required in each group for testing a niche-niche pair. If either group has fewer than `min_valid` non-NaN values for a given pair, the corresponding test is not performed.  

- **alpha**: Default: `0.05`. float, significance level for multiple testing correction.  

- **alternative**: Default: `'two-sided'`. string, alternative hypothesis for the Mann-Whitney U test. Options typically include:  

  - ``'two-sided'``: tests whether the two groups differ.  

  - ``'greater'``: tests whether values in group 1 tend to be greater than those in group 2.  

  - ``'less'``: tests whether values in group 1 tend to be less than those in group 2.  

- **fdr_method**: Default: `'fdr_by'`. string, method for false discovery rate correction.  

**Returns**

- **df_nnc_between_groups**: `pandas.DataFrame` containing niche-niche colocalization comparison results with columns:  

  - ``niche1``: source niche  

  - ``niche2``: target niche  

  - ``mean1``: mean colocalization value in group 1  

  - ``mean2``: mean colocalization value in group 2  

  - ``delta_mean``: difference in mean colocalization value between group 1 and group 2  

  - ``n1_valid``: number of valid samples in group 1  

  - ``n2_valid``: number of valid samples in group 2  

  - ``p_value``: raw p-value from the Mann-Whitney U test  

  - ``q_value``: FDR-corrected p-value  

  - ``rejected``: bool, whether the comparison is significant after multiple testing correction.