report/latex/thesis_main.tex · ot-in-linear-ica

\documentclass[12pt,a4paper,twoside,openany]{book}
%\usepackage[work]{MCthesis}            % you can select this to work

\usepackage[final]{MCthesis}            % this is for the final submission

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage[utf8]{inputenc}         % allow utf-8 input
% ------------------------------------------
\usepackage[T1]{fontenc}            % use 8-bit T1 fonts
\usepackage{hyperref}               % hyperlinks

% Hiding the red boxes around links
\hypersetup{
    hidelinks
}

\usepackage{url}                    % simple URL typesetting
\usepackage{booktabs}               % professional-quality tables
\usepackage{nicefrac}               % compact symbols for 1/2, etc.
\usepackage{microtype}              % microtypography
\usepackage{xcolor}                 % colors
\usepackage{amssymb,amsmath,amsfonts,amsbsy, amsthm}
\usepackage{mathrsfs}
\usepackage{bm}                     % bold vectors
\usepackage{listings}               % to include R-Code
\usepackage[section]{placeins}      % forces floats to appear in the section they are placed in
\usepackage[]{apacite}              % bibliography example
\usepackage[toc,page]{appendix}
\usepackage{framed}
\usepackage{float}
\usepackage{algorithm}              % For your pseudo-code
\usepackage{algpseudocode}          % For your pseudo-code
\usepackage[font=small, labelfont=bf, textfont=it]{caption}
\usepackage{enumitem}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\usepackage{blindtext}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SET VARIABLES FOR TITLE PAGE
\title{Optimal Transport in linear Independent Component Analysis}
\author{Ashutosh Jha}
\fromCity{Tuebingen}
\matnr{6639615}
\date{\today} % submission date

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SET VARIABLES FOR SUPERVISION AND EXAMINERS
\supervisor{Prof. Dr. Joachim Grammig\textsuperscript{1}}
           {Prof. Dr. Michel Besserve\textsuperscript{2,3}}
           {Dr. Simon Buchholz\textsuperscript{2}}

\reviewer{Prof. Dr. Joachim Grammig\textsuperscript{1}}
         {Prof. Dr. Augustin Kelava\textsuperscript{4}
          \\[1.5cm] % Adds space before the affiliations
          \multicolumn{2}{l}{\small \textsuperscript{1} Chair of Statistics, Econometrics and Empirical Economics, University of Tübingen} \\
          \multicolumn{2}{l}{\small \textsuperscript{2} Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen} \\
          \multicolumn{2}{l}{\small \textsuperscript{3} Institute for Machine Learning and AI, TU Braunschweig} \\
          \multicolumn{2}{l}{\small \textsuperscript{4} Methods Center, University of Tübingen}
         }
         {}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% --- Better Float Parameters ---
\setcounter{topnumber}{2}
\setcounter{bottomnumber}{2}
\setcounter{totalnumber}{4}
\renewcommand{\topfraction}{0.85}
\renewcommand{\bottomfraction}{0.85}
\renewcommand{\textfraction}{0.15}
\renewcommand{\floatpagefraction}{0.8}
% -------------------------------

\begin{document}

% The \maketitle command in your MCthesis package automatically handles 
% the title page, declaration, \frontmatter, \tableofcontents, and \mainmatter.
\maketitle

\clearpage
\listoffigures
\clearpage
\listoftables
\cleardoublepage

% -----------------------------------------------------------------
% CHAPTER 1: INTRODUCTION
% -----------------------------------------------------------------
\chapter{Introduction}

Blind source separation, specifically formulated as independent component analysis (ICA), is a fundamental problem in modern statistics \cite{jutten1985}. 
The primary objective of ICA is to recover a set of hidden, statistically independent source signals from an observed set of mixtures. 
In this thesis, the problem is strictly bounded to the linear mixture setting, wherein the observations are assumed to be generated 
by an unknown, invertible linear combination of the latent sources.

To achieve exact separation, ICA imposes strict statistical independence among the recovered components. 
This approach is frequently contrasted with principal component analysis (PCA). While PCA diagonalizes the covariance matrix 
to ensure that the resulting components are uncorrelated, ICA enforces the mathematical condition of full statistical independence \cite[Chapter 1]{hyvarinen2001}. 
This rigorous criterion allows ICA to systematically disentangle complex latent variables that remain mixed under simple decorrelation.

\section{Existing Methodologies}

In practice, the standard ICA pipeline begins by centering and whitening the observed mixture. 
Whitening is a necessary preprocessing step that removes linear correlations and scales the marginal variances to unity. 
Once the data is whitened, the search for independent components is geometrically restricted to finding 
an optimal orthogonal rotation matrix on the Stiefel manifold, a space of orthogonal matrices. \cite{absil2008optimization}.

Traditional ICA algorithms navigate this rotational search by optimizing surrogate contrast functions to identify non-Gaussian projections \cite[Chapter 8]{hyvarinen2001}. 
These classical methods rely on surrogate information-theoretic measures, such as negentropy approximations or cumulant-based matching, 
to quantify the non-Gaussianity of the unmixed components \cite{amari1996new}. 

While computationally efficient, these parametric approximations evaluate specific statistical moments rather than the full probability density. 
Consequently, we find them to fail to capture the complete geometric structure of the underlying empirical distributions, 
leading to optimization failures in heterogeneous mixtures.

\section{Applications}

The ability to blindly separate independent sources without parametric assumptions regarding their underlying distributions has widespread utility across diverse scientific domains. 
In neuroscience, ICA serves as the standard clinical technique for isolating and removing high-amplitude biological artifacts, such as ocular and muscular signals, 
from continuous electroencephalogram (EEG) and magnetoencephalogram (MEG) recordings \cite[Chapter 21]{hyvarinen2001}.

Beyond biological signal processing, the framework is foundational in digital communications and computer vision. 
In telecommunications, ICA is deployed to separate overlapping transmissions in code-division multiple access (CDMA) systems, 
while in image processing, it is utilized to extract localized, oriented, and bandpass features from natural images, 
effectively mimicking the processing functions of the mammalian primary visual cortex \cite[Chapters 22-23]{hyvarinen2001}.

Furthermore, the methodology is increasingly utilized in advanced econometrics and market microstructure. 
In financial price discovery, ICA facilitates the identification of latent structural shocks, allowing researchers to 
uniquely resolve the orthogonal ambiguities inherent in standard vector error correction models \cite{zema2025}. 
Across all these distinct domains, the validity of the downstream scientific or economic analysis depends strictly on the 
reliable mathematical convergence of the underlying ICA optimization algorithm.

\section{Main Contribution}

The primary methodological contribution of this thesis is the formulation and investigation of an optimal transport framework for independent component analysis (OT-ICA). 
The core objective reduces to searching the Stiefel manifold for the rotation matrix that maximizes the non-Gaussianity of the projected data.

To achieve this, we propose utilizing the Wasserstein distance as the objective function for the rotational search. 
Our objective is to find the orthogonal candidates that maximize the Wasserstein distance to a standard normal distribution. 
Because the Wasserstein metric evaluates the true global geometry of the empirical distribution using order statistics \cite{villani2003}, 
it provides a non-parametric and highly sensitive measure of distance to Gaussianity. 
We find that this helps in successfully circumventing the structural blind spots of standard surrogate measures.

\section{Structure of the Thesis}

The remainder of this thesis is structured as follows. Chapter 2 establishes the theoretical foundations of linear ICA, information geometry, 
and optimal transport. Chapter 3 formalizes the OT-ICA framework, proving that the squared Wasserstein ($W_2^2$) distance reliably bounds mixture non-Gaussianity. 
Chapter 4 details the algorithmic enhancements required to optimize this metric efficiently on the Stiefel manifold. 
Chapter 5 presents our empirical evaluation, demonstrating the failure modes of standard proxy optimization and how OT-ICA addresses these limitations. 
Finally, Chapter 6 concludes the thesis with a summary of findings and directions for future research.

% -----------------------------------------------------------------
% CHAPTER 2: THEORETICAL FOUNDATIONS
% -----------------------------------------------------------------
\chapter{Theoretical Foundations}

This chapter establishes the mathematical and conceptual foundations necessary for linear independent component analysis. 
We first formalize the linear mixing model and its identifiability conditions, subsequently detailing the statistical and 
thermodynamic intuitions that motivate the framework. Following this, we introduce an information-geometric perspective linking 
independence to non-Gaussianity, and finally, formalize the principles of optimal transport that govern the proposed optimization metric.

\section{Linear Independent Component Analysis (ICA)}

The foundational assumption of the linear Independent Component Analysis model is that the observed signals are generated by 
a linear combination of statistically independent latent sources. We formalize this generative model and its required constraints as follows:

\vspace{0.5em}
\noindent\textbf{Theorem 2.1 The Linear ICA Model and Identifiability, \cite{comon1994}.} \textit{Let the observed data be denoted by 
a random vector $\mathbf{X} \in \mathbb{R}^d$, and the latent sources by $\mathbf{S} \in \mathbb{R}^d$. The linear ICA model assumes:
\begin{enumerate}[label=\alph*), itemsep=0pt]
    \item \textbf{Linear Mixing:} $\mathbf{X} = \mathbf{A} \mathbf{S}$, where $\mathbf{A} \in \mathbb{R}^{d \times d}$ is an unknown, invertible mixing matrix.
    \item The components $s_1, s_2, \dots, s_d$ of $\mathbf{S}$ are mutually statistically independent.
    \item At most one of the source components $s_i$ follows a Gaussian distribution.
\end{enumerate}
Under these conditions, the mixing matrix $\mathbf{A}$ can be uniquely identified from the observations $\mathbf{X}$, 
up to a permutation and scaling of its columns.}
\vspace{0.5em}

Expanded into vector notation, the structural mixing represents the observed variables as
\begin{equation}
    \begin{bmatrix}
    x_1 \\
    x_2 \\
    \vdots \\
    x_d
    \end{bmatrix}
    = 
    \begin{bmatrix}
    a_{11} & a_{12} & \cdots & a_{1d} \\
    a_{21} & a_{22} & \cdots & a_{2d} \\
    \vdots & \vdots & \ddots & \vdots \\
    a_{d1} & a_{d2} & \cdots & a_{dd}
    \end{bmatrix}
    \begin{bmatrix}
    s_1 \\
    s_2 \\
    \vdots \\
    s_d
    \end{bmatrix}.
\end{equation}
The core objective of ICA is to estimate an unmixing matrix $\mathbf{B} \approx \mathbf{A}^{-1}$ such that the recovered components $\mathbf{Z} = \mathbf{B}\mathbf{X}$ 
are as statistically independent as possible, thereby reconstructing the original sources $\mathbf{S}$.

\subsection{The Insufficiency of Principal Component Analysis}
Principal Component Analysis (PCA) is frequently utilized as a baseline technique to decorrelate observed data. 
This corresponds to finding the orthogonal matrix $\mathbf{V}^\top$ derived from the eigenvalue decomposition of 
the data's covariance matrix. However, rendering a sample strictly decorrelated is mathematically insufficient for 
achieving blind source separation when the underlying data is non-Gaussian. 

As noted by MacKay \citeyear{mackay2003}, PCA is fundamentally an $L_2$-norm variance maximization technique; 
it is inherently sensitive to the scaling of the variables and heavily influenced by outliers. More critically, 
PCA relies exclusively on second-order statistics. Once the data is decorrelated and scaled (whitened) such that $\mathbb{E}[\mathbf{X}\mathbf{X}^\top] = \mathbf{I}_d$, 
the PCA criterion is satisfied for any arbitrary orthogonal rotation. While decorrelation implies strict independence 
for multivariate Gaussian distributions, higher-order dependencies remain intact for non-Gaussian data. Consequently, 
ICA must transcend second-order moments, traversing the space of orthogonal matrices to locate the  
rotation that eliminates these higher-order dependencies and achieves statistical independence.

\subsection{Identifiability and Ambiguities}
The non-Gaussianity constraint defined in Theorem 2.1 is a strict mathematical requirement. If two or more sources are Gaussian, 
any orthogonal transformation of those specific sources yields another set of strictly independent Gaussian variables, 
rendering the mixing matrix $\mathbf{A}$ impossible to determine uniquely. 

Furthermore, even when the non-Gaussianity assumption is satisfied, the model contains two inherent ambiguities. 
The first is a scaling ambiguity: because both $\mathbf{A}$ and $\mathbf{S}$ are unknown, any scalar multiplier assigned 
to a source $s_i$ can be perfectly cancelled by dividing the corresponding column of $\mathbf{A}$ by the same scalar. 
Therefore, the exact variances of the unobserved sources cannot be inferred. The second is a permutation ambiguity: 
the model treats the sum of sources symmetrically, meaning the indexing order of the recovered components is arbitrary. 

\subsection{Centering and Whitening}
To simplify the optimization landscape, the observed data $\mathbf{X}$ is subjected to a two-step preprocessing phase consisting of centering and whitening. 
Centering ensures that the data has a zero mean by subtracting the expectation, $\mathbb{E}[\mathbf{X}] = \mathbf{0}$.
Given centered data, the covariance matrix simplifies to $\mathrm{Cov}(\mathbf{X}) = \mathbb{E}[\mathbf{X}\mathbf{X}^\top]$. 
Whitening transforms the observed vector into a new vector $\widetilde{\mathbf{X}}$ whose components are uncorrelated and possess unit variance. 
This is achieved using the eigenvalue decomposition of the covariance matrix
\begin{equation}
    \mathrm{Cov}(\mathbf{X}) = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^\top
\end{equation}
where $\mathbf{V}$ is the orthogonal matrix of eigenvectors and $\mathbf{\Lambda}$ is the diagonal matrix of eigenvalues $\lambda_j$. 
The whitening transformation matrix $\mathbf{W}$ is constructed by scaling the eigenvectors by the inverse square root of the eigenvalues
\begin{equation}
    \mathbf{W} = \mathbf{\Lambda}^{-1/2} \mathbf{V}^\top.
\end{equation}
Applying this transformation yields the whitened data $\widetilde{\mathbf{X}} = \mathbf{W}\mathbf{X}$, where the covariance is the identity matrix $\mathbf{I}_d$.
After whitening, if we define the unmixing operation on the whitened data as $\widetilde{\mathbf{Z}} = \mathbf{U} \widetilde{\mathbf{X}}$, the covariance of the recovered components is
\begin{equation}
    \mathrm{Cov}(\widetilde{\mathbf{Z}}) = \mathbf{U} \mathrm{Cov}(\widetilde{\mathbf{X}}) \mathbf{U}^\top = \mathbf{U} \mathbf{I}_d \mathbf{U}^\top = \mathbf{U}\mathbf{U}^\top.
\end{equation}
Because the independent components must have unit variance, we require $\mathbf{U}\mathbf{U}^\top = \mathbf{I}_d$. 
This reveals that the unmixing matrix $\mathbf{U}$ must be perfectly orthogonal, thus reducing the ICA optimization to a search over the Stiefel manifold, a space of orthogonal matrices.

\subsection{Statistical and Thermodynamic Intuitions}
The identifiability limitation in Theorem 2.1 leads to the translation of the challenge of finding independent components into 
a distinct optimization objective: maximizing non-Gaussianity. The theoretical justification for this approach is rooted in the Central Limit Theorem. 

\vspace{0.5em}
\noindent\textbf{Theorem 2.2 (Central Limit Theorem Intuition).} \textit{Let $s_1, s_2, \dots, s_d$ be independent random variables. 
Under general regularity conditions, the distribution of their normalized sum tends toward a Gaussian distribution as $d$ increases \cite{billingsley1995probability}.}
\vspace{0.5em}

While Theorem 2.2 formally applies to limits of sums, its consequence in ICA is that any observed 
linear mixture of independent, non-Gaussian sources will inherently appear more Gaussian than the underlying sources themselves. \cite[Chapter 1]{hyvarinen2001}.
Therefore, by extracting the unmixed components that are the least Gaussian, the algorithm effectively isolates the original independent signals.

The use of non-Gaussianity as a proxy for independence is further understood by an information theory and statistical mechanics perspective. 
For any continuous random variable with a fixed finite variance, it is a foundational mathematical result that the Gaussian distribution 
uniquely possesses the maximum possible Shannon entropy \cite[Chapter 8]{cover2006elements}. In thermodynamics, 
isolated physical systems naturally evolve toward states of maximum entropy. Viewed through this interdisciplinary lens, 
the Central Limit Theorem serves as the statistical manifestation of this physical principle: the mathematical process of 
linearly mixing independent sources inherently increases the overall entropy of the system, inevitably driving the joint 
distribution toward the high-entropy Gaussian state. 

\section{An Information Geometry Perspective on ICA}

\subsection{Product and Gaussian Manifolds}
Information geometry provides a framework for quantifying the relationship between independence and non-Gaussianity 
by analyzing probability distributions as points on geometric manifolds. 
Following the perspective introduced by Cardoso \citeyear{cardoso2022}, ICA can be understood as orthogonally projecting 
the empirical distribution onto two distinct subspaces. The first is the Product manifold $\mathcal{P}$, 
which contains all distributions where the marginal components are strictly independent. The second is the 
Gaussian manifold $\mathcal{G}$, which encompasses all multivariate Gaussian distributions with zero mean and a fixed covariance structure. 
We dive into this perspective with the knowledge that standard decorrelation merely ensures that the covariance is diagonal, but full ICA seeks to find a linear transform 
that minimizes the divergence between the empirical data and its projection onto the Product manifold \cite{cardoso2022}.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.85\textwidth]{pythagorean_ig_ica.png}
    \caption[Information Geometry of ICA]{The Pythagorean geometry of Independent Component Analysis. 
    The empirical joint distribution $P_Y$ is orthogonally projected onto the Product manifold $\mathcal{P}$ (yielding the independent marginals $P_Y^P$) 
    and the Gaussian manifold $\mathcal{G}$ (yielding the best Gaussian fit $\mathcal{N}(\operatorname{Cov} Y)$). The intersection of these approximations yields 
    the fully factorized Gaussian target $\mathcal{N}(\operatorname{diag\,Cov} Y)$. The Kullback-Leibler divergences between these nodes geometrically represent distinct statistical properties: 
    Mutual Information $\mathscr{I}(Y)$, Joint Non-Gaussianity $G(Y)$, and Correlation $C(Y)$. Figure adapted from Cardoso \citeyear{cardoso2022}.}
    \label{fig:cardoso_geometry}
\end{figure}

\subsection{Kullback-Leibler Pythagorean Identities}
Distances between these distributions are measured using the Kullback-Leibler (KL) divergence \cite{cover2006elements}. For any random vector $Y$ with a joint density $P_Y$, 
the mutual information $\mathscr{I}(Y)$ measures the distance to the closest independent product distribution, denoted $P_Y^P = \prod_{i} P_{Y_i}$. This expands integrally as:
\begin{equation}
    \mathscr{I}(Y) = D_{\mathrm{KL}}\left(P_Y \,\Vert \, P_Y^P\right) = \int P_Y(y) \log \frac{P_Y(y)}{\prod_{i} P_{Y_i}(y_i)} \, dy.
\end{equation}
A Pythagorean identity exists on the Product manifold. For any candidate product distribution $Q = \prod_{i} q_i$, 
the divergence splits precisely into the mutual information and the marginal divergences. Expanding the logarithm reveals this additive structure:
\begin{align}\label{eq:pythagoras_product}
    D_{\mathrm{KL}}\left(P_Y \Vert Q\right) &= \int P_Y(y) \log \left( \frac{P_Y(y)}{P_Y^P(y)} \frac{P_Y^P(y)}{Q(y)} \right) dy \nonumber \\
    &= \int P_Y(y) \log \frac{P_Y(y)}{P_Y^P(y)} \, dy + \sum_i \int P_{Y_i}(y_i) \log \frac{P_{Y_i}(y_i)}{q_i(y_i)} \, dy \nonumber \\
    &= \mathscr{I}(Y) + \sum_{i} D_{\mathrm{KL}}\left(P_{Y_i} \Vert q_i\right).
\end{align}
A similar Pythagorean identity holds for the Gaussian manifold $\mathcal{G}$. 
Let $\mathcal{N}(\Sigma)$ denote an arbitrary zero-mean multivariate Gaussian distribution parameterized by a covariance matrix $\Sigma$. 
Furthermore, let $\mathcal{N}(\operatorname{Cov} Y)$ denote the specific Gaussian distribution that shares the exact covariance structure 
of the empirical data $Y$. Geometrically, $\mathcal{N}(\operatorname{Cov} Y)$ represents the orthogonal projection of $P_Y$ onto the 
Gaussian manifold, yielding the best possible Gaussian approximation of the data. 

The KL divergence from the empirical distribution $P_Y$ to the arbitrary Gaussian target $\mathcal{N}(\Sigma)$ decomposes 
into the divergence to this best Gaussian fit and the divergence between the two Gaussian models:
\begin{equation}\label{eq:pythagoras_gaussian}
    D_{\mathrm{KL}}\big(P_Y \Vert \mathcal{N}(\Sigma)\big) = D_{\mathrm{KL}}\big(P_Y \Vert \mathcal{N}(\operatorname{Cov} Y)\big) + D_{\mathrm{KL}}\big(\mathcal{N}(\operatorname{Cov} Y) \Vert \mathcal{N}(\Sigma)\big).
\end{equation}

\subsection{Independence, Correlation, and Non-Gaussianity}
These geometric identities can be combined to relate independence directly to non-Gaussianity. 
We define the non-Gaussianity of the joint variable $Y$ as $G(Y) = D_{\mathrm{KL}}(P_Y \,\Vert\, \mathcal{N}(\operatorname{Cov} Y))$, 
and the scalar measure of correlation as $C(Y) = D_{\mathrm{KL}}(\mathcal{N}(\operatorname{Cov} Y) \,\Vert\, \mathcal{N}(\operatorname{diag\,Cov} Y))$. 

To establish the fundamental balance between these quantities, we evaluate the KL divergence from the empirical distribution $P_Y$ 
to a fully factorized Gaussian target $Q = \mathcal{N}(\operatorname{diag\,Cov} Y)$. 
We can compute this divergence via two distinct geometric paths shown in Figure \ref{fig:cardoso_geometry}. 

Following the Pythagorean identity on the Product manifold (Equation~\ref{eq:pythagoras_product}):
\begin{equation}
    D_{\mathrm{KL}}\big(P_Y \Vert \mathcal{N}(\operatorname{diag\,Cov} Y)\big) = \mathscr{I}(Y) + \sum_{i} G(Y_i).
\end{equation}
Following the Pythagorean identity on the Gaussian manifold (Equation~\ref{eq:pythagoras_gaussian}):
\begin{equation}
    D_{\mathrm{KL}}\big(P_Y \Vert \mathcal{N}(\operatorname{diag\,Cov} Y)\big) = G(Y) + C(Y).
\end{equation}
Equating these two paths yields Cardoso's unified theorem \cite{cardoso2022}:
\begin{equation}
    \mathscr{I}(Y) + \sum_{i} G(Y_i) = G(Y) + C(Y).
\end{equation}

To demonstrate how maximizing marginal non-Gaussianities minimizes mutual information, we analyze the terms under 
the constraints of the ICA pipeline. Crucially, the joint non-Gaussianity $G(Y)$ is invariant under any invertible 
linear transformation \cite{cardoso2022}. Furthermore, once the data has been whitened, the covariance matrix is forced to be the identity matrix. 
This implies that $\operatorname{Cov}(Y) = \operatorname{diag\,Cov}(Y)$, which strictly forces the correlation term to zero ($C(Y) = 0$). 

Applying these constraints, the relationship rearranges into a direct subtraction:
\begin{align}
    \mathscr{I}(Y) + \sum_{i} G(Y_i) &= G(Y) + 0 \nonumber \\
    \mathscr{I}(Y) &= G(Y) - \sum_{i} G(Y_i).
\end{align}
Because $G(Y)$ remains constant under orthogonal rotation, minimizing the mutual information $\mathscr{I}(Y)$, finding independent components, 
is mathematically equivalent to maximizing the sum of the marginal non-Gaussianities $\sum_i G(Y_i)$. 
This shows that centering and whitening constrain the data to a subspace where true statistical 
independence can be achieved solely by maximizing non-Gaussianity.

\section{Optimal Transport Theory}

\subsection{The Monge Problem: Push-Forwards and Pull-Backs}
While information geometry utilizes Kullback-Leibler divergence, optimal transport provides an alternative metric 
space parameterized by the physical geometry of probability mass. First formulated by Gaspard Monge \citeyear{monge1781}, 
the problem was originally conceptualized as finding the most cost-effective way to move a distribution of mass.

Mathematically, this is framed as finding a transport map $T$ that associates points from a continuous source measure $\mu$ 
to a target measure $\nu$. We express this spatial modification of probability mass using the \textit{push-forward} operator $T_{\#}$. 
A measure $\nu$ is the push-forward of $\mu$ by a map $T$, denoted $\nu = T_{\#}\mu$, if for any measurable set $B$ in the target space:
\begin{equation}
    \nu(B) = \mu(T^{-1}(B)).
\end{equation}

\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.6\textwidth]{ot_pushforward_concept.png}
    \caption[Optimal Transport Push-Forward]{The Monge transport map $T$ pushes forward the probability mass from 
    the source measure $\mu$ to construct the target measure $\nu$, strictly conserving total mass during the transformation.}
    \label{fig:ot_pushforward}
\end{figure}

As illustrated in Figure \ref{fig:ot_pushforward}, the push-forward operator acts on measures, effectively 
pushing mass forward along the map $T$. It is conceptually distinct from the \textit{pull-back} operator $T^{\#}$, which acts on functions. 
For a continuous, bounded test function $g: \mathcal{Y} \rightarrow \mathbb{R}$, the pull-back evaluates the function via 
the composition mapping ($T^{\#} g = g \circ T$). Because push-forward and pull-back are adjoint operators, 
the Monge problem seeks to minimize a transportation cost function $c(x,y)$ subject to the strict constraint that $T_{\#}\mu = \nu$:
\begin{equation}
    \min_{T} \left\{ \int_{\mathcal{X}} c(x, T(x)) d\mu(x) \quad : \quad T_{\#}\mu = \nu \right\}.
\end{equation}

\subsection{The Kantorovich Relaxation}
The fundamental limitation of the Monge problem is that the map $T$ must be strictly deterministic; mass from 
a single source location cannot be split to multiple targets. Consequently, if the target distribution has more 
discrete points than the source distribution, a valid Monge map may not even exist. 

To resolve this degeneracy, Kantorovich \citeyear{kantorovich1942} relaxed the deterministic requirement, 
proposing instead a probabilistic transport that permits mass splitting. 
The Kantorovich formulation replaces the deterministic map $T$ with a joint probability distribution, 
or coupling, $\pi \in \Pi(\mu, \nu)$, which dictates exactly how fractions of mass are distributed from the source to the target.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=1.0\textwidth]{ot_coupling_cases.png}
    \caption[The Three Scenarios of Kantorovich Optimal Transport]{A schematic view of the input measures and 
    optimal couplings across the three primary scenarios of Kantorovich Optimal Transport: Discrete, Semi-discrete, and Continuous. 
    The top row depicts the marginal distributions, while the bottom row visualizes the structural nature of the corresponding coupling spaces. 
    Figure adapted from Peyr{\'e} and Cuturi \citeyear{peyre2019computational}.}
    \label{fig:ot_coupling_cases}
\end{figure}

As illustrated in Figure \ref{fig:ot_coupling_cases}, this relaxation is universally applicable across three fundamental statistical scenarios. 
In the discrete setting depicted in the first column, the coupling manifests as a transportation matrix mapping distinct point masses. 
In the semi-discrete setting (second column), mass from a continuous density is optimally partitioned into specific discrete target regions. 
In the fully continuous setting (third column), the coupling forms a joint density over the continuous supports. 
By allowing fractional transport in all domains, Kantorovich transformed the non-convex combinatorial 
Monge problem into a relatively tractable linear optimization problem.

\subsection{The Wasserstein Distances ($W_1$ and $W_2$)}
The minimal cost required to transform one probability distribution into another is formalized as the Wasserstein distance. 
For continuous measures in arbitrary dimensions $d$, utilizing a Euclidean distance cost function $c(x,y) = \|x - y\|^p$, 
the general $W_p$ distance is defined over the space of optimal couplings $\Pi(\mu, \nu)$ as:
\begin{equation}
    W_p(\mu, \nu) = \left( \inf_{\pi \in \Pi(\mu, \nu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x - y\|^p \, d\pi(x, y) \right)^{1/p}.
\end{equation}

While the integral definition above correctly formalizes optimal transport in high dimensions, direct analytical computation 
of $\pi(x,y)$ is highly restrictive for arbitrary multivariable spaces. The concept of sorting data to construct 
a cumulative distribution function (CDF) is strictly a one-dimensional phenomenon, as there is no natural, 
universal ordering of coordinates in $\mathbb{R}^d$ ($d > 1$). 

However, in the specific one-dimensional setting, the geometry of optimal transport drastically simplifies \cite[Chapter 2]{villani2003}. 
Because the data can be unambiguously sorted, the $W_1$ distance analytically collapses to the integral of 
the absolute difference between the univariate cumulative distribution functions $F$ and $G$:
\begin{equation}
    W_1(\mu, \nu) = \int_{-\infty}^\infty |F(x) - G(x)| \, dx.
\end{equation}

Similarly, the squared Wasserstein distance ($p=2$), which heavily penalizes local geometric distortions, 
takes on a highly efficient analytical form in one dimension. It is most conveniently expressed using the quantile functions, 
which are the exact inverses of the cumulative distributions ($F^{-1}$ and $G^{-1}$):
\begin{equation}
    W_2^2(\mu, \nu) = \int_0^1 |F^{-1}(t) - G^{-1}(t)|^2 \, dt.
\end{equation}

By circumventing the need to construct multi-dimensional couplings, the 1D $W_2^2$ metric provides an extremely 
computationally efficient mechanism for measuring structural divergence. When evaluating empirical linear projections 
against standard normal noise, the 1D squared Wasserstein distance serves as the primary optimization function for the framework utilized in this thesis.

% -----------------------------------------------------------------
% CHAPTER 3: THE OPTIMAL TRANSPORT ICA (OT-ICA) FRAMEWORK
% -----------------------------------------------------------------
\chapter{The Optimal Transport ICA (OT-ICA) Framework}

\section{Optimal Transport Distance as a Contrast for ICA}
As established in the preceding chapter, the process of centering and whitening restricts the search for independent 
components to the manifold of orthogonal matrices. Furthermore, Cardoso's unified theorem \cite{cardoso2022} demonstrates 
that within this whitened space, the mutual information of the projections is minimized precisely when the sum of their 
marginal non-Gaussianities is maximized. Traditional Independent Component Analysis algorithms approach this optimization 
by utilizing proxy contrast functions, such as kurtosis or negentropy approximations. While computationally inexpensive, 
these proxies evaluate specific statistical moments rather than the full probability density.

This thesis formulates ICA directly as an optimal transport problem, utilizing the squared Wasserstein distance, $W_2^2$, 
as a direct measure of non-Gaussianity. We formalize this framework as follows:

\vspace{0.5em}
\noindent\textbf{Definition 3.1 The OT-ICA Optimization Objective.} \textit{Let $\mathbf{X}$ be the whitened observed mixture, and 
let $\mathbf{u} \in S_n$ be a candidate projection vector on the unit sphere. Let the empirical distribution of the projected data be 
denoted by $\mathbb{P}_{\mathbf{u}^\top \mathbf{X}}$, and let the standard normal distribution be defined as $\Gamma \equiv \mathcal{N}(0,1)$. 
The Optimal Transport ICA (OT-ICA) objective is to find the orthogonal rotation matrix whose column vectors $\mathbf{u}$ maximize the transport cost to $\Gamma$:}
\begin{equation}
    \hat{\mathbf{u}} = \arg\max_{\mathbf{u} \in S_n} W_2^2\big(\mathbb{P}_{\mathbf{u}^\top \mathbf{X}}, \Gamma\big).
\end{equation}
\vspace{0.5em}

By framing the problem in this manner, we leverage the optimal transport metric's ability to evaluate geometric deviations from Gaussianity, providing a sensitive contrast function.

\section{Theoretical Motivation: Bounding Mixture Non-Gaussianity}
To mathematically justify the use of $W_2^2$ as an objective function for ICA, we demonstrate that maximizing this distance 
recovers the true independent sources. By the Central Limit Theorem, a mixture of non-Gaussian sources is more Gaussian 
than the constituent sources themselves. In optimal transport terms, this implies that the Wasserstein distance between a mixture 
and a Gaussian distribution is bounded by the distances of its independent components.

\subsection{The $W_2$ Upper Bound}

\vspace{0.5em}
\noindent\textbf{Theorem 3.1 $W_2$ Upper Bound.} \textit{Let $\mathbf{Z} = (Z_1, \dots, Z_d)^\top$ be 
a vector of mutually independent latent sources, and let $\mathbf{w}$ be a weight vector on the unit sphere ($\sum_{i=1}^d w_i^2 = 1$). 
The squared Wasserstein distance between the linear mixture $\mathbf{w} \cdot \mathbf{Z}$ and the 
standard normal distribution $\Gamma$ is bounded by the weighted sum of the source distances to $\Gamma$:}
\begin{equation}
    W_2^2(\mathbf{w} \cdot \mathbf{Z}, \Gamma) \le \sum_{i=1}^d w_i^2 W_2^2(Z_i, \Gamma).
\end{equation}

\textit{Proof.} We construct a specific joint probability distribution, or coupling, using a common source of randomness. 
Let $\mathbf{N} = (N_1, \dots, N_d)^\top$ be a vector of $d$ independent standard normal variables, $\mathbf{N} \sim \mathcal{N}(0, \mathbf{I}_d)$, 
assumed to be statistically independent of the sources $\mathbf{Z}$. For each independent latent source $Z_i$, 
there exists an optimal transport map $T_i$ that pushes forward the standard normal distribution to the source distribution, 
denoted as $(T_i)_{\#} \Gamma = Z_i$. 

We construct a random variable $X$ representing the mixture using the transport maps evaluated on the common noise vector:
\begin{equation}
    X = \sum_{i=1}^d w_i T_i(N_i).
\end{equation}
Because $T_i(N_i)$ follows the distribution of $Z_i$, $X$ models the weighted mixture $\mathbf{w} \cdot \mathbf{Z}$. 
To compare this mixture to a Gaussian distribution, we construct a target variable $Y$ using the same underlying noise vector $\mathbf{N}$ and 
weights $\mathbf{w}$:
\begin{equation}
    Y = \sum_{i=1}^d w_i N_i.
\end{equation}
Given that the weights sum to unity and the components $N_i$ are independent standard normals, 
the resulting variable $Y$ follows the standard normal distribution, $Y \sim \Gamma$. 
The pair $(X, Y)$ creates a valid coupling between the mixture distribution and the Gaussian target.

The squared Wasserstein distance is the infimum of the expected squared cost over all valid couplings. 
Therefore, the optimal distance is less than or equal to the cost of our constructed coupling $(X, Y)$, yielding the inequality:
\begin{equation}
    W_2^2(\mathbf{w} \cdot \mathbf{Z}, \Gamma) \le \mathbb{E}\left[|X - Y|^2\right].
\end{equation}
Expanding the squared difference yields:
\begin{align}
    |X - Y|^2 &= \left| \sum_{i=1}^d w_i T_i(N_i) - \sum_{i=1}^d w_i N_i \right|^2 \nonumber \\
              &= \left( \sum_{i=1}^d w_i \big(T_i(N_i) - N_i\big) \right)^2.
\end{align}
Squaring the summation produces diagonal terms ($i=j$) and cross terms ($i \neq j$):
\begin{equation}
    |X - Y|^2 = \sum_{i=1}^d w_i^2 \big(T_i(N_i) - N_i\big)^2 + \sum_{i \neq j} w_i w_j \big(T_i(N_i) - N_i\big)\big(T_j(N_j) - N_j\big).
\end{equation}
Taking the expectation simplifies the expression. Because the underlying noise variables $N_i$ and $N_j$ are independent, 
and the functions evaluate to a mean of zero, all cross terms vanish. We are left with the expectation of the diagonal terms:
\begin{equation}
    \mathbb{E}\left[|X - Y|^2\right] = \sum_{i=1}^d w_i^2 \mathbb{E}\left[ \big(T_i(N_i) - N_i\big)^2 \right].
\end{equation}
Recognizing that $\mathbb{E}[ (T_i(N_i) - N_i)^2 ]$ is the definition of the squared Wasserstein distance between 
the individual source $Z_i$ and $\Gamma$, we arrive at the bound:
\begin{equation}
    W_2^2(\mathbf{w} \cdot \mathbf{Z}, \Gamma) \le \sum_{i=1}^d w_i^2 W_2^2(Z_i, \Gamma). \quad \blacksquare
\end{equation}
This property demonstrates that mixing independent sources mathematically reduces the Wasserstein distance to Gaussianity. 
Maximizing the left-hand side forces the weight vector $\mathbf{w}$ to collapse toward a canonical axis, isolating a single independent source.

\subsection{Comonotonicity and the Proof of Strict Inequality}
Optimizing ICA requires establishing that mixtures are strictly less non-Gaussian than the original sources. 
We determine the conditions under which the upper bound becomes a strict inequality.

\vspace{0.5em}
\noindent\textbf{Theorem 3.2 Strict $W_2$ Inequality for Non-Gaussian Sources.} \textit{Assuming at most one source $Z_i$ is Gaussian, 
and the source distributions possess smooth and strictly positive densities, the upper bound derived in Theorem 3.1 is 
a strict inequality for any valid mixture where multiple weights $w_i$ are non-zero:}
\begin{equation}
    W_2^2(\mathbf{w} \cdot \mathbf{Z}, \Gamma) < \sum_{i=1}^d w_i^2 W_2^2(Z_i, \Gamma).
\end{equation}

\textit{Proof.} In a one-dimensional setting, the $W_2$ distance equals the expected squared difference of a specific coupling 
if and only if the random variables are comonotonic. By Brenier's Theorem \cite{brenier1991}, this implies there exists a strictly increasing, 
deterministic function mapping one variable to the other. For the constructed mixture variable $X$ and the Gaussian target $Y$ to 
be comonotonic with respect to the shared noise vector $\mathbf{N} \in \mathbb{R}^d$, their gradients must be parallel at 
every point in the probability space. This requires the existence of a scalar function $\lambda(\mathbf{N})$ such that:
\begin{equation}
    \nabla X(\mathbf{N}) = \lambda(\mathbf{N}) \nabla Y(\mathbf{N}).
\end{equation}

Assuming the source distributions possess smooth and strictly positive densities, Caffarelli's regularity theory \cite{caffarelli1992} 
guarantees that the optimal transport maps $T_i$ are continuously differentiable. We compute the gradients of our 
constructed variables with respect to $\mathbf{N}$. The gradient of the Gaussian target is the weight vector:
\begin{equation}
    Y(\mathbf{N}) = \sum_{i=1}^d w_i N_i \implies \nabla Y = \mathbf{w}.
\end{equation}
The gradient of the source mixture leverages the derivatives of the individual transport maps:
\begin{equation}
    X(\mathbf{N}) = \sum_{i=1}^d w_i T_i(N_i) \implies \nabla X = \begin{bmatrix} w_1 T'_1(N_1) \\ w_2 T'_2(N_2) \\ \vdots \\ w_d T'_d(N_d) \end{bmatrix}.
\end{equation}

Substituting these gradients into the comonotonicity condition yields the system of equations:
\begin{equation}
    \nabla X = \lambda(\mathbf{N}) \mathbf{w}.
\end{equation}
To determine if this condition holds, we differentiate both the Left-Hand Side (LHS) and the Right-Hand Side (RHS) 
with respect to $\mathbf{N}$ to compute their Hessian matrices.

On the LHS, because $X$ separates into independent terms $w_i T_i(N_i)$, the cross-derivatives $\frac{\partial^2 X}{\partial N_i \partial N_j}$ 
are zero for all $i \neq j$. This yields a diagonal matrix containing the second derivatives of the transport maps. On the RHS, 
we differentiate the product $\lambda(\mathbf{N}) \mathbf{w}$. Applying the product rule yields the outer product of $\mathbf{w}$ 
and $\nabla \lambda$, forming a Rank-1 matrix.
Equating the matrices:
\begin{equation}
    \underbrace{
    \begin{bmatrix} 
    w_1 T''_1(N_1) & 0 & \cdots & 0 \\
    0 & w_2 T''_2(N_2) & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & w_d T''_d(N_d)
    \end{bmatrix}
    }_{\text{Hessian of } X \text{ (Diagonal)}}
    =
    \underbrace{
    \begin{bmatrix}
    w_1 \frac{\partial \lambda}{\partial N_1} & w_1 \frac{\partial \lambda}{\partial N_2} & \cdots & w_1 \frac{\partial \lambda}{\partial N_d} \\
    w_2 \frac{\partial \lambda}{\partial N_1} & w_2 \frac{\partial \lambda}{\partial N_2} & \cdots & w_2 \frac{\partial \lambda}{\partial N_d} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_d \frac{\partial \lambda}{\partial N_1} & w_d \frac{\partial \lambda}{\partial N_2} & \cdots & w_d \frac{\partial \lambda}{\partial N_d}
    \end{bmatrix}
    }_{\text{Outer Product } \mathbf{w} (\nabla \lambda)^\top \text{ (Rank-1)}}
\end{equation}

Examining any off-diagonal entry located at row $i$ and column $j$ ($i \neq j$), the diagonal LHS matrix dictates that 
this entry must be exactly zero. The RHS Rank-1 matrix evaluates this entry as $w_i \frac{\partial \lambda}{\partial N_j}$, establishing the requirement:
\begin{equation}
    w_i \frac{\partial \lambda}{\partial N_j} = 0 \quad \text{for all } i \neq j.
\end{equation}
Because we assume a mixture where multiple component weights are non-zero ($w_i \neq 0$), it follows that $\frac{\partial \lambda}{\partial N_j} = 0$. 
Since this holds across all dimensions, the gradient of the scaling function is zero ($\nabla \lambda = \mathbf{0}$). 
Consequently, $\lambda(\mathbf{N})$ is a global scalar constant, $\lambda$.

Equating the diagonal elements implies $T'_i(N_i) = \lambda$. Integrating this relationship indicates 
the optimal transport maps are linear functions of the form:
\begin{equation}
    T_i(x) = \lambda x + c.
\end{equation}
Applying a purely linear transformation to Gaussian noise $N_i$ produces another Gaussian distribution. 
The identifiability assumption of ICA dictates that the original latent sources $Z_i$ are non-Gaussian, 
which requires their optimal transport mappings from a Gaussian base to be non-linear. Therefore, the linear map requirement 
contradicts the non-Gaussianity of the sources. Because the condition for equality cannot be met, the relationship is a strict inequality. 

\begin{equation}
    W_2^2(\mathbf{w} \cdot \mathbf{Z}, \Gamma) < \sum_{i=1}^d w_i^2 W_2^2(Z_i, \Gamma). \quad \blacksquare
\end{equation} 

\section{Empirical Landscape: $W_1$ versus $W_2^2$}
Having established the theoretical foundations of the $W_2^2$ metric, we evaluate its practical suitability for manifold optimization. 
In the context of ICA, gradient ascent algorithms must climb the loss landscape to locate the maxima of the distance metric 
to isolate the independent sources.

While the $W_1$ distance (Earth Mover's Distance) is robust to outliers, its reliance on an $L_1$ absolute penalty 
introduces geometric challenges when applied to finite empirical samples. We demonstrate why the squared $L_2$ 
penalty of the $W_2^2$ metric is preferable for gradient-based convergence.

\subsection{Empirical Gradients and Derivative Volatility}
To observe this behavior, we generated a synthetic bivariate mixture of Student-t distributions ($\nu=4$) and 
evaluated both the raw distances and their first-order numerical gradients across a continuous sweep of 
rotation angles $\theta \in [0, \pi]$. Because the unmixing matrix restricts the search to orthogonal vectors, 
the two true independent components are geometrically separated by exactly $\pi/2$ radians ($90^\circ$).

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{w1_vs_w2_derivatives.png} 
    \caption[Gradient Landscapes of $W_1$ and $W_2^2$]{The empirical optimization landscapes (top) and their normalized first derivatives (bottom). 
    While both metrics peak near the true independent components ($\theta \approx 0, \pi/2, \pi$), their gradient behaviors differ. 
    The $W_2^2$ gradient smoothly approaches zero, whereas the $W_1$ gradient exhibits jagged volatility on finite samples.}
    \label{fig:w1_vs_w2_derivatives}
\end{figure}

The empirical results, visualized in Figure \ref{fig:w1_vs_w2_derivatives}, illustrate the behaviors of 
the two metrics during continuous gradient ascent. In optimal transport, because the $W_1$ cost scales linearly with distance, 
the resulting optimization landscape lacks the convexity required to smooth finite sample noise.

The consequence of this lack of curvature is evident in the bottom gradient profile. Because $W_1$ relies on 
piecewise linear sorting mechanics without a squaring penalty, its empirical derivative fluctuates frequently 
as the projection angle rotates. Consequently, first-order gradient ascent solvers will encounter a jagged derivative surface, 
which hinders convergence to the true peak. This instability is exacerbated in higher dimensions, where the $W_1$ 
derivative surface becomes even more jagged and difficult to traverse. Furthermore, this volatility limits 
the utility of second-order Quasi-Newton approximations (such as L-BFGS), as a stable Hessian cannot be reliably estimated 
from such a jagged and difficult to traverse gradient.

In contrast, the squared penalty of the $W_2^2$ metric acts as a natural smoother by penalizing local geometric distortions. 
As seen in the gradient profiles, the $W_2^2$ derivative behaves continuously, decelerating predictably as it approaches the optimum. 
This retention of curvature provides a reliable signal for gradient-based solvers, allowing the OT-ICA framework to converge consistently.


% -----------------------------------------------------------------
% CHAPTER 4: ALGORITHMIC DESIGN AND OPTIMIZATION
% -----------------------------------------------------------------
\chapter{Algorithmic Enhancements}

\section{The Baseline OT-ICA Algorithm}
Having established the theoretical optimality of the squared Wasserstein distance ($W_2^2$) for Independent Component Analysis, 
the challenge transitions to computationally minimizing this cost function. The baseline OT-ICA algorithm proceeds by first centering and whitening 
the observed mixture $\mathbf{X} \in \mathbb{R}^{d \times N}$, yielding the whitened data $\widetilde{\mathbf{X}}$. The algorithm then initializes 
an orthogonal unmixing matrix $\mathbf{W} \in \mathbb{R}^{d \times d}$. In the forward pass, the data is projected onto the candidate components, 
$\mathbf{Y} = \mathbf{W}\widetilde{\mathbf{X}}$. For each row (component) of $\mathbf{Y}$, the empirical cumulative distribution function is 
constructed by sorting the $N$ observations. The $W_2^2$ distance is then computed by matching these sorted empirical quantiles to the 
corresponding quantiles of a standard normal distribution. 

While theoretically sound, the standard implementation of this gradient descent loop faces computational and geometric limitations. 
The exact optimal transport matching requires sorting at every iteration, presenting a computational bottleneck. 
Furthermore, standard gradient updates do not inherently respect the strict orthogonality constraint of the mixing matrix, 
and discrete approximations of the target Gaussian introduce gradient noise. To make OT-ICA computationally competitive with existing methods, 
several specific algorithmic and geometric enhancements are required.

\section{Computational Bottlenecks and Solutions}

\subsection{Exact Analytical Gaussian Targets}
One source of instability in the baseline algorithm is the discrete approximation of the target Gaussian distribution. 
Traditionally, the target quantiles are approximated by evaluating the inverse cumulative distribution function at evenly spaced intervals, 
such as $q_i = \Phi^{-1}((i - 0.5)/N)$. However, representing a continuous, infinite-tailed Gaussian distribution with a finite set of discrete points 
introduces approximation noise. When the algorithm computes the gradient of the $W_2^2$ distance, this noise translates into a non-smooth optimization landscape, 
hindering convergence as the components approach independence.

To eliminate this approximation noise, we replace the point-sampled targets with an exact analytical formulation. 
We compute the expected value of a standard normal variable within each discrete quantile bin. 

\vspace{0.5em}
\noindent\textbf{Definition 4.1 Exact Analytical Gaussian Target.} \textit{Let the uniform probability bin edges for $N$ samples be 
defined as $p_i = \frac{i}{N}$ for $i = 0, \dots, N$, with corresponding Gaussian domain boundaries $z_i = \Phi^{-1}(p_i)$, where $\Phi$ is 
the standard normal cumulative distribution function. The exact analytical target value $T_i$ for the $i$-th sorted sample is the conditional expectation within that interval:}
\begin{equation}
    T_i = N \int_{z_{i-1}}^{z_i} x \phi(x) \, dx
\end{equation}
\textit{where $\phi(x)$ is the standard normal probability density function. Evaluating this integral yields the closed-form target:}
\begin{equation}
    T_i = N \big( \phi(z_{i-1}) - \phi(z_i) \big).
\end{equation}
\vspace{0.5em}

By computing this analytical target once during initialization, the algorithm evaluates the $W_2^2$ distance against an exact discrete representation of a Gaussian, 
ensuring smooth gradients near the optimum.

\subsection{Batched Vectorization for Restarts}
Unlike Principal Component Analysis, which yields a unique global solution via singular value decomposition, 
the ICA optimization landscape is non-convex and characterized by local optima. Consequently, 
identifying the true independent components requires multiple random initializations (restarts) to explore the space. 

Because evaluating the $W_2^2$ objective requires an $\mathcal{O}(N \log N)$ sorting operation for each candidate component, 
executing these random restarts sequentially via standard iterative loops is computationally expensive. To resolve this, we utilize a batched vectorization strategy. 
Rather than maintaining a single weight matrix, we aggregate $B$ random restarts into a single high-dimensional tensor $\mathbf{W}_{\text{batch}} \in \mathbb{R}^{B \times d \times d}$. 

This formulation allows the projection $\mathbf{Y} = \mathbf{W}_{\text{batch}} \widetilde{\mathbf{X}}$ to be computed simultaneously for all $B$ trajectories. 
The elements of the resulting batch tensor can be sorted in parallel. 
The pre-computed exact analytical target vector $\mathbf{T}$ is subsequently broadcast-subtracted across the entire batch to compute the distances and gradients in a single step.

\section{Optimization on the Stiefel Manifold}

\subsection{The Riemannian Gradient}
Because the input data is whitened, the unmixing matrix $\mathbf{W}$ must remain orthogonal, satisfying $\mathbf{W}\mathbf{W}^\top = \mathbf{I}$. 
The geometric space of all such orthogonal matrices forms the Stiefel manifold, denoted $\mathcal{S}$. If we compute the standard Euclidean gradient of the squared Wasserstein distance, 
$\mathbf{G} = \nabla_{\mathbf{W}} W_2^2$, this vector points into the unconstrained ambient space $\mathbb{R}^{d \times d}$. 
Updating the matrix using this Euclidean gradient would violate the orthogonality constraint.

To navigate the geometry of the Stiefel manifold, we project the Euclidean gradient onto the tangent space of the manifold at the current point $\mathbf{W}$ (Figure \ref{fig:stiefel_manifold}). 

\begin{figure}[tbp]
    \centering
    \includegraphics[width=0.8\textwidth]{stiefel_manifold_projection.png}
    \caption[Riemannian Optimization on the Stiefel Manifold]{Geometric sketch of optimization on the Stiefel manifold $\mathcal{S}$. 
    The Euclidean gradient $\mathbf{G}$ is projected onto the tangent space $T_{\mathbf{W}}\mathcal{S}$ to yield the Riemannian gradient $\nabla_{\mathcal{S}} \mathbf{W}$. 
    Following a step along the tangent plane, a retraction maps the estimate back to the orthogonal manifold surface at $\mathbf{W}_{\text{new}}$.}
    \label{fig:stiefel_manifold}
\end{figure}

\vspace{0.5em}
\noindent\textbf{Definition 4.2 Riemannian Gradient on the Stiefel Manifold.} \textit{Given a Euclidean gradient $\mathbf{G}$, 
the Riemannian gradient $\nabla_{\mathcal{S}} \mathbf{W}$ that projects $\mathbf{G}$ onto the tangent space of the Stiefel manifold 
at $\mathbf{W}$ is given by \cite[Chapter 3]{absil2008optimization}:}
\begin{equation}
    \nabla_{\mathcal{S}} \mathbf{W} = \mathbf{G} - \frac{1}{2}\big(\mathbf{G}\mathbf{W}^\top + \mathbf{W}\mathbf{G}^\top \big)\mathbf{W}.
\end{equation}
\vspace{0.5em}

Geometrically, the term $(\mathbf{G}\mathbf{W}^\top + \mathbf{W}\mathbf{G}^\top)$ measures the symmetric violation of orthogonality induced 
by the Euclidean gradient. By subtracting this violation, we ensure that the update step aligns with the curvature of the manifold, preserving the decorrelation of the components.

\subsection{Retraction via Symmetric Decorrelation}
Although the Riemannian gradient ensures the update step is tangential to the Stiefel manifold, 
moving along this tangent plane for a finite step size $\eta$ causes the new matrix $\mathbf{W}_{\text{step}} = \mathbf{W} - \eta \nabla_{\mathcal{S}} \mathbf{W}$ 
to depart from the curved manifold surface. To restore exact orthogonality, the matrix must be mapped back onto the manifold 
via a retraction \cite[Chapter 4]{absil2008optimization}.

%While the Gram-Schmidt orthogonalization process is a common approach, it is asymmetric; it fixes the first vector and modifies subsequent vectors to fit, 
%which introduces bias in simultaneous optimization. Instead, we utilize symmetric decorrelation as the retraction mechanism.

\vspace{0.5em}
\noindent\textbf{Definition 4.3 Symmetric Decorrelation Retraction.} \textit{Let $\mathbf{W}_{\text{step}}$ be the unmixing matrix after a tangent space update. 
The retraction mapping the matrix back to the orthogonal Stiefel manifold surface is:}
\begin{equation}
    \mathbf{W}_{\text{new}} = (\mathbf{W}_{\text{step}}\mathbf{W}_{\text{step}}^\top)^{-1/2} \mathbf{W}_{\text{step}}.
\end{equation}
%\vspace{0.5em}

This operation computes the overlap covariance of the stepped matrix and applies a uniform adjustment to all vectors simultaneously. 
This ensures that no single independent component is prioritized during the retraction process, maintaining the stability of the batched optimization.

\section{Total Complexity and Statistical Efficiency}
Having detailed the sorting-based objective and the Riemannian optimization steps, we formalize the total computational cost of the OT-ICA framework. 
For a dataset of dimension $d$ with $N$ samples, $K$ random restarts, and $T$ optimization iterations, the per-iteration complexity is dominated by:
\begin{equation}
    \mathcal{O}(K \cdot (d \cdot N \log N + d^3))
\end{equation}
The $N \log N$ term stems from the sorting of projections, while the $d^3$ term represents the cost of the symmetric decorrelation (matrix inverse square root) 
used for retraction on the Stiefel manifold. For typical ICA applications where the number of samples significantly exceeds the dimensionality 
($N \gg d$), this $d^3$ cost is computationally negligible. It becomes a bottleneck only in extreme high-dimensional regimes, a scenario that is 
independently precluded by the statistical limits of the framework.

Beyond temporal complexity, we must account for the statistical sample complexity. While the 1D projection approach maintains a convergence rate 
of $\mathcal{O}(N^{-1/2})$, the global identification of the mixing matrix is implicitly subject to the curse of dimensionality. Theoretical results in optimal transport establish 
that the empirical Wasserstein distance in $d$ dimensions converges at a rate of $\mathcal{O}(N^{-1/d})$ \cite{fournier2015rate}. 
In practice, as $d$ increases, the number of samples $N$ required to 
accurately resolve the maxima of non-Gaussianity on the manifold surface grows. This necessitates a corresponding increase in the number of parallel restarts $K$, creating 
a trade-off between statistical resolution and computational feasibility.

\section{Navigating Discrete Optimization Landscapes}
While the OT-ICA framework demonstrates convergence on continuous distributions, data often contains discrete signals, such as binary states 
or count-based events. Applying continuous optimal transport directly to discrete mixtures introduces geometric properties 
that complicate the optimization landscape.

\subsection{The Discontinuous Landscape of Discrete CDFs}
The exact analytical Gaussian target developed in Section 4.2 relies on matching the empirical cumulative distribution function (CDF) of the projected data. 
For a continuous variable, this empirical CDF approximates a strictly increasing curve. However, for highly discrete data, any one-dimensional 
projection creates a step-wise CDF. 

This discreteness disrupts the continuous geometry of the $W_2$ metric due to the mechanics of sorting. Evaluating the Wasserstein distance requires 
sorting the projected points. As the optimization algorithm rotates the unmixing matrix $\mathbf{W}$, dense clusters of discrete data points shift relative to one another, 
causing the sorting order to change abruptly. Every change in the sorting permutation causes a discontinuous shift in the gradient direction. 
Consequently, the theoretically smooth $W_2$ landscape results in a non-differentiable surface filled with local optima. 
The optimizer frequently converges to spurious peaks (often corresponding to diagonal projections of the discrete hypercube), leading to unmixing failures.

\subsection{Continuous Smoothing via Gaussian Dithering}
To restore the smooth gradients of the Stiefel manifold without resorting to computationally intensive $\mathcal{O}(N^2)$ entropic regularization techniques 
(such as Sinkhorn distances), we implement a Gaussian dithering step. 

Adapted from signal processing techniques used to decorrelate quantization error \cite{schuchman1964dither}, we inject a small variance of continuous, zero-mean Gaussian noise 
($\sigma_{\text{dither}} \approx 0.01$) into the projected components prior to the sorting operation in the forward pass. 
Mathematically, this operation convolves the discrete empirical distribution with a narrow Gaussian kernel (Figure \ref{fig:gaussian_dithering}).

This deforms the step-wise discrete CDF into a strictly increasing, continuous curve. By eliminating non-differentiable sorting ties, 
Gaussian dithering restores stable gradients while preserving the $\mathcal{O}(N \log N)$ computational efficiency of the 1D optimal transport formulation.

This dithering process does not constitute the injection of restrictive prior knowledge regarding the source distributions. 
Rather, it acts as an uninformed regularizer applied uniformly to the 1D projections $\mathbf{w}^\top \mathbf{X}$. 
For latent sources that are already continuous, the addition of small variance ($\sigma_{\text{dither}} \approx 0.01$) is 
statistically negligible and preserves the underlying Wasserstein geometry. For discrete sources, it prevents sorting failures.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{gaussian_dithering_cdf.png}
    \caption[Gaussian Dithering of Discrete CDFs]{The effect of algorithmic Gaussian Dithering. 
    The non-differentiable staircase of a discrete empirical CDF (gray) is convolved with a narrow Gaussian kernel, 
    yielding a strictly increasing, continuous curve (black) that restores stable Wasserstein gradients.}
    \label{fig:gaussian_dithering}
\end{figure}

From a theoretical perspective, this dithering process represents a form of kernel density regularization. Mathematically, 
adding Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma_{\text{dither}}^2)$ to the projection $Y$ is equivalent to convolving the empirical measure $\mu_N$ with 
a Gaussian kernel $\phi_{\sigma}$:
\begin{equation}
    \hat{\mu}_{\sigma} = \mu_N * \phi_{\sigma}
\end{equation}
This operation transforms discrete measures into a smooth density. By Brenier's Theorem \cite{brenier1991} and Caffarelli's regularity \cite{caffarelli1992} results, 
this ensures that the resulting optimal transport map is unique and continuously differentiable. Furthermore, this approach is analogous 
to entropic regularization used in Sinkhorn distances, where a blur parameter is introduced to make the Wasserstein distance differentiable. 
By utilizing 1D dithering, we approximate the smoothness of Sinkhorn divergence while retaining the computational efficiency of sorted exact transport.

\subsection{Escaping Local Minima with Stochastic Mini-Batching}
Even with a smoothed CDF, the global optimization landscape of a discrete mixture remains non-convex. Evaluating the exact $W_2$ distance using the 
full dataset guarantees deterministic convergence to the nearest local maximum. In non-convex environments with discrete local minima, 
this deterministic behavior prevents global convergence.

To address this, we utilize stochastic mini-batching. At each optimization step, the algorithm evaluates 
the Wasserstein gradients using a randomly sampled subset of the data (e.g., $N_{\text{batch}} = 512$ or $1024$). This stochastic subsampling introduces gradient noise, 
acting similarly to simulated annealing. A geometric local minimum that exists for the full dataset may have a different gradient profile for a random subset. 
This stochastic momentum allows the optimizer to escape spurious local minima. The combination of Gaussian dithering to smooth 
the discrete steps and stochastic mini-batching to navigate the non-convex topology enables OT-ICA to separate mixed environments.

\section{OT-ICA Algorithm Overview}
The theoretical components and algorithmic enhancements are synthesized into the final OT-ICA training procedure. 
The pseudo-code for extracting independent non-Gaussian components from a mixed dataset using stochastic Riemannian optimization is detailed below (Algorithm \ref{alg:ot_ica}).

\begin{algorithm}[H]
\caption{The Optimal Transport ICA (OT-ICA) Algorithm}
\label{alg:ot_ica}
\begin{algorithmic}[1]
\Require Observed mixture $\mathbf{X} \in \mathbb{R}^{d \times N}$, iterations $K$, learning rate $\eta$, batch size $B$, dither noise $\sigma_{\text{dither}}$
\Ensure Estimated unmixing matrix $\mathbf{W}_{\text{final}} \in \mathbb{R}^{d \times d}$
\State \textbf{Phase 1: Preprocessing \& Target Generation}
\State Center the data: $\mathbf{X} \gets \mathbf{X} - \mathbb{E}[\mathbf{X}]$
\State Whiten the data: $\widetilde{\mathbf{X}} \gets (\mathbf{X}\mathbf{X}^\top)^{-1/2} \mathbf{X}$ \Comment{Yields unit variance, uncorrelated data}
\State Compute analytical target $\mathbf{T} \in \mathbb{R}^{B}$ via exact Gaussian CDF integration
\State \textbf{Phase 2: Initialization}
\State Initialize $\mathbf{W} \in \mathbb{R}^{d \times d}$ randomly from $\mathcal{N}(0,1)$
\State Apply symmetric decorrelation:
\Statex \qquad $\mathbf{W} \gets (\mathbf{W}\mathbf{W}^\top)^{-1/2} \mathbf{W}$ \Comment{Initialize on Stiefel manifold}
\State \textbf{Phase 3: Stochastic Riemannian Optimization}
\For{$k = 1$ to $K$}
    \State Sample stochastic mini-batch $\widetilde{\mathbf{X}}_{\text{batch}} \in \mathbb{R}^{d \times B}$ from $\widetilde{\mathbf{X}}$
    \State Project data onto candidates: $\mathbf{Y} \gets \mathbf{W} \widetilde{\mathbf{X}}_{\text{batch}}$
    \State Apply Gaussian Dithering: $\mathbf{Y}_{\text{dither}} \gets \mathbf{Y} + \mathcal{N}(0, \sigma_{\text{dither}}^2)$
    \State Sort each row of $\mathbf{Y}_{\text{dither}}$ ascending to yield empirical quantiles $\mathbf{Y}_{\text{sorted}}$
    \State Compute Wasserstein objective (maximize non-Gaussianity):
    \Statex \qquad $\mathcal{L} = \frac{1}{B} \sum_{i=1}^{d} ||\mathbf{Y}_{\text{sorted}}^{(i)} - \mathbf{T}||_2^2$
    \State Backpropagate to compute Euclidean gradient: $\mathbf{G} = \nabla_{\mathbf{W}} \mathcal{L}$
    \State Project to tangent space (Riemannian Gradient): 
    \Statex \qquad $\nabla_{\mathcal{S}} \mathbf{W} = \mathbf{G} - \frac{1}{2}\big(\mathbf{G}\mathbf{W}^\top + \mathbf{W}\mathbf{G}^\top \big)\mathbf{W}$
    \State Update matrix along tangent plane:
    \Statex \qquad $\mathbf{W}_{\text{step}} = \mathbf{W} + \eta \nabla_{\mathcal{S}} \mathbf{W}$ \Comment{Gradient Ascent}
    \State Retract to Stiefel manifold via symmetric decorrelation: 
    \Statex \qquad $\mathbf{W} \gets (\mathbf{W}_{\text{step}}\mathbf{W}_{\text{step}}^\top)^{-1/2} \mathbf{W}_{\text{step}}$
    \State Optional: Apply learning rate decay $\eta \gets \eta \cdot 0.99$
\EndFor
\State \textbf{Phase 4: Finalization}
\State $\mathbf{W}_{\text{final}} \gets \mathbf{W} (\mathbf{X}\mathbf{X}^\top)^{-1/2}$ \Comment{Map back to original data space}
\State \Return $\mathbf{W}_{\text{final}}$
\end{algorithmic}
\end{algorithm}


\section{Limitations of Fixed-Point Algorithms for OT-ICA}
A widely used algorithm for ICA, FastICA \cite[Chapter 8]{hyvarinen2001}, achieves efficient convergence by employing a Newton (second-order) fixed-point step. 
FastICA utilizes this approach by maximizing proxy contrast functions that are static and easily differentiable. 
A natural question is whether a similar fixed-point iteration can be derived for the $W_2^2$ objective. 

\vspace{0.5em}
\noindent\textbf{Theorem 4.1 Intractability of Exact Newton Updates.} \textit{For an empirical distribution mapped to a continuous target via sorting, 
computing an exact, closed-form Hessian $\mathbf{H} = \nabla^2_{\mathbf{w}} W_2^2$ is analytically intractable without introducing external density estimation mechanisms. 
The Map Derivative $\nabla_{\mathbf{w}} T_{\mathbf{w}}$ inherently relies upon the rate of change of cumulative probability mass, which strictly requires the continuous 
Probability Density Function (PDF) evaluated at the quantile boundaries.}
\vspace{0.5em}

\textit{Proof.} To implement a Newton step, the exact Hessian matrix of the squared Wasserstein distance, $\mathbf{H} = \nabla^2_{\mathbf{w}} W_2^2$, is required. 
We first derive the first derivative (the gradient) of the objective with respect to the weight vector $\mathbf{w}$. Utilizing the envelope theorem for 
optimal transport \cite[Section 7.2]{santambrogio2015optimal}, the variation of the optimal map evaluates to zero within the cost function, 
allowing us to differentiate the squared cost directly:
\begin{align}
    \nabla_{\mathbf{w}} \left( \frac{1}{2} W_2^2 \right) &= \nabla_{\mathbf{w}} \left( \frac{1}{2} \mathbb{E}\Big[ \big( \mathbf{w}^\top \mathbf{X} - T_{\mathbf{w}}(\mathbf{w}^\top \mathbf{X}) \big)^2 \Big] \right) \nonumber \\
    &= \mathbb{E}\Big[ \big( \mathbf{w}^\top \mathbf{X} - T_{\mathbf{w}}(\mathbf{w}^\top \mathbf{X}) \big) \nabla_{\mathbf{w}} (\mathbf{w}^\top \mathbf{X}) \Big] \nonumber \\
    &= \mathbb{E}\Big[ \mathbf{X} \big( \mathbf{w}^\top \mathbf{X} - T_{\mathbf{w}}(\mathbf{w}^\top \mathbf{X}) \big) \Big].
\end{align}
To compute the Hessian, we differentiate the optimal transport map $T_{\mathbf{w}}(\mathbf{w}^\top \mathbf{X})$ with respect to $\mathbf{w}$. 
Applying the total derivative yields two terms:
\begin{equation}
    \nabla_{\mathbf{w}} [T_{\mathbf{w}}(\mathbf{w}^\top \mathbf{X})] = \underbrace{T'_{\mathbf{w}}(\mathbf{w}^\top \mathbf{X}) \mathbf{X}}_{\text{Argument Derivative}} + \underbrace{(\nabla_{\mathbf{w}} T_{\mathbf{w}})(\mathbf{w}^\top \mathbf{X})}_{\text{Map Derivative}}.
\end{equation}
By Caffarelli's regularity theory \cite{caffarelli1992}, the transport map $T_{\mathbf{w}}$ is continuously differentiable, guaranteeing the existence of the Argument Derivative. 
The limitation lies within the Map Derivative. 

In the 1D optimal transport setting, the transport map to a standard Gaussian relies on the empirical cumulative distribution function, $F_{\mathbf{w}}$, 
of the projected data $y = \mathbf{w}^\top \mathbf{X}$. Expanding the Map Derivative yields:
\begin{equation}
    (\nabla_{\mathbf{w}} T_{\mathbf{w}})(y) = \nabla_{\mathbf{w}} \big[ \Phi^{-1}(F_{\mathbf{w}}(y)) \big] = \frac{1}{\phi(\Phi^{-1}(F_{\mathbf{w}}(y)))} \nabla_{\mathbf{w}} F_{\mathbf{w}}(y).
\end{equation}
The term $\nabla_{\mathbf{w}} F_{\mathbf{w}}(y)$ represents the rate of change of the cumulative probability mass as the projection 
plane $\mathbf{w}$ is rotated. The derivative of a cumulative distribution function's boundary strictly requires the underlying 
probability density function (PDF) evaluated at that boundary, rendering the closed-form calculation intractable without explicit density estimation. $\blacksquare$

This theoretical requirement negates the computational advantage of the OT-ICA formulation. The efficiency of 1D optimal 
transport stems from its reliance on sorting (empirical CDFs), circumventing continuous density estimation entirely. 
Requiring the PDF to compute the Hessian necessitates density approximation methods like Kernel Density Estimation (KDE) or 
Normalizing Flows. These methods introduce statistical noise, hyperparameter tuning, and significant computational overhead.

Because computing a true Newton-based fixed-point step for exact OT-ICA is analytically intractable, the chosen strategy for 
the OT-ICA framework utilizes Quasi-Newton methods, specifically the Limited-memory BFGS (L-BFGS) algorithm \cite{liu1989limited} 
adapted for the Stiefel manifold \cite[Chapter 6]{absil2008optimization}. L-BFGS approximates the inverse Hessian curvature by 
observing the history of first-order Wasserstein gradients. This approach allows the algorithm to achieve convergence while preserving 
the sorting-based methodology of the optimal transport formulation.


% -----------------------------------------------------------------
% CHAPTER 5: EXPERIMENTS AND Application
% -----------------------------------------------------------------
\chapter{Experimental Evaluation and Applications}

\section{Evaluation Metrics And Methodology}
\subsection{The Amari Performance Index}
To empirically evaluate the success of an Independent Component Analysis (ICA) algorithm, a quantitative metric is required to measure 
the distance between the estimated unmixing matrix $\mathbf{W}_{\text{est}}$ and the true mixing matrix $\mathbf{A}$. 
Because ICA is subject to inherent scaling and permutation ambiguities, we utilize the Amari Performance Index \cite{amari1996new}. 

\vspace{0.5em}
\noindent\textbf{Definition 5.1 Amari Performance Index.} \textit{Let $\mathbf{P} = \mathbf{W}_{\text{est}} \mathbf{A}$ be the global transfer matrix. 
For a $d \times d$ matrix $\mathbf{P}$ with elements $p_{ij}$, the Amari error measures the structural divergence from a generalized permutation matrix:}
\begin{equation}
    E(\mathbf{P}) = \frac{1}{2d} \sum_{i=1}^d \left( \sum_{j=1}^d \frac{|p_{ij}|}{\max_k |p_{ik}|} - 1 \right) + \frac{1}{2d} \sum_{j=1}^d \left( \sum_{i=1}^d \frac{|p_{ij}|}{\max_k |p_{kj}|} - 1 \right).
\end{equation}
\vspace{0.5em}

This formulation normalizes each row and column by its maximum absolute value. If $\mathbf{P}$ is a generalized permutation matrix, 
indicating successful source separation, the index evaluates to zero. The experimental evaluations employ specific heuristic thresholds 
to interpret the Amari error (Table \ref{tab:amari_thresholds}).

\begin{table}[htbp]
    \centering
    \begin{tabular}{@{}lp{10cm}@{}}
        \toprule
        \textbf{Amari Error ($E$)} & \textbf{Interpretation / Quality of Separation} \\
        \midrule
        $E < 0.1$ & Excellent to near-perfect source separation. \\
        \addlinespace
        $0.1 \le E < 0.3$ & Good separation; the structural shape of the underlying sources is highly recoverable. \\
        \addlinespace
        $0.3 \le E < 0.5$ & Moderate separation; some structural details may be lost. \\
        \addlinespace
        $E \ge 0.5$ & Severe degradation; multiple sources remain significantly mixed. \\
        \addlinespace
        $E \ge 1.0$ & The estimated matrix is effectively a random orthogonal projection with no meaningful relation to the true sources. \\
        \bottomrule
    \end{tabular}
    \caption{Heuristic thresholds for interpreting the Amari Performance Index.}
    \label{tab:amari_thresholds}
\end{table}

\subsection{Computational Resource Regimes}
Because OT-ICA relies on first-order gradient descent over the non-convex Stiefel manifold, its capacity to find 
the global optimum depends on the computational resources allocated to search the space, particularly as dimensionality increases. 
Conversely, FastICA utilizes a Newton-based fixed-point iteration that converges rapidly under standard conditions. 
FastICA requires extended computational limits in specific edge cases where its structural assumptions are violated, 
causing the solver to oscillate. 

To accommodate OT-ICA's scaling requirements, and to evaluate whether FastICA's non-convergence in certain cases is due 
to mathematical properties rather than standard optimization timeouts, high-dimensional tests in 
this chapter are evaluated under two computational regimes. The hyperparameter configurations establish specific boundaries for these regimes (Table \ref{tab:compute_regimes}).

\begin{table}[htbp]
    \centering
    \small
    \begin{tabular}{@{} p{4.0cm} p{5.5cm} p{4.5cm} @{}}
        \toprule
        \textbf{Regime \& Objective} & \textbf{OT-ICA Configuration} & \textbf{FastICA Configuration} \\
        \midrule
        \textbf{Low Compute Baseline} \newline \textit{Represents standard, efficient execution.} & 
        \textbf{Restarts:} $\min(4d, 150)$ \newline
        \textbf{Deflation Phase:} 150 iterations \newline
        \textbf{Symmetric Phase:} 200 iterations & 
        \textbf{Max Iterations:} 1,000 \newline 
        \textit{Standard limit for fixed-point convergence.} \\
        \addlinespace
        \addlinespace
        \textbf{High Compute Regime} \newline \textit{Evaluates structural convergence independent of timeouts.} & 
        \textbf{Restarts:} $\min(15d, 600)$ \newline
        \textbf{Deflation Phase:} 300 iterations \newline
        \textbf{Symmetric Phase:} 500 iterations & 
        \textbf{Max Iterations:} 5,000 \newline 
        \textit{Extended to ensure non-convergence is structural.} \\
        \bottomrule
    \end{tabular}
    \caption[Standardized Computational Regimes]{Summary of the computational regimes used to evaluate algorithmic scaling. 
    These iteration limits and restart thresholds separate standard optimization timeouts from mathematical convergence issues.}
    \label{tab:compute_regimes}
\end{table}

\section{OT-ICA Methodology Validation}
We first verify that OT-ICA successfully recovers a mixture of non-Gaussian distributions in a standard, 
low-dimensional environment. We generated $N=10,000$ samples from four continuous independent sources: 
Laplace (super-Gaussian), Uniform (sub-Gaussian), Student-t (continuous heavy-tailed), and a Beta distribution parameterized at $(0.5, 0.5)$ 
(bimodal arcsine distribution). These sources were linearly combined using a $4 \times 4$ mixing matrix $\mathbf{A}$. 

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{otica_baseline_recovery.png} 
    \caption[Baseline Source Recovery via OT-ICA]{Empirical probability distributions of the 4D baseline validation. 
    The original generated sources are represented by gray histograms, overlaid with the sources extracted by the OT-ICA algorithm (orange curves). 
    The framework recovers continuous non-Gaussian structures, navigating both super-Gaussian peaks and sub-Gaussian bimodality.}
    \label{fig:otica_baseline_recovery}
\end{figure}

The OT-ICA algorithm successfully extracted the four structural signals from the mixture (Figure \ref{fig:otica_baseline_recovery}). 
To quantify this separation numerically, we evaluate the true mixing matrix $\mathbf{A}$, the estimated unmixing matrix $\mathbf{W}_{\text{est}}$, 
and the global transfer matrix $\mathbf{P} = \mathbf{W}_{\text{est}}\mathbf{A}$. Because ICA is subject to permutation and sign ambiguities, 
comparing the estimated unmixing matrix to the true mixing matrix directly is challenging. 
The global transfer matrix $\mathbf{P}$ addresses this; if the algorithm inverted the mixing process, 
$\mathbf{P}$ approximates a generalized permutation matrix.

\begin{table}[tbp]
    \centering
    \small
    \begin{tabular}{cc}
        \toprule
        \textbf{True Mixing Matrix ($\mathbf{A}$)} & \textbf{Estimated Unmixing Matrix ($\mathbf{W}_{\text{est}}$)} \\
        \midrule
        $
        \begin{bmatrix}
         1.058 &  1.520 & -0.249 &  1.012 \\
         0.042 & -0.486 &  0.388 & -0.523 \\
        -0.154 & -1.572 & -0.304 & -1.234 \\
        -0.006 &  0.503 &  0.053 &  0.928 
        \end{bmatrix}
        $ 
        & 
        $
        \begin{bmatrix}
         0.154 & -1.943 &  0.608 & -0.456 \\
        -0.092 & -0.284 & -0.632 & -1.977 \\
         0.146 &  0.719 &  1.097 &  1.691 \\
         1.021 &  1.204 &  0.828 &  0.677 
        \end{bmatrix}
        $ \\
        \midrule
        \multicolumn{2}{c}{\textbf{Global Transfer Matrix ($\mathbf{P} = \mathbf{W}_{\text{est}}\mathbf{A}$)}} \\
        \midrule
        \multicolumn{2}{c}{
        $
        \begin{bmatrix}
        -0.010 & -0.007 & -1.000 & -0.002 \\
         0.000 & -0.003 &  0.000 & -1.000 \\
         0.006 & -1.000 & -0.002 & -0.012 \\
         1.000 &  0.006 & -0.003 &  0.010 
        \end{bmatrix}
        $
        } \\
        \midrule
        \multicolumn{2}{c}{\textbf{Amari Performance Index}} \\
        \midrule
        \multicolumn{2}{c}{OT-ICA Error: \textbf{0.0155} \quad | \quad FastICA Error: \textbf{0.0188}} \\
        \bottomrule
    \end{tabular}
    \caption[Baseline Matrices and Amari Error]{The ground truth mixing matrix, estimated unmixing matrix, and resulting global transfer matrix for the OT-ICA baseline validation. 
    The transfer matrix approximates a generalized permutation matrix, resulting in a comparable Amari error to FastICA.}
    \label{tab:baseline_matrices}
\end{table}

The global transfer matrix $\mathbf{P}$ forms a generalized permutation matrix (Table \ref{tab:baseline_matrices}). 
This confirms that OT-ICA extracted the independent sources, 
yielding an Amari error below the 0.1 threshold. For comparison, the FastICA algorithm applied to the same 4D mixture achieved a comparable Amari error.

\section{Computational \& Temporal Scaling}
To evaluate computational scaling in higher dimensions, we benchmarked OT-ICA against FastICA using a linear mixture of continuous super-Gaussian Laplacian sources. 
We evaluated both algorithms across increasing dimensions with a fixed sample size of $N=10,000$. Because OT-ICA relies on gradient-based optimization over the Stiefel manifold, 
its capacity to find global optima depends on computational resources. We evaluated both algorithms under the Low Compute and High Compute regimes.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{laplace_amari_ot_ica_scaling.png} 
    \caption[Separation Accuracy on Laplacian Sources]{Amari error comparison between FastICA and OT-ICA across dimensions for Laplacian sources. 
    While OT-ICA degrades in the Low Compute regime due to the surface area of the Stiefel manifold, the High Compute regime achieves separation ($E < 0.3$), 
    matching FastICA's capabilities.}
    \label{fig:laplace_amari_scaling}
\end{figure}

FastICA performs well for lower dimensions, maintaining low error regardless of the compute regime (Figure \ref{fig:laplace_amari_scaling}). 
The Laplacian distribution aligns with FastICA's default \textit{logcosh} contrast function. While OT-ICA scales similarly to FastICA, 
it requires the High Compute regime to navigate the higher-dimensional optimization landscape. As dimensionality approaches $d=40$, 
the statistical resolution of the joint distribution for a sample size of $N=10,000$ degrades. Both algorithms fail to maintain the $E < 0.3$ threshold, 
demonstrating a statistical limitation of the fixed sample size.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{laplace_time_ot_ica_scaling.png} 
    \caption[Execution Time Scaling]{Total execution time (in seconds) for FastICA and OT-ICA across dimensions. 
    OT-ICA exhibits higher computational costs, particularly in the High Compute regime (right) where parallel restarts are required.}
    \label{fig:laplace_time_scaling}
\end{figure}

The computational cost for OT-ICA is significantly higher than FastICA (Figure \ref{fig:laplace_time_scaling}). FastICA's Newton-based 
fixed-point iteration allows rapid convergence. In contrast, OT-ICA's first-order gradient descent and parallel restarts cause execution times 
to scale more steeply. This difference in algorithmic efficiency raises the question of whether a gradient-based framework like 
OT-ICA provides advantages over FastICA in specific settings.

While FastICA efficiently separates homogeneous mixtures of continuous distributions, empirical data is often heterogeneous. 
The following sections explore how specific structural properties of non-Gaussian sources affect FastICA's convergence.

\section{Algorithmic Limitations of FastICA}
We examine two specific mathematical conditions that affect FastICA's convergence on certain non-Gaussian independent components, 
resulting from its reliance on proxy contrast functions and Newton-based iteration.

\subsection{The Zero Negentropy Condition}
FastICA approximates negentropy $J(x)$ using a non-quadratic contrast function $G(\cdot)$, such as the \textit{logcosh} function $G(x) = \frac{1}{a} \log \cosh(ax)$. 
The approximation is given by:
\begin{equation}
    J(x) \propto \big( \mathbb{E}[G(x)] - \mathbb{E}[G(\nu)] \big)^2
\end{equation}
where $\nu$ is a standard Gaussian random variable. The algorithm seeks projections that maximize this squared difference. 

If a latent independent source possesses a specific distribution such that its expected value under the contrast function equals that of 
a Gaussian ($\mathbb{E}[G(x)] = \mathbb{E}[G(\nu)]$), the approximated negentropy evaluates to zero. In this scenario, 
the objective function provides no gradient signal, and the algorithm does not extract the source.

\subsection{Empirical Results: Zero Negentropy}
To empirically demonstrate this condition, we constructed a symmetric Trimodal Gaussian Mixture parameterized by peak 
locations $\{-b, 0, b\}$ and variance $\sigma^2$. We solved for parameters such that the expected value of the \textit{logcosh} function matched 
the theoretical baseline of a standard normal distribution. This distribution maintains unit variance, representing an independent non-Gaussian 
source that yields an approximated negentropy of zero.

We generated linear mixtures of this Trimodal Gaussian source across dimensions ($d \in [5, 25]$) with $N=10,000$ samples. 
We separated the evaluation into the Low Compute and High Compute regimes.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{zero_negentropy_amari_compute.png} 
    \caption[Performance on Zero-Negentropy Failure Mode]{Amari error comparison between FastICA and OT-ICA for the zero-negentropy Trimodal Gaussian distribution. 
    FastICA lacks convergence ($E > 0.5$) across all regimes. OT-ICA unmixes the sources, though it requires the High Compute regime for higher dimensions.}
    \label{fig:zero_negentropy_amari}
\end{figure}

The empirical results demonstrate that FastICA yields high Amari errors ($E>0.5$) for this distribution (Figure \ref{fig:zero_negentropy_amari}). 
Providing FastICA with High Compute resources does not improve performance, confirming that the lack of convergence is related to the properties of 
the proxy objective function rather than iteration limits.

OT-ICA achieves source separation, bypassing parametric approximations by utilizing the empirical CDF via optimal transport. While OT-ICA shows 
performance degradation in the Low Compute regime for $d>15$ due to the non-convexity of the Stiefel manifold, the High Compute regime restores 
accurate unmixing ($E<0.3$) up to $d=20$. 

\subsection{The Vanishing Curvature Condition}
FastICA utilizes a Newton (second-order) fixed-point iteration step. The mathematical stability of this convergence relies on a condition governing 
the algorithmic curvature of the optimization landscape, as formalized by Hyv{\"a}rinen, Karhunen, and Oja \cite[Chapter 8]{hyvarinen2001} (Appendix \ref{app:fastica_proof}).

For the fixed-point iteration to converge to a true independent component $s_i$, the expectation governing the second-order Taylor expansion of 
the iteration must not vanish. Defining $g(x) = G'(x)$ and $g'(x) = G''(x)$, where $G$ is the contrast function, convergence requires:
\begin{equation}
    \mathbb{E}\big[s_i g(s_i) - g'(s_i)\big] \neq 0
\end{equation}
For the standard \textit{logcosh} contrast function (with $a=1$), this equates to $g(x) = \tanh(x)$ and $g'(x) = 1 - \tanh^2(x)$. 

If a non-Gaussian source distribution causes this expectation to evaluate to zero, the denominator of the FastICA update rule approaches zero. 
The fixed-point iteration fails to converge under these conditions.

\subsection{Empirical Results: Vanishing Curvature}
We engineered a continuous source distribution to test this curvature condition. We analyzed the target function: 
\begin{equation}
    h(x) = x \tanh(x) - 1 + \tanh^2(x).
\end{equation}
We constructed a Trimodal Gaussian Mixture parameterized by symmetrically spaced peak locations $\{-b, 0, b\}$ and variance $\sigma^2$. 
Solving the integral equation $\int h(x) \text{PDF}(x) dx = 0$ subject to the unit variance constraint ($\mathbb{E}[X^2] = 1$) determines parameters 
that balance the regions of negative and positive curvature.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{vanishing_curvature_trimodal_pdf.png} 
    \caption[The Vanishing Curvature Counterexample]{The engineered Trimodal Gaussian Mixture. The central mass generates negative algorithmic curvature, 
    while the side peaks generate positive algorithmic curvature. The parameters $b$ and $p$ are solved such that these regions cancel ($\int h(x) \text{PDF}(x) dx = 0$) 
    while maintaining unit variance.}
    \label{fig:vanishing_curvature}
\end{figure}

This distribution is non-Gaussian, but its theoretical algorithmic curvature evaluates to zero (Figure \ref{fig:vanishing_curvature}). 
We evaluated FastICA and OT-ICA across dimensions ($d \in [5, 25]$) at $N=10,000$ under both compute regimes.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{vanishing_curvature_amari_compute.png} 
    \caption[Performance on Trimodal Failure Mode]{Amari error comparison between FastICA and OT-ICA across dimensions for the engineered Trimodal distribution. 
    FastICA lacks convergence ($E > 1.0$) across all regimes.}
    \label{fig:amari_trimodal_failure}
\end{figure}

FastICA returned unmixing matrices with high Amari errors ($E > 0.3$) across dimensions (Figure \ref{fig:amari_trimodal_failure}). 
The High Compute Regime did not improve FastICA's performance, indicating the non-convergence is structural.

OT-ICA unmixed lower dimensions ($d < 20$) in the Low Compute regime and degraded as $d$ increased. In the High Compute regime, 
OT-ICA maintained separation ($E < 0.3$) through 20 dimensions. This suggests OT-ICA's performance is limited by the expanding 
non-convex search space rather than local density approximations.

\section{The Generalized Hybrid Mixture Stress Test}
To evaluate OT-ICA and FastICA on heterogeneous signals, we tested a generalized hybrid mixture.

\subsection{Experimental Setup}
We generated linear mixtures at dimensions $d=30$ and $d=40$ with $N=10,000$. One source was fixed as a standard Gaussian distribution, $\mathcal{N}(0,1)$. 
The remaining $(d-1)$ independent sources were randomly drawn from eight distributions: Laplace, Bernoulli, Uniform, Student-t, Poisson, Binomial, Chi-squared, and Exponential. 

FastICA was evaluated with a limit of $10,000$ iterations, while OT-ICA utilized its baseline parameters (Table \ref{tab:compute_regimes}).

\subsection{Empirical Results: Hybrid Mixture Stress Test}
OT-ICA maintained moderate to good separation ($E < 0.5$) across the dimensions for all but discrete-only mixtures (Figure \ref{fig:hybrid_mixture_results}). 

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{general_hybrid_test.png} 
    \caption[General Hybrid Mixture Stress Test]{Amari error comparison for the Generalized Hybrid Mixture. OT-ICA maintains stability across the tested dimensions.}
    \label{fig:hybrid_mixture_results}
\end{figure}

FastICA exhibited instability in three of the five hybrid mixtures: full hybrid, zero Gaussian, and discrete-only. At $d=30$, 
FastICA triggered non-convergence warnings. Providing an extended $10,000$ iteration limit resulted in Amari errors remaining high ($E>0.5$). 
This indicates that FastICA's fixed-point update steps may not converge reliably when processing mixtures containing both super-Gaussian and sub-Gaussian signals simultaneously.

\section{Discrete Only Mixtures: A Unique Challenge}
We isolated specific discrete distributions from the discrete-only mixtures to observe their impact on optimization. We generated mixtures 
containing one continuous Gaussian source and $(d-1)$ sources of a single discrete type (Bernoulli, Poisson, or Binomial).

\subsection{Empirical Results: Discrete Only Mixtures}
FastICA and OT-ICA separated binary Bernoulli mixtures, but both algorithms reported high Amari errors ($E > 1.0$) for multi-step count-based 
geometries of standard Poisson ($\lambda=3$) and Binomial ($n=10$) distributions (Figure \ref{fig:discrete_isolation}). 

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{isolated_discrete_test.png} 
    \caption[Performance on Isolated Discrete Pools]{Amari error for mixtures composed entirely of a single discrete distribution type. 
    Both algorithms separate binary (Bernoulli) data but show higher error on count data (Poisson, Binomial).}
    \label{fig:discrete_isolation}
\end{figure}

\subsection{Optimization Properties of Discrete Data}
To analyze this behavior, we calculated the squared Wasserstein distance ($W_2^2$) and the Logcosh proxy negentropy to measure the distance 
between standard normal noise and discrete distributions.

\begin{table}[htbp]
    \centering
    \begin{tabular}{lcc}
        \toprule
        \textbf{Distribution} & \textbf{$W_2^2$ Distance} & \textbf{Logcosh Proxy Negentropy} \\
        \midrule
        Laplace (Continuous Baseline) & 0.0398 & 0.001214 \\
        Binomial (Standard, $n=10$) & 0.0344 & 0.000031 \\
        Poisson (Standard, $\lambda=3.0$) & 0.0513 & 0.000000 \\
        Binomial (Harsh, $n=2$) & 0.2022 & 0.000267 \\
        Poisson (Harsh, $\lambda=0.5$) & 0.3384 & 0.000085 \\
        \bottomrule
    \end{tabular}
    \caption[Non-Gaussianity Sensitivity Analysis]{Comparison of non-Gaussianity metrics across distributions.}
    \label{tab:w2_sensitivity}
\end{table}

Adjusting the parameters (Poisson $\lambda=0.5$, Binomial $n=2$) increases the non-Gaussianity measured by $W_2^2$ (Table \ref{tab:w2_sensitivity}). 
The Logcosh proxy evaluates to lower relative values for these discrete structures compared to the continuous Laplace baseline. 

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{all_optimization_surfaces.png}
    \caption[Optimization Surface of Discrete Mixtures]{The 1D optimization landscapes for discrete mixtures. The step-function geometry of 
    discrete data affects the gradient properties of both the OT-ICA surface and the FastICA proxy approximation.}
    \label{fig:optimization_surface}
\end{figure}

The 1D optimization surfaces were mapped for both metrics (Figure \ref{fig:optimization_surface}). For FastICA (dashed blue line), 
the \textit{logcosh} proxy landscape has low amplitude gradient changes (scale $1e-5$). For OT-ICA (solid orange line), 
the geometric matching reflects the step-function nature of the empirical CDF, creating flat plateaus where the gradient approaches zero. 
This non-smooth landscape can stall the L-BFGS solver prior to identifying the components.

\subsection{Empirical Results: Harsh Discrete Environments}
We evaluated the algorithms on the skewed parameterizations.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{binomial_poisson_fast_vs_ot_ica.png} 
    \caption[Performance on Harsh Discrete Environments]{Amari Error for skewed discrete parameterizations. 
    OT-ICA achieves separation on the Poisson distribution but does not converge on the symmetric Binomial.}
    \label{fig:harsh_discrete}
\end{figure}

For the skewed Poisson mixture ($\lambda=0.5$), FastICA did not converge (Figure \ref{fig:harsh_discrete}). 
OT-ICA achieved separation ($E \approx 0.15$), as the skewed nature of the Poisson distribution creates a narrower plateau at 
the minimum of the optimization surface (Figure \ref{fig:optimization_surface}). For the symmetric Binomial mixture ($n=3$), 
OT-ICA got stuck, resulting in an Amari error of $E > 2.0$, due to the wider plateau at the optimum.

\section{Synthesis: The Case for OT-ICA}
The evaluations indicate a distinction between theoretical identifiability and optimization behavior. 
While statistical non-Gaussianity permits source separation, extracting discrete sources requires navigating a non-smooth optimization surface. 
As observed in the pure discrete environments, step-functions create zero-gradient plateaus that stall first-order solvers, 
despite the objective function correctly identifying the non-Gaussianity.

In heterogeneous environments containing both continuous and discrete variables, the continuous distributions smooth the overall 
joint Cumulative Distribution Function, allowing the gradient solver to utilize the optimal transport metric across the manifold (Figure \ref{fig:hybrid_mixture_results}). 

FastICA's reliance on local negentropy approximations creates specific convergence limitations, such as the zero-negentropy 
condition and vanishing algorithmic curvature, which affect its performance in generalized hybrid mixtures. 
Because OT-ICA explicitly maps the full empirical geometry via the CDF, it bypasses these specific parametric approximations. 
Its performance is instead constrained by computational resources and the non-convexity of the Stiefel manifold. 
For homogeneous mixtures of standard continuous distributions, FastICA provides a computational advantage. 
In heterogeneous environments where proxy assumptions do not hold, OT-ICA provides a reliable alternative methodology capable of 
resolving signals that traditional proxy solvers cannot separate.

\section{Application: EEG Artifact Removal}

\subsection{The Challenge of Volume Conduction}
In electroencephalography (EEG) recordings, the human skull acts as a linear volume conductor. 
Electrical signals generated by brain regions and non-brain muscles (such as the eyes) mix linearly before reaching the scalp electrodes. 
This satisfies the linear mixing model of ICA: $\mathbf{X} = \mathbf{A}\mathbf{S}$.

Eye blinks (ocular artifacts) possess high amplitude and mask underlying brain waves. Because eye blinks are sparse and super-Gaussian, 
ICA is utilized to isolate and remove them. We apply the OT-ICA framework to a clinical dataset to isolate these artifacts from continuous brain wave recordings.

\subsection{Empirical Results on Clinical EEG Data}
We utilized the \texttt{mne} library to process a clinical MEG/EEG dataset. The data was band-pass filtered between 1 Hz and 40 Hz, 
and we isolated a 10-second window (10.0s to 20.0s) containing eye-blinks across five frontal EEG channels. The data was standardized before applying the OT-ICA algorithm.

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{eeg_artifact_removal.png}
    \caption[EEG Artifact Removal via OT-ICA]{Application of OT-ICA to EEG data. \textbf{Top (Plot A):} 
    Five frontal EEG channels containing eye-blink artifacts. \textbf{Bottom (Plot B):} The independent components recovered by OT-ICA. 
    The ocular artifact is isolated into a single independent component (Comp 1).}
    \label{fig:eeg_artifact_removal}
\end{figure}

The raw frontal electrodes contain synchronized high-amplitude spikes (Figure \ref{fig:eeg_artifact_removal}, Plot A). 
The OT-ICA framework separated these signals. The ocular artifact is isolated into a single independent component (Comp 1), 
leaving the underlying continuous signals in the remaining components (Figure \ref{fig:eeg_artifact_removal}, Plot B).

\section{Application: Price Discovery and Information Shares}

\subsection{Non-Gaussianity in Econometric Market Microstructure}
We apply OT-ICA to price discovery in financial econometrics. In fragmented financial markets, determining which market 
contributes most to price discovery involves measuring information shares. Standard vector error correction models (VECMs) 
identify information shares using uncorrelation of price innovations. However, identification relying solely on uncorrelation 
is statistically equivalent up to an orthogonal transformation \cite{zema2025}. 

By utilizing the non-Gaussianity of latent market shocks, ICA can resolve this orthogonal ambiguity, providing a measurement of market information shares. 

\subsection{Economic Identification Strategy}
To address the permutation and sign ambiguities of ICA, we impose two economic identification constraints.

First, to resolve the permutation ambiguity, we enforce a Dominant Diagonal constraint. The structural shock assigned to Market $i$ must 
contribute the maximum variance to that market's price innovations. We implement this using the Hungarian Algorithm for optimal bipartite matching.

Second, to resolve the sign ambiguity, we enforce a Positive Impact constraint based on positive spillovers. A structural innovation 
representing 'good news' is assumed to have a positive initial impact on its own market price. Columns of the unmixing matrix violating this logic are sign-flipped. 

\subsection{Empirical Results on Simulated Market Data}
We simulated a fragmented market scenario targeting three markets with pre-defined Information Shares of $12\%$, $24\%$, and $64\%$. 
Market innovations were generated using Student-t distributions to satisfy the non-Gaussianity requirements. We executed 500 Monte Carlo simulations comprising 5,000 observations each.

A representative run demonstrates the correlation between the generative shocks and the empirical shocks estimated by OT-ICA (Figure \ref{fig:shock_recovery}).

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{shock_recovery_scatter.png}
    \caption[Recovery of Structural Shocks via OT-ICA]{Pairplot demonstrating the recovery of true structural shocks versus estimated shocks 
    from OT-ICA for a single Monte Carlo run. The linear alignment indicates unmixing of the latent market innovations.}
    \label{fig:shock_recovery}
\end{figure}

Applying the Dominant Diagonal and Positive Impact constraints, we mapped the statistical components to their corresponding markets and calculated empirical information shares.

The estimated information shares converge to their theoretical targets across all 500 simulations (Table \ref{tab:simulated_is_results}). 

\begin{table}[htbp]
    \centering
    \begin{tabular}{@{}lccc@{}}
        \toprule
        \textbf{Market / Source} & \textbf{True IS} & \textbf{Estimated IS (Mean)} & \textbf{Standard Deviation} \\
        \midrule
        Market 1 & 0.1200 & 0.1212 & 0.0161 \\
        \addlinespace
        Market 2 & 0.2400 & 0.2399 & 0.0245 \\
        \addlinespace
        Market 3 & 0.6400 & 0.6389 & 0.0260 \\
        \bottomrule
    \end{tabular}
    \caption[Simulation Results for Information Share Recovery]{Monte Carlo simulation results comparing true Information Shares (IS) against 
    the estimated IS recovered via OT-ICA over 500 runs.}
    \label{tab:simulated_is_results}
\end{table}

The empirical density of the estimated information shares is centered around the true parameter values across all simulation runs (Figure \ref{fig:is_distribution}).

\begin{figure}[tbp]
    \centering
    \includegraphics[width=1.0\textwidth]{econ_applicatn_is_distribution.png} 
    \caption[Empirical Distribution of Estimated Information Shares]{Empirical density distributions of the estimated Information Shares 
    across 500 Monte Carlo simulations. The vertical dashed lines represent the true theoretical values.}
    \label{fig:is_distribution}
\end{figure}


% -----------------------------------------------------------------
% CHAPTER 6: CONCLUSION
% -----------------------------------------------------------------
\chapter{Conclusion}

\section{Summary of Contributions}
This thesis evaluated the Optimal Transport Independent Component Analysis (OT-ICA) framework, transitioning linear source separation from 
local proxy approximations to global geometric matching. By formulating the search for independent components as a minimization of 
the squared Wasserstein distance ($W_2^2$) to a standard normal distribution on the Stiefel manifold, we addressed specific 
mathematical vulnerabilities present in established proxy algorithms.

Our theoretical analysis established that the squared Wasserstein distance bounds mixture non-Gaussianity, providing a direct geometric objective 
for extracting latent sources. Empirically, we demonstrated that while Newton-based algorithms like FastICA fail to converge in the presence 
of structural blind spots, such as zero-negentropy topologies and vanishing curvature, OT-ICA maintains stable convergence. 
By mapping the full empirical cumulative distribution function rather than relying on point-estimates of density, OT-ICA successfully resolved 
generalized hybrid mixtures where proxy solvers infinitely oscillate.

We introduced specific algorithmic adaptations to make this optimal transport framework computationally viable. 
These included exact analytical Gaussian targets to eliminate discretization noise, batched vectorization to handle the non-convex search space, 
and Gaussian dithering to process the discontinuous optimization landscapes of discrete data. Finally, we demonstrated the practical application of this framework, 
utilizing OT-ICA to isolate sparse artifacts in clinical EEG recordings and to resolve orthogonal ambiguities in econometric price discovery. 

\section{Limitations}
The OT-ICA framework is subject to constraints regarding computational efficiency. Because calculating the exact Hessian of 
the Wasserstein distance relies on continuous density estimation, OT-ICA must utilize first-order Quasi-Newton methods (L-BFGS) 
combined with parallel random restarts. This results in execution times that scale exponentially with dimensionality, making it 
computationally expensive compared to fixed-point iteration solvers.

The framework also encounters a statistical performance ceiling related to the curse of dimensionality. As the dimensionality expands, 
the non-convex search space of the orthogonal group $O(d)$ becomes increasingly complex. In our empirical evaluations, 
a fixed sample size of $N=10,000$ became insufficient to maintain the heuristic threshold for good source separation (Amari error $E < 0.3$) 
when $d > 40$, demonstrating a statistical resolution limit of the joint distribution.

Finally, while Gaussian dithering smooths step-wise empirical CDFs, highly symmetric and sparse discrete distributions 
(such as a low parameter Binomial distribution) yield optimization landscapes with flat, zero-gradient plateaus. 
While OT-ICA navigates hybrid environments where continuous variables smooth the joint distribution, isolating purely discrete, 
symmetrically clustered sources remains a limitation for first-order gradient solvers.

\section{Future Work}
Future research directions include the integration of entropic regularization (Sinkhorn distances) \cite{cuturi2013sinkhorn, peyre2019computational} 
to replace the Gaussian dithering step. Sinkhorn divergence provides a naturally differentiable approximation of the Wasserstein metric 
that processes discrete distributions without mathematically altering the underlying data.

Improving computational scaling represents another critical objective. Investigating stochastic second-order methods, or utilizing neural-based 
transport map approximations (such as Normalizing Flows), could reduce the execution time required to search the Stiefel manifold. 
Leveraging Amortized Optimal Transport to learn the transport map offline would bypass the $\mathcal{O}(N \log N)$ sorting requirement entirely during runtime.

Integrating the OT-ICA framework into the domain of causal discovery offers another promising avenue. Causal structures are frequently modeled using 
Linear Non-Gaussian Acyclic Models (LinGAM) \cite{shimizu2006linear}. Because LinGAM relies directly on ICA to identify the independent error terms 
that dictate causal directionality, applying a globally geometrically aware metric like OT-ICA could improve causal estimation in complex, 
heterogeneous datasets where traditional ICA proxies fail. Furthermore, the emerging field of Causal Representation Learning seeks to infer causally 
related latent variables. The recently proposed Causal Component Analysis (CauCA) \cite{liang2023causal} bridges ICA and causal learning by modeling 
the causal dependence among latent components under known graphs. Integrating the robust Wasserstein metric of OT-ICA into the CauCA framework could 
enhance the estimation of unmixing functions and causal mechanisms, particularly when latent variables exhibit complex, heterogeneous non-Gaussianity.

Finally, extending the geometric properties of optimal transport contrast functions to nonlinear Independent Component Analysis represents a compelling frontier. 
Unsupervised learning of latent variable models via nonlinear ICA requires strict constraints on the function class to ensure identifiability, 
as explored by Buchholz, Besserve, and Sch{\"o}lkopf \cite{buchholz2022function}. The optimal transport metric, which inherently evaluates global structural deformations, 
may serve to effectively regularize these constrained function spaces, enabling reliable unmixing of nonlinear representations.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% APPENDICES
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\appendix
\chapter{Mathematical Proofs and Derivations}

\section{Derivation of the FastICA Fixed-Point Failure Condition}
\label{app:fastica_proof}

\textbf{Proof of Theorem 8.1} Denote by $H(\mathbf{w})$ the function to be minimized/maximized, $E\{G(\mathbf{w}^T \mathbf{z})\}$. 
Make the orthogonal change of coordinates $\mathbf{q} = \mathbf{A}^T \mathbf{V}^T \mathbf{w}$. Then we can calculate the gradient as $\frac{\partial H(\mathbf{q})}{\partial \mathbf{q}} = E\{\mathbf{s}g(\mathbf{q}^T \mathbf{s})\}$ and the Hessian as $\frac{\partial^2 H(\mathbf{q})}{\partial \mathbf{q}^2} = E\{\mathbf{s}\mathbf{s}^T g'(\mathbf{q}^T \mathbf{s})\}$. Without loss of generality, it is enough to analyze the stability of the point $\mathbf{q} = \mathbf{e}_1$, where $\mathbf{e}_1 = (1, 0, 0, 0, \dots)$. Evaluating the gradient and the Hessian at point $\mathbf{q} = \mathbf{e}_1$, we get using the independence of the $s_i$,
\begin{equation}
    \frac{\partial H(\mathbf{e}_1)}{\partial \mathbf{q}} = \mathbf{e}_1 E\{s_1 g(s_1)\}
\end{equation}
and
\begin{equation}
    \frac{\partial^2 H(\mathbf{e}_1)}{\partial \mathbf{q}^2} = \text{diag}(E\{s_1^2 g'(s_1)\}, E\{g'(s_1)\}, E\{g'(s_1)\}, \dots).
\end{equation}
Making a small perturbation $\epsilon = (\epsilon_1, \epsilon_2, \dots)$, we obtain
\begin{equation}
\begin{aligned}
    H(\mathbf{e}_1 + \epsilon) &= H(\mathbf{e}_1) + \epsilon^T \frac{\partial H(\mathbf{e}_1)}{\partial \mathbf{q}} + \frac{1}{2} \epsilon^T \frac{\partial^2 H(\mathbf{e}_1)}{\partial \mathbf{q}^2} \epsilon + o(\|\epsilon\|^2) \\
    &= H(\mathbf{e}_1) + E\{s_1 g(s_1)\} \epsilon_1 + \frac{1}{2} \Big[ E\{s_1^2 g'(s_1)\} \epsilon_1^2 + E\{g'(s_1)\} \sum_{i>1} \epsilon_i^2 \Big] + o(\|\epsilon\|^2)
\end{aligned}
\end{equation}
Due to the constraint $\|\mathbf{w}\| = 1$ we get $\epsilon_1 = \sqrt{1 - \epsilon_2^2 - \epsilon_3^2 - \dots} - 1$. Due to the fact that $\sqrt{1 - \gamma} = 1 - \gamma/2 + o(\gamma)$, the term of order $\epsilon_1^2$ is $o(\|\epsilon\|^2)$, i.e., of higher order, and can be neglected. Using the aforementioned first-order approximation for $\epsilon_1$ we obtain $\epsilon_1 = - \sum_{i>1} \epsilon_i^2 / 2 + o(\|\epsilon\|^2)$, which finally gives
\begin{equation}
    H(\mathbf{e}_1 + \epsilon) = H(\mathbf{e}_1) + \frac{1}{2} \Big[ E\{g'(s_1) - s_1 g(s_1)\} \Big] \sum_{i>1} \epsilon_i^2 + o(\|\epsilon\|^2)
\end{equation}
which clearly proves $\mathbf{q} = \mathbf{e}_1$ is an extremum, and of the type implied by the condition of the theorem.

\vspace{1em}
\noindent\textbf{Proof of convergence of FastICA} The convergence is proven under the assumptions that first, the data follows the ICA data model and second, that the expectations are evaluated exactly.

Let $g$ be the nonlinearity used in the algorithm. In the case of the kurtosis-based algorithm, this is the cubic function, so we obtain that algorithm as a special case of the following proof for a general $g$. We must also make the following technical assumption:
\begin{equation}
    E\{s_i g(s_i) - g'(s_i)\} \neq 0, \quad \text{for any } i
\end{equation}
which can be considered a generalization of the condition valid when we use kurtosis, that the kurtosis of the independent components must be nonzero. If this is true for a subset of independent components, we can estimate just those independent components.

To begin with, make the change of variable $\mathbf{q} = \mathbf{A}^T \mathbf{V}^T \mathbf{w}$, as earlier, and assume that $\mathbf{q}$ is in the neighborhood of a solution (say, $q_1 \approx 1$ as before). As shown in the proof of Theorem 8.1, the change in $q_1$ is then of a lower order than the change in the other coordinates, due to the constraint $\|\mathbf{q}\| = 1$. Then we can expand the terms using a Taylor approximation for $g$ and $g'$, first obtaining
\begin{equation}
\begin{aligned}
    g(\mathbf{q}^T \mathbf{s}) &= g(q_1 s_1) + g'(q_1 s_1)\mathbf{q}_{-1}^T \mathbf{s}_{-1} + \frac{1}{2} g''(q_1 s_1)(\mathbf{q}_{-1}^T \mathbf{s}_{-1})^2 \\
    &\quad + \frac{1}{6} g'''(q_1 s_1)(\mathbf{q}_{-1}^T \mathbf{s}_{-1})^3 + O(\|\mathbf{q}_{-1}\|^4)
\end{aligned}
\end{equation}
and then
\begin{equation}
\begin{aligned}
    g'(\mathbf{q}^T \mathbf{s}) &= g'(q_1 s_1) + g''(q_1 s_1)\mathbf{q}_{-1}^T \mathbf{s}_{-1} \\
    &\quad + \frac{1}{2} g'''(q_1 s_1)(\mathbf{q}_{-1}^T \mathbf{s}_{-1})^2 + O(\|\mathbf{q}_{-1}\|^3)
\end{aligned}
\end{equation}
where $\mathbf{q}_{-1}$ and $\mathbf{s}_{-1}$ are the vectors $\mathbf{q}$ and $\mathbf{s}$ without their first components. Denote by $\mathbf{q}^+$ the new value of $\mathbf{q}$ (after one iteration). Thus we obtain, using the independence of the $s_i$ and doing some tedious but straightforward algebraic manipulations,
\begin{equation}
    q_1^+ = E\{s_1 g(q_1 s_1) - g'(q_1 s_1)\} + O(\|\mathbf{q}_{-1}\|^2)
\end{equation}
\begin{equation}
\begin{aligned}
    q_i^+ &= \frac{1}{2} E\{s_i^3\} E\{g''(s_1)\} q_i^2 \\
    &\quad + \frac{1}{6} \text{kurt}(s_i) E\{g'''(s_1)\} q_i^3 + O(\|\mathbf{q}_{-1}\|^4), \quad \text{for } i > 1
\end{aligned}
\end{equation}
We obtain also
\begin{equation}
    \mathbf{q}^* = \mathbf{q}^+ / \|\mathbf{q}^+\|
\end{equation}
This shows clearly that under assumption (A.5), the algorithm converges (locally) to such a vector $\mathbf{q}$ that $q_1 = \pm 1$ and $q_i = 0$ for $i > 1$. This means that $\mathbf{w} = ((\mathbf{V}\mathbf{A})^T)^{-1}\mathbf{q}$ converges, up to the sign, to one of the rows of the inverse of the mixing matrix $\mathbf{V}\mathbf{A}$, which implies that $\mathbf{w}^T\mathbf{z}$ converges to one of the $s_i$. Moreover, if $E\{g''(s_1)\} = 0$, i.e., if the $s_i$ has a symmetric distribution, as is usually the case, the convergence is cubic. In other cases, the convergence is quadratic. If kurtosis is used, however, we always have $E\{g''(s_1)\} = 0$ and thus cubic convergence. In addition, if $G(y) = y^4$, the local approximations are exact, and the convergence is global.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% BACK MATTER
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\backmatter

\bibliographystyle{apacite}
\bibliography{references} 

\end{document}