Accepted by AAAI 2026

M-Loss: Quantifying Model Merging Compatibility with Limited Unlabeled Data

Tiantong Wang1, Yiyang Duan1, Haoyu Chen1,2, Tiantong Wu3*, Wei Yang Bryan Lim1
College of Computing and Data Science, Nanyang Technological University1, School of Computer and Information Technology, Beijing Jiaotong University2, Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)3

Abstract

Training of large-scale models is both computationally intensive and often constrained by the availability of labeled data. Model merging offers a compelling alternative by directly integrating the weights of multiple source models without requiring additional data or extensive training. However, conventional model merging techniques, such as parameter averaging, often suffer from the unintended combination of non-generalizable features, especially when source models exhibit significant weight disparities.

Comparatively, model ensembling generally provides more stable and superior performance that aggregates multiple models by averaging outputs. However, it incurs higher inference costs and increased storage requirements. While previous studies experimentally showed the similarities between model merging and ensembling, theoretical evidence and evaluation metrics remain lacking. To address this gap, we introduce Merging-ensembling loss (M-Loss), a novel evaluation metric that quantifies the compatibility of merging source models using very limited unlabeled data. By measuring the discrepancy between parameter averaging and model ensembling at layer and node levels, M-Loss facilitates more effective merging strategies. Specifically, M-Loss serves both as a quantitative criterion of the theoretical feasibility of model merging, and a guide for parameter significance in model pruning. Our theoretical analysis and empirical evaluations demonstrate that incorporating M-Loss into the merging process significantly improves the alignment between merged models and model ensembling, providing a scalable and efficient framework for accurate model consolidation. Our codes are available in https://github.com/languangduan/mLoss.

Background

Training of large-scale models is both computationally intensive and often constrained by the availability of labeled data. Model merging offers a compelling alternative by directly integrating the weights of multiple source models without requiring additional data or extensive training. However, conventional model merging techniques, such as parameter averaging, often suffer from the unintended combination of non-generalizable features, especially when source models exhibit significant weight disparities.

Comparatively, model ensembling generally provides more stable and superior performance by averaging outputs, but it incurs higher inference costs and increased storage requirements. While previous studies experimentally showed the similarities between model merging and ensembling, theoretical evidence and evaluation metrics remain lacking.

The discrepancy between model merging and ensembling arises mainly from non-linear activations. We examine the flow of intermediate representations around activations and identify linearly correlated model parameters (LCP): parameter subsets that jointly influence a node’s representation in a row-wise manner, motivating layer-/node-level measurements and row-wise operations.

LCP

Visualization of Linearly Correlated Parameters (LCP) in a neural network .

To address this gap, we introduce Merging-ensembling loss (M-Loss), a metric that quantifies the compatibility of merging source models using very limited unlabeled data by measuring the discrepancy between parameter averaging and model ensembling at layer and node levels. M-Loss serves as both a quantitative criterion for the theoretical feasibility of model merging and a guide for parameter significance in pruning.

Problem Statement

In large-scale deep learning, reliance on large labeled datasets and intensive computation limits supervised methods. Model merging fuses the weights of multiple pretrained or fine-tuned models into a single network to reduce data collection and training costs, but simple parameter averaging can combine non-generalizable features and fails when source models exhibit significant weight disparities. Existing research lacks a theoretical tool to assess model merging compatibility without labeled data.

The goal is to evaluate and guide model merging using only limited unlabeled data, meeting two key requirements:

  1. Compatibility Assessment: Quantify the discrepancy between parameter averaging (merged model) and model ensembling (averaged outputs) to assess mergeability without labeled test sets.
  2. Practical Guidance: Provide layer-/node-level signals to guide merging strategies (e.g., pruning schedules and hyperparameter selection) and improve alignment with ensembling.

Because simple parameter averaging often fails under non-linear architectures and weight disparities, a principled evaluation metric and merge-guided mechanism are necessary.

Methodology

The proposed framework, M-Loss for Mergeability and Merge-Guided Pruning, addresses model merging with limited unlabeled data in three main steps:

  1. Merging–Ensembling Discrepancy Measurement (M-Loss)
    • Given a small unlabeled set, compute the layer-/node-level discrepancy between parameter averaging (merged model) and output averaging (ensemble).
    • Produce an M-Loss score map that quantifies mergeability without labels and highlights node-level conflicts.
    • Theoretical analysis under common activations (ReLU, GELU, Leaky ReLU) explains when fine-tuned models from a shared backbone can be effectively merged.
  2. LCP-Informed Perspective for Structure
    • Analyze intermediate representations around non-linear activations to identify linearly correlated model parameters (LCP), i.e., row-wise parameter groups that jointly influence a node’s representation.
    • Use this perspective to align the granularity of measurement (node-level) with the granularity of intervention (row-wise operations).
  3. Merge-Guided Pruning and Integration (M-TIES)
    • Convert node-wise M-Loss scores into dynamic row-wise keep rates to prioritize low-conflict parameters.
    • Plug these keep rates into standard merging backends (e.g., TIES, DARE) to perform pruning/scheduling and parameter fusion.
    • Resulting merged models better align with ensembling performance while reducing inference and storage overhead.

MLOSS

Conceptual overview of M-Loss and its use in M-TIES. (a) M-Loss measures the discrepancy between parameter-averaged and ensembled representations on unlabeled data, producing layer-/node-wise scores. (b) The node-wise M-Loss score map drives dynamic row-wise keep rates, which integrate with standard merging backends (e.g., TIES Top-K or DARE) to improve mergeability and efficiency.

Experiments

Experimental Setup

To validate the effectiveness of the M-TIES method, we established the following experimental setup:

  • Models:
    • Vision Transformer (ViT-B/32, ViT-L/14), based on pretrained weights from OpenAI CLIP.
  • Source Models:
    • We fine-tuned the pretrained ViT on 8 different datasets to obtain the source models. These datasets include: RESISC45, Cars, MNIST, DTD, EuroSAT, GTSRB, SUN397, and SVHN.
  • Baselines:
    • We compared M-TIES with four mainstream merging methods:
      1. Simple Average
      2. Task Arithmetic
      3. TIES-Merging
      4. DARE
  • M-Loss Configuration:
    • We used only 128 unlabeled samples to calculate M-Loss, which averages to just 16 samples per source dataset, making it highly practical for real-world scenarios.
    • For non-linear layers (like Attention and MLP), we adopted the standard pruning strategy from TIES, as calculating M-Loss for these layers is more complex and prior research indicates they are less critical for merging.
  • Hardware:
    • All experiments were conducted on a single NVIDIA RTX A6000 GPU.

Experimental Results & Analysis

Main Performance

Our core experimental results, presented in the table below, demonstrate that M-TIES is the top-performing merging method on average for both ViT-B/32 and the larger ViT-L/14 backbones.

Specifically, on ViT-B/32, M-TIES achieves the highest average accuracy (73.23%) among all merging baselines. On the larger ViT-L/14 model, the advantage of M-TIES becomes even more pronounced. Its average accuracy of 85.28% not only surpasses other merging techniques but also comes remarkably close to the computationally expensive Ensemble baseline (85.56%). Notably, for ViT-L/14, M-TIES even outperforms the Ensemble method on 5 out of 8 individual tasks (RESISC45, MNIST, DTD, SUN397, and SVHN), validating our motivation to bridge the gap between model merging and ensembling.

Furthermore, M-TIES demonstrates better stability. On ViT-B/32, the accuracy variance of M-TIES across different tasks (172.22) is lower than that of TIES (203.27) and DARE (197.61), indicating that our method performs more consistently and does not disproportionately favor high-accuracy tasks.

Table 1: Accuracy comparison of merging methods on ViT-B/32 and ViT-L/14 backbones.

Backbone Method RESISC45 Cars MNIST DTD EuroSAT GTSRB SUN397 SVHN Avg
ViT-B/32 M-TIES 72.60 61.07 97.62 54.84 82.02 72.44 62.19 83.06 73.23
  TIES 70.67 58.61 98.30 54.20 80.22 72.11 59.01 86.20 72.42
  Task Arithmetic 71.27 60.70 95.32 51.76 79.74 67.32 62.06 76.68 70.61
  Simple Avg. 71.46 63.34 87.46 50.11 73.00 52.79 64.91 64.16 65.90
  DARE 69.97 57.98 97.95 53.24 78.89 72.00 59.14 83.96 71.64
  Ensemble 79.87 66.60 95.80 58.30 98.30 81.11 66.35 82.15 78.56
ViT-L/14 M-TIES 88.57 83.35 99.06 66.91 94.61 83.80 76.13 89.78 85.28
  TIES 88.19 82.81 99.01 66.70 94.37 83.36 75.65 89.42 84.94
  Task Arithmetic 86.17 82.44 98.54 65.59 93.93 83.47 73.56 85.26 83.62
  Simple Avg. 82.67 81.54 97.01 62.77 91.17 70.63 71.65 78.23 79.46
  DARE 88.33 83.35 98.97 66.86 94.06 84.20 75.37 89.19 85.04
  Ensemble 87.73 85.36 98.78 66.81 98.24 87.92 74.76 84.92 85.56

Ablation Study: Which Layers to Prune?

We investigated the effect of applying the M-Loss guided pruning strategy to only a subset of the model’s layers. As shown in the tables below, applying M-Loss to only the last few layers of the ViT (e.g., layers 8, 9, and 10 for ViT-B/32) yields an average accuracy that is nearly identical to applying it to all layers. This confirms the findings of previous research: the deeper parts of a model are more critical for merging.

Table 2: Accuracy of M-TIES on ViT-B/32 with different layers pruned by M-Loss

Layer Index RESISC45 Cars MNIST DTD EuroSAT GTSRB SUN397 SVHN Avg
All 72.603 61.074 97.620 54.840 82.019 72.439 62.195 83.063 73.232
0,8,9,10 72.619 61.112 97.570 55.000 82.000 72.328 62.337 82.760 73.216
8,9,10 72.619 61.099 97.560 55.000 82.074 72.312 62.333 82.771 73.221

Table 3: Accuracy of M-TIES on ViT-L/14 with different layers pruned by M-Loss

Layer Index RESISC45 Cars MNIST DTD EuroSAT GTSRB SUN397 SVHN Avg
All 88.571 83.348 99.060 66.915 94.611 83.800 76.134 89.782 85.278
0, 20, 21, 22 88.587 83.273 99.070 66.809 94.593 83.903 76.088 89.747 85.259
20, 21, 22 88.571 83.273 99.070 66.809 94.611 83.895 76.065 89.751 85.256

Visualizing M-Loss

To intuitively understand the conflicts between models, we plotted a heatmap of the M-Loss distribution across different layers and node groups in ViT-B/32. As shown below, areas of high M-Loss (representing high conflict) are not uniformly distributed but are concentrated in specific layers and nodes. This finding strongly justifies that adopting a dynamic, node-level pruning strategy (like M-TIES) is more reasonable than a fixed, global pruning strategy (like TIES).

NODE Layer-wise and node-group M-Loss distribution heatmap for ViT-B/32 models. Each colored block represents the average M-Loss of 50 consecutive nodes.

Computational Cost

M-TIES does not introduce significant time overhead.

  • Theoretically, when calculating M-Loss for layer k, the weights of the previous k-1 layers have already been merged. Therefore, the input only needs to pass through the shared network once, greatly reducing the number of forward passes.
  • In practice, for the ViT-B/32 model, TIES merging takes about 30 seconds, while M-TIES takes 1 minute and 30 seconds. For the larger ViT-L/14, TIES takes 1 minute, and M-TIES takes 3 minutes. The additional time is mainly for the forward inference needed to calculate M-Loss, which is a negligible overhead compared to the entire evaluation process (5-15 minutes).

Conclusion and Future Work

Conclusion

This paper introduces M-Loss, a novel metric that quantifies the gap between parameter-averaged and output-averaged models without relying on labeled data. By computing the expected M-Loss for common activation functions, we provide theoretical justification for the conditions where parameter averaging yields predictions close to model ensembling, thereby establishing theoretical model merging feasibility. To show that M-Loss can be integrated with a concrete merging method, we integrate the M-Loss dynamic budget scheduler into the TIES merging framework, guiding the selective removal of conflicting parameters. The integration leads to superior performance compared to existing methods. Empirical evaluation results on ViT models underscore M-Loss’s effectiveness in identifying crucial parameters. Overall, this work advances the theoretical foundations of model merging and contributes practical tools for the efficient merging of multiple models.

Future Work

Building on this work, future research could explore several promising directions:

  • Expanding Architectural Scope: Applying and adapting the M-Loss framework to other prominent architectures, particularly Large Language Models (LLMs), where model merging is of great interest.
  • M-Loss for Non-Linear Layers: Developing efficient methods to compute M-Loss for more complex, non-linear layers (e.g., attention mechanisms) to enable a fully M-Loss-guided merging process.
  • Beyond Pruning: Investigating the use of M-Loss to guide other aspects of the merging process, such as determining optimal model weighting schemes instead of simple uniform averaging.
  • Broader Theoretical Analysis: Extending the theoretical analysis to cover a wider range of activation functions and network non-linearities.

Citation

@inproceedings{wang2025mloss,
  title        = {M-Loss: Quantifying Model Merging Compatibility with Limited Unlabeled Data},
  author       = {Wang, Tiantong and Duan, Yiyang and Chen, Haoyu and Wu, Tiantong and Lim, Wei Yang Bryan},
  booktitle    = {Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
  year         = {2025},
  publisher    = {AAAI Press},
  address      = {Palo Alto, California, USA},
  url          = {https://openreview.net/forum?id=eJz0fKa8xg},
  note         = {To appear},
  keywords     = {model merging, parameter averaging, M-Loss, ViT, multimodel integration}
}