F2HF: Feed-Forward Network as a Noisy Head Filter in Vision Transformer Explanations

ICCIT 2025

I. Hossen, K. S. Alam

Khulna University of Engineering & Technology

vision-transformers interpretability attention-mechanism explainable-ai

Abstract

Vision Transformers have established themselves as the state-of-the-art method for computer vision tasks involving large datasets. Yet, the underlying mechanism behind their superior performance remains insufficiently understood. Several attempts have been made to understand the decision-making process of these models. Despite promising results, these methods frequently suffer from noisy outputs or exhibit limited faithfulness in their explanations. Our analysis reveals that a key limitation of existing interpretability methods lies in their inability to capture the Vision Transformer architecture completely. Recent research has revealed that the feed-forward network (FFD) acts as a token filter. Based on this finding, we designed an interpretability method incorporating a feed-forward network as an attention filter for each attention map. We named this method F2HF (Feed-Forward Head Filtering Saliency Generator). The baseline we worked on generated a single attention map by accumulating each block's attention map and multiplying that with classifier gradient heatmap. However, the addition of Feed-Forward filtering in the encoder attention map shows as much as 16.4% improvement of the IoU. Similarly, improvement was observed in precision and F1 score. In a subjective test, three human subjects blindly selected the less noisy heatmap for 500 images, and in 84% of the cases, they chose the saliency map generated by F2HF. The experiment and the method can be reproduced in a consumer-grade GPU (NVIDIA Tesla T4).

Overview

Overview of F2HF
Figure 1: Saliency Generation Flowchart using F2HF. The method leverages the feed-forward network's weights to filter attention heads.

Vision Transformers (ViTs) have become a key component of modern computer vision tasks, yet creating faithful and holistic explanations for their decisions remains an ongoing challenge. Many existing interpretability methods are either model-agnostic or, when specifically designed for transformers, primarily focus on the attention mechanism. However, attention maps are inherently not class-specific and often introduce noise, checkerboard artifacts, and irrelevant background attributions.

Key idea

Recent work has shown that the Feed-Forward Network (FFD) in each transformer block acts as a factual memory and a token filter. Existing methods often fail to fully account for the FFD's role, representing a significant gap in faithfully interpreting Transformer models. To address this, we introduce F2HF (Feed-Forward Head Filtering Saliency Generator).

In our approach, the generation of the attention map leverages the previous block's final linear layer to filter out unnecessary attention heads. By clipping negative values and focusing only on positive weights, we ensure that the most important class-specific features are retained while noisy, background-focused heads are suppressed.

Results

Objective and Subjective Evaluation

The addition of Feed-Forward filtering in the encoder attention map provides significant improvements over the baseline (which accumulates block attention maps and multiplies them with the classifier gradient heatmap):

  • Intersection over Union (IoU): Up to 16.4% improvement compared to existing methods.
  • Precision and F1 Score: Consistent and notable improvements were observed across these metrics during objective segmentation and perturbation tests.

Furthermore, in a subjective test where three human subjects blindly selected the less noisy heatmap across 500 images, the saliency map generated by F2HF was chosen in 84% of the cases. When target objects have minimal presence, F2HF almost always outperforms existing methods by generating a faithful, class-specific, and less noisy heatmap.

Qualitative Comparison

Qualitative comparison of generated heatmaps
Figure 2: Comparison between Proposed method (F2HF) and baseline interpretability methods. The proposed method successfully mitigates noise and isolates the target object.

Segmentation Test Comparison

Architecture Method Pix. Acc (%) mIoU mAP mF1 mIoU (FG)
ViT Base Proposed (F2HF) 81.36 0.65 0.88 0.45 0.53
Beyond Int.-T 77.47 0.60 0.87 0.44 0.49
Beyond Int.-H 78.04 0.61 0.86 0.44 0.49
Rollout 70.33 0.52 0.82 0.42 0.41
Raw Attention 65.91 0.41 0.72 0.19 0.20
Int. Gradient 67.31 0.49 0.78 0.38 0.38
GradCAM 65.91 0.41 0.72 0.19 0.20
ViT Large Proposed (F2HF) 77.35 0.61 0.85 0.42 0.50
Beyond Int.-T 71.46 0.52 0.82 0.39 0.40
Beyond Int.-H 77.14 0.59 0.85 0.43 0.47
Rollout 66.00 0.47 0.79 0.39 0.36
Raw Attention 67.41 0.40 0.70 0.11 0.11
Int. Gradient 65.89 0.43 0.72 0.30 0.20
GradCAM 63.49 0.40 0.70 0.10 0.10

Positive Perturbation Test Results (Lower is better)

Perturbation Top-K Proposed (F2HF) B. Int. (T) B. Int. (H) Rollout IG
10% Top-1 62.14% 65.37% 63.60% 63.55% 72.10%
Top-3 77.26% 80.33% 78.88% 78.60% 85.68%
Top-5 81.94% 84.56% 83.43% 82.95% 89.10%
20% Top-1 48.37% 53.51% 63.59% 50.76% 65.94%
Top-3 63.80% 69.03% 78.87% 66.27% 80.39%
Top-5 69.82% 74.58% 83.43% 71.84% 84.49%
30% Top-1 37.50% 43.59% 41.08% 40.96% 59.25%
Top-3 52.58% 58.74% 56.15% 55.82% 74.31%
Top-5 59.10% 64.70% 62.23% 61.76% 79.26%
Key takeaway: By integrating Feed-Forward filtering into the attention mechanism, F2HF provides a significantly more faithful and noise-free interpretability method for Vision Transformers that is computationally efficient enough to run on a consumer-grade GPU (e.g., NVIDIA Tesla T4).

Citation

@inproceedings{hossen2025f2hf,
  author = {Hossen, Imran and Alam, Kazi Saeed},
  title = {{F2HF: Feed-Forward Network as a Noisy Head Filter in Vision Transformer Explanations}},
  booktitle = {{2025 28th International Conference on Computer and Information Technology (ICCIT)}},
  year = {2025},
  organization = {IEEE}
}