MAXIM: Multi-Axis MLP for Image Processing

Zhengzhong Tu1,2   Hossein Talebi1   Han Zhang1   Feng Yang1
Peyman Milanfar1   Alan Bovik2   Yinxiao Li1
1 Google Research   2 University of Texas at Austin

Work done during an internship at Google.
Abstract

Recent progress on Transformers and multi-layer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for low-level vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and ‘fully-convolutional’, two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, deraining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at https://github.com/google-research/maxim.

1 Introduction

Image processing tasks, such as restoration and enhancement, are important computer vision problems, which aim to produce a desired output from a degraded input. Various types of degradations may require different image enhancement treatments, such as denoising, deblurring, super-resolution, dehazing, low-light enhancement, and so on. Given the increased availability of curated large-scale training datasets, recent high-performing approaches [110, 111, 125, 50, 20, 15, 60, 61, 17, 52, 22] based on highly designed convolutional neural network (CNN) have demonstrated state-of-the-art (SOTA) performance on many tasks.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: Our proposed MAXIM model significantly advances state-of-the-art performance on five image processing tasks in terms of PSNR: 1) Denoising (+0.24 dB on SIDD [2]), 2) Deblurring (+0.15 dB on GoPro [62]) 3) Deraining (+0.86 dB on Rain100L [105]), 4) Dehazing (+0.94 dB on RESIDE [46]), and 5) Retouching (Enhancement) (+1.15 dB on FiveK [8]).

Improving the architectural design of the underlying model is one of the keys to improving the performance of most computer vision tasks, including image restoration. Numerous researchers have invented or borrowed individual modules or building blocks and implemented them into low-level vision tasks, including residual learning [120, 43, 95], dense connections [121, 95], hierarchical structures [42, 41, 37], multi-stage frameworks [111, 16, 34, 113], and attention mechanisms [110, 111, 66, 91].

Recent research explorations on Vision Transformers (ViT) [24, 11, 57] have exemplified their great potential as alternatives to the go-to CNN models. The elegance of ViT [24] has also motivated similar model designs with simpler global operators such as MLP-Mixer [87], gMLP [54], GFNet [76], and FNet [44], to name a few. Despite successful applications to many high-level tasks [24, 57, 89, 4, 104, 102, 85], the efficacy of these global models on low-level enhancement and restoration problems has not been studied extensively. The pioneering works on Transformers for low-level vision [15, 10] directly applied full self-attention, which only accepts relatively small patches of fixed sizes (e.g., 48×48). Such a strategy will inevitably cause patch boundary artifacts when applied on larger images using cropping [15]. Local-attention based Transformers [52, 97] ameliorate this issue, but they are also constrained to have limited sizes of receptive field, or to lose non-locality [93, 24], which is a compelling property of Transformers and MLP models relative to hierarchical CNNs.

To overcome these issues, we propose a generic image processing network, dubbed MAXIM, for low-level vision tasks. A key design element of MAXIM is the use of multi-axis approach (Sec. 3.2) that captures both local and global interactions in parallel. By mixing information on a single axis for each branch, this MLP-based operator becomes ‘fully-convolutional’ and scales linearly with respect to image size, which significantly increases its flexibility for dense image processing tasks. We also define and build a pure MLP-based cross-gating module, which adaptively gate the skip-connections in the neck of MAXIM using the same multi-axis approach, and which further boosts performance. Inspired by recent restoration models, we develop a simple but effective multi-stage, multi-scale architecture consisting of a stack of MAXIM backbones. MAXIM achieves strong performance on a range of image processing tasks, while requiring very few number of parameters and FLOPs. Our contributions are:

  • A novel and generic architecture for image processing, dubbed MAXIM, using a stack of encoder-decoder backbones, supervised by a multi-scale, multi-stage loss.

  • A multi-axis gated MLP module tailored for low-level vision tasks, which always enjoys a global receptive field, with linear complexity relative to image size.

  • A cross gating block that cross-conditions two separate features, which is also global and fully-convolutional.

  • Extensive experiments show that MAXIM achieves SOTA results on more than 10 datasets including denoising, deblurring, deraining, dehazing, and enhancement.

2 Related Work

Restoration models. Driven by recent enormous efforts on building vision benchmarks, learning-based models, especially CNN models, have been developed that attain state-of-the-art performance on a wide variety of image enhancement tasks [111, 16, 37, 50, 81, 17, 15, 52]. These increased performance gains can be mainly attributed to novel architecture designs, and/or task-specific modules and units. For instance, UNet [80] has incubated many successful encoder-decoder designs [111, 37, 20] for image restoration that improve on earlier single-scale feature processing models [120, 45]. Advanced components developed for high-level vision tasks have been brought into low-level vision tasks as well. Residual and dense connections [120, 43, 95, 121, 95], the multi-scale feature learning [41, 97, 20], attention mechanisms [110, 111, 66, 91, 121], and non-local networks [93, 53, 121] are such good examples. Recently, multi-stage networks [111, 16, 34, 113] have attained promising results relative to the aforementioned single-stage models on the challenging deblurring and deraining tasks [23, 34, 111]. These multi-stage frameworks are generally inspired by their success on higher-level problems such as pose estimation [18, 48], action segmentation [25, 47], and image generation [116, 117].

Low-level vision Transformers. Transformers were originally proposed for NLP tasks [90], where multi-head self-attention and feed-forward MLP layers are stacked to capture non-local interactions between words. Dosovitskiy et al. coined the term Vision Transformer (ViT) [24], and demonstrated the first pure Transformer model for image recognition. Several recent studies explored Transformers for low-level vision problems, e.g., the pioneering pre-trained image processing Transformer (IPT) [15]. Similar to ViT, IPT directly applies vanilla Transformers to image patches. The authors of [10] presented a spatial-temporal convolutional self-attention network that exploits local information for video super-resolution. More recently, Swin-IR [52] and UFormer [97] apply efficient window-based local attention models on a range of image restoration tasks.

MLP vision models. More recently, several authors have argued that when using a patch-based architecture as in ViT, the necessity of complex self-attention mechanisms becomes questionable. For instance, MLP-Mixer [87] adopts a simple token-mixing MLP to replace self-attention in ViT, resulting in an all-MLP architecture. The authors of [54] proposed the gMLP, which applies a spatial gating unit on visual tokens. ResMLP [88] adopts an Affine transformation as a substitute to Layer Normalization for acceleration. Very recent techniques such as FNet [44] and GFNet [76] demonstrate the simple Fourier Transform can be used as a competitive alternative to either self-attention or MLPs.

Refer to caption
Figure 2: MAXIM architecture. We take (a) an encoder-decoder backbone with each (b) encoder, decoder, and bottleneck containing a multi-axis gated MLP block (Fig. 3) as well as a residual channel attention block. The model is further boosted by (c) a cross gating block which allows global contextual features to gate the skip-connections. More detailed description can be found in Sec. A.2.

3 Our Approach: MAXIM

We present, to the best of our knowledge, the first effective general-purpose MLP architecture for low-level vision, which we call Multi-AXIs MLP for image processing (MAXIM). Unlike previous low-level Transformers [15, 10, 52, 97], MAXIM has several desired properties, making it intriguing for image processing tasks. First, MAXIM expresses global receptive fields on arbitrarily large images with linear complexity; Second, it directly supports arbitrary input resolutions, i.e., being fully-convolutional; Lastly, it provides a balanced design of local (Conv) and global (MLP) blocks, outperforming SOTA methods without the necessity for large-scale pre-training [15].

3.1 Main Backbone

The MAXIM backbone (Fig. 2a) follows the encoder-decoder design principles that originated with UNet [80]. We have observed that operators having small footprints such as Conv3x3 are essential to the performance of UNet-like networks. Thus, we rely on a hybrid model design for each block (Fig. 2b) – Conv for local, and MLP for long-range interactions – to make the most of them.

To allow long-range spatial mixing at different scales, we insert the multi-axis gated MLP block (MAB) into each encoder, decoder, and bottleneck (Fig. 2b), with a residual channel attention block (RCAB) [100, 111] (LayerNorm-Conv-LeakyReLU-Conv-SE [31]) stacked subsequently. Inspired by the gated filtering of skip connections [67, 71], we extend the gated MLP (gMLP) to build a cross gating block (CGB, Fig. 2c), which is an efficient 2nd-order alternative to cross-attention (3rd-order correlations), to interact, or condition two distinct features. We leverage the global features from Bottleneck (Fig. 2a) to gate the skip connections, while propagating the refined global features upwards to the next CGB. Multi-scale feature fusion [84, 110, 20] (red and blue lines) is utilized to aggregate multi-level information in the EncoderCGB and CGBDecoder dataflow.

3.2 Multi-Axis Gated MLP

Refer to caption
Figure 3: Multi-axis gated MLP block (best viewed in color). The input is first projected to a [6,4,C] feature, then split into two heads. In the local branch, the half head is blocked into 3×2 non-overlapping [2,2,C/2] patches, while we grid the other half using a 2×2 grid in the global branch. We only apply the gMLP block [54] (illustrated in the right gMLP Block) on a single axis of each branch - the 2nd axis for the local branch and the 1st axis for the global branch, while shared along the other spatial dimensions. The gMLP operators, which run in parallel, correspond to local and global (dilated) attended regions, as illustrated with different colors (i.e., the same color are spatially mixed using the gMLP operator). Our proposed block expresses both global and local receptive fields on arbitrary input resolutions.

Our work is inspired by the multi-axis blocked self-attention proposed in [123], which performs attention on more than a single axis. The attentions performed on two axes on blocked images correspond to two forms of sparse self-attention, namely regional and dilated attention. Despite capturing local and global information in parallel, this module cannot accommodate image restoration or enhancement tasks where the test images are often of arbitrary sizes.

We improve the ‘multi-axis’ concept for image processing tasks, by building a (split-head) multi-axis gated MLP block (MAB), as shown in Fig. 3. Instead of applying multi-axis attention in a single layer [123], we split in half the heads first, each being partitioned independently. In the local branch, the half head of a feature of size (H,W,C/2) is blocked into a tensor of shape (Hb×Wb,b×b,C/2), representing partitioning into non-overlapping windows each with size of (b×b); in the global branch, the other half head is gridded into the shape (d×d,Hd×Wd,C/2) using a fixed (d×d) grid, with each window having size (Hd×Wd). For visualization, we set b=2,d=2 in Fig. 3. To make it fully-convolutional, we only apply the gated MLP (gMLP) block [54] on a single axis of each branch – the 2nd axis for the local branch and the 1st axis for the global branch – while sharing parameters on the other spatial axes. Intuively, applying multi-axis gMLPs in parallel correspond to local and global (dilated) mixing of spatial information, respectively. Finally, the processed heads are concatenated and projected to reduce the number of channels, which are further combined using the long skip-connection from the input. It is worth noting that this approach provides an advantage for our model over methods that process fixed-size image patches [15] by avoiding patch boundary artifacts.

Complexity analysis. The computational complexity of our proposed Multi-Axis gMLP block (MAB) is:

Ω(MAB)=d2HWCGlobalgMLP+b2HWCLocalgMLP+10HWC2Denselayers, (1)

which is linear with respect to image size HW, while other global models like ViT, Mixer, and gMLP are quadratic.

Universality of the multi-axis approach. Our proposed parallel multi-axis module (Fig. 3) presents a principled way to apply 1D operators on 2D images in a scalable manner. It also allows for significant flexibility and universality. For example, a straightforward replacement of a gMLP with a spatial MLP [87], self-attention [24], or even Fourier Transform [76, 44] leads to a family of MAXIM variants (see Sec. 4.3D), all sharing globality and fully-convolutionality. It is also easily extensible to any future 1D operator that may be defined on, e.g., Language models.

3.3 Cross Gating MLP Block

A common improvement over UNet is to leverage contextual features to selectively gate feature propagation in skip-connections [67, 71], which is often achieved by using cross-attention [90, 13]. Here we build an effective alternative, namely cross-gating block (CGB, Fig. 2c), as an extension of MAB (Sec. 3.2) which can only process a single feature. CGB can be regarded as a more general conditioning layer that interacts with multiple features [90, 70, 13]. We follow similar design patterns as those used in MAB.

To be more specific, let 𝐗,𝐘 be two input features, and 𝐗1,𝐘1H×W×C be the features projected after the first Dense layers in Fig. 2c. Input projections are then applied:

𝐗2=σ(𝐖1𝖫𝖭(𝐗1)),𝐘2=σ(𝐖2𝖫𝖭(𝐘1)) (2)

where σ is the 𝖦𝖤𝖫𝖴 activation [30], 𝖫𝖭 is Layer Normalization [5], and 𝐖1,𝐖2 are MLP projection matrices. The multi-axis blocked gating weights are computed from 𝐗2,𝐘2, respectively, but applied reciprocally:

𝐗^=𝐗2G(𝐘2),𝐘^=𝐘2G(𝐗2) (3)

where represents element-wise multiplication, and the function G() extracts multi-axis cross gating weights from the input using our proposed multi-axis approach (Sec. 3.2):

G(𝐱)=𝐖5([𝐖3𝖡𝗅𝗈𝖼𝗄b(𝐳𝟏),𝐖4𝖦𝗋𝗂𝖽d(𝐳𝟐)]) (4)

where [,] denotes concatenation. Here (𝐳𝟏,𝐳𝟐) are two independent heads split from 𝐳 along the channel dimension, where 𝐳 represents the projected features 𝐱 after activation:

[𝐳𝟏,𝐳𝟐]=𝐳=σ(𝐖6𝖫𝖭(𝐱)), (5)

and 𝐖3,𝐖4 are spatial projection matrices applied on the 2nd and 1st axis of the blocked/gridded features having fixed window size b×b (𝖡𝗅𝗈𝖼𝗄b), and fixed grid size of d×d (𝖦𝗋𝗂𝖽d), respectively. Finally, we adopt residual connection from the inputs, following an output channel-projection that maintains the same channel dimensions as the inputs (𝐗1,𝐘1), using projection matrices 𝐖7, 𝐖8, denoted by

𝐗3=𝐗1+𝐖7𝐗^,𝐘3=𝐘1+𝐖8𝐘^. (6)

The complexity of CGB is also tightly-bounded by Eq. 1.

3.4 Multi-Stage Multi-Scale Framework

We further adopt a multi-stage framework because we find it more effective, as compared to scaling up the model width or height (see ablation Sec. 4.3A). We deem full resolution processing [69, 77, 16] a better approach than a multi-patch hierarchy [83, 111, 113], since the latter would potentially induce boundary effects across patches. To impose stronger supervision, we apply a multi-scale approach [48, 18, 20] at each stage to help the network learn. We leverage the supervised attention module [111] to propagate attentive features progressively along the stages. We leverage the cross-gating block (Sec. 3.3) for cross-stage feature fusion. We refer the reader to Fig. 9 for details.

SIDD [2] DND [72] Average
Method PSNR SSIM PSNR SSIM PSNR SSIM
DnCNN [120] 23.66 0.583 32.43 0.790 28.04 0.686
MLP [7] 24.71 0.641 34.23 0.833 29.47 0.737
BM3D [21] 35.65 0.685 34.51 0.851 35.08 0.768
CBDNet* [29] 30.78 0.801 38.06 0.942 34.42 0.872
RIDNet* [3] 38.71 0.951 39.26 0.953 38.99 0.952
AINDNet* [38] 38.95 0.952 39.37 0.951 39.16 0.952
VDN [107] 39.28 0.956 39.38 0.952 39.33 0.954
SADNet* [12] 39.46 0.957 39.59 0.952 39.53 0.955
CycleISP* [109] 39.52 0.957 39.56 0.956 39.54 0.957
MIRNet [110] 39.72 0.959 39.88 0.956 39.80 0.958
MPRNet [111] 39.71 0.958 39.80 0.954 39.76 0.956
MAXIM-3S 39.96 0.960 39.84 0.954 39.90 0.957
Table 1: Denoising results. Our model is only trained on SIDD [2] and evaluated on SIDD [2] and DND [72], where * denotes methods using additional training data.

Formally, given an input image 𝐈H×W×3, we first extract its multi-scale variants by downscaling: 𝐈n,n=1,,N. MAXIM predicts multi-scale restored outputs at each stage s of S stages, yielding a total of S×N outputs: 𝐑s,n. Despite being multi-stage, MAXIM is trained end-to-end with losses accumulating across stages and scales:

=s=1Sn=1N[char(𝐑s,n,𝐓n)+λfreq(𝐑s,n,𝐓n)], (7)

where 𝐓n denotes (bilinearly-rescaled) multi-scale target images, and char is the Charbonnier loss [111]:

char(𝐑,𝐓)=𝐑𝐓2+ϵ2, (8)

where we set ϵ=103. freq is the frequency reconstruction loss that enforces high-frequency details [35, 20]:

freq(𝐑,𝐓)=(𝐑)(𝐓)1 (9)

where () represents the 2D Fast Fourier Transform. We used λ=0.1 as the weighting factor in all experiments.

4 Experiments

We aim at building a generic backbone for a broad spectrum of image processing tasks. Thus, we evaluated MAXIM on five different tasks: (1) denoising, (2) deblurring, (3) deraining, (4) dehazing, and (5) enhancement (retouching) on 17 different datasets (summarized in Tab. 8. More comprehensive results and visualizations can be found in Sec. A.6.

4.1 Experimental Setup

Datasets and metrics. We measured PSNR and SSIM [96] metrics between ground truth and predicted images to make quantitative comparisons. We used SIDD [2] and DND [72] for denoising, GoPro [62], HIDE [81], and RealBlur [79] for debluring, a combined dataset Rain13k used in [111] for deraining. The RESIDE [46] is used for dehazing, while Five-K[8] and LOL [98] are evaluated for enhancement.

Refer to caption
Input Target VDN [107] DANet [108] MIRNet [110] CycleISP [109] MPRNet [111] MAXIM-3S
Figure 4: Denoising comparisons. The example from SIDD [2] shows that our method produces cleaner denoising results.
GoPro [62] HIDE [81] Average
Method PSNR SSIM PSNR SSIM PSNR SSIM
DeblurGAN [40] 28.70 0.858 24.51 0.871 26.61 0.865
Nah et al. [62] 29.08 0.914 25.73 0.874 27.41 0.894
Zhang et al. [118] 29.19 0.931 - - - -
DeblurGAN-v2 [41] 29.55 0.934 26.61 0.875 28.08 0.905
SRN [86] 30.26 0.934 28.36 0.915 29.31 0.925
Shen et al. [81] - - 28.89 0.930 - -
Gao et al. [28] 30.90 0.935 29.11 0.913 30.01 0.924
DBGAN [119] 31.10 0.942 28.94 0.915 30.02 0.929
MT-RNN [69] 31.15 0.945 29.15 0.918 30.15 0.932
DMPHN [113] 31.20 0.940 29.09 0.924 30.15 0.932
Suin et al. [83] 31.85 0.948 29.98 0.930 30.92 0.939
MPRNet [111] 32.66 0.959 30.96 0.939 31.81 0.949
Pretrained-IPT [15] 32.58 - - - - -
MIMO-UNet+ [20] 32.45 0.957 29.99 0.930 31.22 0.944
HINet [16] 32.71 0.959 30.32 0.932 31.52 0.946
MAXIM-3S 32.86 0.961 32.83 0.956 32.85 0.959
Table 2: Deblurring results. Our model is trained on GoPro [62] and evaluated on the GoPro and the HIDE dataset [81].
RealBlur-R [79] RealBlur-J [79] Average
Method PSNR SSIM PSNR SSIM PSNR SSIM
Hu et al. [33] 33.67 0.916 26.41 0.803 30.04 0.860
Nah et al. [62] 32.51 0.841 27.87 0.827 30.19 0.834
DeblurGAN [40] 33.79 0.903 27.97 0.834 30.88 0.869
Pan et al. [68] 34.01 0.916 27.22 0.790 30.62 0.853
Xu et al. [103] 34.46 0.937 27.14 0.830 30.8 0.884
DeblurGAN-v2 [41] 35.26 0.944 28.70 0.866 31.98 0.905
Zhang et al. [118] 35.48 0.947 27.80 0.847 31.64 0.897
SRN [86] 35.66 0.947 28.56 0.867 32.11 0.907
DMPHN [113] 35.70 0.948 28.42 0.860 32.06 0.904
MPRNet [111] 35.99 0.952 28.70 0.873 32.35 0.913
MAXIM-3S 35.78 0.947 28.83 0.875 32.31 0.911
DeblurGAN-v2 36.44 0.935 29.69 0.870 33.07 0.903
SRN [86] 38.65 0.965 31.38 0.909 35.02 0.937
MPRNet [111] 39.31 0.972 31.76 0.922 35.54 0.947
MIMO-UNet+ [20] - - 32.05 0.921 - -
MAXIM-3S 39.45 0.962 32.84 0.935 36.15 0.949
Table 3: Deblurring results on RealBlur [79]. denotes methods that are trained on RealBlur, while those without indicate methods trained only on GoPro.
Refer to caption
Input Target DMPHN [113] Suin et al. [83] MPRNet [111] HINet [16] MIMO-UNet [20] MAXIM-3S
Figure 5: Deblurring comparisons. The top row shows an example from GoPro [62] while the second row shows one from HIDE [81].

Training details. Our proposed MAXIM model is end-to-end trainable and requires neither large-scale pretraining nor progressive training. The network is trained on 256×256 random-cropped patches. We train different iterations for each task. We used random horizontal and vertical flips, 90 rotation, and MixUp [112] with probability 0.5 for data augmentation. We used the Adam optimizer [39] with an initial learning rate of 2×104, which are steadily decreased to 107 with the cosine annealing decay [59]. When testing, we padded the input images to be a multiplier of 64×64 using symmetric padding on both sides. After inference, we cropped the padded image back to original size. More training details on each task can be found in Sec. A.1.

Architectural configuration. We designed two MAXIM variants: a two-stage model called MAXIM-2S, and a three-stage model, MAXIM-3S, for different tasks. We start with 32 initial channels for feature extraction, with 3 downsampling layers, where the features contract from 2562×32, 1282×64, 642×128, to 322×256 processed by two Bottlenecks (Fig. 2a), then symmetrically expanded back to full resolution. The number of parameters and required FLOPs of MAXIM-2S and MAXIM-3S, when applied on a 256×256 image are shown in the last two rows of Tab. 7A.

4.2 Main Results

Denoising. We report in Tab. 1 numerical comparisons on the SIDD [2] and DND [72] datasets. As may be seen, our method outperformed previous SOTA techniques, e.g., MIRNet [110] by 0.24 dB of PSNR on SIDD while obtaining competitive PSNR (39.84 dB) on DND. Fig. 4 shows visual results on SIDD. Our method clearly removes real noise while maintaining fine details, yielding visually pleasant results to the other methods.

Deblurring. Tab. 2 shows the quantitative comparison of MAXIM-3S against SOTA deblurring methods on two synthetic blur datasets: GoPro [62] and HIDE [81]. Our method achieves 0.15 dB gain in PSNR over the previous best model HINet [16]. It is notable that the GoPro-trained MAXIM-3S model generalizes extremely well on the HIDE dataset, setting new SOTA PSNR values: 32.83 dB. We also evaluated on real-world blurry images from RealBlur [79] under two settings: (1) directly applied the GoPro-trained model on RealBlur, and (2) fine-tuned the model on RealBlur. Under setting (1), MAXIM-3S ranked first on RealBlur-J subset while obtaining the top two performance on RealBlur-R. Fig. 5 shows visual comparisons of the evaluated models on GoPro [62], HIDE [81] and RealBlur [79], respectively. It may be observed that our model recovers text extremely well, which may be attributed to the use of multi-axis MLP module within each block that globally aggregates repeated patterns across various scales.

Refer to caption
Input Target RESCAN [49] PreNet [77] MSPFN [34] MPRNet [111] HINet [16] MAXIM-2S
Figure 6: Deraining comparisons. The top and bottom rows present examples from Rain100L [105] and Test100 [115], respectively, demonstrating the ability of MAXIM to remove rain streaks while recovering more details, hence yielding more visually pleasant results.
Refer to caption
Input Target GCANet [14] GridDehaze [55] DuRN [56] MSBDN [23] FFA-Net [74] MAXIM-2S
Figure 7: Dehazing comparisons. The top and bottom rows exemplify visual results from the SOTS indoor and outdoor sets [46].
Rain100L [105] Rain100H [105] Test100 [115] Test1200 [114] Test2800 [27] Average
Method PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
DerainNet [26] 27.03 0.884 14.92 0.592 22.77 0.810 23.38 0.835 24.31 0.861 22.48 0.796
SEMI [99] 25.03 0.842 16.56 0.486 22.35 0.788 26.05 0.822 24.43 0.782 22.88 0.744
DIDMDN [114] 25.23 0.741 17.35 0.524 22.56 0.818 29.65 0.901 28.13 0.867 24.58 0.770
UMRL [106] 29.18 0.923 26.01 0.832 24.41 0.829 30.55 0.910 29.97 0.905 28.02 0.880
RESCAN [49] 29.80 0.881 26.36 0.786 25.00 0.835 30.51 0.882 31.29 0.904 28.59 0.857
PreNet [77] 32.44 0.950 26.77 0.858 24.81 0.851 31.36 0.911 31.75 0.916 29.42 0.897
MSPFN [34] 32.40 0.933 28.66 0.860 27.50 0.876 32.39 0.916 32.82 0.930 30.75 0.903
MPRNet [111] 36.40 0.965 30.41 0.890 30.27 0.897 32.91 0.916 33.64 0.938 32.73 0.921
HINet [16] 37.20 0.969 30.63 0.893 30.26 0.905 33.01 0.918 33.87 0.940 33.00 0.925
MAXIM-2S 38.06 0.977 30.81 0.903 31.17 0.922 32.37 0.922 33.80 0.943 33.24 0.933
Table 4: Deraining comparisons. Our method consistently yields better quality metrics with respect to both PSNR or SSIM on all the tested datasets: Rain100L [105], Rain100H [105], Test100 [115], Test1200 [114], Test2800 [27]
SOTS-Indoor SOTS-Outdoor
Method PSNR SSIM PSNR SSIM
DehazeNet [9] 21.14 0.847 22.46 0.851
GFN [78] 22.30 0 880 21.55 0.844
GCANet [14] 30.23 0.959 19.98 0.704
GridDehaze [55] 32.14 0.983 30.86 0.981
GMAN [58] 27.93 0.896 28.47 0.944
MSBDN [23] 33.79 0.984 23.36 0.875
DuRN [56] 32.12 0.980 24.47 0.839
FFA-Net [74] 36.39 0.989 33.57 0.984
AECR-Net [101] 37.17 0.990 - -
MAXIM-2S 38.11 0.991 34.19 0.985
Table 5: Dehazing comparisons. Our model achieved the best results on both indoor and outdoor scenes.
Refer to caption
Input Target CycleGAN [124] Exposure [32] DPE [19] EnlightenGAN UEGAN [65] MAXIM-2S
Refer to caption
Input Target Retinex [98] GLAD [92] KinD [122] EnlightenGAN MIRNet [110] MAXIM-2S
Figure 8: Retouching and low-light enhancement comparisons. The top row shows an example from the MIT-Adobe FiveK dataset [8], while the bottom row exemplifies a comparison from LOL [98]. Our model generated variegated and more naturalistic colors (top) for retouching, while achieving clearer and brighter visual enhancements in the bottom example.

Deraining. Following previous work [34, 111], we computed the performance metrics using the Y channel (in YCbCr color space). Tab. 5 shows quantitative comparisons with previous methods. As may be seen, our model improved over the SOTA performances on all datasets. The average PSNR gain of our model over the previous best model HINet [16] is 0.24 dB. We demonstrate some challenging examples in Fig. 6, which demonstrates that our method consistently delivered faithfully recovered images without introducing any noticeable visual artifacts.

Dehazing. We report our comparisons against SOTA models in Tab. 5. Our model surpassed the previous best model by 0.94 dB and 0.62 dB of PSNR on the SOTS [46] indoor and outdoor sets. Fig. 7 shows that our model recovered images of better quality on both flat regions as well as textures, while achieving a harmonious global tone.

Enhancement / Retouching. As Tab. 6 illustrates, our model achieved the best PSNR and SSIM values on FiveK [8] and LOL [98], respectively. As the top row of Fig. 8 suggests, MAXIM recovered diverse naturalistic colors as compared to other techniques. Regarding the bottom example, while MIRNet [110] obtained a higher PSNR, we consistently observed that our model attains visually better quality with sharper details and less noise. Moreover, the far more perceptually relevant SSIM index indicates a significant advantage of MAXIM-2S relative to MIRNet.

Other benchmarks. Due to space limitations, we detail the outcomes of our experiments on the REDS deblurring [63] and the Raindrop removal task [73] in Sec. A.5.

4.3 Ablation

We conduct extensive ablation studies to validate the proposed multi-axis gated MLP block, cross-gating block, and multi-stage multi-scale architecture. The evaluations were performed on the GoPro dataset [62] trained on image patches of size 256×256 for 106 iterations. We used the MAXIM-2S model as the test-bed for Ablation-A and -B.

A. Individual components. We conducted an ablation by progressively adding (1) inter-stage cross-gating blocks (CGBIS), (2) a supervised attention module (SAM), (3) cross-stage cross-gating blocks (CGBCS, and (4) the multi-scale supervision (MS-Sp). Tab. 7A indicates a PSNR gain of 0.25, 0.63, 0.36, 0.26 dB for each respective component.

FiveK [8] LOL [98]
Method PSNR SSIM Method PSNR SSIM
CycleGAN [124] 18.23 0.835 Retinex [98] 16.77 0.559
Exposure [32] 22.35 0.861 GLAD [92] 19.71 0.703
EnlightenGAN 17.74 0.828 EnlightenGAN 17.48 0.657
DPE [19] 24.08 0.922 KinD [122] 20.37 0.804
UEGAN [65] 25.00 0.929 MIRNet [110] 24.14 0.830
MAXIM-2S 26.15 0.945 MAXIM-2S 23.43 0.863
Table 6: Enhancement results on FiveK [8] and LOL [98].

B. Effects of multi-axis approach. We further examined the necessity of our proposed multi-axis approach, as shown in Tab. 7B. We conducted experiments over (1) baseline UNet, (2) by adding the local branch of MAB (MAB), (3) by adding the global branch of MAB (MABg), (4) by adding the local branch of CGB (CGB), (5) by adding the global branch of CGB (CGBg). Note that the huge jump (+1.04 dB) of PSNR by adding MAB can be largely attributed to the addition of input and output channel projection layers, because we also observe a high performance of 31.42 dB PSNR if only MABg is added. Overall, we observed a major improvement when including MAB, and a relatively minor gain when adding CGB.

C. Why multi-stage? Towards understanding this, we scaled up MAXIM in terms of width (channels), depth (downscaling steps), and the number of stages. Tab. 7C suggests that packing the backbone into multi-stages yields the best performance vs. complexity tradeoff (32.44 dB, 22.2 M, 339.2 G), compared to making it wider or deeper.

D. Beyond gMLP: the MAXIM families. As described in Sec. 3.2, our proposed multi-axis approach (Fig. 3) offers a scalable way of applying any 1D operators on (high-resolution) images, with linear complexity relative to image size while maintaining fully-convolutional. We conducted a pilot study using MAXIM-1S and -2S on SIDD [2] to explore the MAXIM families: MAXIM-FFT, -MLP, -gMLP (modeled in this paper), -SA, where we use the Fourier Transform filter [76, 44], spatial MLP [87], gMLP [54], and self-attention [24] on spatial axes using the same multi-axis approach (Fig. 3). As Tab. 7D shows, the gMLP and self-attention variants achieved the best performance, while the FFT and MLP families were more computationally efficient. We leave deeper explorations to future works.

CGBIS SAM CGBCS MS-Sp PSNR
30.73
30.98
31.61
31.97
32.23
A. Individual components.
MAB MABg CGB CGBg PSNR
30.48
31.52
31.68
31.84
31.91
B. Effects of multi-axis approach.
S W D PSNR Params FLOPs
Base 1 32 3 31.08 6.1M 93.6G
Wider 1 64 3 32.09 19.4M 309.9G
1 96 3 32.31 41.7M 648.9G
Deeper 1 32 4 31.17 19.8M 121.6G
1 32 5 31.43 75.0M 153.4G
More 2 32 3 31.82 14.1M 216.4G
stages 3 32 3 32.44 22.2M 339.2G
C. Why multi-stage?
Variant PSNR Params FLOPs
M1-FFT 39.67 4.1M 71G
M1-MLP 39.75 5.4M 83G
M1-gMLP 39.80 6.1M 93G
M1-SA 39.79 5.3M 111G
M2-FFT 39.74 10.1M 172G
M2-MLP 39.70 12.7M 195G
M2-gMLP 39.83 14.1M 216G
M2-SA 39.85 12.5M 250G
D. Beyond gMLP.
Table 7: Ablation studies. Components in subtable A and B are defined in Sec. 4.3. S, W, and D denote the number of stages, width, and depth, respectively. M1 and M2 in subtable D denote MAXIM-1S and MAXIM-2S models, respectively.

5 Conclusion

We have presented a generic network for restoration or enhancement tasks, dubbed MAXIM, inspired by recently popular MLP-based global models. Our work suggests an effective and efficient approach for applying gMLP to low-level vision tasks to gain global attention, a missing attribute of basic CNNs. Our gMLP initialization of the MAXIM family significantly advances state-of-the-arts in several image enhancement and restoration tasks with moderate complexity. We demonstrate a few applications, but there are many more possibilities beyond the scope of this work which could significantly benefit by using MAXIM. Our future work includes exploring more efficient models for extremely high-resolution image processing, as well as training large models that can adapt on multiple tasks.

Broader impacts. The proposed model can be used as an effective tool to enhance and retouch daily photos. However, enhancing techniques such as denoising and deblurring are vulnerable to malicious use for privacy concerns. The models trained on specific data may express bias. These issues should be responsibly taken care of by researchers.

6 Acknowledgment

We thank Junjie Ke, Mauricio Delbracio, Sungjoon Choi, Irene Zhu, Innfarn Yoo, Huiwen Chang, and Ce Liu for valuable discussions and feedback.

Appendix A Appendix

A.1 Datasets and Training Details

All the datasets used in the paper are summarized in Tab. 8. We describe details of training for each dataset in the following. Note that we used the 2 loss for the dehazing task while using the loss defined in the main paper for all the other tasks.

Image Denoising. We trained our model on 320 high-resolution images provided in SIDD [2] and evaluated on 1,280 (256×256) and 1,000 (512×512) images provided by authors of SIDD [2] and DND [72], respectively. The results on DND were obtained via the online server [1]. We cropped the training images into 512×512 patches with a stride of 256 to prepare the training patches. We trained the MAXIM-3S model for 600k steps with a batch size of 256.

Image Deblurring. We trained our model on 2,103 image pairs from GoPro [62]. To demonstrate generalization ability, we evaluated our GoPro trained model on 1,111 pairs of the GoPro evaluation set, 2,025 images in the HIDE dataset [81], as well as the RealBlur dataset [79], which contains 980 paired images of camera JPEG output and RAW images, respectively. We cropped training images from GoPro into 512×512 patches with a stride of 128 to generate training patches. We trained our MAXIM-3S model over 600k steps with a batch size of 256. For evaluation on RealBlur setting (2) (see main paper), we loaded the GoPro pre-trained checkpoint and fine-tuned for 70k and 15k iterations on RealBlur-J and RealBlur-R, respectively. Additionally, we trained our model on 24,000 images from the REDS dataset of the NTIRE 2021 Image Deblurring Challenge Track 2 JPEG artifacts [63]. For evaluation, we followed the settings in the NTIRE 2021 Challenge on Image Deblurring [64], i.e., we used 300 images in the validation set of REDS. We trained from scratch for 10k epochs on REDS [63].

Task Dataset #Train #Test Test Dubname
Denoising SIDD [2] 320 40 SIDD
DND [72] 0 50 DND
Deblurring GoPro [62] 2103 1111 GoPro
HIDE [81] 0 2025 HIDE
RealBlur-J [79] 3758 980 RealBlur-J
RealBlur-R [79] 3758 980 RealBlur-R
REDS [63] 24000 300 REDS
Deraining Rain14000 [27] 11200 2800 Test2800
Rain1800 [105] 1800 0 -
Rain800 [115] 700 98 Test100
Rain100H [105] 0 100 Rain100H
Rain100L [105] 0 100 Rain100L
Rain1200 [114] 0 1200 Test1200
Rain12 [51] 12 0 -
Raindrop [73] 861 58 Raindrop-A
Raindrop [73] 0 239 Raindrop-B
Dehazing RESIDE-ITS [46] 13990 500 SOTS-Indoor
RESIDE-OTS [46] 313950 500 SOTS-Outdoor
Enhancement MIT-Adobe FiveK [8] 4500 500 FiveK
(Retouching) LOL [98] 485 15 LOL
Table 8: Dataset summary on five image processing tasks.
Refer to caption
Figure 9: We adopt a general multi-stage framework to improve the performance of MAXIM for challenging restoration tasks. Inspired by [111, 16], we employ the supervised attention module (SAM) and cross-stage feature fusion to help later stages learn. Unlike previous approaches, our MAXIM backbone attains global perception at each layer in each stage due to the proposed multi-axis MLP approaches, making it more powerful in learning global interactions in both low-level and high-level features.

Image Deraining. Following [111, 34], we used a composite training set containing 13,712 clean-rain image pairs collected from multiple datasets [27, 105, 115, 105, 114, 51]. Evaluation was performed on five test sets, Rain100H [105], Rain100L [105], Test100 [115], Test1200 [114], and Test2800 [27]. We trained our MAXIM-2S model over 500k steps with a batch size of 512. For the raindrop removal task, we trained MAXIM-2S on 861 pairs of training images in Raindrop dataset [73] for 80k steps with a batch size of 512, and evaluate on testset A (58 images) and testset B (239 images), respectively.

Image Dehazing. The RESIDE dataset [46] contains two subsets: Indoor Training Set (ITS) which contains 13,990 hazy images generated from 1399 clean ones, and Outdoor Training Set (OTS) that consists of 313,950 hazing images synthesized from 8,970 haze-free outdoor scenes. We evaluated our model on the Synthetic Objective Testing Set (SOTS) [46]: 500 indoor images for ITS-trained, and 500 outdoor images for OTS-trained models, respectively. We trained for 10k and 500 epochs on RESIDE-ITS and RESIDE-OTS using the 2 loss.

Image Enhancement. We used the MIT-Adobe FiveK [8] dataset provided by [65] for the retouching evaluation: the first 4,500 images for training and the rest 500 for testing. We cropped training images into 512×512 patches with a stride of 256. We also used the LOL dataset [98] which includes 500 pairs of images for low-light enhancement. We trained our model on 485 training images and evaluated on 15 test images. We trained for 14k and 180k steps on FiveK and LOL, respectively.

A.2 Architecture Details

Our proposed general multi-stage and multi-scale framework is illustrated in Fig. 9, where each stage uses a single-stage MAXIM backbone, which is illustrated in the main paper. We leveraged the multi-scale input-output approach [20] to deeply supervise each stage. Specifically, given an input image 𝐈H×W×3, we used the nearest neighbour downscaling method [20] to generate multi-scale input variants: 𝐈n,n=1,2,3, while we adopted a bilinear downscaler to produce the ground truth variants: 𝐓n,n=1,2,3. For each stage, we extracted shallow features from the inputs at each scale using Conv3x3. Except for the first stage, we fused the shallow features with attention features coming from the previous supervised attention module (SAM) [111] using a cross gating block (CGB). We also employed cross-stage feature fusion [111, 16] to help later stages, where the intermediate Encoder and Decoder features from the previous stage are fused with features encoded at the current stage using a CGB (blue lines in Fig. 9).

A.2.1 Configurations

The detailed specifications of the Encoder part for a single-stage MAXIM are shown in Tab. 9. We also provide the input and output shapes of each block and layer. Here Conv3x3_s1_w32 means a Conv layer with 3x3 kernels, stride 1, and 32 channels. MAB and RCAB are the two major components in Encoder / Decoder / Bottleneck. Note that in Bottleneck blocks, we use (Conv1x1) layers to replace Conv3x3 in RCAB.

Depth Input shape Output Shape Layers
1 2562×3 2562×32 Conv3x3_s1_w32
1 2562×32 2562×32 CGB* (b=d=16)
1 2562×32 2562×32 Conv1x1_s1_w32
1 2562×32 2562×32 { MAB(b=d=16)RCAB(3×3,r=4)}×2
1 2562×32 1282×32 Conv3x3_s2_w32
2 1282×32 1282×64 Conv3x3_s1_w64
2 1282×64 1282×64 CGB* (b=d=16)
2 1282×64 1282×64 Conv1x1_s1_w64
2 1282×64 1282×64 { MAB(b=d=16)RCAB(3×3,r=4)}×2
2 1282×64 642×64 Conv3x3_s2_w64
3 642×64 642×128 Conv3x3_s1_w128
3 642×128 642×128 CGB* (b=d=8)
3 642×128 642×128 Conv1x1_s1_w128
3 642×128 642×128 { MAB(b=d=8)RCAB(3×3,r=4)}×2
3 642×128 322×128 Conv3x3_s2_w128
4 322×128 322×256 Conv1x1_s1_w256
4 322×256 322×256 { MAB(b=d=8)RCAB(1×1,r=4)}×2
4 322×256 322×256 Conv1x1_s1_w256
4 322×256 322×256 { MAB(b=d=16)RCAB(1×1,r=4)}×2
Table 9: Detailed architectural specifications of the Encoder part of a single-stage MAXIM backbone. Depth 1-3 denotes Encoder blocks, while depth 4 corresponds to Backbone blocks. Note that in Bottlenecks, we use Conv1x1 in RCAB. * indicates layers that are not employed in the first stage.

The Decoder part of MAXIM is symmetric with respect to Tab. 9, and has the same configuration. For the CGB necks, we used b=d=16 for the depths 1 and 2, while b=d=8 is adopted for depth 3. Basically, we set the block and grid sizes as 16 for high-resolution stages (i.e. feature size 128) and 8 for low-resolution stages (i.e. feature size <128). Consequently, the input images need to have both dimensions to be divisible by 64, requiring the images to be padded by a multiplier of 64 during the inference.

A.2.2 Comparison with Other MLPs

In Fig. 10, we show a visual comparison of the approximated effective receptive fields among recent MLP models: MLP-Mixer [87], gMLP [54], Swin-Mixer [57], and our proposed MAXIM. Our approach achieves sparse interactions to obtain both local (red in Fig. 10c) and global dilated (green) spatial communications. Moreover, as shown in Tab. 10, unlike previous MLP models, MAXIM obtains both global and fully-convolutional properties with a linear complexity with respect to the number of pixels N.

A.3 JAX Implementations

Here we provide a JAX [6] implementation of the key component of MAXIM, namely the multi-axis gated MLP block (MAB), in Algorithm 1.

A.4 Performance vs. Complexity

We demonstrate the performance vs. complexity trade-off in Tab. 11 as compared with other competing methods for all the tasks. As it can be seen, our model obtains state-of-the-art performance at a very moderate complexity. On denoising, for example, MAXIM-3S has only 21% FLOPs and 70% parameters of MIRNet [110]; on deblurring, our MAXIM-3S model requires only 25% of the number of parameters of the previous best model HINet [16], and merely 19% of the number of parameters of the Transformer model IPT [15]. It is also worth noting that unlike IPT, our model requires no large-scale pre-training to obtain leading performance, making it attractive for low-level tasks where datasets are often at limited scale.

Model Complexity Fully-conv Global
MLP-Mixer [87] 𝒪(N2)
gMLP [54] 𝒪(N2)
Swin-Mixer [57] 𝒪(N)
MAXIM (ours) 𝒪(N)
Table 10: Comparisons of MAXIM with other MLP models. Our model is both fully-convolutional and global, having a linear complexity with respect to the number of pixels N.
Refer to caption
Figure 10: Visualizations of effective receptive fields (shaded area) of the blue pixel for (a) Mixer/gMLP, (b) Swin-Mixer, and (c) our MAXIM. MAXIM attains both local (red) and (dilated) global (green) perception. Yellow pixels are achievable by both local and global branches.

A.5 Additional Experiments

Due to limited space in the main paper, we also show experimental results on deblurring and raindrop removal.

Deblurring on REDS [63]. Tab. 12 shows quantitative comparisons of MAXIM-3S against the winning solution, HINet [16], and a leading model, MPRNet [111] on the REDS dataset of NTIRE 2021 Image Deblurring Challenge Track 2 JPEG artifacts [63]. The metrics are computed and averaged on 300 validation images. Our MAXIM-3S model surpasses HINet by 0.1 dB of PSNR.

Task Dataset Model PSNR Params FLOPs
Denoise SIDD [2] MPRNet [111] 39.71 15.7M 1176G
MIRNet [110] 39.72 31.7M 1572G
MAXIM-3S 39.96 22.2M 339G
Deblur GoPro [62] MPRNet [111] 32.66 20.1M 1554G
HINet [16] 32.71 88.7M 341G
IPT [15] 32.58 114M 1188G
MAXIM-3S 32.86 22.2M 339G
Derain Rain13k (Average) MSPFN [34] 30.75 21.7M -
MPRNet [111] 32.73 3.64M 297G
MAXIM-2S 33.24 14.1M 216G
Dehaze Indoor [46] MSBDN [23] 33.79 31.3M 83G
FFA-Net [74] 36.36 4.5M 576G
MAXIM-2S 39.72 14.1M 216G
Enhance LOL [98] MIRNet [110] 24.14 31.7M 1572G
MAXIM-2S 23.43 14.1M 216G
Table 11: Model performance vs. complexity comparison of our model with other competing methods for all the tasks. FLOPs are calculated on an input image of size 256×256.
REDS [63]
Method PSNR SSIM
MPRNet [111] 28.79 0.911
HINet [16] 28.83 0.862
MAXIM-3S 28.93 0.865
Table 12: Deblurring comparisons on REDS. Our method outperforms previous winning solution (HINet) on the REDS dataset of NTIRE 2021 Image Deblurring Challenge Track 2 JPEG artifacts. The scores are evaluated on 300 images from the validation set. Results are gathered from the authors of [16].
Raindrop-A [73] Raindrop-B [73]
Method PSNR SSIM PSNR SSIM
AGAN [73] 31.62 0.921 25.05 0.811
DuRN [56] 31.24 0.926 25.32 0.817
Quan [75] 31.36 0.928 - -
MAXIM-2S 31.87 0.935 25.74 0.827
Table 13: Deraining comparisons on Raindrop removal dataset [73]. Our MAXIM-2S model attains state-of-the-art performance on both Raindrop testset A and B.

Raindrop removal [73]. Apart from the rain streak removal task reported in the main paper, we also evaluated our MAXIM model on the raindrop removal task. As can be seen in Tab. 13, our model achieved the best performance: 31.87 dB and 25.74 dB PSNR on Raindrop testset A and B.

A.6 More Visual Comparisons

Denoising. Fig. 12 shows denoising results of our model compared with SOTA models on SIDD [2]. Our model recovers more details, yielding visually pleasant outputs.

Deblurring. The visual results on GoPro [62], HIDE [81], RealBlur-J [79], and REDS [63] are shown in Fig. 13, Fig. 14, Fig. 15, and Fig. 16, respectively. Our model outperformed other competing methods on both synthetic and real-world deblurring benchmarks.

Deraining. Qualitative comparisons of our model against SOTA methods on deraining are shown in Fig. 17, Fig. 18, Fig. 19, and Fig. 20.

Raindrop removal. We provide visual comparisons of the raindrop removal task on the Raindrop testset A and B [73] in Fig. 21 and Fig. 22.

Dehazing. We provide dehazing comparisons on the SOTS [46] indoor and outdoor sets in Fig. 23 and Fig. 24.

Retouching. Fig. 25 shows additional comparisons of our model with competing methods on the Five-K dataset [8] provided by [65] for retouching results.

Low-light enhancement. Fig. 26 demonstrates the evaluations on the LOL [98] test set for low-light enhancement.

A.7 Weight Visualizations

Fig. 11 visualizes the spatial projection matrices of the block gMLP and the grid gMLP layers of each stage of MAXIM-3S trained on GoPro [62]. Similar to [54], we also observed that the weights after learning exhibit locality and spatial invariance. Surprisingly, the global grid gMLP layer also learns to perform ‘local’ operations (but on the uniform dilated grid). The spatial weights of block gMLP and grid gMLP in the same layer often demonstrate similar or coupled shapes, which may be attributed to the parallel-branch design in the multi-axis gMLP block. However, we have not observed a clear trend on how these filters at different stages vary.

A.8 Limitations and Discussions

One potential limitation of our model, which is shared with the existing SOTA, is the relatively inadequate generalization to real-world examples. This perhaps can be attributed to the training examples provided by the existing synthesized image restoration benchmarks. Creating more realistic, large-scale datasets through data-generation schemes [82, 94] can improve this shortcoming. Also, we observe that our model tends to slightly overfit certain benchmarks, because we did not apply a strong regularization (e.g., dropout) during training. Even though we find that regularization may result in a small reduction in performance for our models on these benchmarks we evaluated, it is worth exploring in future to effectively improve the generalization of our restoration models.

It is worth mentioning that our model is able to generate high quality sharp images, which are visually comparable to the state-of-the-art generative models [123, 36]. Notably, our model produces more conservative results without hallucinating many nonexistent details, delivering more reliable results than generative models.

Refer to caption
Figure 11: Spatial projection weights in block gMLP and grid gMLP layers of the MAXIM-3S model trained on GoPro [62]. Each row shows the filters (reshaped into 2D) for a reduced set of consecutive channels. The filter sizes for Encoder depth 1 and 2 are 16×16, while for Encoder depth 3 and Bottleneck1 are 8×8 (resized to the same shape for better visualization). It is worth noting that the weights of block gMLP layers (left) are directly applied on pixels within local windows and shared at each non-overlapping window of the feature maps (similar to strided convolution), while the weights of grid gMLP layers (right) correspond to a global, dilated aggregation overlaid on the entire image.
Refer to caption
Figure 12: Visual examples for image denoising on SIDD [2] among VDN [107], DANet [108], MIRNet [110], CycleISP [109], MPRNet [111], and the proposed MAXIM-3S. Our model clearly removed real noise while recovering more details.
Refer to caption
Figure 13: Visual examples for image deblurring on GoPro [62] among DMPHN [113], Suin et al. [83], MPRNet [111], HINet [16], MIMO-UNet [20], and our MAXIM-3S.
Refer to caption
Figure 14: Visual comparisons for image deblurring on HIDE [81] among DMPHN [113], Suin et al. [83], MPRNet [111], HINet [16], MIMO-UNet [20], and our MAXIM-3S.
Refer to caption
Figure 15: Visual comparisons for image deblurring on RealBlur-J [79] between previous best model MPRNet [111] and MAXIM-3S.
Refer to caption
Figure 16: Visual comparisons for image deblurring on REDS [63] between our model and the winning solution, HINet [16], for REDS dataset of the NTIRE 2021 Image Delurring Challenge Track 2 JPEG artifacts [63].
Refer to caption
Figure 17: Visual examples for image deraining on Rain100L [105] among RESCAN [49], PreNet [77], MSPFN [34], MPRNet [111], HINet [16], and our MAXIM-2S model.
Refer to caption
Figure 18: Visual examples for image deraining on Rain100H [105]. At extremely high raining levels, our model recovers more details and textures compared to previous competitive methods.
Refer to caption
Figure 19: Visual examples for image deraining on Test100 [115]. Our model removes both raining streaks and visible JPEG artifacts.
Refer to caption
Figure 20: Visual examples for image deraining on Test1200 [114].
Refer to caption
Figure 21: Visual comparisons for raindrop removal on Raindrop-A [73] among AGAN [73], DuRN [56], Quan [75], and MAXIM-2S.
Refer to caption
Figure 22: Visual comparisons for raindrop removal on Raindrop testset B [73].
Refer to caption
Figure 23: Visual comparisons for image dehazing on SOTS indoor testset [46] among GCANet [14], GridDehaze [55], DuRN [56], MSBDN [23], FFA-Net [74], and our MAXIM-2S.
Refer to caption
Figure 24: Visual comparisons for image dehazing on SOTS outdoor testset [46] of MAXIM-2S against other approaches.
Refer to caption
Figure 25: Visual comparisons for image retouching on MIT-Adobe FiveK [8] provided by the authors of [65] among CycleGAN [124], Exposure [32], DPE [19], EnlightenGAN [37], UEGAN [65] and MAXIM-2S.
Refer to caption
Figure 26: Visual examples for image low-light enhancement on the LOL dataset [98] between Retinex [98], GLAD [92], KinD [122], EnlightenGAN [37], MIRNet [110], and MAXIM-2S. Our model effectively enhances lighting while largely reducing noise, producing higher-quality images compared to other approaches.
Algorithm 1 JAX code implementing the Multi-Axis Gated MLP Block (MAB).
from typing import Sequence
import einops
import flax.linen as nn
import jax.numpy as jnp
def block_images(x, patch_size):
n, h, w, channels = x.shape
grid_height, grid_width = h // patch_size[0], w // patch_size[1]
x = einops.rearrange(x, "n(ghfh)(gwfw)c->n(ghgw)(fhfw)c",
gh=grid_height, gw=grid_width, fh=patch_size[0], fw=patch_size[1])
return x
def unblock_images(x, grid_size, patch_size):
x = einops.rearrange(x, "n(ghgw)(fhfw)c->n(ghfh)(gwfw)c",
gh=grid_size[0], gw=grid_size[1], fh=patch_size[0], fw=patch_size[1])
return x
class SpatialGatingUnit(nn.Module):
"""GatedMLPappliedonaspecifiedaxis:-3forgridand-2forblock."""
@nn.compact
def __call__(self, x, axis=-3):
u, v = jnp.split(x, 2, axis=-1)
v = nn.LayerNorm()(v)
n = x.shape[axis] # get spatial dim at the ’grid’ or ’block’ axis
v = jnp.swapaxes(v, -1, axis)
v = nn.Dense(n)(v)
v = jnp.swapaxes(v, -1, axis)
return u * (v + 1.)
class SpatialGmlpLayer(nn.Module):
"""GatedMLPappliedonaspecifiedaxis:-3forgridand-2forblock."""
grid_size: Sequence[int]
block_size: Sequence[int]
@nn.compact
def __call__(self, x, axis=-3):
n, h, w, num_channels = x.shape
if axis=-3: # for grid gMLP layer
gh, gw = self.grid_size
fh, fw = h // gh, w // gw
elif axis=-2: # for block gMLP layer
fh, fw = self.block_size
gh, gw = h // fh, w // fw
x = block_images(x, patch_size=(fh, fw))
y = nn.LayerNorm()(x)
y = nn.Dense(num_channels * 2)(y)
y = nn.gelu(y)
y = SpatialGatingUnit()(y, axis=axis)
y = nn.Dense(num_channels)(y)
x = x + y
x = unblock_images(x, grid_size=(gh, gw), patch_size=(fh, fw))
return x
class MultiAxisGmlpBlock(nn.Module):
block_size: Sequence[int]
grid_size: Sequence[int]
@nn.compact
def __call__(self, x):
shortcut = x
n, h, w, num_channels = x.shape
x = nn.LayerNorm()(x)
x = nn.Dense(num_channels * 2)(x)
x = nn.gelu(x)
# split two heads, then applied grid gMLP and block gMLP respectively.
u, v = jnp.split(x, 2, axis=-1)
u = SpatialGmlpLayer(grid_size=self.grid_size)(u, axis=-3)
v = SpatialGmlpLayer(block_size=self.block_size)(v, axis=-2)
# Concat and output projection
x = jnp.concatenate([u, v], axis=-1)
x = nn.Dense(num_channels)(x)
x = x + shortcut
return x

References

  • [1] Darmstadt noise dataset. https://noise.visinf.tu-darmstadt.de/benchmark, 2017. Accessed: 2021-10-30.
  • [2] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In CVPR, pages 1692–1700, 2018.
  • [3] Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In CVPR, pages 3155–3164, 2019.
  • [4] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
  • [5] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [6] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018.
  • [7] Harold C Burger, Christian J Schuler, and Stefan Harmeling. Image denoising: Can plain neural networks compete with bm3d? In CVPR, pages 2392–2399. IEEE, 2012.
  • [8] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Frédo Durand. Learning photographic global tonal adjustment with a database of input/output image pairs. In CVPR, pages 97–104. IEEE, 2011.
  • [9] Bolun Cai, Xiangmin Xu, Kui Jia, Chunmei Qing, and Dacheng Tao. Dehazenet: An end-to-end system for single image haze removal. IEEE TIP, 25(11):5187–5198, 2016.
  • [10] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
  • [11] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  • [12] Meng Chang, Qi Li, Huajun Feng, and Zhihai Xu. Spatial-adaptive network for single image denoising. In ECCV, pages 171–187. Springer, 2020.
  • [13] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv preprint arXiv:2103.14899, 2021.
  • [14] Dongdong Chen, Mingming He, Qingnan Fan, Jing Liao, Liheng Zhang, Dongdong Hou, Lu Yuan, and Gang Hua. Gated context aggregation network for image dehazing and deraining. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1375–1383. IEEE, 2019.
  • [15] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In CVPR, pages 12299–12310, 2021.
  • [16] Liangyu Chen, Xin Lu, Jie Zhang, Xiaojie Chu, and Chengpeng Chen. Hinet: Half instance normalization network for image restoration. In CVPRW, pages 182–192, 2021.
  • [17] Li-Heng Chen, Christos G Bampis, Zhi Li, Andrey Norkin, and Alan C Bovik. Proxiqa: A proxy approach to perceptual optimization of learned image compression. IEEE TIP, 30:360–373, 2020.
  • [18] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. In CVPR, pages 7103–7112, 2018.
  • [19] Yu-Sheng Chen, Yu-Ching Wang, Man-Hsin Kao, and Yung-Yu Chuang. Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In CVPR, pages 6306–6314, 2018.
  • [20] Sung-Jin Cho, Seo-Won Ji, Jun-Pyo Hong, Seung-Won Jung, and Sung-Jea Ko. Rethinking coarse-to-fine approach in single image deblurring. In ICCV, pages 4641–4650, 2021.
  • [21] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE TIP, 16(8):2080–2095, 2007.
  • [22] Mauricio Delbracio, Hossein Talebi, and Peyman Milanfar. Projected distribution loss for image enhancement. ICCP, 2021.
  • [23] Hang Dong, Jinshan Pan, Lei Xiang, Zhe Hu, Xinyi Zhang, Fei Wang, and Ming-Hsuan Yang. Multi-scale boosted dehazing network with dense feature fusion. In CVPR, pages 2157–2167, 2020.
  • [24] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • [25] Yazan Abu Farha and Jurgen Gall. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR, pages 3575–3584, 2019.
  • [26] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal. IEEE TIP, 26(6):2944–2956, 2017.
  • [27] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In CVPR, pages 3855–3863, 2017.
  • [28] Hongyun Gao, Xin Tao, Xiaoyong Shen, and Jiaya Jia. Dynamic scene deblurring with parameter selective sharing and nested skip connections. In CVPR, pages 3848–3856, 2019.
  • [29] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. Toward convolutional blind denoising of real photographs. In CVPR, pages 1712–1722, 2019.
  • [30] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  • [31] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [32] Yuanming Hu, Hao He, Chenxi Xu, Baoyuan Wang, and Stephen Lin. Exposure: A white-box photo post-processing framework. ACM TOG, 37(2):1–17, 2018.
  • [33] Zhe Hu, Sunghyun Cho, Jue Wang, and Ming-Hsuan Yang. Deblurring low-light images with light streaks. In CVPR, pages 3382–3389, 2014.
  • [34] Kui Jiang, Zhongyuan Wang, Peng Yi, Chen Chen, Baojin Huang, Yimin Luo, Jiayi Ma, and Junjun Jiang. Multi-scale progressive fusion network for single image deraining. In CVPR, pages 8346–8355, 2020.
  • [35] Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for image reconstruction and synthesis. In ICCV, pages 13919–13929, 2021.
  • [36] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two transformers can make one strong gan. arXiv preprint arXiv:2102.07074, 1(3), 2021.
  • [37] Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. Enlightengan: Deep light enhancement without paired supervision. IEEE TIP, 30:2340–2349, 2021.
  • [38] Yoonsik Kim, Jae Woong Soh, Gu Yong Park, and Nam Ik Cho. Transfer learning from synthetic to real-noise denoising with adaptive instance normalization. In CVPR, pages 3482–3492, 2020.
  • [39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [40] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. In CVPR, pages 8183–8192, 2018.
  • [41] Orest Kupyn, Tetiana Martyniuk, Junru Wu, and Zhangyang Wang. Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better. In ICCV, pages 8878–8887, 2019.
  • [42] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In CVPR, pages 624–632, 2017.
  • [43] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pages 4681–4690, 2017.
  • [44] James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
  • [45] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and Dan Feng. Aod-net: All-in-one dehazing network. In ICCV, pages 4770–4778, 2017.
  • [46] Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. Benchmarking single-image dehazing and beyond. IEEE TIP, 28(1):492–505, 2019.
  • [47] Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020.
  • [48] Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi Xiao, Gang Yu, Hongtao Lu, Yichen Wei, and Jian Sun. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148, 2019.
  • [49] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin Zha. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In ECCV, pages 254–269, 2018.
  • [50] Yinxiao Li, Pengchong Jin, Feng Yang, Ce Liu, Ming-Hsuan Yang, and Peyman Milanfar. Comisr: Compression-informed video super-resolution. In ICCV, 2021.
  • [51] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S Brown. Rain streak removal using layer priors. In CVPR, pages 2736–2744, 2016.
  • [52] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In IEEE Int. Conf. Comput. Vis. Worksh., 2021.
  • [53] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. arXiv preprint arXiv:1806.02919, 2018.
  • [54] Hanxiao Liu, Zihang Dai, David R So, and Quoc V Le. Pay attention to mlps. arXiv preprint arXiv:2105.08050, 2021.
  • [55] Xiaohong Liu, Yongrui Ma, Zhihao Shi, and Jun Chen. Griddehazenet: Attention-based multi-scale network for image dehazing. In ICCV, pages 7314–7323, 2019.
  • [56] Xing Liu, Masanori Suganuma, Zhun Sun, and Takayuki Okatani. Dual residual networks leveraging the potential of paired operations for image restoration. In CVPR, pages 7007–7016, 2019.
  • [57] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
  • [58] Zheng Liu, Botao Xiao, Muhammad Alrabeiah, Keyan Wang, and Jun Chen. Single image dehazing with a generic model-agnostic convolutional neural network. IEEE Signal Processing Letters, 26(6):833–837, 2019.
  • [59] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [60] Erik Matlin and Peyman Milanfar. Removal of haze and noise from a single image. In Computational Imaging X, volume 8296, page 82960T. International Society for Optics and Photonics, 2012.
  • [61] Zibo Meng, Runsheng Xu, and Chiu Man Ho. Gia-net: Global information aware network for low-light imaging. In European Conference on Computer Vision, pages 327–342. Springer, 2020.
  • [62] Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, pages 3883–3891, 2017.
  • [63] Seungjun Nah, Sanghyun Son, Suyoung Lee, Radu Timofte, and Kyoung Mu Lee. Ntire 2021 challenge on image deblurring. In CVPR Workshops, pages 149–165, June 2021.
  • [64] Seungjun Nah, Sanghyun Son, Suyoung Lee, Radu Timofte, and Kyoung Mu Lee. Ntire 2021 challenge on image deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 149–165, 2021.
  • [65] Zhangkai Ni, Wenhan Yang, Shiqi Wang, Lin Ma, and Sam Kwong. Towards unsupervised deep image enhancement with generative adversarial network. IEEE TIP, 29:9140–9151, 2020.
  • [66] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network. In ECCV, pages 191–207. Springer, 2020.
  • [67] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
  • [68] Jinshan Pan, Deqing Sun, Hanspeter Pfister, and Ming-Hsuan Yang. Blind image deblurring using dark channel prior. In CVPR, pages 1628–1636, 2016.
  • [69] Dongwon Park, Dong Un Kang, Jisoo Kim, and Se Young Chun. Multi-temporal recurrent neural networks for progressive non-uniform single image deblurring with incremental temporal training. In ECCV, pages 327–343. Springer, 2020.
  • [70] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, volume 32, 2018.
  • [71] Olivier Petit, Nicolas Thome, Clement Rambour, Loic Themyr, Toby Collins, and Luc Soler. U-net transformer: self and cross attention for medical image segmentation. In International Workshop on Machine Learning in Medical Imaging, pages 267–276. Springer, 2021.
  • [72] Tobias Plotz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In CVPR, pages 1586–1595, 2017.
  • [73] Rui Qian, Robby T Tan, Wenhan Yang, Jiajun Su, and Jiaying Liu. Attentive generative adversarial network for raindrop removal from a single image. In CVPR, pages 2482–2491, 2018.
  • [74] Xu Qin, Zhilin Wang, Yuanchao Bai, Xiaodong Xie, and Huizhu Jia. Ffa-net: Feature fusion attention network for single image dehazing. In AAAI, volume 34, pages 11908–11915, 2020.
  • [75] Yuhui Quan, Shijie Deng, Yixin Chen, and Hui Ji. Deep learning for seeing through window with raindrops. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2463–2471, 2019.
  • [76] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. arXiv preprint arXiv:2107.00645, 2021.
  • [77] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, and Deyu Meng. Progressive image deraining networks: A better and simpler baseline. In CVPR, pages 3937–3946, 2019.
  • [78] Wenqi Ren, Lin Ma, Jiawei Zhang, Jinshan Pan, Xiaochun Cao, Wei Liu, and Ming-Hsuan Yang. Gated fusion network for single image dehazing. In CVPR, 2018.
  • [79] Jaesung Rim, Haeyun Lee, Jucheol Won, and Sunghyun Cho. Real-world blur dataset for learning and benchmarking deblurring algorithms. In ECCV, pages 184–201. Springer, 2020.
  • [80] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [81] Ziyi Shen, Wenguan Wang, Xiankai Lu, Jianbing Shen, Haibin Ling, Tingfa Xu, and Ling Shao. Human-aware motion deblurring. In ICCV, pages 5572–5581, 2019.
  • [82] Sanghyun Son, Jaeha Kim, Wei-Sheng Lai, Ming-Hsuan Yang, and Kyoung Mu Lee. Toward real-world super-resolution via adaptive downsampling models. IEEE transactions on pattern analysis and machine intelligence, 2021.
  • [83] Maitreya Suin, Kuldeep Purohit, and AN Rajagopalan. Spatially-attentive patch-hierarchical network for adaptive motion deblurring. In CVPR, pages 3606–3615, 2020.
  • [84] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
  • [85] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
  • [86] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In CVPR, pages 8174–8182, 2018.
  • [87] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. arXiv preprint arXiv:2105.01601, 2021.
  • [88] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404, 2021.
  • [89] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scaling local self-attention for parameter efficient visual backbones. In CVPR, pages 12894–12904, 2021.
  • [90] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  • [91] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang Zhang, and Rynson WH Lau. Spatial attentive single-image deraining with a high quality real rain dataset. In CVPR, pages 12270–12279, 2019.
  • [92] Wenjing Wang, Chen Wei, Wenhan Yang, and Jiaying Liu. Gladnet: Low-light enhancement network with global awareness. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 751–755. IEEE, 2018.
  • [93] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–7803, 2018.
  • [94] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1905–1914, 2021.
  • [95] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In IEEE Eur. Conf. Comput. Vis. Worksh., pages 0–0, 2018.
  • [96] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13:600–612, 2004.
  • [97] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and Jianzhuang Liu. Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106, 2021.
  • [98] Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018.
  • [99] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying Wu. Semi-supervised transfer learning for image rain removal. In CVPR, pages 3877–3886, 2019.
  • [100] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018.
  • [101] Haiyan Wu, Yanyun Qu, Shaohui Lin, Jian Zhou, Ruizhi Qiao, Zhizhong Zhang, Yuan Xie, and Lizhuang Ma. Contrastive learning for compact single image dehazing. In CVPR, pages 10551–10560, 2021.
  • [102] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021.
  • [103] Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural l0 sparse representation for natural image deblurring. In CVPR, pages 1107–1114, 2013.
  • [104] Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, and Jiaqi Ma. V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. arXiv preprint arXiv:2203.10638, 2022.
  • [105] Wenhan Yang, Robby T Tan, Jiashi Feng, Jiaying Liu, Zongming Guo, and Shuicheng Yan. Deep joint rain detection and removal from a single image. In CVPR, pages 1357–1366, 2017.
  • [106] Rajeev Yasarla and Vishal M Patel. Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining. In CVPR, pages 8405–8414, 2019.
  • [107] Zongsheng Yue, Hongwei Yong, Qian Zhao, Lei Zhang, and Deyu Meng. Variational denoising network: Toward blind noise modeling and removal. arXiv preprint arXiv:1908.11314, 2019.
  • [108] Zongsheng Yue, Qian Zhao, Lei Zhang, and Deyu Meng. Dual adversarial network: Toward real-world noise removal and noise generation. In ECCV, pages 41–58. Springer, 2020.
  • [109] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Cycleisp: Real image restoration via improved data synthesis. In CVPR, pages 2696–2705, 2020.
  • [110] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Learning enriched features for real image restoration and enhancement. In ECCV, pages 492–511. Springer, 2020.
  • [111] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In CVPR, pages 14821–14831, 2021.
  • [112] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • [113] Hongguang Zhang, Yuchao Dai, Hongdong Li, and Piotr Koniusz. Deep stacked hierarchical multi-patch network for image deblurring. In CVPR, pages 5978–5986, 2019.
  • [114] He Zhang and Vishal M Patel. Density-aware single image de-raining using a multi-stream dense network. In CVPR, pages 695–704, 2018.
  • [115] He Zhang, Vishwanath Sindagi, and Vishal M Patel. Image de-raining using a conditional generative adversarial network. IEEE TCSVT, 30(11):3943–3956, 2019.
  • [116] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5907–5915, 2017.
  • [117] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE TPAMI, 41(8):1947–1962, 2018.
  • [118] Jiawei Zhang, Jinshan Pan, Jimmy Ren, Yibing Song, Linchao Bao, Rynson WH Lau, and Ming-Hsuan Yang. Dynamic scene deblurring using spatially variant recurrent neural networks. In CVPR, pages 2521–2529, 2018.
  • [119] Kaihao Zhang, Wenhan Luo, Yiran Zhong, Lin Ma, Bjorn Stenger, Wei Liu, and Hongdong Li. Deblurring by realistic blurring. In CVPR, pages 2737–2746, 2020.
  • [120] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE TIP, 26(7):3142–3155, 2017.
  • [121] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, pages 2472–2481, 2018.
  • [122] Yonghua Zhang, Jiawan Zhang, and Xiaojie Guo. Kindling the darkness: A practical low-light image enhancer. In ACM MM, pages 1632–1640, 2019.
  • [123] Long Zhao, Zizhao Zhang, Ting Chen, Dimitris N Metaxas, and Han Zhang. Improved transformer for high-resolution gans. arXiv preprint arXiv:2106.07631, 2021.
  • [124] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, pages 2223–2232, 2017.
  • [125] Xiang Zhu, Filip Šroubek, and Peyman Milanfar. Deconvolving psfs for a better motion deblurring using multiple images. In European Conference on Computer Vision, pages 636–647. Springer, 2012.