[NeurIPS 2019]
Code in PyTorch: https://github.com/nibuiro/CondConv-pytorch
We propose conditionally parameterized convolutions (CondConv), which challenge the paradigm of static convolutional kernels by computing convolutional kernels as a function of the input.
Conditional computation.
In conditional computation models, this is achieved by activating only a portion of the entire network for each example [3, 8, 5, 2].
BlockDrop [47] and SkipNet [45] use reinforcement learning to learn the subset of blocks needed to process a given input.
Multi-branch convolutional networks. Multi-branch architectures like Inception [40] and ResNext [48] have shown success on a variety of computer vision tasks. In these architectures, a layer consists of multiple convolutional branches, which are aggregated to compute the final output. A CondConv layer is mathematically equivalent to a multi-branch convolutional layer where each branch is a single convolution and outputs are aggregated by a weighted sum, but only requires the computation of one convolution.
Specifically, we parameterize the convolutional kernels in CondConv by:
$Output(x) = \sigma((\alpha_1 · W_1 + . . . + \alpha_n · W_n) ∗ x)$
When we adapt a convolutional layer to use CondConv, each kernel Wi has the same dimensions as the kernel in the original convolution.
A CondConv layer is mathematically equivalent to a more expensive linear mixture of experts formulation, where each expert corresponds to a static convolution (Fig 1b):
$\sigma((\alpha_1 · W_1 + . . . + α_n · W_n) ∗ x) = σ(α_1 · (W_1 ∗ x) + . . . + α_n · (W_n ∗ x))$
We compute the exampledependent routing weights α_i = r_i(x) from the layer input in three steps: global average pooling, fully-connected layer, Sigmoid activation.
r(x) = Sigmoid(GlobalAveragePool(x) R)
where R is a matrix of learned routing weights mapping the pooled inputs to n expert weights. A normal convolution operation operates only over local receptive fields, so our routing function allows adaptation of local operations using global context.
The same approach can easily be extended to other linear functions like those in depth-wise convolutions and fully-connected layers.