Bert

朱琪 · December 24, 2020

BERT

Pretraining General Language Model:

Unsupervised Feature-based Approaches

word embedding, sentence embedding, paragraph embedding
Unsupervised Fine-tuning Approaches

auto-encoder
Transfer Learning from Supervised Data

We note that in the literature the bidirectional Transformer is often referred to as a “Transformer encoder” while the left-context-only version is referred to as a “Transformer decoder” since it can be used for text generation.

We refer to this procedure as a “masked LM” (MLM), although it is often referred to as a Cloze task in the literature

mask 15% token

If the i-th token is chosen, we replace the i-th token with

(1) the [MASK] token 80% of the time

(2) a random token 10% of the time

(3)the unchanged i-th token 10% of the time

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI.

Improving Language Understanding by Generative Pre-Training

Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Nat- ural Language Processing (EMNLP), pages 1532– 1543.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. Swag: A large-scale adversarial dataset for grounded commonsense inference. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).

VISUALBERT: A SIMPLE AND PERFORMANT BASELINE FOR VISION AND LANGUAGE

(1) part of the text is masked and the model learns to predict the masked words based on the remaining text and visual context;

(2) the model is trained to determine whether the provided text matches the image.

These approaches often consist of a text encoder, an image feature extractor, a multi-modal fusion module (typically with attention), and an answer classifier.

In VisualBERT, the self-attention mechanism allows the model to capture the implicit relations between objects.

VideoBERT (Sun et al., 2019) transforms a video into spoken words paired with a series of images and applies a Transformer to learn joint representations. Their model architecture is similar to ours. However, VideoBERT is evaluated on captioning for cooking videos, while we conduct comprehensive analysis on a variety of vision-and-language tasks. Concurrently with our work, ViLBERT (Jiasen et al., 2019) proposes to learn joint representation of images and text using a BERT-like architecture but has separate Transformers for vision and language that can only attend to each-other (resulting in twice the parameters).

different visual representation and pre-training resource are used.

VisualBERT

feature embedding
segment embedding
position embedding

Pretrain VisualBERT

Task-Agnostic Pre-Training
Task-Specific Pre-Training
Fine-Tuning

For future work, we are curious about whether we could extend VisualBERT to image-only tasks, such as scene graph parsing and situation recognition. Pre-training VisualBERT on larger caption datasets such as Visual Genome and Conceptual Caption is also a valid direction.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

self-supervised learning which have captured rich semantic and structural information from large, unlabelled data sources by training models to perform so-called ‘proxy’ tasks

While work within the vision community has shown increasing promise [21–23], the greatest impact of self-supervised learning so far is through language models like ELMo [13], BERT [12], and GPT [14] which have set new high-water marks on many NLP tasks.

Conceptual Captions [24] dataset

Conceptual Captions：

predicting the semantics of masked words and image regions given the unmasked inputs
predicting whether an image and text segment correspond

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

The alt attribute is the HTML attribute used in HTML and XHTML documents to specify alternative text (alt text) that is to be rendered when the element to which it is applied cannot be rendered.

pipeline

derivation

model

https://github.com/google-research/bert

Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single “word embedding” representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence.

Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers. Cased means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).

manually download models:

https://github.com/huggingface/transformers/issues/856

Bilinear

Beyond Bilinear: Generalized Multimodal Factorized High-order Pooling for Visual Question Answering

paper

reference of lorra language self-attention

multi-modal feature fusion: linear (e.g., concatenation or element-wise addition)

bilinear pooling
MCB(Multi-modal Compact Bilinear), MLB(Multi-modal Low-rank Bilinear)
MFB(Multi-modal Factorized Bilinear pooling, expansion of MLB), MFH(Multi-modal Factorized High-order pooling, cascading multiple MFB blocks)

Important tricks： (b) the normalization techniques are extremely important in bilinear pooling models.

Share: Twitter, Facebook