Advisor

朱琪 · August 12, 2019

2019/8/12

跑QAnet

VCR的knowledge base类型

co-occurence
相对位置
…
动作
比较级

VCR数据集以及model

Conceptnet有很多noise

DBpedia不太常识，虽然很精准

Visual Genome

scene graph

图像表示成图（图像处理的圣杯）

关系图的形式/三元组的形式

三元组（扩展性好）

2019/8/29

cv任务：

第一阶段Image → Image(图像处理)

Deblur（去除模糊）
Denoise（降噪）
Dehaze（除霾）
Super Resolution（超分辨率）

第二阶段Image → Labels + Coordinates

segmentation(分割)
detection(检测)
tracking(跟踪)

3.第三阶段Image → Semantics/Language

scene graph generation
vision-language interaction

ConceptNet下载下来并理解

引入外部知识的方法：

− 1. Representation based − 2. Model based 2.1 knowledge distillation 2.2 pretraining & fine-tuning − 3. Memory based

introduce external knowledge

Representation based

Distributed Representations统计学的知识 bag of words

用国家、选举的概率分布来表示“政治”

Word2Vec GloVe

Paragraph2Vec Doc2Vec

Multimodal Compact Bilinear 关注一下用了GloVe

Visual Relationship Detection with Language Priors, Lu et al., ECCV, 2016

贝叶斯公式先验概率后验概率 Word2Vec典型应用

Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation, Yu et al., ICCV, 2017

先验信息 enforce student 样本分布接近teacher样本分布

实际上做的就是在学习常识 teacher&student分开训练

Learning to Detect Human-Object Interactions with Knowledge, Xu et al., CVPR, 2019

关注引的和被引的 top conference

teacher&student同时训练

绿色是内部知识target，红色是外部知识，黄色是share知识，内部和外部之间图卷积

VL-BERT: PRE-TRAINING OF GENERIC VISUALLINGUISTIC REPRESENTATIONS, Su et al., ArXiv, 2019

upstream target和downstream target都必须是跨媒体任务，是一个致命缺陷，dataset收集困难

conceptual captions(Sharma et al., 2018)看一下

CycleGAN unpaired data

合成数据有很多清晰的图片&模糊的图片

semi-supervision pair&unpair

Dynamic Memory Networks for Visual and Textual Question Answering, Xiong et al., ICML, 2016

input分成memory中的单元

external knowledge可以用这个框架

FVQA,UIUC跟进

inference的过程，query knowledge base(Memory Network)

From Recognition to Cognition: Visual Commonsense Reasoning(2018)

简介

主要的四个贡献：

描述任务：VCR
引进数据集：VCR
生成数据集的方法：Adversarial Matching
新的推理机器：R2C网络

数据收集

用Ｍask-RCNN检测到object tag

Adversarial Matching

a new method that allows for any ‘language generation’ dataset be turned into multiple choice test

Here, we employ state-of-the-art models for Natural Language Inference: BERT [15] and ESIM+ELMo [10, 57], respectively.

Recognition to Cognition Networks

ground层：让问题和答案中指代的物体落地（与图片对应起来）

学习到image-language联合表达，核心是双向LSTM

CNN学图像feature,Roi-Aligned

BERT学文字feature
contextualize层：问题、答案、图片形成上下文（结合在一起）

用attention机制获得权重。
reason层：推理图片区域、问题、答案间的关系

把response, attended query and objects喂给LSTM, 在每个timestep与问题和答案concatenate起来，给MLP。

ROI Align 是在Mask-RCNN这篇论文里提出的一种区域特征聚集方式, 很好地解决了ROI Pooling操作中两次量化造成的区域不匹配(mis-alignment)的问题。实验显示，在检测任务中将 ROI Pooling 替换为 ROI Align 可以提升检测模型的准确性。

结果和剥离测试

The model suffers most when using GloVe representations instead of BERT: a loss of 24%. This suggests that strong textual representations are crucial to VCR performance.

FVQA: Fact-based Visual Question Answering(2017)(TPAMI2018)

主要贡献：

数据集：FVQA
一个解决问题的model

在CVPR2019（Visual Question Answering as Reading Comprehension）那篇论文中提到这篇论文：

Wang et al. [29] introduced the “Fact-based VQA (FVQA)” problem and proposed a semantic parsing based method for supporting facts retrieval.

This method is vulnerable to misconceptions caused by synonyms and homographs(同形词).

Visual Question Answering as Reading Comprehension(2018)(CVPR2019)

1)represent image content explicitly by natural language and solve VQA as a reading comprehension problem

2)Two types of VQA models are proposed to address the open-end VQA and the multiple-choice VQA respectively

3)address knowledge based VQA

Introduction

VQA: visual question answering
TQA: textual question answering/ machine reading comprehension

(CNNs) to represent images and (RNNs) to represent sentences or phrases

The extracted visual and textual feature vectors are then jointly embedded by concatenation, element-wise sum or product to infer the answer.
Multimodal Compact Bilinear pooling method (MCB) for VQA
embed knowledge in memory slots and incorporated external knowledge with image, question and answer features by Dynamic Memory Networks (DMN)

2.1. Joint embedding

2.2. Knowledge-based VQA

现有的研究包括两个方法：semantic parsing or information retrieval

2.3. Textual Question Answering

end-to-end neural network models and attention mechanism, such as DMN [17], r-net [31], DrQA [6], QANet [36], and most recently BERT [7]

VQA Models

3.1. QANet

It consists of embedding block, embedding encoder, context-query attention block, model encoder and output layer.

3.2. Open-ended VQA model

Wang et al. [29] proposed to query the knowledge bases according to the estimated query types and visual concepts detected from the image.

Heuristic keyword matching: no

Rather than apply the heuristic matching approach which is vulnerable to homographs and synonyms, here we make use of all the retrieved candidate supporting facts as context

3.3. Multiple-choice VQA model

Experiments

4.1. Datasets

4.2. Implementation Details

4.2.1 Results Analysis on FVQA

4.2.2 Results Anslysis on Visual Genome QA

4.2.3 Results Analysis on Visual7W

Conclusion

VALSE会议ppt

Never ending language learning

以前的技术：supervised function approximation

Generative Adversarial Nets (NIPS 2014)

GAN as Structured Learning(Output is composed of components with dependency)

Structured Learning/Prediction: output a sequence, a matrix, a graph, a tree ……

Structured Learning as One-shot/Zero-shot Learning

Generator(Bottom Up): Learn to generate the object at the component level

Discriminator(Top Down): Evaluating the whole object, and find the best one

Discriminator: Evaluation function, Potential Function, Energy Function …

discriminator to generate: graphical model

structured learning -> graphical model ->

Bayesian Network (Directed Graph)

Markov Random Field (Undirected Graph)

graph, potential function - discriminator (iteratively)positive + negative examples, model sample negative examples, update model

Energy-based Model: http://www.cs.nyu.edu/~yann/research/ebm/ Although we do not know the distributions of 𝑃_𝐺 and 𝑃_𝑑𝑎𝑡𝑎, we can sample from them.

use sample to replace expectation

Visual Relationship Detection with Internal and External Linguistic(语言的;语言学的;) Knowledge Distillation(ICCV, 2017)

先验信息 enforce student 样本分布接近teacher样本分布

实际上做的就是在学习常识 teacher&student分开训练

数据集：卢策吾Visual Relationship Detection (VRD), Visual Genome datasets

给定主语和宾语的对子，计算谓词的条件概率分布

knowledge distillation framework

NETWORK COMPRESSION

• Network Pruning

After pruning, the accuracy will drop (hopefully not too much)

Fine-tuning on training data for recover

• Knowledge Distillation

• Parameter Quantization • Architecture Design • Dynamic Computation

问题解决

安装torchvision的layers分支时，出现了下列问题：

unable to execute '/usr/local/cuda:/bin/nvcc': No such file or directory
error: command '/usr/local/cuda:/bin/nvcc' failed with exit status 1

解决方法，修改/etc/profile中的环境变量

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda:$CUDA_HOME

第三行改为

export CUDA_HOME=/usr/local/cuda

spacy>=2.0,<2.1

en_core_web_sm　2.0.0版本与spacy适配

export PYTHONPATH=/home/B/gaoye/code/r2c

ln -s /home/B/gaoye/data/vcr/vcr1image

/dataloaders/vcr.py:

self.h5fn = os.path.join(VCR_ANNOTS_DIR, f’{self.embs_to_load}{self.mode}{self.split}.h5’)

print(“Loading embeddings from {}”.format(self.h5fn), flush=True)

https://github.com/rowanz/r2c/issues/9

python train.py -params models/multiatt/default.json -folder models/saves/flagship_answer

scp -P 10035 -r gaoye@10.69.21.155:/home/B/gaoye/tool ~/tool

Unpaired Image Captioning by Language Pivoting (ECCV 2018)

Alibaba AI Labs

pivot核心，枢轴语言，作为第三语言沟通两种语言之间翻译的桥梁

pivot

pivot1

Unpaired Image Captioning via Scene Graph Alignments(ICCV2019)

Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)

CUHK

centerpiece中心装饰品；放在餐桌中央的摆饰;

对偶任务：

Captioning Module

Self-retrieval Module

Image captioning methods can be divided into three categories [49].

Template based methods [20, 29, 48] generate captions based on language templates.
Search-based methods [11,13] search for the most semantically similar captions from a sentence pool.
Recent works mainly focus on language-based methods with an encoder-decoder framework [7,14–17,28,41,43,46,47], where a convolutional neural network (CNN) encodes images into visual features, and an Long Short Term Memory network (LSTM) decodes features into sentences [41].

It has been shown that attention mechanisms [5,26,31,47] and high-level attributes and concepts [14,16,49,50] can help with image captioning.

self_retrieval

The word at each time step t is chosen based on the probability distribution of each word by greedy decoding or beam search.

retrieval

rattan 藤

posterior 在后面的

Unsupervised Image Caption (CVPR 2019)

腾讯AI lab

Rochester大学

warrant 使有必要

caption

unsuper

gen

Share: Twitter, Facebook