2019/8/12
- 跑QAnet
- VCR的knowledge base类型
- co-occurence
- 相对位置
- …
- 动作
- 比较级
- VCR数据集以及model
- Conceptnet有很多noise
DBpedia不太常识,虽然很精准
- Visual Genome
scene graph
图像表示成图(图像处理的圣杯)
- 关系图的形式/三元组的形式
三元组(扩展性好)
2019/8/29
cv任务:
- 第一阶段Image → Image(图像处理)
-
Deblur(去除模糊)
-
Denoise(降噪)
-
Dehaze(除霾)
-
Super Resolution(超分辨率)
- 第二阶段Image → Labels + Coordinates
- segmentation(分割)
- detection(检测)
- tracking(跟踪)
3.第三阶段Image → Semantics/Language
- scene graph generation
- vision-language interaction
- ConceptNet下载下来并理解
引入外部知识的方法:
− 1. Representation based − 2. Model based 2.1 knowledge distillation 2.2 pretraining & fine-tuning − 3. Memory based
- introduce external knowledge
Representation based
Distributed Representations统计学的知识 bag of words
用国家、选举的概率分布来表示“政治”
- Word2Vec GloVe
Paragraph2Vec Doc2Vec
- Multimodal Compact Bilinear 关注一下用了GloVe
- Visual Relationship Detection with Language Priors, Lu et al., ECCV, 2016
贝叶斯公式 先验概率 后验概率 Word2Vec典型应用
- Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation, Yu et al., ICCV, 2017
先验信息 enforce student 样本分布接近teacher样本分布
实际上做的就是在学习常识 teacher&student分开训练
- Learning to Detect Human-Object Interactions with Knowledge, Xu et al., CVPR, 2019
关注引的和被引的 top conference
teacher&student同时训练
绿色是内部知识target,红色是外部知识,黄色是share知识,内部和外部之间图卷积
- VL-BERT: PRE-TRAINING OF GENERIC VISUALLINGUISTIC REPRESENTATIONS, Su et al., ArXiv, 2019
upstream target和downstream target都必须是跨媒体任务,是一个致命缺陷,dataset收集困难
conceptual captions(Sharma et al., 2018)看一下
- CycleGAN unpaired data
合成数据 有很多清晰的图片&模糊的图片
semi-supervision pair&unpair
- Dynamic Memory Networks for Visual and Textual Question Answering, Xiong et al., ICML, 2016
input分成memory中的单元
external knowledge可以用这个框架
- FVQA,UIUC跟进
inference的过程,query knowledge base(Memory Network)
From Recognition to Cognition: Visual Commonsense Reasoning(2018)
简介
主要的四个贡献:
- 描述任务:VCR
- 引进数据集:VCR
- 生成数据集的方法:Adversarial Matching
- 新的推理机器:R2C网络
数据收集
用Mask-RCNN检测到object tag
Adversarial Matching
a new method that allows for any ‘language generation’ dataset be turned into multiple choice test
Here, we employ state-of-the-art models for Natural Language Inference: BERT [15] and ESIM+ELMo [10, 57], respectively.
Recognition to Cognition Networks
-
ground层:让问题和答案中指代的物体落地(与图片对应起来)
学习到image-language联合表达,核心是双向LSTM
CNN学图像feature,Roi-Aligned
BERT学文字feature
-
contextualize层:问题、答案、图片形成上下文(结合在一起)
用attention机制获得权重。
-
reason层:推理图片区域、问题、答案间的关系
把response, attended query and objects喂给LSTM, 在每个timestep与问题和答案concatenate起来,给MLP。
ROI Align 是在Mask-RCNN这篇论文里提出的一种区域特征聚集方式, 很好地解决了ROI Pooling操作中两次量化造成的区域不匹配(mis-alignment)的问题。实验显示,在检测任务中将 ROI Pooling 替换为 ROI Align 可以提升检测模型的准确性。
结果和剥离测试
The model suffers most when using GloVe representations instead of BERT: a loss of 24%. This suggests that strong textual representations are crucial to VCR performance.
FVQA: Fact-based Visual Question Answering(2017)(TPAMI2018)
主要贡献:
- 数据集:FVQA
- 一个解决问题的model
在CVPR2019(Visual Question Answering as Reading Comprehension)那篇论文中提到这篇论文:
Wang et al. [29] introduced the “Fact-based VQA (FVQA)” problem and proposed a semantic parsing based method for supporting facts retrieval.
This method is vulnerable to misconceptions caused by synonyms and homographs(同形词).
Visual Question Answering as Reading Comprehension(2018)(CVPR2019)
1)represent image content explicitly by natural language and solve VQA as a reading comprehension problem
2)Two types of VQA models are proposed to address the open-end VQA and the multiple-choice VQA respectively
3)address knowledge based VQA
Introduction
-
VQA: visual question answering
-
TQA: textual question answering/ machine reading comprehension
(CNNs) to represent images and (RNNs) to represent sentences or phrases
-
The extracted visual and textual feature vectors are then jointly embedded by concatenation, element-wise sum or product to infer the answer.
-
Multimodal Compact Bilinear pooling method (MCB) for VQA
-
embed knowledge in memory slots and incorporated external knowledge with image, question and answer features by Dynamic Memory Networks (DMN)
Related Work
2.1. Joint embedding
2.2. Knowledge-based VQA
现有的研究包括两个方法:semantic parsing or information retrieval
2.3. Textual Question Answering
end-to-end neural network models and attention mechanism, such as DMN [17], r-net [31], DrQA [6], QANet [36], and most recently BERT [7]
VQA Models
3.1. QANet
It consists of embedding block, embedding encoder, context-query attention block, model encoder and output layer.
3.2. Open-ended VQA model
Wang et al. [29] proposed to query the knowledge bases according to the estimated query types and visual concepts detected from the image.
Heuristic keyword matching: no
Rather than apply the heuristic matching approach which is vulnerable to homographs and synonyms, here we make use of all the retrieved candidate supporting facts as context
3.3. Multiple-choice VQA model
- Experiments
4.1. Datasets
4.2. Implementation Details
4.2.1 Results Analysis on FVQA
4.2.2 Results Anslysis on Visual Genome QA
4.2.3 Results Analysis on Visual7W
- Conclusion
Never ending language learning
以前的技术:supervised function approximation
相关的工作:
e.g., life long learning (Thrun and Mitchell 1995), and that
learn to learn (Thrun and Pratt 1998),
work on multitask transfer learning (Caruana 1997)
Generative Adversarial Nets (NIPS 2014)
GAN as Structured Learning(Output is composed of components with dependency)
Structured Learning/Prediction: output a sequence, a matrix, a graph, a tree ……
Structured Learning as One-shot/Zero-shot Learning
Generator(Bottom Up): Learn to generate the object at the component level
Discriminator(Top Down): Evaluating the whole object, and find the best one
Discriminator: Evaluation function, Potential Function, Energy Function …
discriminator to generate: graphical model
structured learning -> graphical model ->
Bayesian Network (Directed Graph)
Markov Random Field (Undirected Graph)
graph, potential function - discriminator (iteratively)positive + negative examples, model sample negative examples, update model
Energy-based Model: http://www.cs.nyu.edu/~yann/research/ebm/ Although we do not know the distributions of 𝑃_𝐺 and 𝑃_𝑑𝑎𝑡𝑎, we can sample from them.
use sample to replace expectation
Visual Relationship Detection with Internal and External Linguistic(语言的;语言学的;) Knowledge Distillation(ICCV, 2017)
先验信息 enforce student 样本分布接近teacher样本分布
实际上做的就是在学习常识 teacher&student分开训练
数据集:卢策吾Visual Relationship Detection (VRD), Visual Genome datasets
给定主语和宾语的对子,计算谓词的条件概率分布
NETWORK COMPRESSION
• Network Pruning
After pruning, the accuracy will drop (hopefully not too much)
Fine-tuning on training data for recover
• Knowledge Distillation
• Parameter Quantization • Architecture Design • Dynamic Computation
问题解决
安装torchvision的layers分支时,出现了下列问题:
unable to execute '/usr/local/cuda:/bin/nvcc': No such file or directory
error: command '/usr/local/cuda:/bin/nvcc' failed with exit status 1
解决方法,修改/etc/profile中的环境变量
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda:$CUDA_HOME
第三行改为
export CUDA_HOME=/usr/local/cuda
spacy>=2.0,<2.1
en_core_web_sm 2.0.0版本与spacy适配
export PYTHONPATH=/home/B/gaoye/code/r2c
ln -s /home/B/gaoye/data/vcr/vcr1image
/dataloaders/vcr.py:
self.h5fn = os.path.join(VCR_ANNOTS_DIR, f’{self.embs_to_load}{self.mode}{self.split}.h5’)
print(“Loading embeddings from {}”.format(self.h5fn), flush=True)
https://github.com/rowanz/r2c/issues/9
python train.py -params models/multiatt/default.json -folder models/saves/flagship_answer
scp -P 10035 -r gaoye@10.69.21.155:/home/B/gaoye/tool ~/tool
Unpaired Image Captioning by Language Pivoting (ECCV 2018)
Alibaba AI Labs
pivot核心,枢轴语言,作为第三语言沟通两种语言之间翻译的桥梁
Unpaired Image Captioning via Scene Graph Alignments(ICCV2019)
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)
Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)
CUHK
centerpiece中心装饰品;放在餐桌中央的摆饰;
对偶任务:
Captioning Module
Self-retrieval Module
Image captioning methods can be divided into three categories [49].
-
Template based methods [20, 29, 48] generate captions based on language templates.
- Search-based methods [11,13] search for the most semantically similar captions from a sentence pool.
- Recent works mainly focus on language-based methods with an encoder-decoder framework [7,14–17,28,41,43,46,47], where a convolutional neural network (CNN) encodes images into visual features, and an Long Short Term Memory network (LSTM) decodes features into sentences [41].
It has been shown that attention mechanisms [5,26,31,47] and high-level attributes and concepts [14,16,49,50] can help with image captioning.
The word at each time step t is chosen based on the probability distribution of each word by greedy decoding or beam search.
rattan 藤
posterior 在后面的
Unsupervised Image Caption (CVPR 2019)
腾讯AI lab
Rochester大学
warrant 使有必要