論文相關(Related Work)

# 論文相關(Related Work) ###### tags: `論文` **關鍵字** 1. Temporal Action Detection 2. Online Action Detection 3. Early Action Detection 4. Online Detection of Action Start (ODAS) --- 高手整理(補充) https://hackmd.io/@allen108108 #### TV-L1 OpticalFlow https://zhuanlan.zhihu.com/p/42537928 #### thumos解壓密碼 https://zhuanlan.zhihu.com/p/80333569 ```python #test data set THUMOS14_REGISTERED ``` --- ::: spoiler ### spoiler Temporal Action Detection 在做出決定前必須看完整個影片才能做出決定 S-CNN 生成建議分類建議回歸建議 R-C3D[37] 降低計算成本 [8]提出TCN結合上下文排名產生分類 [4]提出高效模型避免重疊 [28]提出CDC 使用3DCNN ### Online Action Detection 不看未來只看過去看到就判 [9] 介紹問題 [12]強化式解碼器網路使用時間遞歸網路[38]組成 **被參考用於ClsNet上** ### Early Action Detection 一個動作的SVM ### Online Detection of Action Start (ODAS) [github](https://github.com/junting/odas) 沒東西回歸動作問題 . To achieve the goal, they force the learned representation(表示) of an action start window to be similar(類似) to that of the following action window and different from that of the preceding background. ::: --- [Temporal Action Detection with Structured Segment Networks github](https://github.com/yjxiong/action-detection) [知乎大神Hao Zhang的視訊理解近期研究進展](https://www.itread01.com/content/1547008204.html) [Temporal Action Detection總結](https://zhuanlan.zhihu.com/p/52524590) [Temporal Action Detection (時序動作檢測)綜述](https://www.geek-share.com/detail/2755691303.html) #### (TAG) [A Pursuit of Temporal Accuracy in General Activity Detection](https://arxiv.org/pdf/1703.02716.pdf) [Code github](https://github.com/yjxiong/action-detection) #### (TAP) [TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals](https://arxiv.org/pdf/1703.06189.pdf) #### (SSN) [Structured Segment Networks](https://blog.csdn.net/zhang_can/article/details/79782387) [paper](https://arxiv.org/pdf/1704.06228.pdf) [code by pytorch](http://yjxiong.me/others/ssn/) #### (CBR) [Cascaded Boundary Regression for Temporal Action Detection](https://pdfs.semanticscholar.org/3ca2/0fcaf2f129c363f6b0daa74b32b2467116c6.pdf) [Code github by TensorFlow](https://github.com/jiyanggao/CBR) #### (R-C3D) [R-C3D: Region Convolutional 3D Network for Temporal Activity Detection](http://openaccess.thecvf.com/content_ICCV_2017/papers/Xu_R-C3D_Region_Convolutional_ICCV_2017_paper.pdf) 出處:ICCV2017 [Code github use caffe](https://github.com/VisionLearningGroup/R-C3D) [code 改 pytorch](https://github.com/sunnyxiaohu/R-C3D.pytorch) [C3D pytorch](https://github.com/JJBOY/C3D-pytorch) --- ### THUMOS 14 ![](https://i.imgur.com/gMhgvNl.png) --- ### [R-C3D: Region Convolutional 3D Network for Temporal Activity Detection](http://openaccess.thecvf.com/content_ICCV_2017/papers/Xu_R-C3D_Region_Convolutional_ICCV_2017_paper.pdf) 出處:ICCV2017 ==引用241== **R-C3D貢獻** 1. 可以對任意長度視頻、任意長度行為結合proposal和分類進行端到端的檢測 2. 通過共享Progposal generation 和Classification網路的C3D參數提升網路速度 3. 三個測試集上測試並取得了不錯的效果 ==簡單說就是== 1. faster-RCNN 的 Region Proposal Network 用到時間域 2. faster-RCNN 的 ROI Pooling 用到時間域 **R-C3D架構** ![](https://i.imgur.com/zEKEYRg.png) *展開* ![](https://i.imgur.com/R9pdI30.png) [C3D論文](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Tran_Learning_Spatiotemporal_Features_ICCV_2015_paper.pdf) 原文指出是輸入16幀不重疊片段在這裡的輸入是看GPU的記憶體大小，以文中4.1表示GPU memory buffer is 768 frames 影片是FPS25的所以約略30秒原文的學習率為0.0001(thumos14) anchor[2, 4, 5, 6, 8, 9, 10, 12, 14, 16] between 0.64 and 5.12 seconds 2 ∗ 8/25 = 0.64 and 16 ∗ 8/25 = 5.12 **R-C3D流程** 1. 特徵提取網路：對於輸入任意長度的視頻進行特徵提取骨幹網路採用C3D網路，輸入為一系列RGB的幀 2. 候選框子網路：提取可能存在行為的時序片段為了檢測出所有長度的proposal，在候選框子網路中引入faster R-CNN中的RPN網路如上圖 3. 動作分類子網路：對檢測到的行為時序進行分類 4. 損失函數 smooth L1 (回歸) loss function Softmax (分類) **結果比較** ![](https://i.imgur.com/FEItNIM.png) --- ==R-C3D 被引用的地方== --- **比較好的** ### [BSN: Boundary Sensitive Network for Temporal Action Proposal Generation](http://openaccess.thecvf.com/content_ECCV_2018/papers/Tianwei_Lin_BSN_Boundary_Sensitive_ECCV_2018_paper.pdf) 出處:ECCV2018 ==引用88== *架構tensorflow 雙流捲積* ![](https://i.imgur.com/cYqtZXp.png) **thumos14 28.9>>>36.9** **BSN架構** ![](https://i.imgur.com/LsgvMuG.png) **BSN流程** 1. 影片特徵提取特徵提取 two-stream network 雙流網路來提取特徵 2. 邊界敏感提案(網路)*(時序動作的提名)(BSP)* 邊界敏感 BSN主要包含三個模塊，依次為**時序評估模塊**，**提名生成模塊**和**提名評估模塊** 3. 非極大化抑制(NMS) 非極大化抑制-後處理(Soft-NMS) 最後，我們需要對結果進行非極大化抑制，從而去除重疊的結果。具體而言，我們採用了soft-nms算法來通過降低分數的方式來抑制重疊的結果。處理後的結果即為BSN算法最終生成的時序動作提名。 --- ### [AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos](http://openaccess.thecvf.com/content_ECCV_2018/papers/Zheng_Shou_AutoLoc_Weakly-supervised_Temporal_ECCV_2018_paper.pdf) 出處:ECCV2018 ==引用58== *架構caffe* ![](https://i.imgur.com/Lt44Szm.png) **thumos14 ==28.9>>>21.2== 但他是弱監督** 1. 弱監督TAL(Temporal Action Localization )框架 2. 新的損失( Outer-Inner-Contrastive (OIC) loss) **AutoLoc架構** ![](https://i.imgur.com/oQ3vRz7.png) **AutoLoc 流程** 1. input 影片分成15幀無重疊的輸入影格 2. 在每一段取D維度總共有T個snippets 3. 分類的頂部有兩個分支 a.分類(動作) b.區域(時間) 4. loss Outer-Inner-Contrastive(OIC) **邊界預測的過程** ![](https://i.imgur.com/QFSLbo0.png) 1. anchor生成獲得邊界假說 2. 邊界回歸以獲得行為段的預測的邊界（記作為內邊界） 3. 邊界膨脹得到外邊界 a. 目前的邊界預測(橘色的) b. 對於每個anchor tx去移動中心點 tw縮放anchor的長度 4. 引入OIC 使用OIC損失來測量每個段包含行為的可能性，然後刪除不可能包含行為的片段 *論文裡面數學式好多* --- ### [Rethinking the Faster R-CNN Architecture for Temporal Action Localization](http://openaccess.thecvf.com/content_cvpr_2018/papers/Chao_Rethinking_the_Faster_CVPR_2018_paper.pdf) 出處:CVPR2018 ==引用136== ![](https://i.imgur.com/fzGXpuH.png) **thumos14 ==28.9>>>42.8==** **TAL貢獻** 1. 感受野需和proposal長度對齊長度各異(時間跨度不同)的proposal能否共享同一個特徵(感受野相同)呢？(下右) 作者覺得不合理，應該是要(下左) ![](https://i.imgur.com/fiy8pF0.png) 2. 擴大感受野發掘上下文信息 ![](https://i.imgur.com/lxrTTdo.png) 3. 多流特徵(RGB or Optical Flow)和特徵晚融合 ![](https://i.imgur.com/Z4Whlid.png) **TAL-Net架構** 左(Faster R-CNN) 右(TAL-Net) ![](https://i.imgur.com/BnVEsNQ.png) --- ### [Two-Stream Region Convolutional 3D Network for Temporal Activity Detection](https://arxiv.org/pdf/1906.02182v1.pdf) 出處:IEEE (TPAMI) 2019 ==引用2== 作者:就是R-C3D作者 ![](https://i.imgur.com/srCNYae.png) **thumos14 28.9>>>36.1** *原本的在上面的圖* **雙流R-C3D架構** ![](https://i.imgur.com/bTXgNLX.png) 感覺差不多但是加入了光流 --- **效果比較差的(踩雷)** ### [ConvNet Architecture Search for Spatiotemporal Feature Learning](https://arxiv.org/pdf/1708.05038.pdf) 出處:2017CVPR ==引用157== 作者:提出C3D的那個人 **貢獻** 1. 簡化網路讓訓練更快 2. 使用Res3D 3. 輸入幀最佳的採樣間隔 **2~4** 4. 深度18層最優兼顧準確率、計算複雜度以及網路容量 ![](https://i.imgur.com/chwhmPh.png) **thumos14 28.9>>>22.5** --- --- ### **沒被引用但也可以參考** Shou et al CDC [CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos](https://arxiv.org/pdf/1703.01515.pdf) 出處:ICCV 2017 ==引用232== [Weakly Supervised Action Localization by Sparse Temporal Pooling Network](http://openaccess.thecvf.com/content_cvpr_2018/papers/Nguyen_Weakly_Supervised_Action_CVPR_2018_paper.pdf) 出處:CVPR 2018 ==引用72== ## 新架構 step 1 video > frame > googlenet > output ``` python #coding=utf-8 import torch import torchvision import torchvision.transforms as transforms import torch.nn as nn import torch.nn.functional as F import torch.utils.data as Data from torch.autograd import Variable import torch.optim as optim import matplotlib.pyplot as plt import numpy as np from torchvision.datasets import ImageFolder from torchvision.transforms import ToTensor from torchvision import transforms, datasets from sklearn.metrics import confusion_matrix import pandas as pd ``` ``` python EPOCH = 100 BATCH_SIZE = 16 LR = 0.005 TRAIN_DATA_PATH='D:\\論文data\\train_data' TEST_DATA_PATH='D:\\論文data\\test_data' saveidx=0.005 #!!!!!! ``` ``` python use_cuda = torch.cuda.is_available() if (use_cuda): print("Great, you have a GPU!") else: print("Life is short -- consider a GPU!") ``` ``` python data_transform = transforms.Compose([ #transforms.RandomResizedCrop(299), #inception3 input 299*299 transforms.Resize((299,299)), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) ``` ``` python train_data = ImageFolder(root=TRAIN_DATA_PATH, transform=data_transform) test_data = ImageFolder(root=TEST_DATA_PATH, transform=data_transform) print(train_data.classes) ``` ``` python train_loader = Data.DataLoader(train_data, batch_size=BATCH_SIZE,shuffle=True) test_loader = Data.DataLoader(test_data, batch_size=BATCH_SIZE,shuffle=False) ``` ``` python #保存提取 #net=torch.load('my_thumos_5.pth') #net=torchvision.models.Inception3() #第一次用的net net=torchvision.models.inception_v3(pretrained=True) #inception 有兩個輸出層主輸出(末端線性層) 跟輔助輸出(AuxLogits) 所以要兩個都改 net.AuxLogits.fc = nn.Linear(768, 3) num_ftrs = net.fc.in_features net.fc = nn.Linear(num_ftrs, 3) #最後輸出只有五類 ``` [模型微調](https://pytorch.apachecn.org/docs/1.2/beginner/finetuning_torchvision_models_tutorial.html) ``` python #平均分配資料 def make_weights_for_balanced_classes(images, nclasses): count = [0] * nclasses for item in images: count[item[1]] += 1 weight_per_class = [0.] * nclasses N = float(sum(count)) for i in range(nclasses): weight_per_class[i] = N/float(count[i]) weight = [0] * len(images) for idx, val in enumerate(images): weight[idx] = weight_per_class[val[1]] return weight ``` ```python weights = make_weights_for_balanced_classes(train_data.imgs, len(train_data.classes)) weights = torch.DoubleTensor(weights) sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, len(weights)) train_loader = torch.utils.data.DataLoader(train_data , batch_size=BATCH_SIZE, shuffle = False, sampler = sampler, num_workers=1, pin_memory=True) ``` ```python net.cuda() optimizer = optim.Adam(net.parameters(), lr=LR) loss_func = nn.CrossEntropyLoss().cuda() ``` ```python for epoch in range(EPOCH): for step, (x, y) in enumerate(train_loader): b_x = x.cuda() # batch x (image) b_y = y.cuda() # batch y (target) #参数的梯度值初始化为0 optimizer.zero_grad() output = net(b_x)[0] #計算loss loss = loss_func(output, b_y) #反向傳播 loss.backward() #更新參數 optimizer.step() if step % 10 == 0: print('Epoch: ', epoch, '| train loss: %.4f' % loss.data.cpu().numpy()) if saveidx>=loss.data.cpu().numpy(): savemodelname=('my_thumos_{}.pth'.format(epoch)) torch.save(net , savemodelname) print('best.pth save success') saveidx=loss.data.cpu().numpy() print('train over') # ways to save the net torch.save(net, 'my_thumos.pth') # save entire net print('save success') #print('best train loss is : %.4f' % saveidx) ``` :::spoiler 感覺有錯的測試區 ```python correct=0 total=0 labels_list=list(np.array([])) pred_list=list(np.array([])) for data in test_loader: images,labels=data outputs,_=net(images.cuda()) labels=labels.cuda() total +=labels.size(0) # pred=torch.max(outputs,1) pred=torch.max(outputs,1)[1].cuda() # print(labels.cpu().numpy()) pred_list.extend(pred.cpu().numpy()) labels_list.extend(labels.cpu().numpy()) #混淆矩陣 correct+=(pred==labels).sum() #混淆矩陣 print('accuracy : %d %%'%(100*correct/total)) ``` ```python #coding=utf-8 test_data.class_to_idx #print(confusion_matrix(labels_list,pred_list)) #混淆矩陣 y_actu = pd.Series(labels_list, name='Actual') y_pred = pd.Series(pred_list, name='Predicted') df_confusion = pd.crosstab(y_actu, y_pred) df_confusion ``` ::: ```python #單圖測試 from PIL import Image import matplotlib.pyplot as plt import numpy as np import glob, os import sys import getopt import torch import torchvision.transforms as transforms from torch.autograd import Variable use_cuda = torch.cuda.is_available() if (use_cuda): print("Great, you have a GPU!") else: print("Life is short -- consider a GPU!") new_model = torch.load('my_thumos.pth') new_model.cuda() new_model.eval() #test訓練好的模型專用 def predict_image(image_path): #print("Prediction in progress") image = Image.open(image_path) # Define transformations for the image, should (note that imagenet models are trained with image size 224) transformation = transforms.Compose([ #transforms.RandomResizedCrop(299), #inception3 input 299*299 transforms.Resize((299,299)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225]) ]) # 預處理圖像 image_tensor = transformation(image).float() # 額外添加一個批次維度，因為PyTorch將所有的圖像當做批次 image_tensor = image_tensor.unsqueeze_(0) if torch.cuda.is_available(): image_tensor.cuda() # 將輸入變為變量 input = Variable(image_tensor).cuda() # 預測圖像的類 output = new_model(input) index = output.data.cpu().numpy().argmax() return index ``` ```python #看數據區 #看分布 def count_list_info(time_list): print('len : ',len(time_list)) print('0 : ',time_list.count(0)) print('1 : ',time_list.count(1)) print('2 : ',time_list.count(2)) print('3 : ',time_list.count(3)) print('4 : ',time_list.count(4)) print('5 : ',time_list.count(5)) #跑anchor投票 def run_anchor(time_list): anchor=[2, 4, 5, 6, 8, 9, 10, 12, 14, 16] test_list=[] for i in range(len(time_list)-16): scr=0 for idx in anchor: #scr=0 for j in range(0,idx): a=time_list[i+j] if(a!=4): scr+=1 test_list.append(scr) return test_list #跑出最後判斷的時間點 def look_ans(ans): check_max=[] for i,a in enumerate(ans,start=1): #全看 #print(f"{a}",end=' ') #if(i%30==0): # print('') if(a>=38): check_max.append((i,a)) #print(f"frame:{i},{a}") index=check_max[0][0] ans_time_list=[] pp=[] for i in check_max: #print(i[0]) diff=i[0]-index if diff>=60 or diff==0: #print(i[0]) index=i[0] if(len(ans_time_list)>0): pp.append((min(ans_time_list),max(ans_time_list))) ans_time_list=[] if diff<30 or diff==0: ans_time_list.append(i[0]) index=i[0] print(pp) ``` --- ## 自己的實驗結果 :::success 以下實驗結果 Online Action Detection Based on Two Dimensional Convolutional Network and Temporal Sampling Architecture ::: **my_paper_0_wei2.pth** ![](https://i.imgur.com/w1QFabc.png) **my_paper_6_wei2.pth** ![](https://i.imgur.com/Y7GnJZd.png) **my_paper_19_wei2.pth** ![](https://i.imgur.com/tFcyJqd.png) **my_paper_30_wei2.pth** ![](https://i.imgur.com/FX1uqyI.png) **my_paper_32_wei2.pth** ![](https://i.imgur.com/7F2bHOc.png) **my_paper_40_wei2.pth** ![](https://i.imgur.com/qyOpbfj.png) **my_paper_43_wei2.pth** ![](https://i.imgur.com/Gv3Emtm.png) **my_paper_46_wei2.pth** ![](https://i.imgur.com/gHyhhgk.png) --- **dataset thumos14** | mAP | C3D | res18 | | :----: | :----: | :----: | | 0.1 | 0.58 | 0.50 | | 0.2 | 0.57 | 0.49 | | 0.3 | 0.54 | 0.46 | | 0.4 | 0.47 | 0.41 | | **0.5** | **0.39** | **0.34** | | 0.6 | 0.29 | 0.25 | | 0.7 | 0.18 | 0.16 | ### ref ```tiddlywiki [1] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” IEEE Trans. on Pattern Analysis and Machin Intelligence (PAMI), Vol. 35, No, 1, pp. 232-231, 2013. [2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning Realistic Human Actions from Movies,” Computer Vision and Pattern Recognition, 2008. [3] J.Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, ”Beyond Short Snippets: Deep Networks for Video Classification,” CVPR, 2015. [4] K. Simonyan and A. Zisserman, “Two-stream Convolutional Networks for Action Recognition in Videos,” Neural Information Processing Systems, pages 568–576, 2014. [5] H. Wang and C. Schmid, “Action Recognition with Improved Trajectories,” International Conference on Computer Vision (ICCV), 2013. [6] Z. Jingjing, Z. Jiang, and R. Chellappa, “Cross-View Action Recognition via Transferable Dictionary Learning,” IEEE Transactions on Image Processing, 25(6):2542–2556, 2016. [7] J. Donahue, L.A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K.Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” IEEE Trans.on PAMI, 2017. [8] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CVPR, pp. 4724-4733, 2017. [9] Y.H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G.Toderici, “Beyond short snippets: deep networks for video classification,”CVPR, 2015. [10] Z. Qi u, T. Yao, and T. Mei, “Learning spatio temporal representation with pseudo-3d residual networks,” ICCV, pp. 5534-5542, 2017. [11] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal LSTM with trust gates for 3D Human action recognition,” ECCV, pp. 816-833, 2016. [12] A. Diba, M. Fayyaz, V. Sharma, A.H. Karami, M.M. Arzani, R. Yousefzadeh, and L. VanGool, “Temporal 3D convNets: new architec ture and transfer learning for video classification,” arXiv:1711.08200 , 2017. [13] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two stream network fusion for video action recognition,” CVPR, 2016. [14] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D . Lin, X. and L.V. Gool, “Temporal segment networks: towards good practices for deep action recognition,” ECCV,pp. 20-36, 2016. [15] H. Wang and C. Schmid, ”Action recognition with improved trajectories,” ICCV,2013. [16] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” CVPR, pp. 1-8, 2008. [17] J. Zheng, Z. Jiang, and R. Chellappa, “Cross-view action recognition via transferable dicti onary learning,” IEEE Trans. on IP , Vol. 25, No. 6, pp.2542-2556, 2016. [18] P. Weinzaepfel , Z. Harchaoui, and C. Schmid, “Learning to track for spatio temporal action localization,” ICCV, 2015. [19] G. Yu and J. Yuan, “Fast action proposals for human action detection and search,” CVPR, 2015. [20] A. Gaidon, Z. Harchaoui, and C. Schmid, “Temporal localization of actions with actoms,” IEEE Trans. on PAMI, Vol. 35, No. 11, pp. 2782-2795, 2013. [21] Z. Shu, K. Yun, and D. Samaras, “Action detection with improved dense trajectories and sliding window,” ECCV, pp. 541-551, 2014. [22] S. Karaman, L. Sei denari, and A.D. Bimbo, “Fast saliency based pooling of Fisher encoded dense trajectories,” ECCV THUMOS Workshop, 2014. [23] D. Oneata, J. Verbeek, and C. Schmid, “The LEAR submission at Thumos 2014,” ECCV THUMOS Workshop, 2014. [24] L. Wang, Y. Yu Qiao, and X. Tang, “Action recognition and detection by combining motion and appearance features,” ECCV THUMOS Workshop, vol. 1, 2014. [25] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, 2014. [26] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real time object detection with region proposal networks” NIPS, pp. 91-99, 2015. [27] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN)for action detection in videos,” ICCV, 2017. [28] V. Escorcia, F. Heilbron , J. Niebles, and B. Ghanem, “DAPs: deep action Proposals for Action Understanding,” ECCV, pp. 768-784 , 2016. [29] A. Montes, A. Salvador, and X. Nieto, “Temporal activity detection in untrimmed videos with recurrent neural networks,” arXiv:1608.08128, 2016 [30] B. Singh, T. Marks, M. Jones, O. Tuzel, and M. Shao, “A Multi stream bidirectional recurrent neural network for fine-grained action detection,” CVPR, 2016. [31] S. Yeung, O. Russakovsky , G. Mori, and L. FeiFei, “End-to-end learning of action detection from frame glimpses in videos,” CVPR, pp. 2678-2687, 2016. [32] S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in LSTMs for activity detection and early detection,” CVPR , pp. 1942-1950, 2016. [33] Z. Shou, J. Chan, A . Zareian, K. Miyazaway, and F. Chang, “CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos,” CVPR, 2017. [34] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin, “Temporal action detection with structured segment networks,” ICCV, 2017. [35] F. Heilbron, V. Escorcia, B. Ghanem, and J. Niebles, “ActivityNet: A large-scale video benchmark for human activity understanding,” CVPR, 2015. [36] He, Kaiming, et al. "Deep residual learning for image recognition," CVPR, 2016. [37] Xu, Huijuan, Abir Das, and Kate Saenko. "R-c3d: Region convolutional 3d network for temporal activity detection," ICCV, 2017. [38] Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767 (2018). [39] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision," CVPR, 2016. [40] Kingma, P. Diederik, and B. Jimmy. "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980 (2014). [41] Pytorch官網 , https://pytorch.org/ [42] YouTube首頁 , https://www.youtube.com/ ``` 一部影片中往往結合著多種各式各樣的訊息，有可能是一部脫口秀或者是一部精采的運動賽事，這些訊息透過我們的眼睛一看，就可以知道他是屬於哪種類型的影片，在機器學習中偵測一部影片中，哪裡有動作的方法通常是會藉由3D Convolutional Networks(C3D)或者是Long Short Term Memory Network(LSTM)，來達成想要的特徵提取，但C3D缺乏著良好的預訓練模型，且LSTM往往在訓練上會花上不少的時間，那麼在典型二維卷積神經網路，比如說：AlexNet、VGG、ResNet…等，發展到現在都已經有良好且正確率高的預訓練模型，若直接使用於遷移式學習的話，有了前人的智慧，就可以節省不少的時間。本論文希望藉由只使用二維卷積神經網路在一段影片中，辨別出動作開始與結束的時間。採用的方法為在讀取影片時，逐幀讀取影像，對影像先做人物的影像辨識，並把有人出現的區域標記出來，把這些含有人的區域當作是一個propose，再把這些propose做分類，分類出該propose是屬於何種行為，本論文以運動常見的行為為例，以揮、投、跳為主，最後在使用類似於Faster RCNN的anchor的概念，來計算出動作是從何時開始，到何時結束，在處理的速度上使用GeForce GTX 1080 Ti和Intel i7-7700於Windows10上僅需要一部影片播完後該影片後即可得知結果。 --- There is a lot of information in a video. It may be a talk show or a splendid sports event. These messages look through our eyes. We can know what kind of video it is. Detect in a video in machine learning. Where there is an action method is usually 3D Convolutional Networks (C3D) or Long Short Term Memory Network (LSTM). To achieve the desired feature extraction. But C3D lacks a good pre-trained model. And LSTM often takes a lot of time in training. That is in a typical two-dimensional convolutional neural network, for example: AlexNet, VGG, ResNet... etc. Up to now, there are good pre-trained models with high accuracy. If used for transfer learning. Can save a lot of time. This paper hopes to use only a two-dimensional convolutional neural network to identify the start and end time of an action in a video. The method used is to read the image frame by frame when reading the movie. First, do the image recognition of person in the image, and mark the areas where person appear, treat these areas containing person as a proposal, and then classify these proposals. What kind of action does the classification proposal belong to. This paper takes common sports action as examples, mainly swinging, throwing, and jumping. Finally, the anchor concept similar to Faster RCNN is used to calculate when the action starts. To calculate when the action starts and ends. With GeForce GTX 1080 Ti and Intel i7-7700 on Windows 10. After the video is played, the result will be known. ``` CNNs Convolutional Neural Networks ConvNets Convolutional Neural Networks I3D Inflated 3D ConvNet ResNet Deep Residual Convolutional Neural Networks LSTM Long-Short Temporal Memory RNNs Recurrent Neural Networks HOG Histograms of Image Gradients HOF Histogram of Flow iDT improved Dense Trajectory FV Fisher Vector VLAD Vector of Linearly Aggregated Descriptors FCNs Fully Convolutional Networks GMMs Gaussian Mixture Models HOOF Histogram of Oriented Optical Flow AP Average Precision mAP mean Average Precision IoU Intersection-over-Union RGB Red Green Blue ``` ##### P1 各位評審委員大家好，我是王韋力，今天要來報告的是，基於二維卷積網路與時序取樣架構於線上行為偵測 ##### P2 接下來會介紹一下我的投影片的大鋼，一開始會先介紹一下這個研究在做什麼，然後會介紹一些相關的論文與方法，在介紹本研究的主要重點與使用方法，和一些實驗結果，最後為本研究做個結論 ##### P3 Action Recognition 動作檢測什麼是動作檢測，就是在一段影片中檢測這段影片是甚麼動作，通常是一段處理好的影片，那個影片就只會有那個特定動作 Temporal Action Detection 時序行為偵測，是指在一段影片中找出是甚麼動作，和動作檢測很像，但是不同的地方是在他的任務目標需要知道這個動作是甚麼，還要知道是在影片的甚麼時間出現，像右邊的投球影片，我們只需要關心投球的動作，但大部分的時間都是其他的動作，指定目標的動作通常在影片中只佔了一小部分。 ##### P4 Action Recognition 也可以叫做 Action Classification 目的是在切割好的影片中分辨出是屬於哪種動作一般使用的數據庫都先將動作分割好了，一個片斷中包含一段明確的動作，時間較短（幾秒鐘）且有唯一確定的label。所以也可以看作是輸入為影片，輸出為動作標籤的多分類問題。動作識別的主要目標是判斷一段影片中人的行爲的類別，所以也可以叫做 Human Action Recognition。動作識別的難點在 (1)類內和類間差異, 同樣一個動作，不同人的表現可能有極大的差異。 (2)環境差異, 遮擋、多視角、光照、低解析度、動態背景. (3)時間變化, 人在執行動作時的速度變化很大，很難確定動作的起始點，從而在對影片提取特徵表示動作時影響最大。 (4)缺乏標註良好的大的數據集 ##### P5 Temporal Action Detection 也可以叫做 Temporal Action Localization 這個目標不僅要知道一個動作在影片中是否發生，還需要知道動作發生在影片的哪段時間（包括開始和結束時間）。特點是需要處理較長的，未分割的影片。且影片通常有較多干擾，目標動作一般只佔影片的一小部分。常用數據庫包括thumos2014/2015, ActivityNet等。相當於對影片進行指定行爲的檢測。 ##### P6 那在Action Recognition 和 Temporal Action Detection 他們的關係就好像是 Image Classification 和 Object Detection Image Classification 分類一張圖片像上面那個狗 Object Detection 在圖片中找到東西並且知道他是什麼像是下面那張圖紅色框框裡面有一隻狗那Action Recognition就是分類這個影片 Temporal Action Detection在一整部影片中找到動作在哪裡，是甚麼動作相對應的概念 ##### P7 在影片中有一些行為是需要即時可以判斷出來的，但是在現有的架構下通常都只能透過事後才能得到結果，沒有辦法在即時的發現問題像是在車內行為偵測的話通常會偵測駕駛有沒有危險駕駛，如果在事後才判斷出是危險駕駛那這個系統形同虛設，還有商店內如果被搶劫若不能即時的發現問題，那有可能都已經發生車禍了或者是商店已經被搶光了，都過了很久最後才知道，這些都是需要立即得到回應的東西 ##### P8 relate work 來回顧一下Action Recognition 動作檢測通常會使用哪些方法在有時間這個概念的東西上首先是會先聯想到LSTM一個時序型的網路還有3D CNN 在第三個維度是時間的CNN網路再來就是雙流卷積網路了加入光流的時間性通常使用的是UCF101來做訓練 UCF101是一個有101類動作的一個資料集每一類都有切好的影片對應著單一的標記 ##### P9 可以看到網路的結構很容易理解，視訊的每一幀輸入到CNN網路中，然後CNN網路的輸出作為LSTM的輸入， LSTM的輸出作為最終網路的輸出。 CNN和LSTM的引數是沿著時間共享的 ##### P10 3D CNN 很直觀的作法把影片用三維卷積一次丟進去這篇論文的特點是Temporal Transition Layer（TTL）層， TTL層包含幾個不同大小和時間域深度的捲積kernel和三維pooling層構成，所希望達到的效果是能夠對短、中、長三個不同的時間長度的序列信息進行建模 ##### P11 （1）提出了two-stream結構的CNN，由空間和時間兩個維度的網路組成。（2）使用多幀的密集光流場作為訓練輸入，可以提取動作的資訊。 ##### P12 Temporal Action Detection 時序行為偵測主要的作法大致如下 Sliding Window Methods 跟Proposal-Classifier Methods 還有Frame/Snippet-Level Methods 通常會使用thumos14來做訓練 thumos14就是一些日常影片沒有剪輯但有標記時間資訊的一個資料集 ##### P13 Sliding Window Methods 這裡是透過Sliding Window搭上3D卷積來進行Temporal Action Detection ##### P14 Proposal-Classifier Methods 把影片切成一段一段的之後對這一段一段去做Action Recognition 最後在拚起來 ##### P15 Frame/Snippet-Level Methods 對影片做CNN後搭上LSTM去做訓練 ##### P16 論文精神 ##### P17 架構首先輸入的可以是一部影片或者是一個串流影片可以拆解成一個一個的影格先把這個影格做person detection 再把這個裁好的圖片使用2D CNN做分類在這裡就可以知道當下是甚麼動作再往後把結果儲存在一個序列中最後使用voting conv來把動作的類別和開始跟結束的時間來做輸出在這個架構上可以在影片發生的當下就可以知道有沒有出現過這個行為或者也可以在事後進行影片回顧 ##### P18 Dataset ##### P19 為甚麼自己做 ##### P20 人物提取 ##### P21~P25 介紹動作定義 ##### P26~28 data分布 ##### P29 分類網路採用googlenet ##### P30 訓練的標記 ##### P31 時間編碼 ##### P32 介紹anchor ##### P33 介紹算法 ##### P34 分布 ##### P35 測試結果-環境介紹 ##### P36~37 結果 ##### P38 IOU是甚麼 ##### P39~44 實驗結果 ##### P45 影片 ##### P46 結論 ##### P47 END ### 網路架構 ``` python C3D( (features): Sequential( (0): Conv3d(3, 64, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (1): ReLU(inplace) (2): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False) (3): Conv3d(64, 128, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (4): ReLU(inplace) (5): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False) (6): Conv3d(128, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (7): ReLU(inplace) (8): Conv3d(256, 256, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (9): ReLU(inplace) (10): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False) (11): Conv3d(256, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (12): ReLU(inplace) (13): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (14): ReLU(inplace) (15): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=0, dilation=1, ceil_mode=False) (16): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (17): ReLU(inplace) (18): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1)) (19): ReLU(inplace) (20): MaxPool3d(kernel_size=(2, 2, 2), stride=(2, 2, 2), padding=(0, 1, 1), dilation=1, ceil_mode=False) ) (classifier): Sequential( (0): Linear(in_features=8192, out_features=4096, bias=True) (1): ReLU(inplace) (2): Dropout(p=0.5) (3): Linear(in_features=4096, out_features=4096, bias=True) (4): ReLU(inplace) (5): Dropout(p=0.5) (6): Linear(in_features=4096, out_features=487, bias=True) ) ) ``` ``` python resnet_tdcnn( (RCNN_rpn): _RPN( (RPN_Conv1): Conv3d(256, 512, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1)) (RPN_Conv2): Conv3d(512, 512, kernel_size=(3, 3, 3), stride=(1, 2, 2), padding=(1, 1, 1)) (RPN_output_pool): MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2), padding=0, dilation=1, ceil_mode=False) (RPN_cls_score): Conv3d(512, 4, kernel_size=(1, 1, 1), stride=(1, 1, 1)) (RPN_twin_pred): Conv3d(512, 4, kernel_size=(1, 1, 1), stride=(1, 1, 1)) (RPN_proposal): _ProposalLayer() (RPN_anchor_target): _AnchorTargetLayer() ) (RCNN_proposal_target): _ProposalTargetLayer() (RCNN_roi_temporal_pool): _RoITemporalPooling() ) ``` ``` python #原ResNet 18 ResNet( (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (layer1): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer2): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer3): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (layer4): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (avgpool): AvgPool2d(kernel_size=7, stride=7, padding=0) (fc): Linear(in_features=512, out_features=1000, bias=True) ) ``` :::spoiler 測試詳情 code SOP ``` 測試方法 run script_test.sh 輸出成一個文字檔到output #要先改好 test_net.py裡面的東西 ./script_test.sh c3d thumos14 再跑evaluation 裡面的log_analysis 把文字檔整理成純結果 XXX_log_analysis.py xxx.txt --framerate 25 #xxx.txt是輸出的文字檔 25是fps 最後跑算ap map 的eval_thumos14.py 裡面改一下路徑 ``` ::: --- :::warning 以下暫時用不到 :zap: ::: --- 暫時用不到 ### MMALab [openMMLab](https://open-mmlab.github.io/) [MMAction介紹](https://www.chainnews.com/articles/526112877676.htm) [MMAction github](https://github.com/open-mmlab/mmaction) https://ronaldzzz.blogspot.com/2017/08/raspberry-pipi-opencv-c.html https://blog.csdn.net/baobei0112/article/details/79851434 [mmdetection使用以及源码解析](https://nicehuster.github.io/2019/04/08/mmdetection/) cmake -DOpenCV_DIR=../../opencv-4.1.0/build -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.0 .. cmake 要更新 ``` sudo apt-get install cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev ``` dense flow test ``` 要先新增tmp資料夾 ./extract_gpu -f=test.avi -x=tmp/flow_x -y=tmp/flow_y -i=tmp/image -b=20 -t=1 -d=0 -s=1 -o=dir ./extract_warp_gpu -f=test.avi -x=tmp/flow_x -y=tmp/flow_y -i=tmp/image -b=20 -t=1 -d=0 -s=1 -o=dir ``` /home/wei/mmaction/third_party/dense_flow/build/extract_gpu -f=/home/wei/mmaction/data/ucf101/videos/PlayingFlute/v_PlayingFlute_g12_c04.avi -x=/home/wei/mmaction/data/ucf101/rawframes/PlayingFlute/v_PlayingFlute_g12_c04/flow_x -y=/home/wei/mmaction/data/ucf101/rawframes/PlayingFlute/v_PlayingFlute_g12_c04/flow_y -i=/home/wei/mmaction/data/ucf101/rawframes/PlayingFlute/v_PlayingFlute_g12_c04/img -b=20 -t=1 -d=4 -s=1 -o=dir -w=0 -h=0