陳履軒
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # CNN Note ## Overview ### Jiwon Jeong's article series Deep Dive into the Computer Vision World: Part 1 - https://towardsdatascience.com/deep-dive-into-the-computer-vision-world-f35cd7349e16 Deep Dive into the Computer Vision World: Part 2 - https://towardsdatascience.com/deep-dive-into-the-computer-vision-world-part-2-7a24efdb1a14 Deep Dive into the Computer Vision World: Part 3 - https://towardsdatascience.com/deep-dive-into-the-computer-vision-world-part-3-abd7fd2c64ef ### An Overview of Early Vision in InceptionV1 https://drafts.distill.pub/circuits-early-overview/?fbclid=IwAR0sF2wJ-nj2PrBjlDd4IauKLdZ0z4UIf_cUJBy6Ttp-yAraiA5N9cfU1-A#mixed3a_discussion_BW ## Concepts ### Output Size of Convolution - https://towardsdatascience.com/understanding-convolution-neural-networks-the-eli5-way-785330cd1fb7 $$ W_O = \frac{W_I - F + 2P}{S} + 1 $$ $$ H_O = \frac{H_I - F + 2P}{S} + 1 $$ $$ \# outputs = \# input filters $$ where - $W_I$ means input width - $W_O$ means input width - $H_I$ means input height - $H_O$ means input height - $F$ means filter (kernel) size - $P$ means padding size - $S$ means the interval at which the filter is appiled ### Type of Convolution Explained - https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215 - https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d - https://github.com/vdumoulin/conv_arithmetic - https://towardsdatascience.com/types-of-convolution-kernels-simplified-f040cb307c37 #### (Standard) Convolutions #### Dilated Convolutions #### Transposed Convolutions #### Separable Convolutions - It refers to breaking down the convolution kernel into lower dimension kernels. - *spatially separable* and *depthwith separable* #### Spatially Separable Convolutions ![image alt](https://miro.medium.com/max/896/1*JFJtXP5ukd9T2JnZCcIrcQ.png) <center>A standard 2D convolution kernel</center> ![image alt](https://miro.medium.com/max/1484/1*30plpGXgSUmlI6IIzc5UIA.png) <center>Spatially separable 2D convolution</center> - not commonly used in deep learning #### Depthwise Separable Convolutions ![iamge alt](https://miro.medium.com/max/2494/1*QZYNiE9NETKrh-AiIIKP1g.png) <center>A standard 2D convolution with 3 input channels and 128 filters</center> ![image alt](https://miro.medium.com/max/4055/1*Ln60_L0pg1gBB5TosrReWg.png) <center>Depthwise separable 2D convolution which first processes each channel separately and then applies inter-channel convolutions</center> - widely used and provide really goo performance - **Efficiency** - vastly decrease the number of parameters required #### Deformable Convolutions ![image alt](https://miro.medium.com/max/763/1*cIllLT7udYoBGikjq2cZow.png) - Standard convolutions are very rigid in terms of the shape of feature extraction. - ***idea: the shape of convolution in itself is learnable*** - implementation - straightforward - Every kernel is represented with two different matrics. - learn to predict the 'offset' of the origin - simply the convolution branch with input is now the values at these offsets ![image alt](https://miro.medium.com/max/1384/1*viNZMBna5JAJQWE2K1z6hg.png) #### 1x1 convolution related - https://zhuanlan.zhihu.com/p/40050371 - https://www.zhihu.com/question/56024942 - https://iamaaditya.github.io/2016/03/one-by-one-convolution/ ##### outline - dimension reductionality - 1x1 Convolution with higher strides leads to even more redution in data by decreasing resolution, while losing very little non-spatially correlated information. - Replace fully connected layers with 1x1 convolutions as Yann LeCun believes they are the same. - https://iamaaditya.github.io/2016/03/one-by-one-convolution/ #### Atrous Convolution - https://towardsdatascience.com/review-deeplabv3-atrous-convolution-semantic-segmentation-6d818bfd1d74 ![Atrous Convolution with Different Rates r](https://miro.medium.com/max/706/1*-r7CL0AkeO72MIDpjRxfog.png) <center>Atrous Convolution with Different Rates r</center> ![Atrous Convolution](https://miro.medium.com/max/380/1*Gm3S_I_8A4QWIgqmNSDjQQ.png) <center>Atrous Convolution</center> - For each location $i$ on the output $y$ and a filter $w$, atrous convolution is applied over the input feature map $x$ where the atrous rate $r$ corresponds to the stride with which we sample the input signal. - This is equivalent to convolving the input $x$ with upsampled filters produced by inserting $r-1$ zeros between two consecutive filter values along each spatial dimension. ("trous" means hold in English) - When $r = 1$, it is *standard convolution*. - **By adjusting $r$, we can adaptively modify filter's field-of-view.** - It is also called **dilated convolution** ([DilatedNet](https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5)) or **Hole Algorithm**. ![image alt](https://miro.medium.com/proxy/1*O5B0IRWewitfivGklGDJjA.png) <center>Standard Convolution (Top) Atrous Convolution (Bottom)</center> - **Top**: Standard convolution. - **Bottom**: Atrous convolution - We can see that where $r=2$, the input signal is sampled alternatively. - First, pad = 2 means we pad 2 zeros at both left and right sides. - Then, with rate = 2, we sample the input signal every 2 inputs for convoution. - Atrous convolution **allows us to enlarge the field-of-view of filters to incorporate larger context. - It therefore offers an efficient mechanism to **control the field-of-view** and **finds the best trade-off between accurate localization (small field-of-view) and context assimilation (large field-of-view)**. #### involution https://arxiv.org/abs/2103.06255 https://github.com/d-li14/involution ### NIN https://towardsdatascience.com/review-nin-network-in-network-image-classification-69e271e499ee ### Loss Functions Understanding Loss Functions in Computer Vision! - https://medium.com/ml-cheat-sheet/winning-at-loss-functions-2-important-loss-functions-in-computer-vision-b2b9d293e15a ## Techniques ### im2col im2col (w/ fancy indexing) - https://fdsmlhn.github.io/2017/11/02/Understanding%20im2col%20implementation%20in%20Python(numpy%20fancy%20indexing)/ - https://github.com/huyouare/CS231n/blob/master/assignment2/cs231n/im2col.py - https://hackmd.io/@bouteille/B1Cmns09I - https://sahnimanas.github.io/post/anatomy-of-a-high-performance-convolution/ ### Fast Convolution - http://people.ece.umn.edu/users/parhi/SLIDES/chap8.pdf ## Arch ### Inception Deep Learning: Understanding The Inception Module https://towardsdatascience.com/deep-learning-understand-the-inception-module-56146866e652 ![naive inception module](https://miro.medium.com/max/1000/1*27SNg1gTB0HOVzTCAhp_Eg.png) <center>naive inception module</center> ![A single component of naive inception module](https://miro.medium.com/max/1000/1*R_-PxyvOlRuyb0r7uDJ-gw.png) <center>A single component of naive inception module. Cost 120M operations</center> ![Inception Module Component](https://miro.medium.com/max/1000/1*4RXOMwX5AP2WZTE3liZuQA.png) <center>Inception Module Component. Cost only 12M opetaions</center> ![Complete Inception module](https://miro.medium.com/max/1000/1*U04APgUo2xnLmyU2N_Bgig.png) Benefits of the Inception Module * High-performance gain on convolutional neural networks * Efficient utilisation of computing resource with minimal increase in computation load for the high-performance output of an Inception network. * Ability to extract features from input data at varying scales through the utilisation of varying convolutional filter sizes. * 1x1 conv filters learn cross channels patterns, which contributes to the overall feature extractions capabilities of the network. ### Non-Local Blocks #### Motivations - https://sh-tsang.medium.com/review-non-local-neural-networks-video-classification-object-detection-segmentation-pose-ac42fe57d0e4 - In videos, **long-range interactions** occur between distant pixels in space as well as time. - A single non-local block, which is the basic unit, can directly capture these spacetime dependencies in a feedforward fashion. #### Overview - https://sh-tsang.medium.com/review-non-local-neural-networks-video-classification-object-detection-segmentation-pose-ac42fe57d0e4 - With a few non-local blocks, the architectures called non-local neural networks are more accurate for video classification than 2D and 3D convolutional networks - In addition, non-local neural networks are more computationally economical than the 3D convolutional counterparts. - Self-attention is a special case of non-local operations in the embedded Gaussian version. - In this work, it gives the insight by relating this recent self-attention model to the classic computer vision method of non-local means. ## Application ### Classification (or Basic Structure) ![image alt](https://miro.medium.com/max/3840/0*U3D3gF6ikmb4_La-.png) <center>Adapter from <a href="https://medium.com/yodayoda/segmentation-for-creating-maps-92b8d926cf7e">here</a></center> #### ResNet - https://towardsdatascience.com/hitchhikers-guide-to-residual-networks-resnet-in-keras-385ec01ec8ff #### EffNet - https://towardsdatascience.com/3-small-but-powerful-convolutional-networks-27ef86faa42d - Arxiv link: https://arxiv.org/abs/1801.06434 - Use **spatial separable convolution**. ![image alt](https://miro.medium.com/max/848/1*kgQt2D0U_Uuw69tE6o3-ig.png) - The separable depthwise convolution is the rectangle colored in blue for EffNet block. - absence of "normal convolution" at the beginning: ![image alt](https://miro.medium.com/max/1108/1*wJHTHDO8-r-RMUJgB0gsSA.png) ### Object Detection #### YOLO - https://allen108108.github.io/blog/2019/11/24/[%E8%AB%96%E6%96%87]%20You%20Only%20Look%20Once%20_%20Unified,%20Real-Time%20Object%20Detection/ - https://towardsdatascience.com/yolo-made-simple-interpreting-the-you-only-look-once-paper-55f72886ab73 - https://towardsdatascience.com/object-detection-with-voice-feedback-yolo-v3-gtts-6ec732dca91 ##### source code - https://medium.com/mlearning-ai/object-detection-explained-yolo-v1-fb4bcd3d87a1 #### YOLOv4 https://towardsdatascience.com/yolo-v4-optimal-speed-accuracy-for-object-detection-79896ed47b50 #### Faster R-CNN - https://docs.google.com/presentation/d/1SC33WlIT264Ry8cJmqeFyqNZDGMJeQCUAA21a9JOkGM/edit?fbclid=IwAR01L9Nyz1NDhVVQeL89UqSMrxKDIIMwi33YSru3gnB0IHVMBbZJ9XMVQyA#slide=id.p1 #### SSD — Single Shot Detector - https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11 - It only need to take **one single shot** to detect multiple objects within the image. ##### Overview - Three main points 1. Can detect in *multiple scales*. - With the layers coming after the base network, it yields several feature maps with different resolutions, which enables the network to work at multi-scales. 2. These predictions are made in a convolutional way. - A set of convolutional filters make the predictions for the location of the objects and the class score. - This is where the name of "Single Shot" comes from. Instead of having additional classifiers and regressors, now the detections are made in a single shot! 3. There are a fixed number of default boxes associated with these features maps. - Similar to the anchors of Faster R-CNN, the default boxes are applied at each feature map cell. ##### MultiBox Detector ![SSD: Multiple Bounding Boxes for Localization (loc) and Confidence (conf)](https://miro.medium.com/max/1469/1*St98vVQEqLndeV_-SeUc9Q.png) <center>SSD: Multiple Bounding Boxes for Localization (loc) and Confidence (conf)</center> - After going through certain of convolutions for feature exteaction, we get **a feature layer of size $m\times n$ (number of locations) with $p$ channels**, such $8\times 8$ and $4\times 4$ above. - And a $3\times 3$ conv is applied on this $m\times n\times p$ feature layer. - **For each location, we got $k$ bounding boxes.** - This $k$ bounding boxes have different sizes and aspect ratios. - The concept is, maybe a vertical rectangle is more fit for human, and a horizontal rectangle is more fit for car. - **For each of the bounding box, we will compute $c$ class scores and 4 offsets relative to the original default bounding box shape.** - Thus, we get **$(c+4)kmn$ outputs.** - This why the paper is call "SSD: Single Shot **Multibox** Detector" - But this above it's just a part of SSD. ##### SSD Network Architecture ![SSD (Top) vs YOLO (Bottom)](https://miro.medium.com/max/1568/1*t2oxAKMUMpdKEqI9b1JSLw.png) <center>SSD (Top) vs YOLO (Bottom)</center> - In order to have more accurate detection, different layers of feature maps are also going through a small $3\times 3$ convolution for object detection as shown above. - For example, **at Conv4_3, it is of size $38\times 38\times 512$. ($3\times 3$ conv is appiled.)** - And there are **4 bounding boxes** and **each bounding box will have (classes + 4) outputs.** - Thus, at Conv4_3, the output is $38\times 38\times 4\times (c+4)$. - Suppose there are 20 object classes plus one background class, the output will be $38\times 38\times 4\times (21+4)=144,400$. - **In terms of number of bounding boxes, there are $38\times 38\times 4 = 5776$ bounding boxes.** - Similarly for other conv layers: - Conv7: 19×19×6 = 2166 boxes (6 boxes for each location) - Conv8_2: 10×10×6 = 600 boxes (6 boxes for each location) - Conv9_2: 5×5×6 = 150 boxes (6 boxes for each location) - Conv10_2: 3×3×4 = 36 boxes (4 boxes for each location) - Conv11_2: 1×1×4 = 4 boxes (4 boxes for each location) - Summing them up we get 5776 + 2166 + 600 + 150 + 36 +4 = **8732 boxes in total**. - YOLO only gets 7×7×2 = 98 boxes. - There are 7×7 locations at the end with 2 bounding boxes for each location. - **SSD has 8732 bounding boxes which is more than that of YOLO.** ##### Loss Function ![loss function](https://miro.medium.com/max/563/1*vfpARz-ozzpb0jHM8yJ0zA.png) - The loss function consists of two terms: $L_{conf}$ and $L_{loc}$ where $N$ is the matched default boxes. ![Localization Loss](https://miro.medium.com/max/670/1*yOU4TCmYwLgS06kvcfSWxA.png) <center>Localization Loss</center> - $L_{loc}$ is the localization loss which is the smooth L1 loss between the predicted box (l) and the ground-truth box (g) parameters. - These parameters include the offsets for the center point $(c_x, c_y)$, widch ($w$) and height ($h$) of the bounding box. - Thess loss is similar to the one in Faster R-CNN. ![Confidence Loss](https://miro.medium.com/max/916/1*flNOlxfJj0Pyt1lNZmQaLA.png) <center>Confidence Loss</center> - $L_{conf}$ is the confidence loss which is the softmax loss over multiple classes confidence \(c\). ($\alpha$ is set to 1 by cross validation). - $x_{ij}^p = {1,0}$, is an indicator for matching $i$-th default box to the $j$-th ground truth box of category $p$. ##### Scales Details of Training ![Scale of Default Boxes](https://miro.medium.com/max/543/1*bX2dGa3mgbO5Fdwv8biQUg.png) <center>Scale of Default Boxes</center> - Suppoer we have $m$ feature maps for prediction, we cna calculate $s_k$ for the $k$-th feature map. - $S_{min}$ is 0.2, $S_{max}$ is 0.9. - That means the scale at the lowest layer is 0.2 and the scale at the highest layer is 0.9. - For each scale $s_k$ we have 5 non-square aspect ratios: ![5 Non-Square Bounding Boxes](https://miro.medium.com/max/634/1*vI4qQt5MRNTtOA1ODyKbEA.png) <center>5 Non-Square Bounding Boxes</center> - For aspect ratio of 1:1 we get $s_{k}^{'}$: ![1 Square Bounding Box](https://miro.medium.com/max/180/1*C6MiW3aHuZwMjht2svfAnQ.png) <center>1 Square Bounding Box</center> - Hence, we can have at most 6 bounding boxes in total with different aspect ratios. - For layers with only 4 bounding boxes, $a_r = 1/3$ and 3 are omitted. ##### Some Details of Training 1. Hard Negative Mining - Instead of using all the negative examples, we sort them using the hightest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at nost 3:1. - Leads to faster optimization and a more stable training. 2. Data Augmentation - Each training image is randomly sampled by: - entire original input image - sample a patch so that the overlap with objects is 0.1, 0.3, 0.5, 0.7, and 0.9. - randomly sample a patch - The **size of each sampled patch is [0.1, 1]** or original image size, and **aspect ratio from 1/2 to 2**. - After the above steps, each sampled patch will be **resized to fixed size** and maybe **horizontally flipped with probability of 0.5**, in addition to some **photo-metric distortions**. 3. Atrous Convolution (Hole Algorithm / Dilated Convolution) - The base network is VGG16 with pre-trained weight of ILSVRC classification dataset. - FC6 and FC7 are **altered to convolution layers as Conv6 and Conv7**. - Moreover, **FC6 and FC7 use Atrous Convolution** - And **pool5 is changed from 2×2-s2 to 3×3-s1.** ![Atrous convolution / Hole algorithm / Dilated convolution](https://miro.medium.com/max/494/0*7LCdr8W3gSQdnlSC.gif) <center>Atrous convolution / Hole algorithm / Dilated convolution</center> - As we can see the feature maps are large at Conv6 and Conv7, using Arous Convolution as shown above can **increase the receptive field while keeping number of parameters relatively fewer** compared to conventional convolution. ##### Results - see here: https://towardsdatascience.com/review-ssd-single-shot-detector-object-detection-851a94607d11 #### FPN (Feature Pyramid Network) ##### Overview ![image alt](https://miro.medium.com/max/915/1*dlJltaGZlyh57xT1Oa2SSw.png) What's the point having this kind of architecture? The information for the precise location of an object can be detected in the deeper level of the network. On the contrary, the semantics gets stronger in the lower level. When a network has one way of just going more in-depth, there aren't many semantic data we can use to classify objects at a pixel level. Therefore by implementing a feature pyramid network, we can produce multi-scale feature maps from all levels, and the features from all levels are semantically strong. ![image alt](https://miro.medium.com/max/1325/1*4v71S9xe4p-mV8kEM-SIZg.png) <center>The architecture of FPN</center> So the architecutre of FPN is shown above. There are three directions, bottom-up, top-down and lateral directions: * The bottom-up pathway is the feedforward computation of the backbone ConvNet and ResNet as used in the original paper. * There are 5 stages and the size of the feature map at each stage has a scaling step of 2. * There are differentiated by their dimensions, and we call them as **C1, C2, C3, C4 and C5**. * The higher the layer goes, the smaller the size of the feature map gets along the bottom-up pathway. * Now at the level **C5**, we move to the top-down pathway. * Upsampling is applied to the output maps at each level. And the corresponding map that has the same spatial sizoe from the bottom-up pathway, is merged via a lateral connection. * A 1x1 convolution is appiled before the addition to match the depth size. * Then, we apply a 3x3 convolution on the merged map to reduce aliasing effect of upsampling. * By doing so, we can combine the low resolution and semantically strong features (from top-down pathway) with high resolution and semantically weak features (from bottom-up pathway). * This implies we can have rich semantics at all scales. And as shown in the picture, the same process is iterated and produces **P2, P3, P4, and P5**. Now the last part of the network is an actual classification. Actually, FPN isn't an object detector in itself. A detector isn't *built-in* so we need to plug in a detector from this stage. RPN and Fast R-CNN are used in the original paper, and only FPN with RPN case is depicted in the above picture. In the RPN, there are two sibling layers for an object classifier and bounding box regressor. So in FPN, this part is attached to the tail of each level of **P2, P3, P4 and P5**. #### RetinaNet ##### Overview > RetinaNet is **more about proposing a new loss function to handle class imblance**, rather than publising a novel new network. Let's bring the class imblance back again. One-stage detectors such as SSD and YOLO are apparently faster but still fall behind in accuracy compared to two-stage detectors. And the class imblance issue is one of the reasons for this drawback. ![image alt](https://miro.medium.com/max/1361/1*shphqtA1fvNIpB3rNXN69A.png) <center>Focal Loss</center> So the author proposed a new loss function called **focal loss** by putting a weight on easy examples. The mathematical expression is as shown above. It's a cross-entropy loss multiplied with a modulating factor. The modulating factor reduces the impact of easy examples on the loss. For example, compare the loss when Pₜ= 0.9 and γ = 2. If we say the cross-entropy loss as CE, then the focal loss becomes -0.01CE. The loss becomes 100 times lower. And if the Pₜ gets bigger, say 0.968, with the constant γ, the focal loss becomes -0.001CE (as (1–0.968)² = (0.032)² ≈ 0.001). So it puts down the easy examples even harder and adjusting the imblance in return. And when γ increase, as you can see the graph on the right, the loss gets smaller as the weight increases when Pₜ is constant. ![image alt](https://miro.medium.com/max/1728/0*MN3eMmUZniOsLSux) <center>The architecture of RetinaNet</center> The architecture of RetinaNet is shown above. As we now know ResNet, RPN of faster R-CNN and FPN, there is no nothing new here. The network can be divided into two parts, a backbone with (a) and (b), and two subnets for classification and box regression. The backbone network is composed of ResNet and FPN and the pyramid has levels **P3 to P7**. As with RPN, there are prior anchor boxes as well. The size of anchors changes according to its level. At the level **P3**, the anchor *area* is 32\*32 and at **P7**, it's 512\*512. The anchors have three different ratios and three different sizes, so the number of anchors of anchors **A** for each level is 9. Therefore, the dimension of output from a box regression subnet is **(N x N) x4A**. And when the number of classes in the dataset is **K**, the output from a classification subnet becomes **(N x N) x KA**. The result was quite eye-catching. RetinaNet outperformed all the previous networks in both accuracy and speed. RetinaNet-101-500 indicates the network with ResNet-101 and a 500-pixel image scale. Using larger scales gives higher accuracy than all two-stage approaches, while still being fast enough as well. ![image alt](https://miro.medium.com/max/840/1*8WVVQrl29Rpnsa0U2i0giA.png) <center>Speed vs Accuracy on COCO dataset</center> ##### Training and Inference https://github.com/ramarlina/retinanet-player-ball-detection https://towardsdatascience.com/detecting-soccer-palyers-and-ball-retinantet-2ab5f997ab2 #### MobileNet - https://towardsdatascience.com/3-small-but-powerful-convolutional-networks-27ef86faa42d - Use **depthwise separable convolutions**. - first introduced by [Xception](https://arxiv.org/abs/1610.02357) #### MobileNetV2 - https://towardsdatascience.com/mobilenetv2-inverted-residuals-and-linear-bottlenecks-8a4362f4ffd5 ##### Inverted Residuals - Original residual block (ResNet) follows a wide->narrow->wide approach. ![image alt](https://miro.medium.com/max/740/1*5Jdh_PDTXp0uhF8c79TEsQ.png) <center>A residual block connects wide layers with a skip connection while layers in between are narrow</center> - On the other hand, MobileNetV2 follows a narrow->wide->narrow approach. ![image alt](https://miro.medium.com/max/765/1*BaxdP8RS5x_EVMNJSd1Urg.png) <center>An inverted residual block connects narrow layers with a skip connection while layers in between are wide</center> - The author describes this idea as an **inverted resudial block** beasue skip connections exist between narrow parts of the network which is opposite of how an original residual connection works. - The inverted block has **far fewer** parameters. - With inverted residual block we do the opposite and sqeeze the layers where the skip connections are linked. - This hurts the performance of the network. ##### Linear Bottlenecks - The author introduces the idea of a linear bottleneck where the last convolution of a residaul block hasha linear output before it’s added to the initial activations. - Putting this to code is super simple as we **simply dicard the last activation function** of the convolution block. ##### ReLU6 - omitted as straightforward #### MobileNetV3 ##### Review - https://towardsdatascience.com/everything-you-need-to-know-about-mobilenetv3-and-its-comparison-with-previous-versions-a5d5e5a6eeaa #### ShuffleNet - https://towardsdatascience.com/3-small-but-powerful-convolutional-networks-27ef86faa42d - Introduce three variants of the Shuffle unit. - composed of **group convolutions** and **channel shuffles**. ![image alt](https://miro.medium.com/max/1312/1*SHyel9_UZuy3C5799vL86g.png) ##### Group Convolution - simply several convolutions with each taking a portion of the input channels. - In the following picture you can see a group convolution, with 3 groups, each taking one of the 3 input channels. ![image alt](https://miro.medium.com/max/542/1*Du6kevcjT2Gj1w55xYFymQ.png) - First introduce by AlexNet to split a network into two GPUs. ##### Channel Shuffle - randomly mix the output channels of the group convolutions. - trick of produce randomness see [here](https://github.com/arthurdouillard/keras-shufflenet/blob/master/shufflenet.py#L37-L48) #### Metrics https://github.com/rafaelpadilla/Object-Detection-Metrics #### Interview Questions https://medium.com/@akulahemanth/interview-questions-object-detection-9430d7dee763 ### Semantic Segmentation #### Hypercolumn - https://towardsdatascience.com/review-hypercolumn-instance-segmentation-367180495979 #### Image segmentation in 2020 - https://towardsdatascience.com/image-segmentation-in-2020-756b77fa88fc ### Biomedical Image Segmentation #### U-Net - https://towardsdatascience.com/retinal-vasculature-segmentation-with-a-u-net-architecture-d927674cf57b - https://medium.com/srm-mic/a-gentle-introduction-to-u-net-fc4afd12a893 #### U-Net++ - https://towardsdatascience.com/biomedical-image-segmentation-unet-991d075a3a4b #### Attention U-Net - https://towardsdatascience.com/biomedical-image-segmentation-attention-u-net-29b6f0827405 ### Instance Segmentation (panoptic segmentation) #### Mask-RCNN ##### Training on Custom Dataset - https://towardsdatascience.com/train-custom-dataset-mask-rcnn-6407846598db - https://github.com/miki998/Custom_Train_MaskRCNN ##### Examples - https://towardsdatascience.com/webcam-object-detection-with-mask-r-cnn-on-google-colab-b3b012053ed1 ## MISC ### 纵览轻量化卷积神经网络:SqueezeNet、MobileNet、ShuffleNet、Xception - https://www.jiqizhixin.com/articles/2018-01-08-6?fbclid=IwAR1udQ2Bizk074fTaoxRDnavpIDXXjF_k5XYZH9SvUjbfblLw8jLovS6Rfo ### An overview of deep-learning based object-detection algorithms. - https://medium.com/@fractaldle/brief-overview-on-object-detection-algorithms-ec516929be93 ### CNN vs fully-connected network for image processing - A CNN with $k_x = 1$ and $K(1, 1)=1$ can match the performance of a **fully-connected network**. - The representatation power of the filtered-activated image is least for $k_x = n_x$ and $K(a, b)=1$ for all a, b. - Therefore, by tuning hyperparameter $k_x$ we can control the amount of information retained in the filtered-activated image. - Also, by tuning $K$ to have values different from 1 we can focus on different sections of the image. - By doing both -- tuning hyperparameter $k_x$ and learning parameter $K$, a CNN is guaranteed to have ***better bias-variance*** characteristics with lower bound performance equal to the performance of a **fully-connected network**. - This can be improved further by having multiple channels. - Extending the above discussion, it can be argued that a CNN will outperform a fully-connected network if they have same number of hidden layers with same/similiar structure. - Nevertheless, this comparison is like comparing apples to oranges. ### ResNets, DenseNets & UNets - https://medium.com/swlh/resnets-densenets-unets-6bbdbcfdf010

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully