park - HackMD

# park ## Abstract. We propose an approach based on a lightweight and general network architecture for efficient image-based detection of the parking space. In previous works related to multi-person pose estimation, part affinity fields (PAPs) have been proposed for matching keypoints; however, these works did not efficiently do not exhibit efficiency under the assumptions of this work (1) that the model is lightweight/requires low computational cost and (2) that the model uses a general network architecture. Another issues associated with these works are parking space estimation. Therefore, we propose localized PAFs, which are an extension of PAPs. to focus on local estimation by applying redundancy to the two-dimensional vector fields. The results of the experiments conducted using models that can function on low-cost edge devices, based on our inethod exhibit a significant improvement when compared with those obtained based on the previous approach. ## 抄録本研究では、駐車スペースの検出を効率的に行うために、軽量かつ一般的なネットワークアーキテクチャに基づいたアプローチを提案する。本研究では，軽量かつ汎用的なネットワークアーキテクチャを用いて，画像に基づく駐車空間の効率的な検出を実現するアプローチを提案する．これまでの多人数姿勢推定に関する研究では，キーポイントのマッチングに部分親和場（PAP）を用いた手法が提案されてきたが，本研究の前提（1）モデルが軽量・低計算コストであること，（2）モデルが一般的なネットワークアーキテクチャを用いていること，の3点では効率が良くないことがわかっていた．また，これらの作品に関連したもう一つの課題は，駐車場の推定である．そこで、2次元ベクトル場に冗長性を適用することで局所的な推定に着目し、PAPの拡張である局所化PAFを提案する。低コストエッジデバイス上で機能するモデルを用いて実施した実験の結果、本手法を用いて得られた結果は、従来の手法で得られた結果と比較して大幅な改善を示した。 ## Introduction Parking space detection is an important factor for the realization of next generation autonomous driving systems. Parking space information is essential for achieving sophisticated autonomous parking systems, such as route estimation and vehicle control for target spaces. On the contrary, from the viewpoint of vehicle loadability, this must be operated using low-cost edge devices in real time, and using a lightweight/low-computational cost network is desirable. Additionally, autonomous driving systems are envisaged to not be operated as single tasks but along with multiple other tasks, such as semantic segmentation, object detection, and depth estimation. To conduct multitasking efficiently, a learning method has been proposed wherein an encoder is shared in fully convolutional networks (FCNs) between tasks(10,21,11]. Considering all these aspects in a comprehensive manner, we estimate the parking space assuming that a lightweight/low-computational cost general-purpose network architecture is used. When estimating the parking space, problems similar to multi-person pose estimation are encountered, including it being impossible to specify the position and size of the space, and the number of instances and other specific problems are unknown. First, the impact of occlusion caused by the adjacent vehicles is extremely high. Herein, using the approach reported by Cao et al.(4,9), we herein estimate the keypoint with respect to all sides of a parking space, but visualize only two points in the forward direction of the vehicle. These can easily be overlooked because the majority of area in many spaces is invisible. Another problem that can be raised in relation to estimating parking spaces is cases in which recognition difficulty is caused by the size and wear of the white lines that are an element forming the area. Even when humans cannot clearly see the parking space, places where parking is possible can be recognized by considering the parking positions of other vehicles. Additionally, reference lines, such as driving lanes and road markings, must be distinguished, and a broad understanding of the information in the form of an image must be gained. Two types of approaches can broadly be taken in estimating the parking space. 1. Top-down: After detecting individual spaces using an object detector, estimate the detailed keypoint position assuming that only one parking space exists for each detection result. The top-down approach can estimate with high accuracy, when the target suits a rectangular approximation using a bounding box for such as humans and animals. However, in the case of a parking space, as mentioned earlier, majority of the invisible area contains substantial data. In other words, rectangular approximation is difficult, making it unsuitable for estimation. 2. Bottom-up: Estimates the keypoint in all parking spaces existing in the image and uses grouping for parsing into individual instances. The bottomup approach, regardless of the number of parking spaces within the image, can almost always estimate within a fixed period of time; therefore, it is an effective means when real-time processing is required. However, in these methods, there is a particular architecture within the network, and the overall configuration tends to become complex. This work presents an effective method for estimating multiple parking spaces with a bottom-up approach using a lightweight/low-computational cost generalpurpose network architecture. Herein, the detection of keypoint as elements constituting the parking space and expressing the relations between the keypoint are simultaneously estimated by the same decoder. We propose localized part affinity fields (LPAFs) that suppress the decrease in the estimation accuracy when detecting keypoint and express relations for efficient implementation of collaborative learning. Herein, we construct and evaluate parking lot datasets captured using a fisheve lens camera. Then, we verified that high-accuracy parking space estimation could be achieved with a minimal calculation cost using a lightweight/lowcomputational cost, general-purpose network architecture. ## 前書き駐車スペースの検出は、次世代の自動運転システムを実現するための重要な要素です。駐車スペース情報は、ルート推定やターゲットスペースの車両制御など、高度な自動駐車システムを実現するために不可欠です。逆に、車両の搭載性の観点からは、低コストのエッジデバイスを用いてリアルタイムで運用する必要があり、軽量・低計算コストのネットワークを利用することが望ましい。さらに、自動運転システムは、単一のタスクとしてではなく、セマンティックセグメンテーション、オブジェクト検出、深度推定などの他の複数のタスクと一緒に操作されることが想定されています。マルチタスクを効率的に行うために、タスク間の完全たたみ込みネットワーク（FCN）でエンコーダーを共有する学習方法が提案されています（10、21、11）。これらすべての側面を包括的に考慮して、軽量/低計算コストの汎用ネットワークアーキテクチャは中古。駐車スペースを推定する場合、スペースの位置とサイズを指定できないなど、複数人のポーズ推定と同様の問題が発生し、インスタンスの数やその他の特定の問題は不明です。まず、隣接する車両によるオクルージョンの影響が非常に大きい。ここで、Cao et al。（4,9）によって報告されたアプローチを使用して、ここでは駐車スペースのすべての側面に関してキーポイントを推定しますが、車両の前方の2つのポイントのみを視覚化します。多くのスペースの大部分の領域は見えないため、これらは簡単に見落とされる可能性があります。また、駐車スペースの推定に関しては、エリアを構成する要素である白線の大きさや摩耗により認識困難になるケースもある。人間が駐車スペースをはっきりと見えなくても、他の車両の駐車位置を考慮することで、駐車可能な場所を認識することができます。さらに、走行車線や道路標示などの参照線を区別する必要があり、画像の形で情報を広く理解する必要があります。駐車スペースの推定には、大きく分けて2つの方法があります。 1.トップダウン：オブジェクト検出器を使用して個々のスペースを検出した後、検出結果ごとに1つの駐車スペースのみが存在すると想定して、詳細なキーポイント位置を推定します。ターゲットが人間や動物などの境界ボックスを使用して長方形近似に適合する場合、トップダウンアプローチは高精度で推定できます。ただし、駐車スペースの場合、前述のように、不可視領域の大部分には大量のデータが含まれています。つまり、矩形近似が難しく、推定には不向きです。 2.ボトムアップ：画像に存在するすべての駐車スペースのキーポイントを推定し、グループ化を使用して個々のインスタンスに解析します。ボトムアップアプローチは、画像内の駐車スペースの数に関係なく、ほとんどの場合、一定の時間内に推定できます。したがって、リアルタイム処理が必要な場合に有効な手段です。ただし、これらの方法では、ネットワーク内に特定のアーキテクチャがあり、構成全体が複雑になる傾向があります。この作業は、軽量/低計算コストの汎用ネットワークアーキテクチャを使用して、ボトムアップアプローチで複数の駐車スペースを推定する効果的な方法を示しています。ここで、駐車スペースを構成する要素としてのキーポイントの検出とキーポイント間の関係の表現は、同一のデコーダにより同時に推定される。キーポイントを検出する際の推定精度の低下を抑え、関係を表現し、協調学習の効率的な実装を実現するローカライズドパーツアフィニティフィールド（LPAF）を提案します。ここでは、魚眼レンズカメラを使用してキャプチャした駐車場データセットを作成して評価します。次に、軽量/低計算コストの汎用ネットワークアーキテクチャを使用して、最小の計算コストで高精度の駐車スペースの見積もりが達成できることを確認しました。 ## 2 Related Work Examples of datasets, wherein parking lot scenarios were focused upon, can be found in the PKLot Dataset (6] and CNRPark Dataset (2). Barry Street Dataset |). In these datasets, it is assumed that the images are captured using a fixed-point surveillance camera and the parking space position is already known. Therefore, from the viewpoint of the problem setting, this approximates to a classification task and is a different issue from the parking space estimation, which has been targeted herein. This is a scene captured from above the vehicle; therefore, the rectangular approximation for each parking space is simple, and comparatively little variation in terms of its shape is found (Fig.1). We obtain the idea for the proposed parking space estimation from the multi-person pose estimation method, which is a closer task from the viewpoint of problem setting. The multi-person pose estimation method can be broadly classified into top-down (Section 2.1) and bottom-up approaches (Section 2.2). ## 2.関連作業駐車場のシナリオに焦点を当てたデータセットの例は、PKLotデータセット（6）およびCNRParkデータセット（2）。バリーストリートデータセット|）にあります。これらのデータセットでは、画像は固定小数点監視カメラを使用してキャプチャされ、駐車スペースの位置は既知であると想定されています。したがって、問題設定の観点からは、これは分類タスクに近似し、ここで対象とした駐車スペース推定とは別の問題です。これは車両の上から撮影されたシーンです。したがって、各駐車スペースの四角形の近似は単純であり、その形状に関して比較的小さな変化が見られます（図1）。提案された駐車スペース推定のアイデアは、問題設定の観点からより近いタスクである複数人姿勢推定方法から得られます。複数人の姿勢推定方法は、トップダウン（セクション2.1）とボトムアップアプローチ（セクション2.2）に大きく分類できます。 ### 2.1 Top-down: space detection and single-space keypoint detection Using the top-down approach, all the people are detected using the object detector and the respective poses are estimated in relation to the individual detection results. Using this method, the execution time increases in proportion with the number of people included in the image because the pose estimation is performed individually for each detection result; however, this makes highly-accurate estimation possible because the number of people included within each detection result can be presumed to be just one person. He et al.[] proposed a method for instance segmentation using a maskbranch for segmentation and a box-branch for bounding box detection, subsequently applying these for pose estimation. Using this method, on the contrary to the Gaussian confidence maps (heatmaps) estimation problem, the estimation of the keypoint expressing the human body is taken as a problem of size 1 field segmentation. In the method reported by Huang et al.[8], the detectors are separated into a detector for keypoint for which clear judgements can be made and can be estimated with high accuracy and a detector of keypoint that is difficult to recognize. The estimation results are then merged. In addition to the estimation of confidence maps for keypoint detection, Papandreou et al.[19] deployed offset maps for deriving the keypoint position with a higher accuracy and achieved highly-accurate estimation. Sun et al.[26] focused on the joint estimation problem after the keypoint detection. Several existing methods are based on the confidence map estimation; therefore, they are unable to perform end-to-end learning because of nondifferentiable threshold processing. The authors newly introduced a concept, called integral regression, and proposed a unified learning method including joint detection. Chen et al.[5] proposed a two-stage method using GlobalNet to search for keypoint by hierarchy (resolution) and RefineNet to integrate the estimation results of each GlobalNet hierarchy. With RefineNet. they achieved an estimation that was robust against occlusion by integrating the results of several layers. This method is similar to the Stacked hourglass proposed by Newell et al. [171] however, it differs from Stacked hourglass because it uses the features of all the layers generated by GlobalNet. Sun et al. [25] achieved a more advanced feature representation by executing and interconnecting the pipelines of different hierarchies in parallel. These methods assume that the bounding box is detected using the object detector. Unlike objects suited to the rectangular approximation of human forms, among others, in terms of parking space estimation, majority of the invisible. area has substantial data, making high-accuracy object detection difficult and unsuited for this task. ### 2.1 トップダウン：スペース検出とシングルスペースキーポイント検出トップダウンアプローチを使用すると、すべての人がオブジェクト検出器を使用して検出され、個々の検出結果に関連してそれぞれのポーズが推定されます。この方法では、検出結果ごとに個別に姿勢推定が行われるため、画像に含まれる人物の数に比例して実行時間が長くなります。ただし、各検出結果に含まれる人数は一人と推定できるため、精度の高い推定が可能です。彼らは、セグメンテーションにマスクブランチを使用し、バウンディングボックス検出にボックスブランチを使用して、次にこれらをポーズ推定に適用する例を提案しました。この方法を使用すると、ガウス信頼マップ（ヒートマップ）推定問題とは対照的に、人体を表すキーポイントの推定は、サイズ1のフィールドセグメンテーションの問題として扱われます。 Huang et al。[8]が報告した方法では、明確な判断ができる高精度推定が可能なキーポイント用検出器と、認識が困難なキーポイント用検出器に分離されています。その後、推定結果がマージされます。キーポイント検出のための信頼マップの推定に加えて、Papandreou et al。[19]より高い精度でキーポイント位置を導出するための展開されたオフセットマップと、非常に正確な推定を実現しました。 Sun et al。[26]キーポイント検出後の共同推定問題に焦点を当てた。いくつかの既存の方法は、信頼マップ推定に基づいています。したがって、区別できないしきい値処理のために、エンドツーエンドの学習を実行できません。著者らは、積分回帰と呼ばれる概念を新たに導入し、同時検出を含む統合学習法を提案しました。 Chen et al。[5]は、GlobalNetを使用して階層（解像度）でキーポイントを検索し、RefineNetを使用して各GlobalNet階層の推定結果を統合する2段階の方法を提案しました。 RefineNet。彼らは、いくつかの層の結果を統合することにより、閉塞に対してロバストな推定を達成しました。この方法は、ニューウェルらによって提案された積み重ねられた砂時計に似ています。 [171]ただし、それはGlobalNetによって生成されたすべてのレイヤーの機能を使用するため、スタック型砂時計とは異なります。 Sunら[25]は、異なる階層のパイプラインを並行して実行および相互接続することにより、より高度な機能表現を実現しました。これらのメソッドは、境界ボックスがオブジェクト検出器を使用して検出されることを前提としています。人間の形の長方形近似に適したオブジェクトとは異なり、駐車スペースの推定に関しては、ほとんどが不可視です。エリアには大量のデータがあるため、高精度のオブジェクト検出が難しく、このタスクには適していません。 ### 2.2 Bottom-up: keypoint detection and grouping In the bottom-up approach, the keypoint are detected for all the area to be detected on the image and detection results are separated by instance. In this method, the problems with the execution time in [20,9], for which the problems for efficiently separating instances exist, have been mainly overcome with the latest research. Cao et al.[1,3] achieved efficient instance separation by estimating the position of the keypoint from the input image and two-dimensional (2D) vector, i.e. part affinity fields (PAFs), expressing the relation between parts. Kocabas et al. [11] proposed a method to share encoders for keypoint detection and object detection, thereby realizing high-speed/high-accuracy estimation. Sekii et al. [24] replaced keypoint detection using confidence map estimation, which had been mainstream until that point, with grid estimation using the object detector and achieved massive improvement of execution speed. This method was not suited for a situation where several small targets for estimation densely located exist because performing a single estimation for cach grid has feature-based and spatial restrictions. In a separate approach, a method[16,18] using embedding was also proposed. One advantage claimed for these methods is that consistent learning can be executed up to the point of instance parsing. However, with parking space estimation, the separation of adjacent parking spaces using embedding is difficult owing to the characteristics of a similar external appearance between each instance. We proposed a method for efficiently estimating the parking space based on the approach of Cao et al. (4,3). Using their approach, learning and estimation were performed using separate pipelines for confidence maps and PAFs, respectively. While estimating the parking space, we assumed that a lightweight/low computational cost general-purpose network architecture is used, and several problems emerge if confidence maps and PAFs are simultaneously estimated using the same pipelines. The first problem is the differences in the output form. The definition of coordinates p and ground-truth PAFs L*generated to express the relations between and the k and c-th instances is as follows: EQ Herein, Xj,k are the keypoint coordinates for the k, j-th instances, lock is the distance between the keypoint expressed with lc, k = ||Xj2,k – Xj1,k||2, and v is the unit vector expressed by v = (xjz,k – Xj1,k)/I|Xj2,k – Xji,k||2. For the PAFs considering the image as a whole, we calculated the average PAFs for each instance, as expressed in L: = 2kLk/nc(p), [nc(p)] number of nonzero vector). Similarly, the confidence maps are also local because the keypoint involve a local estimation. On the contrary, for the PAFs, the line segments linking the keypoint are standard; hence, the output format greatly differs. Next, we can point toward the differences in the output scale. While the output scale for estimating the confidence maps is [0, 1], the output for the PAFs is restricted to (-1,1). In the PAFs, the 2D unit vectors are set for ground-truth; hence, it is very rare for 1 to be set to ground-truth. Additionally, with the experimental setting (Section3) used herein, the PAF estimation is performed even if the keypoint area is invisible; however, predicting the invisible position is difficult owing to the characteristics of the fish-eye lens camera These are insignificant problems when the network is a large, but they cause a major drop in accuracy in the case of lightweight/low-computational cost networks. We are proposing LPAF's as more efficient expressions of the relation between the keypoints. Fig.2 shows the differences between the PAFs and LPAFs. The LPAFs may suppress the decrease in the keypoint detection performance because the output shape and scale are not restricted. ### 2.2ボトムアップ：キーポイントの検出とグループ化ボトムアップアプローチでは、キーポイントは画像上で検出されるすべての領域に対して検出され、検出結果はインスタンスごとに分離されます。この方法では、インスタンスを効率的に分離するための問題が存在する[20,9]の実行時間の問題が主に最新の研究で克服されています。 Cao et al。[1,3]は、入力画像と2次元（2D）ベクトル、つまりパーツアフィニティフィールド（PAF）からキーポイントの位置を推定し、パーツ間の関係を表現することで、効率的なインスタンス分離を実現しました。コカバス等。 [11]は、キーポイント検出とオブジェクト検出のためにエンコーダを共有し、高速・高精度な推定を実現する手法を提案した。関井ほか[24]その時点まで主流であった信頼マップ推定を使用したキーポイント検出を、オブジェクト検出器を使用したグリッド推定に置き換え、実行速度の大幅な改善を達成しました。キャッシュグリッドに対して単一の推定を実行すると、機能ベースの空間的制限があるため、この方法は、密に配置された推定にいくつかの小さなターゲットが存在する状況には適していませんでした。別のアプローチでは、埋め込みを使用する方法[16、18] も提案されました。これらの方法で主張されている1つの利点は、一貫した学習をインスタンス解析の時点まで実行できることです。ただし、駐車スペースの推定では、各インスタンス間で類似した外観の特性があるため、埋め込みを使用して隣接する駐車スペースを分離することは困難です。カオらのアプローチに基づいて駐車スペースを効率的に推定する方法を提案した。（4,3）。それらのアプローチを使用して、学習と推定は、それぞれ信頼マップとPAFの個別のパイプラインを使用して実行されました。駐車スペースを推定する際、軽量/低計算コストの汎用ネットワークアーキテクチャが使用されていると想定し、同じパイプラインを使用して信頼マップとPAFを同時に推定すると、いくつかの問題が発生します。最初の問題は、出力形式の違いです。座標pと、k番目とc番目のインスタンス間の関係を表すために生成されたL *生成されたグラウンドトゥルースPAFの定義は、次のとおりです。 EQ ここで、Xj、kはk番目、j番目のインスタンスのキーポイント座標、lockはlcで表されるキーポイント間の距離、k = || Xj2、k – Xj1、k || 2、vは単位ベクトルですv =（xjz、k – Xj1、k）/ I | Xj2、k – Xji、k || 2で表されます。画像全体を考慮したPAFについて、L：= 2kLk / nc（p）、[nc（p）]非ゼロベクトルの数）で表されるように、各インスタンスの平均PAFを計算しました。同様に、キーポイントにはローカル推定が含まれるため、信頼マップもローカルです。逆に、PAFの場合、キーポイントを結ぶ線分は標準です。したがって、出力形式は大きく異なります。次に、出力スケールの違いを指摘します。信頼マップを推定するための出力スケールは[0、1]ですが、PAFの出力は（-1,1）に制限されます。 PAFでは、2D単位ベクトルがグラウンドトゥルースに設定されます。したがって、1がグラウンドトゥルースに設定されることは非常にまれです。さらに、ここで使用する実験設定（セクション3）では、キーポイント領域が見えなくてもPAF推定が実行されます。ただし、魚眼レンズカメラの特性上、見えない位置を予測することは困難です。ネットワークが大きい場合は問題になりませんが、軽量/低計算コストのネットワークでは精度が大幅に低下します。キーポイント間の関係のより効率的な表現として、LPAFを提案しています。図2にPAFとLPAFの違いを示します。 LPAFは、出力の形状とスケールが制限されていないため、キーポイント検出パフォーマンスの低下を抑制できます。