Lien de la note Hackmd

Scalable video recording

Scalability referes to the capacity of recovering physically meaningful image or video information from deconding only partial compressed bitstreams

Quality scalability:
- finer to finer quantizations
Spatial scalability:
- different spatial resolutions (Laplacian, Pyramid, …)
Temporal scalability:
- we can jump frames and add the missing ones progressively
Frequency scalability:
- lower frequencies to higher frequencies
Combination of basic schemes
Granularity: coarse vs fine ones

Object-based scalability: different resolutions for different objects

2D motion vs optical flow

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

On a une sphere en train de tourner sans illumination: dans le flux video, il n'y a pas de difference.

Prenons ensuite une sphere dont la source lumineuse bouge: l'information visuelle changera.

The observed of apparent 2D motion is called optical flow

Optical flow equation and ambiguity in motion estimation

Imaginons une sequence video
$ψ (x, y, t)$
On image un point
$(x, y)$ deplace en
$(x + d_{x}, y + d_{y})$ au temps
$t + d_{t}$

Under the constant intensity assumption, the images of the same object point at different times have the same luminance value

ψ (x + d_{x}, y + d_{y}, t + d_{t}) = ψ (x, y, t)

On fait un developpement de Taylor:

ψ (x + d_{x}, y + d_{y}, t + d_{t}) = ψ (x, y, t) + \frac{\partial ψ}{\partial x} d_{x} + \frac{\partial ψ}{\partial y} d_{y} + \frac{\partial ψ}{\partial t} d_{t}

On obtient:

\frac{\partial ψ}{\partial x} d_{x} + \frac{\partial ψ}{\partial y} d_{y} + \frac{\partial ψ}{\partial t} d_{t} = 0

Definisson

v_{x} = \frac{d_{x}}{d_{t}}

v_{y} = \frac{d_{y}}{d_{t}}

v_{t} = \frac{d_{x} t}{d_{t}} = 1

\frac{\partial ψ}{\partial x} v_{x} + \frac{\partial ψ}{\partial y} v_{y} + \frac{\partial ψ}{\partial t} = 0

Qui peut etre ecrit:

\nabla ψ^{T} v + \frac{\partial ψ}{\partial t} = 0

Avec

ψ^{T}

le gradient spatial

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The flow vector

v

at any point

x

can be decomposed into 2 orthogonal components:

v = v_{n} e_{n} + v_{t} e_{t}

As we can observe, when a straight edge moves in the plane, we can only detect the normal

v_{n}

of its motion vector !

Because

\nabla ψ = ‖ \nabla ψ ‖ e_{n}

the optical flow equation can be rewritten as:

v_{n} ‖ \nabla ψ ‖ + \frac{\partial ψ}{\partial t} = 0

Avec

‖ \nabla ψ ‖

la magnitude du vecteur gradient.

Les consequences de ces equations sont:

A chaque pixel
$x$
We can compute

v_{n} = - \frac{\frac{\partial ψ}{\partial t}}{‖ \nabla ψ ‖}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

This ambuigity in estimationg the motion vector is known as the aperture problem
The motion can be estimated uniquely only if the aperture contains at least 2 different gradient directions

General methodologies

We consider the ME between 2 given frames,
$ψ (x, y, t_{1})$ and
$ψ (x, y, t_{2})$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

The problem is referred as to as forward motion estimation

Notation

Comment encoder les vecteurs de mouvements ?
Ils ne sont pas les memes en fonction de l'espace, il faut les encoder de facon parametrique.

Fonction mapping: nouvelle position

w (x, a) = x + d (x, a)

Avec le parametre

a

qui encode le mouvement, ca nous donne la nouvelle position.

a = [a_{1}, a_{2}, \dots, a_{n}]^{T}

Motion representation

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Different representations de mouvement.

Image b: pixel-based

On a un vecteur pour chaque pixel de l'image

Image c: on va la faire en TP

On suppose qu'on fait un decoupage par bloc
On fait un vecteur de mouvement par bloc

Pour le champ de vecteur, comment est-ce qu'on parametrise ?

Translations
Polynomial motions
Rotations
…

On estime que l'image est faite de pixel et on fait de la pixel-wise

Ca fait 2 millions d'inconnues a trouver

On rajoute de la regularite.

En general, on decoupe en regions.

On estime d'abord le mouvement ou une region ?

Approche par blocs

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

On decompose l'image en blocs (ex: pour une image

100 \times 100

, en

33 \times 33

)

On a des blocs qui vont se superposer car le mouvement n'est pas uniforme

Et on s'en fout !

On a egalement des coins qui ont bouges.

Il faut faire de la descente de gradient

Les version les plus simples qu'on peut imaginer c'est en terme de translation
Les blocs sont un bon compromis entre la precision et la complexite

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

It can induce warping effects

Motion esimation criteria

Displaced Frame Difference (DFD):

E_{D F D} (a) = \sum_{x \in Λ} | ψ_{2} (w (x, a)) - ψ_{1} (x) |^{p}

where

Λ

is the domain of all pixels in

ψ_{1}

and

p

a positive number

When
$p = 1$ , the above error is called mean absolute difference (MAD) and when
$p = 2$ Mean Squared Error (MSE)
The error image
$e (x, a) = ψ_{2} (w (x, a)) - ψ_{1} (x)$ is usually called displaced frame difference (DFD) image
When
$a$ is optimal (
$p = 2$ )

\frac{\partial E_{D F D}}{\partial a} = 2 \sum_{x \in Λ} (ψ_{2} (w (x, a)) - ψ_{1} (x)) \frac{d (w (x, a))}{d a} \nabla ψ_{2} (w (x, a)) = 0

Prenons un cas plus simple

\frac{\partial ψ}{\partial t} d_{t} = ψ_{2} (x) - ψ_{1} (x)

It is equivalent to minimize:

E_{f l o w} = \sum_{x \in Λ} | \nabla ψ_{1} (x)^{T} d (x, a) + ψ_{2} (x) - ψ_{1} (x) |^{p}

This solution verifies when

p = 2

\frac{\partial E_{f l o w}}{\partial a} = 2 \sum_{x \in Λ} (\nabla ψ_{1} (x)^{T} d (x, a) + ψ_{2} (x) - ψ_{1} (x)) \frac{\partial d (x, a)}{d a} \nabla ψ_{i} (x)

We can add a penalty term in our equation to enforce the smoothness of our vector field (i.e. must vary smoothly)

E_{s} = \sum_{x \in Λ} \sum_{y \in N_{x}} ‖ d (x, a) - d (y, a) ‖^{2}

We want to minimize:

E_{t o t a l} = E_{D F D} + w_{s} E_{s}

with

w

the weighting coefficient.

We have to regularize but not too much (to avoid over-blurring)

Minimzation methods

On va surtout regarder la methode exhaustive

La methode de gradient
La methode de Newton-Raphson

Avec la descente de gradient et le probleme de dimensionnalite, on tombe souvent sur des minimums locaux et non globaux

One important search strategy is to use a multi-resolution representation of the motion field and conduct the search in a hierarchical manner
The basic idea is to first search the motion parameters in a coarse resolution, propagate this solution into a finer resolution, and then refine the solution in the finer resolution
It can combat the slowness of exhaustive search methods

Regularization

E = \sum_{x \in Λ} (\frac{\partial ψ}{\partial x} v_{x} + \frac{\partial ψ}{\partial y} + \frac{\partial ψ}{\partial t})^{2} + w_{s} (‖ \nabla v_{x} ‖^{2} + ‖ \nabla v_{y} ‖^{2})

Block matching algorithm (BMA)

Les blocs peuvent etre de forme polygonale
- On prend en pratique des carres
On suppose qu'on fait de la translation

The Exhaustive Search Block Matching Algorithm (EBMA)

Under the block-wise translation model

w (x; a) = x + d_{m} x \in B_{m}

Then the error can be written:

E (d_{m}, \forall m \in M) = \sum_{m \in M} \sum_{x \in B_{m}} | ψ_{2} (x + d_{m}) - ψ_{1} (x) |^{p}

We can estimate the MV for each block individually

E_{m} (d_{m}) = \sum_{x \in B_{m}} | ψ_{2} (x + d_{m}) - ψ_{1} (x) |^{p}

Deformable block matching algorithm

d_{m} (x) = \sum_{k = 1}^{K} Φ_{m, k} (x) d_{m, k} x \in B_{m}

Le deplacement au bloc

m

x

est une somme ponderee des deplacements en 4 coins

Node-based motion representation

Nodal MVs vs Polynomial coefficients
- Nodal
  - Stabilite

Motion estimation using node-based model

a = [d_{k}; k \in K]

E (a) = \sum_{x \in B} (ψ_{2} (w (x, a)) - ψ_{1} (x))^{2}

where:

w (x, a) = x + \sum_{k \in K} ϕ_{k} (x) d_{k}

Mesh-based motion estimation

Dans le cas des blocs: estime independants et deformes
Mesh: maillage sur l'image et on se permet de les deplacer en meme temps
- Tout est corrole

Contrainte a connaitre: on ne veut pas que nos 2 carres s'inversent

On a souvent des discontinuetes au niveau des edges
Plus on augmente le nombre de noeuds, plus on a une estimation precise
- Mais la puissance de calcul explose

Global motion estimation

Plusieurs methodes existent

Est-ce qu'on est dans le cadre ou pas d'avoir uniquement la camera qui bouge ?

Au foot et tennis, une grande partie du decor est stable

Region-based motion estimation

Est-ce qu'on separe en region ou on estime le mouvement ?

3 approches possibles

Multi-resolution motion estimation

Various ME approaches can be reduced to solving an error minimization problem
Major difficulties
- Many local minima in the gradient-descent case
- Not easy to reach the global minimum
- Computation high

Pyramide laplacienne: on decompose l'image en bandes de frequence