[Paper Review] Temporal Cycle-Consistency Learning

#paper-review #deep-learning #computer-vision #self-supervised-learning #video #temporal-alignment #cycle-consistency

논문 링크: Temporal Cycle-Consistency Learning

논문 정보

항목	내용
Venue	CVPR
출판 시점	2019년
저자	Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman
소속	Google Brain, DeepMind

핵심 아이디어

Temporal Cycle-Consistency Learning(TCC) 는 여러 비디오 사이의 temporal alignment를 이용해 frame-level representation을 학습하는 self-supervised learning 방법이다.

핵심 supervision은 다음 cycle이다.

S_i \xrightarrow{\operatorname{NN}} T_j \xrightarrow{\operatorname{NN}} S_k

좋은 embedding space에서는 원래 위치로 돌아와야 한다.

k \approx i

즉 TCC loss는 다음 조건을 만족하도록 encoder $\phi_\theta$ 를 학습한다.

\boxed{ S_i \rightarrow T_j \rightarrow S_k \quad\Rightarrow\quad k=i }

Notation

두 비디오 sequence를 다음처럼 둔다.

S=\{s_1,s_2,\dots,s_N\}, \qquad T=\{t_1,t_2,\dots,t_M\}

각 frame은 encoder $\phi_\theta$ 를 거쳐 $d$ 차원 embedding으로 변환된다.

u_i=\phi_\theta(s_i)\in\mathbb{R}^d, \qquad v_j=\phi_\theta(t_j)\in\mathbb{R}^d

Embedding sequence는 다음과 같다.

U=\{u_1,\dots,u_N\}, \qquad V=\{v_1,\dots,v_M\}

또는 matrix로 쓰면,

U\in\mathbb{R}^{N\times d}, \qquad V\in\mathbb{R}^{M\times d}

TCC가 학습하려는 것은 frame label이 아니라 embedding function $\phi_\theta$ 이다. 목표는 같은 action phase에 해당하는 frame들이 embedding space에서 가까워지는 것이다.

Distance Matrix

두 sequence 사이의 모든 frame pair distance를 계산한다.

D^{S\rightarrow T}_{ij} = \|u_i-v_j\|_2^2

따라서 distance matrix는 다음 shape을 가진다.

| Matrix | Shape | 의미 | | --- | --- | | $D^{S\rightarrow T}$ | $N\times M$ | $S$ 의 각 frame에서 $T$ 의 각 frame까지 거리 | | $D^{T\rightarrow S}$ | $M\times N$ | $T$ 의 각 frame에서 $S$ 의 각 frame까지 거리 |

Hard nearest neighbor는 다음처럼 정의된다.

\operatorname{NN}_{T}(i) = \arg\min_{j\in\{1,\dots,M\}} D^{S\rightarrow T}_{ij}

그리고 다시 돌아오는 index는 다음과 같다.

\operatorname{NN}_{S}(\operatorname{NN}_{T}(i)) = \arg\min_{k\in\{1,\dots,N\}} \|v_{\operatorname{NN}_{T}(i)}-u_k\|_2^2

Hard cycle consistency condition은 다음이다.

\operatorname{NN}_{S}(\operatorname{NN}_{T}(i))=i

하지만 $\arg\min$ 은 미분 불가능하다. 따라서 논문은 nearest neighbor를 soft하게 근사한다.

Soft Forward Matching

Frame $u_i$ 에서 sequence $V$ 로 가는 soft matching distribution을 정의한다.

\alpha_{ij} = \frac{ \exp(-D^{S\rightarrow T}_{ij}) }{ \sum_{\ell=1}^{M} \exp(-D^{S\rightarrow T}_{i\ell}) }

Temperature $\tau$ 를 명시하면 다음과 같이 쓸 수도 있다.

\alpha_{ij} = \operatorname{softmax}_j \left( -\frac{\|u_i-v_j\|_2^2}{\tau} \right)

$\tau$ 가 작을수록 hard nearest neighbor에 가까워지고, 클수록 더 넓은 distribution이 된다.

Soft nearest neighbor embedding은 weighted average로 계산된다.

\tilde{v}_i = \sum_{j=1}^{M} \alpha_{ij}v_j

Matrix form으로 쓰면 다음과 같다.

A^{S\rightarrow T} = \operatorname{softmax}_{\text{row}} (-D^{S\rightarrow T}) \in\mathbb{R}^{N\times M}

\tilde{V} = A^{S\rightarrow T}V \in\mathbb{R}^{N\times d}

여기서 $\tilde{v}_i$ 는 $u_i$ 가 sequence $T$ 에서 대응된 soft point이다.

Soft Cycle-Back Matching

이제 $\tilde{v}_i$ 에서 다시 sequence $U$ 로 돌아온다. 돌아오는 distribution을 $\beta_{ik}$ 라고 하자.

\beta_{ik} = \frac{ \exp(-\|\tilde{v}_i-u_k\|_2^2) }{ \sum_{\ell=1}^{N} \exp(-\|\tilde{v}_i-u_\ell\|_2^2) }

또는 temperature를 넣으면,

\beta_{ik} = \operatorname{softmax}_k \left( -\frac{\|\tilde{v}_i-u_k\|_2^2}{\tau} \right)

각 $i$ 에 대해 $\beta_i$ 는 원래 sequence $S$ 의 time index 위에 놓인 probability distribution이다.

\beta_i = (\beta_{i1},\dots,\beta_{iN})

TCC의 목표는 $\beta_i$ 가 index $i$ 에 sharp한 peak를 갖도록 만드는 것이다.

\beta_i \approx \delta_i

여기서 $\delta_i$ 는 $i$ 번째 위치만 1인 one-hot distribution이다.

Cycle-Back Classification Loss

첫 번째 loss는 cycle-back을 classification 문제로 본다. $u_i$ 에서 출발했다면 정답 class는 $i$ 이다.

y_{ik} = \mathbb{1}[k=i]

Prediction은 cycle-back distribution $\beta_{ik}$ 이다. 따라서 한 frame $i$ 에 대한 classification loss는 다음과 같다.

\mathcal{L}_{cbc}^{(i)} = - \sum_{k=1}^{N} y_{ik}\log \beta_{ik}

One-hot label을 대입하면 더 간단해진다.

\mathcal{L}_{cbc}^{(i)} = -\log \beta_{ii}

Sequence $S$ 의 모든 frame에 대해 평균내면,

\mathcal{L}_{cbc}^{S\rightarrow T\rightarrow S} = \frac{1}{N} \sum_{i=1}^{N} \left( -\log \beta_{ii} \right)

이 loss는 $\beta_{ii}$ 를 크게 만든다. 즉 $S_i\rightarrow T\rightarrow S_i$ 가 되도록 학습한다.

Cycle-Back Regression Loss

Classification loss는 $k=i$ 인지 아닌지만 본다. 그러면 $i=50$ 에서 출발해 $k=51$ 로 돌아온 경우와 $k=100$ 으로 돌아온 경우가 모두 오답으로 처리된다.

Cycle-back regression은 $\beta_i$ 를 time index 위의 distribution으로 보고 평균과 분산을 계산한다.

\mu_i = \sum_{k=1}^{N} \beta_{ik}k

\sigma_i^2 = \sum_{k=1}^{N} \beta_{ik}(k-\mu_i)^2

원래 index $i$ 와 expected cycle-back index $\mu_i$ 가 가까워야 하므로,

\frac{(i-\mu_i)^2}{\sigma_i^2}

를 줄인다. 여기에 distribution이 너무 퍼지지 않도록 variance regularization을 더한다.

\mathcal{L}_{cbr}^{(i)} = \frac{(i-\mu_i)^2}{\sigma_i^2} +\lambda\log\sigma_i

Sequence 전체에 대해서는 평균을 사용한다.

\mathcal{L}_{cbr}^{S\rightarrow T\rightarrow S} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{(i-\mu_i)^2}{\sigma_i^2} +\lambda\log\sigma_i \right]

이 loss는 두 가지를 동시에 요구한다.

조건	수식적 의미
정확한 cycle-back	$\mu_i \approx i$
sharp한 대응	$\sigma_i$ 가 작아짐

Symmetric Objective

위 식은 $S\rightarrow T\rightarrow S$ 방향만 쓴다. 반대 방향도 같은 방식으로 정의할 수 있다.

T_j \rightarrow \tilde{u}_j \rightarrow \gamma_j(T)

\mathcal{L}^{T\rightarrow S\rightarrow T}

따라서 pairwise TCC loss는 보통 양방향으로 평균낼 수 있다.

\mathcal{L}_{TCC}(S,T) = \frac{1}{2} \left( \mathcal{L}^{S\rightarrow T\rightarrow S} + \mathcal{L}^{T\rightarrow S\rightarrow T} \right)

여기서 $\mathcal{L}$ 은 classification version인 $\mathcal{L}_{cbc}$ 가 될 수도 있고, regression version인 $\mathcal{L}_{cbr}$ 가 될 수도 있다.

Dataset에 여러 video sequence가 있을 때 전체 objective는 video pair에 대해 평균낸다.

\min_\theta \mathbb{E}_{(S,T)\sim\mathcal{D}} \left[ \mathcal{L}_{TCC}(S,T;\theta) \right]

실제 학습에서는 모든 pair와 모든 frame을 다 쓰기보다, video pair와 frame index를 sampling해서 objective를 근사한다.

Matrix View

Forward soft matching은 row-stochastic matrix로 볼 수 있다.

A = \operatorname{softmax}_{\text{row}} \left( -D^{S\rightarrow T} \right) \in\mathbb{R}^{N\times M}

Soft matched point는

\tilde{V}=AV

이다. Cycle-back distribution은 다음 distance matrix에서 나온다.

C_{ik} = \|\tilde{v}_i-u_k\|_2^2

B = \operatorname{softmax}_{\text{row}}(-C) \in\mathbb{R}^{N\times N}

여기서 $B_{ik}$ 가 $S_i\rightarrow T\rightarrow S_k$ 의 soft probability이다. Cycle consistency는 $B$ 가 identity matrix에 가까워지는 것이다.

B \approx I_N

Classification loss는 다음처럼 볼 수 있다.

\mathcal{L}_{cbc}^{S\rightarrow T\rightarrow S} = - \frac{1}{N} \sum_{i=1}^{N} \log B_{ii}

즉 diagonal probability를 키우는 objective이다.

DTW와의 차이

Temporal alignment 관점에서는 DTW(Dynamic Time Warping)와 비교할 수 있다.

방법	최적화 대상
DTW	고정된 feature 위에서 alignment path $P$ 를 찾는다.
TCC	alignment가 cycle-consistent해지는 embedding $\phi_\theta$ 를 학습한다.

DTW는 보통 다음 문제를 푼다.

P^\star = \arg\min_{P} \sum_{(i,j)\in P} \|u_i-v_j\|_2^2

TCC는 path $P$ 를 직접 구하지 않는다. 대신 nearest-neighbor cycle이 성립하도록 representation을 학습한다.

\theta^\star = \arg\min_{\theta} \mathcal{L}_{TCC}(S,T;\theta)

학습 절차

수식으로 쓰면 학습 절차는 다음과 같다.

Video pair sampling

(S,T)\sim\mathcal{D}

Embedding

U=\phi_\theta(S), \qquad V=\phi_\theta(T)

Forward matching

A=\operatorname{softmax}_{\text{row}}(-D^{S\rightarrow T}), \qquad \tilde{V}=AV

Cycle-back distribution

B=\operatorname{softmax}_{\text{row}}(-C), \qquad C_{ik}=\|\tilde{v}_i-u_k\|_2^2

Loss

\mathcal{L}_{cbc} = - \frac{1}{N} \sum_{i=1}^{N} \log B_{ii}

또는

\mathcal{L}_{cbr} = \frac{1}{N} \sum_{i=1}^{N} \left[ \frac{(i-\mu_i)^2}{\sigma_i^2} +\lambda\log\sigma_i \right]

Parameter update

\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}_{TCC}

왜 Self-Supervised인가

TCC에서 label 역할을 하는 것은 frame annotation이 아니라 identity target이다.

S_i \rightarrow T \rightarrow S_i

즉 정답 one-hot vector는 사람이 준 phase label이 아니라, 출발 index $i$ 로부터 자동 생성된다.

y_i = e_i

그래서 objective는 supervised cross entropy와 비슷하지만, target은 temporal cycle 구조에서 나온다.

실험과 활용

논문은 TCC representation을 여러 task에서 평가한다.

Task	설명
Action phase classification	적은 label로 각 frame의 행동 단계를 분류한다.
Progress prediction	행동이 전체 과정 중 어느 정도 진행되었는지 예측한다.
Video alignment	서로 다른 비디오의 같은 phase frame을 맞춘다.
Synchronized playback	여러 비디오를 phase 기준으로 동기화해 재생한다.
Metadata transfer	한 비디오의 label, sound 등 synchronized metadata를 다른 비디오로 옮긴다.
Anomaly detection	정상적인 temporal progression과 어긋나는 구간을 찾는다.

핵심 결과는 TCC가 label이 부족한 상황에서 action phase classification 성능을 크게 개선한다는 점이다. 또한 Shuffle and Learn, Time-Contrastive Networks 같은 다른 self-supervised loss와도 보완적으로 사용할 수 있다고 보고한다.

TCC Loss의 직관적 해석

수식 관점에서 TCC는 cycle-back matrix $B$ 를 identity matrix로 만드는 loss이다.

B_{ik} = p(S_k\mid S_i\rightarrow T)

목표는

B_{ik} \approx \mathbb{1}[i=k]

이다.

따라서 classification version은 diagonal likelihood maximization이다.

\max_\theta \prod_{i=1}^{N} B_{ii}

log를 취하면 앞의 loss와 같다.

\min_\theta - \sum_{i=1}^{N} \log B_{ii}

Regression version은 $B_i$ 의 mass가 $i$ 주변에 몰리도록 평균과 분산을 제어한다.

\mu_i\rightarrow i, \qquad \sigma_i^2\rightarrow 0

Embedding space에 걸리는 압력은 다음처럼 정리할 수 있다.

수식 조건	효과
$\\|u_i-v_j\\|^2$ 감소	같은 phase끼리 가까워진다.
$B_{ii}$ 증가	cycle-back이 원래 위치로 돌아온다.
$\mu_i\approx i$	시간적으로 가까운 위치를 선호한다.
$\sigma_i$ 감소	대응 분포가 sharp해진다.

한계

TCC도 몇 가지 가정을 가진다.

같은 action category 안에 반복 가능한 temporal structure가 있어야 한다. 완전히 불규칙한 비디오에서는 cycle consistency가 의미 있는 supervision이 되기 어렵다.
비디오 pair가 너무 다르면 잘못된 correspondence를 학습할 수 있다. 같은 category라도 행동 순서가 크게 다르거나 중간 phase가 생략되면 ambiguity가 커진다.
모든 frame이 반드시 one-to-one 대응된다고 보기 어렵다. 어떤 영상에는 특정 phase가 길게 나타나고, 다른 영상에는 매우 짧게 나타날 수 있다.
Nearest-neighbor 기반 alignment는 embedding 품질에 강하게 의존한다. 초기 embedding이 너무 나쁘면 soft correspondence가 noisy할 수 있다.
긴 비디오에서는 모든 frame pair distance 계산 비용이 커질 수 있다. $N \times M$ distance matrix가 필요하므로 sampling이나 batching 전략이 중요하다.

정리

Temporal Cycle-Consistency Loss는 라벨 없는 비디오에서 temporal correspondence를 학습하기 위한 loss이다. 핵심은 다음 cycle이다.

\boxed{ S_i \rightarrow T_j \rightarrow S_k \quad\Rightarrow\quad k \approx i }

이를 미분 가능하게 만들기 위해 hard nearest neighbor 대신 soft nearest neighbor를 사용한다. 그리고 돌아온 위치가 원래 frame index와 가까워지도록 classification 또는 regression loss를 정의한다.

TCC의 본질은 다음처럼 정리할 수 있다.

비디오의 시간적 반복 구조를 supervision으로 사용해, action phase가 정렬되는 embedding space를 학습하는 방법이다.

그래서 TCC는 단순히 "비슷한 이미지 frame 찾기"가 아니라, 여러 영상 사이에서 같은 행동 진행 단계를 찾는 representation learning objective로 이해하는 것이 좋다.