Seminar 8: Word Embedding

seminar

이번 포스트에서는 단어(word)를 벡터화하는 방법인 워드 임베딩(Word Embedding)에 대해 다룬다. 이전의 문자 인코딩과 달리 워드 임베딩은 NLP 분야에서 범용적으로 사용되기 때문에 이 부분을 잘 이해해야 이후에 다룰 NLP 모델을 이해하는데 쉽게 이해할 수 있다.

Word Embedding이란?
- 희소 표현과 밀집 표현
Word2Vec
- 분산 표현
- Continuous Bag of Words
WordRNN
- nn.Embedding

Word Embedding이란?

문자 인코딩에서는 코퍼스(corpus)에서 vocab set을 구한 후 vocab size $V$의 크기로 one-hot encodding을 수행했다. character-level에서야 알파벳은 26개뿐이니 one-hot vector 크기가 그렇게 부담되지는 않았다. 그러나 단어(word) 수준에서는 가능한 조합이 너무 많기 때문에 vocab size $V$가 무지무지 크다.

# US National Anthem
text = f'''O say can you see, by the dawn’s early light,
What so proudly we hail’d at the twilight’s last gleaming,
Whose broad stripes and bright stars through the perilous fight
O’er the ramparts we watch’d were so gallantly streaming?
And the rocket’s red glare, the bombs bursting in air,
Gave proof through the night that our flag was still there,
O say does that star-spangled banner yet wave
O’er the land of the free and the home of the brave?'''

character vocab size: 많아봐야 26+
word vocab size: 텍스트가 길어지면 무한하게 증가... 

그래서 단어 수준에서는 더이상 one-hot encoding은 사용할 수 없게 된다. 또, one-hot encoding으로는 단어의 유사도(similarity)를 측정할 수 없다는 단점도 있다. (‘개’와 ‘고양이’라는 단어는 동물이라는 관점에서 ‘토마토’보다는 유사한 단어이다. 그러나 one-hot encoding으로는 이 유사도를 전혀 계산할 수 없다.)

이런 one-hot encoding의 희소 표현(sparse representation)에 반대되는 개념으로 밀집 표현(dense representation)이 있다. one-hot과 같은 희소 표현이 vocab size $V$를 기준으로 encoding vector의 크기를 결정했다면, 밀집 표현은 고정된 크기의 (예를 들어 5) 벡터로 단어를 encoding한다.

dog = [0, 0, 0, ..., 1, 0, 0, ... , 0] # 10,000 차원의 one-hot
vs.
dog = [1.1, 3.2, -0.5, 8.0, -3.3] # size 5의 밀집 표현!

이렇게 단어를 밀집 표현으로 encoding 하면 공간상의 이득도 얻고, 이후에 두 단어의 유사도를 구하는 작업도 수행할 수 있다!

위와 같이 단어를 밀집 표현으로 encoding 하는 것을 워드 임베딩(Word Embedding)이라고 한다. 그리고 이때의 encoding vector를 임베딩 벡터(Embedding Vector)라고 한다.

Word2Vec

Word2Vec은 워드 임베딩 방법 중 하나이다. Word2Vec은 단어를 임베딩 하기 위해 분산 표현(distributed representation)이라는 기법을 사용한다.

분산 표현은 “비슷한 문맥에서 등장하는 단어들은 비슷한 의미를 가진다”라는 분포 가설(distributional hypothesis)을 바탕으로 한다.

이하 자세한 내용은 “딥러닝을 이용한 자언어 처리 입문”의 내용을 바탕으로 진행하겠습니다.

👉 딥러닝을 이용한 자언어 처리 입문: 분산 표현과 Continuous Bag of Words (w/ 하이라이팅)

(간단 요약)

Word2Vec은 Dense Represetation이면서 Distributed Represenation이다
- “비슷한 문맥에 있는 단어들은 비슷한 의미를 가진다.”
CBOW은 중심 단어를 예측하는 얕은 신경망을 학습해 Word2Vec 하는 방식이다.
입력층과 출력층이 원핫 인코딩으로 되어 있고, 학습하는 가중치를 Embedding vector로 사용한다.

`nn.Embedding()`과 WordRNN

PyTorch에서는 nn.Embeding()을 사용해 Word Embedding을 수행할 수 있다. 이때, 입력 시퀀스의 각 단어들은 모두 정수 인코딩 되어 있어야 한다! (CBoW에서는 입출력 값이 one-hot encoding 되어 있어야 했다!)

nn.Embedding()은 CBoW와는 전혀 다른 모델이다. nn.Embedding()은 단순히 룩업 테이블(lookup table) 하나만 있는 구조로 정수가 입력되면 해당 정수에 해당하는 룩업 테이블의 row를 반환할 뿐이다. 그래서 초기의 nn.Embedding()의 출력값은 단순히 정수를 벡터로 매핑한 것이 불과하면 실질적인 word embedding이 아니다. 다만, 전체 모델이 학습되는 과정에서 룩업 테이블의 값이 조정되는데, 학습이 끝난 후의 룩업테이블의 값이 본래 원했던 실질적인 word embedding이다.

Q. 그럼담… nn.Embedding()은 단순히 ‘룩업테이블’이라는 이름을 가진 텐서일 뿐인건가?

A. 놀랍게도 그렇다.

PyTorch의 nn.Embeding()을 활용해 word embedding을 수행하고 이를 통해 다음 단어를 예측하는 WordRNN을 구현해보자.

예시 문장은 “Fly me to the moon”이라는 곡의 가사이다.

text = f"""
Fly me to the moon
Let me play among the stars
And let me see what spring is like
On a-Jupiter and Mars
In other words, hold my hand
In other words, baby, kiss me

Fill my heart with song
And let me sing forevermore
You are all I long for
All I worship and adore
In other words, please be true
In other words, I love you
"""

CharRNN 때와 마찬가지로 몇가지 pre-processing을 거쳐 모델에 입력으로 사용할 데이터셋을 만든다. 그러나 이전의 CharRNN 때와는 달리 원-핫 인코딩이 아니라 정수 인코딩을 한다.

# 코드는 생략하고 결과만 제시
0 ['Fly', 'me', 'to', 'the', 'moon'] -> ['me', 'to', 'the', 'moon', 'Let']
10 ['stars', 'And', 'let', 'me', 'see'] -> ['And', 'let', 'me', 'see', 'what']
20 ['a-Jupiter', 'and', 'Mars', 'In', 'other'] -> ['and', 'Mars', 'In', 'other', 'words,']
30 ['other', 'words,', 'baby,', 'kiss', 'me'] -> ['words,', 'baby,', 'kiss', 'me', 'Fill']
40 ['And', 'let', 'me', 'sing', 'forevermore'] -> ['let', 'me', 'sing', 'forevermore', 'You']
50 ['for', 'All', 'I', 'worship', 'and'] -> ['All', 'I', 'worship', 'and', 'adore']
60 ['be', 'true', 'In', 'other', 'words,'] -> ['true', 'In', 'other', 'words,', 'I']
[4, 2, 44, 19, 18]
[2, 44, 19, 18, 12]
X torch.Size([63, 5])
Y torch.Size([63, 5])

PyTorch의 nn.Embedding()은 ‘전체 단어 갯수’와 ‘임베딩 벡터의 크기’를 num_embeddings와 embeeding_dim으로 입력 받는다.

embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=3)
print(embedding_layer)
---
Embedding(48, 3)

앞에서 만들어둔 X 텐서로 출력을 확인하면…

out = embedding_layer(X)
print('X', X.shape)
print('out', out.shape)
---
X torch.Size([63, 5])
out torch.Size([63, 5, 3])

자! 이제 이 nn.Embedding()을 활용해 WordRNN 모델을 구현해보자.

class WordRNN(nn.Module):
  def __init__(self, vocab_size, embedding_dim=3, hidden_dim=20, num_layers=2):
    super(WordRNN, self).__init__()
    self.embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
    self.rnn = nn.RNN(embedding_dim, embedding_dim, num_layers=num_layers, batch_first=True)
    self.hidden = nn.Linear(embedding_dim, hidden_dim, bias=True)
    self.fc = nn.Linear(hidden_dim, vocab_size, bias=True)
  
  def forward(self, x):
    embedding = self.embedding_layer(x)
    x, _status = self.rnn(embedding)
    x = self.hidden(x)
    x = self.fc(x)
    return x

정수값으로 입력되는 문자를 nn.Embedding()으로 벡터화 한 후, nn.RNN()에 입력으로 넣어 다음 단어(embedding vector 형태로 인코딩)를 예측한다. 이후 hidden과 fc 레이어를 거쳐 전체 단어수 만큼의 크기를 갖는 텐서로 변환한다. 이는 일종의 word embedding을 다시 decoding 하는 과정이라고 볼 수 있다.

그 다음으로는 CharRNN처럼 분류 문제를 풀도록 모델을 학습시키면 된다 🙌

# hyper-parameter
learning_rate = 0.1
dict_size = len(word_set)

model = WordRNN(len(word_set), embedding_dim=7, hidden_dim=20)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), learning_rate)

def textimize_result(results):
  predict_str = text.split()[0] + ' ' # start from first word

  for j, result in enumerate(results):
    if j == 0: # At first time, bring all results
      predict_str += ' '.join([index2word[int(t)] for t in result])
    else: # append only last word
      predict_str += (' ' + index2word[int(result[-1])])
  print(predict_str)

epoch = 51

for i in range(1, epoch):
  optimizer.zero_grad()
  
  outputs = model(X) # (batch, timesteps, dict_size)
  loss = criterion(outputs.reshape(-1, dict_size), Y.view(-1))
  
  loss.backward()
  optimizer.step()

  results = outputs.argmax(dim = 2) 

  if i % 5 == 0:
    print("------ %d ------" % i)
    textimize_result(results)

맺음말

이번 주제였던 word2vec은 NLP 모델의 최신 트렌드를 살펴보는 첫 걸음이였습니다. 새로운 모델을 배우는 것도 중요하지만, 더 중요한 것은 배운 것은 본인 만의 프로젝트로 적용해보는 거라고 생각합니다. NLP 분야가 특히 타겟으로 삼은 태스크나 언어(language)에 따라 사용되는 테크닉이 다양해서 하나의 세미나로는 모든 테크닉을 커버하는게 쉽지 않은 것 같습니다. NLP 분야에 관심이 있다면 Standor의 CS224N: Natural Language Processing with Deep Learning 강좌도 병행해서 본 세미나를 듣는 걸 추천합니다.

다음 세미나에서는 seq2seq와 Attention 메커니즘에 대해 살펴보도록 하겠습니다 🙏

References

딥러닝을 이용한 자언어 처리 입문
- 워드 임베딩
- Word2Vec
PyTorch로 시작하는 딥러닝 입문
PyTorch - nn.Embedding()

Word Embedding이란?

Word2Vec

nn.Embedding()과 WordRNN

맺음말

References

`nn.Embedding()`과 WordRNN