# 5/27 PM
## Current Situation
### paper reading
[Neural Architecture Search With Reinforcement Learning](https://arxiv.org/abs/1611.01578)
* 尚未閱讀
### Programming
[英翻中實作](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html)
* 訓練資料:WMT-19翻譯任務,新聞評論資料集
將訓練資料的切割成training資料20%以及validation資料1%,為了加速訓練過程,將其餘的79%捨棄不用。
``` python=
examples = builder.as_dataset(split=["train[:20%]", "train[20%:21%]"], as_supervised=True)
train_examples, val_examples = examples
print(train_examples)
print(val_examples)
print("-" * 10)
for en, zh in train_examples.take(2):
print(en)
print(zh)
print("-" * 10)
sample_examples = []
num_samples = 5
for en_t, zh_t in train_examples.take(num_samples):
en = en_t.numpy().decode("utf-8")
zh = zh_t.numpy().decode("utf-8")
print(en)
print(zh)
print("-" * 10)
sample_examples.append((en, zh))
```
將資料給打印出來,可以看到每筆訓練資料是由一對英文中文句子組成,原本資料是unicode,再將其轉換成utf-8格式。
```
<PrefetchDataset shapes: ((), ()), types: (tf.string, tf.string)>
<PrefetchDataset shapes: ((), ()), types: (tf.string, tf.string)>
----------
tf.Tensor(b'The fear is real and visceral, and politicians ignore it at their peril.', shape=(), dtype=string)
tf.Tensor(b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x81\x90\xe6\x83\xa7\xe6\x98\xaf\xe7\x9c\x9f\xe5\xae\x9e\xe8\x80\x8c\xe5\x86\x85\xe5\x9c\xa8\xe7\x9a\x84\xe3\x80\x82 \xe5\xbf\xbd\xe8\xa7\x86\xe5\xae\x83\xe7\x9a\x84\xe6\x94\xbf\xe6\xb2\xbb\xe5\xae\xb6\xe4\xbb\xac\xe5\x89\x8d\xe9\x80\x94\xe5\xa0\xaa\xe5\xbf\xa7\xe3\x80\x82', shape=(), dtype=string)
----------
tf.Tensor(b'In fact, the German political landscape needs nothing more than a truly liberal party, in the US sense of the word \xe2\x80\x9cliberal\xe2\x80\x9d \xe2\x80\x93 a champion of the cause of individual freedom.', shape=(), dtype=string)
tf.Tensor(b'\xe4\xba\x8b\xe5\xae\x9e\xe4\xb8\x8a\xef\xbc\x8c\xe5\xbe\xb7\xe5\x9b\xbd\xe6\x94\xbf\xe6\xb2\xbb\xe5\xb1\x80\xe5\x8a\xbf\xe9\x9c\x80\xe8\xa6\x81\xe7\x9a\x84\xe4\xb8\x8d\xe8\xbf\x87\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe7\xac\xa6\xe5\x90\x88\xe7\xbe\x8e\xe5\x9b\xbd\xe6\x89\x80\xe8\xb0\x93\xe2\x80\x9c\xe8\x87\xaa\xe7\x94\xb1\xe2\x80\x9d\xe5\xae\x9a\xe4\xb9\x89\xe7\x9a\x84\xe7\x9c\x9f\xe6\xad\xa3\xe7\x9a\x84\xe8\x87\xaa\xe7\x94\xb1\xe5\x85\x9a\xe6\xb4\xbe\xef\xbc\x8c\xe4\xb9\x9f\xe5\xb0\xb1\xe6\x98\xaf\xe4\xb8\xaa\xe4\xba\xba\xe8\x87\xaa\xe7\x94\xb1\xe4\xba\x8b\xe4\xb8\x9a\xe7\x9a\x84\xe5\x80\xa1\xe5\xaf\xbc\xe8\x80\x85\xe3\x80\x82', shape=(), dtype=string)
----------
---utf-8解碼---
The fear is real and visceral, and politicians ignore it at their peril.
这种恐惧是真实而内在的。 忽视它的政治家们前途堪忧。
----------
In fact, the German political landscape needs nothing more than a truly liberal party, in the US sense of the word “liberal” – a champion of the cause of individual freedom.
事实上,德国政治局势需要的不过是一个符合美国所谓“自由”定义的真正的自由党派,也就是个人自由事业的倡导者。
----------
```
再來要將每個單字賦予一個Index,也就是字典。
```python=
%%time
try:
subword_encoder_en = tfds.features.text.SubwordTextEncoder.load_from_file(en_vocab_file)
print(f"載入已建立的字典: {en_vocab_file}")
except:
print("沒有已建立的字典,從頭建立。")
subword_encoder_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
(en.numpy() for en, _ in train_examples),
target_vocab_size=2**13) # 有需要可以調整字典大小
# 將字典檔案存下以方便下次 warmstart
subword_encoder_en.save_to_file(en_vocab_file)
print(f"字典大小:{subword_encoder_en.vocab_size}")
print(f"前 10 個 subwords:{subword_encoder_en.subwords[:10]}")
print()
```
目前這裡碰上了一些問題,搜尋後發現教學裡面使用的API已經被tensorflow取消了,還未查到替代方法。
```
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<timed exec> in <module>
AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<timed exec> in <module>
AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text'
```
### Others (e.g.,projects)
讀密碼學論文
---
## Next Step
### Paper reading
[Neural Architecture Search With Reinforcement Learning](https://arxiv.org/abs/1611.01578)
### Programming
英翻中實作
### Others
讀密碼學論文