# 5/27 PM ## Current Situation ### paper reading [Neural Architecture Search With Reinforcement Learning](https://arxiv.org/abs/1611.01578) * 尚未閱讀 ### Programming [英翻中實作](https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html) * 訓練資料:WMT-19翻譯任務,新聞評論資料集 將訓練資料的切割成training資料20%以及validation資料1%,為了加速訓練過程,將其餘的79%捨棄不用。 ``` python= examples = builder.as_dataset(split=["train[:20%]", "train[20%:21%]"], as_supervised=True) train_examples, val_examples = examples print(train_examples) print(val_examples) print("-" * 10) for en, zh in train_examples.take(2): print(en) print(zh) print("-" * 10) sample_examples = [] num_samples = 5 for en_t, zh_t in train_examples.take(num_samples): en = en_t.numpy().decode("utf-8") zh = zh_t.numpy().decode("utf-8") print(en) print(zh) print("-" * 10) sample_examples.append((en, zh)) ``` 將資料給打印出來,可以看到每筆訓練資料是由一對英文中文句子組成,原本資料是unicode,再將其轉換成utf-8格式。 ``` <PrefetchDataset shapes: ((), ()), types: (tf.string, tf.string)> <PrefetchDataset shapes: ((), ()), types: (tf.string, tf.string)> ---------- tf.Tensor(b'The fear is real and visceral, and politicians ignore it at their peril.', shape=(), dtype=string) tf.Tensor(b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x81\x90\xe6\x83\xa7\xe6\x98\xaf\xe7\x9c\x9f\xe5\xae\x9e\xe8\x80\x8c\xe5\x86\x85\xe5\x9c\xa8\xe7\x9a\x84\xe3\x80\x82 \xe5\xbf\xbd\xe8\xa7\x86\xe5\xae\x83\xe7\x9a\x84\xe6\x94\xbf\xe6\xb2\xbb\xe5\xae\xb6\xe4\xbb\xac\xe5\x89\x8d\xe9\x80\x94\xe5\xa0\xaa\xe5\xbf\xa7\xe3\x80\x82', shape=(), dtype=string) ---------- tf.Tensor(b'In fact, the German political landscape needs nothing more than a truly liberal party, in the US sense of the word \xe2\x80\x9cliberal\xe2\x80\x9d \xe2\x80\x93 a champion of the cause of individual freedom.', shape=(), dtype=string) tf.Tensor(b'\xe4\xba\x8b\xe5\xae\x9e\xe4\xb8\x8a\xef\xbc\x8c\xe5\xbe\xb7\xe5\x9b\xbd\xe6\x94\xbf\xe6\xb2\xbb\xe5\xb1\x80\xe5\x8a\xbf\xe9\x9c\x80\xe8\xa6\x81\xe7\x9a\x84\xe4\xb8\x8d\xe8\xbf\x87\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe7\xac\xa6\xe5\x90\x88\xe7\xbe\x8e\xe5\x9b\xbd\xe6\x89\x80\xe8\xb0\x93\xe2\x80\x9c\xe8\x87\xaa\xe7\x94\xb1\xe2\x80\x9d\xe5\xae\x9a\xe4\xb9\x89\xe7\x9a\x84\xe7\x9c\x9f\xe6\xad\xa3\xe7\x9a\x84\xe8\x87\xaa\xe7\x94\xb1\xe5\x85\x9a\xe6\xb4\xbe\xef\xbc\x8c\xe4\xb9\x9f\xe5\xb0\xb1\xe6\x98\xaf\xe4\xb8\xaa\xe4\xba\xba\xe8\x87\xaa\xe7\x94\xb1\xe4\xba\x8b\xe4\xb8\x9a\xe7\x9a\x84\xe5\x80\xa1\xe5\xaf\xbc\xe8\x80\x85\xe3\x80\x82', shape=(), dtype=string) ---------- ---utf-8解碼--- The fear is real and visceral, and politicians ignore it at their peril. 这种恐惧是真实而内在的。 忽视它的政治家们前途堪忧。 ---------- In fact, the German political landscape needs nothing more than a truly liberal party, in the US sense of the word “liberal” – a champion of the cause of individual freedom. 事实上,德国政治局势需要的不过是一个符合美国所谓“自由”定义的真正的自由党派,也就是个人自由事业的倡导者。 ---------- ``` 再來要將每個單字賦予一個Index,也就是字典。 ```python= %%time try: subword_encoder_en = tfds.features.text.SubwordTextEncoder.load_from_file(en_vocab_file) print(f"載入已建立的字典: {en_vocab_file}") except: print("沒有已建立的字典,從頭建立。") subword_encoder_en = tfds.features.text.SubwordTextEncoder.build_from_corpus( (en.numpy() for en, _ in train_examples), target_vocab_size=2**13) # 有需要可以調整字典大小 # 將字典檔案存下以方便下次 warmstart subword_encoder_en.save_to_file(en_vocab_file) print(f"字典大小:{subword_encoder_en.vocab_size}") print(f"前 10 個 subwords:{subword_encoder_en.subwords[:10]}") print() ``` 目前這裡碰上了一些問題,搜尋後發現教學裡面使用的API已經被tensorflow取消了,還未查到替代方法。 ``` --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <timed exec> in <module> AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text' During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) <timed exec> in <module> AttributeError: module 'tensorflow_datasets.core.features' has no attribute 'text' ``` ### Others (e.g.,projects) 讀密碼學論文 --- ## Next Step ### Paper reading [Neural Architecture Search With Reinforcement Learning](https://arxiv.org/abs/1611.01578) ### Programming 英翻中實作 ### Others 讀密碼學論文