owned this note
owned this note
Published
Linked with GitHub
# Malicious URL classification
# url representation learning
### unsupervised learning sequence representations
* Hsu, Wei-Ning, Yu Zhang, and James Glass. "Unsupervised learning of disentangled and interpretable representations from sequential data." Advances in neural information processing systems. 2017.
* Pei, Wenjie, and David MJ Tax. "Unsupervised Learning of Sequence Representations by Autoencoders." arXiv preprint arXiv:1804.00946 (2018).
* Misra, Ishan, C. Lawrence Zitnick, and Martial Hebert. "Shuffle and learn: unsupervised learning using temporal order verification." European Conference on Computer Vision. Springer, Cham, 2016.
* Denton, Emily L. "Unsupervised learning of disentangled representations from video." Advances in neural information processing systems. 2017.
* Chung, Yu-An, et al. "Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder." arXiv preprint arXiv:1603.00982 (2016).
* Lee, Hsin-Ying, et al. "Unsupervised representation learning by sorting sequences." Proceedings of the IEEE International Conference on Computer Vision. 2017.
### unsupervised learning sentence representations
* Pagliardini, Matteo, Prakhar Gupta, and Martin Jaggi. "Unsupervised learning of sentence embeddings using compositional n-gram features." arXiv preprint arXiv:1703.02507 (2017).
* Logeswaran, Lajanugen, and Honglak Lee. "An efficient framework for learning sentence representations." arXiv preprint arXiv:1803.02893 (2018).
* Hill, Felix, Kyunghyun Cho, and Anna Korhonen. "Learning distributed representations of sentences from unlabelled data." arXiv preprint arXiv:1602.03483 (2016).
*
# classification
### Unsupervised
* anomaly detection
* Tang, Adrian, Simha Sethumadhavan, and Salvatore J. Stolfo. "Unsupervised anomaly-based malware detection using hardware features." International Workshop on Recent Advances in Intrusion Detection. Springer, Cham, 2014.
* Zhang, Jiong, and Mohammad Zulkernine. "Anomaly based network intrusion detection with unsupervised outlier detection." 2006 IEEE International Conference on Communications. Vol. 5. IEEE, 2006.
* one-class classification
* Amer, Mennatallah, Markus Goldstein, and Slim Abdennadher. "Enhancing one-class support vector machines for unsupervised anomaly detection." Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description. ACM, 2013.
* clustering
* Leung, Kingsly, and Christopher Leckie. "Unsupervised anomaly detection in network intrusion detection using clusters." Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38. Australian Computer Society, Inc., 2005.
* unsupervised sequence Classification
* Tomović, Andrija, Predrag Janičić, and Vlado Kešelj. "n-Gram-based classification and unsupervised hierarchical clustering of genome sequences." Computer methods and programs in biomedicine 81.2 (2006): 137-153.
* Tomović, Andrija, Predrag Janičić, and Vlado Kešelj. "n-Gram-based classification and unsupervised hierarchical clustering of genome sequences." Computer methods and programs in biomedicine 81.2 (2006): 137-153.
### Semi-supervised
* Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural information processing systems. 2015.
### supervised
text classification task
* Le, Hung, et al. "URLNet: learning a URL representation with deep learning for malicious URL detection." arXiv preprint arXiv:1802.03162 (2018).
* [github](https://github.com/Antimalweb/URLNet)
* Cer, Daniel, et al. "Universal sentence encoder." arXiv preprint arXiv:1803.11175 (2018).
* [github](https://github.com/tensorflow/tfjs-models/tree/master/universal-sentence-encoder)
* Yu, Adams Wei, et al. "Qanet: Combining local convolution with global self-attention for reading comprehension." arXiv preprint arXiv:1804.09541 (2018).
* [github](https://github.com/BangLiu/QANet-PyTorch)
* transformer (Bert, XLNet)
* [github](https://github.com/huggingface/pytorch-transformers)
* Graph Convolutional Networks for Text Classification
# information gathering
* lexical
* whois
* HTML view
* web page content
* Host based
* other
# public dataset
### [ISCX-URL-2016](https://www.unb.ca/cic/datasets/url-2016.html)
### kaggle
1. -https://www.kaggle.com/antonyj453/urldataset
2. -https://www.kaggle.com/aktank/url-detection
3. -https://www.kaggle.com/deepak730/finding-malicious-url-through-url-features
### Phising URLS
1. Phishtank - https://www.phishtank.com/developer_info.php
2. Open Phis - https://openphish.com/
### SPAM URLS
1. JWSPAMSPY - http://www.joewein.de/sw/blacklist.htm
### Malware URLS
1. DNS-BH - http://www.malwaredomains.com/wordpress/?page_id=66
2. https://www.malwarepatrol.net/my-account/
3. http://www.malwaredomainlist.com/
### Benign URLS
1. Majestic - https://majestic.com/reports/majestic-million
### Another Source
1. https://zeltser.com/malicious-ip-blocklists/