Professionals in cybersecurity are overwhelmed by the growing amount of publicly available data on cyber threats, complicating threat analysis. Initially, clustering techniques are explored to manage this data by grouping it into broad categories. Still, these methods must be revised to provide the detailed analysis needed for precise threat identification and mitigation. This shortcoming highlights the necessity for more advanced approaches. Supervised machine learning offers a potential solution for predicting data relevance. Still, the dynamic nature of cyber threats limits the effectiveness of static classifiers, and training new classifiers for each incident is too labor-intensive and data-intensive. Markus‘ dissertation proposes a comprehensive solution utilizing low-data regime methods across different stages of the machine learning pipeline to enable practical training with minimal supervised data. Key strategies include active learning, data augmentation, multi-level transfer learning, and adversarial training to enhance model resilience. These methods allow training deep learning models for new cybersecurity incidents with minimal labeled data. Empirical evaluations show that using these methods in BERT-like models significantly improves performance over existing low-data regime techniques.

On September 25, 2024, Markus Bayer successfully defended his doctoral thesis, earning the title of Dr. rer. nat. at the Department of Computer Science, Technical University of Darmstadt.

The entire PEASEC team extends its heartfelt congratulations to our new *Dr. rer. nat.* Markus Bayer!

His dissertation was supervised by Prof. Dr. Dr. Christian Reuter, with Prof. Dr. Lucie Flek (University of Bonn) serving as the second referee. The examination committee also included Prof. Dr. Sebastian Faust and Prof. Dr. Kristian Kersting.

Deep Learning in Textual Low-Data Regimes for Cybersecurity

In the field of cybersecurity, professionals face a growing challenge in dealing with the sheer volume of information on potential cyber threats. While deep learning has revolutionized many areas of data analysis, it typically relies on large amounts of labeled data. However, in cybersecurity, obtaining labeled data can be difficult due to the sensitive and specialized nature of the information. This thesis addresses this problem by developing methods to effectively apply deep learning techniques in textual low-data regimes, particularly within the cybersecurity domain.

The research presented in this thesis focuses on improving the usability and accuracy of machine learning models under conditions of limited data. It explores four main approaches: active learning, data augmentation, transfer learning, and adversarial training. Each of these techniques is adapted to the challenges of low-data regimes, ensuring that machine learning models can be both effective and resilient even when data is scarce.

One of the key contributions of this thesis is the development of a specialized language model, CySecBERT, which is fine-tuned for the cybersecurity domain. This model, along with the application of advanced deep learning techniques, significantly improves the performance of cybersecurity tools, particularly in the context of Computer Emergency Response Teams (CERTs). The findings of this research have broad implications, not only for improving cybersecurity practices but also for enhancing machine learning applications in other fields where data is limited.

Overall, this thesis provides a comprehensive framework for advancing deep learning in cybersecurity, offering practical solutions for managing and analyzing the vast amounts of data necessary to protect against cyber threats.

Selected Publications within the PhD thesis:

Markus Bayer, Marc-Andre Kaufhold, Christian Reuter (2021)
Information Overload in Crisis Management: Bilingual Evaluation of Embedding Models for Clustering Social Media Posts in Emergencies
Proceedings of the 29th European Conference on Information Systems (ECIS 2021). URL: https://aisel.aisnet.org/ecis2021_rp/64
[Download PDF]
Markus Bayer, Christian Reuter (2024)
ActiveLLM: Large Language Model-based Active Learning for Textual Few-Shot Scenarios
CoRR, arXiv. doi:10.48550/ARXIV.2405.10808
[Download PDF]
Markus Bayer, Marc-Andre Kaufhold, Christian Reuter (2023)
A Survey on Data Augmentation for Text Classification
ACM Computing Surveys (CSUR), 2023. doi:10.1145/3544558
[Download PDF]
Markus Bayer, Marc-Andre Kaufhold, Björn Buchhold, Marcel Keller, Jörg Dallmeyer, Christian Reuter (2023)
Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
International Journal of Machine Learning and Cybernetics. doi:10.1007/s13042-022-01553-3
[Download PDF]
Marc-Andre Kaufhold, Markus Bayer, Daniel Hartung, Christian Reuter (2021)
Design and Evaluation of Deep Learning Models for Real-Time Credibility Assessment in Twitter
Artificial Neural Networks and Machine Learning – ICANN 2021 – 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14-17, 2021, Proceedings, Part V. Lecture Notes in Computer Science. Springer, 2021, pp. 396–408. doi:10.1007/978-3-030-86383-8_32
[Download PDF]
Markus Bayer, Philipp Kühn, Ramin Shanehsaz, Christian Reuter (2024)
CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain
ACM Transactions on Privacy and Security, 2024. doi:10.1145/3652594
[Download PDF]
Markus Bayer, Tobias Frey, Christian Reuter (2023)
Multi-level fine-tuning, data augmentation, and few-shot learning for specialized cyber threat intelligence
Computers & Security. doi:10.1016/J.COSE.2023.103430
[Download PDF]
Markus Bayer, Markus Neiczer, Maximilian Samsinger, Björn Buchhold, Christian Reuter (2024)
XAI-Attack: Utilizing Explainable AI to Find Incorrectly Learned Patterns for Black-Box Adversarial Example Creation
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. URL: https://aclanthology.org/2024.lrec-main.1542
[Download PDF]

Projects:

Further News about PhD Defences

Deep Learning in Textual Low-Data Regimes for Cybersecurity – Congratulations to *Dr. rer. nat.* Markus Bayer on the successful defense of his doctoral thesis