Markus Bayer, M.Sc.
bayer(at)peasec.tu-darmstadt.de
Wissenschaftlicher Mitarbeiter / Doktorand
Technische Universität Darmstadt, Fachbereich Informatik,
Wissenschaft und Technik für Frieden und Sicherheit (PEASEC)
Pankratiusstraße 2, 64289 Darmstadt
Markus Bayer, M.Sc. ist wissenschaftlicher Mitarbeiter und Doktorand am Lehrstuhl Wissenschaft und Technik für Frieden und Sicherheit (PEASEC) im Fachbereich Informatik der Technischen Universität Darmstadt.
In seinem an der Technischen Universität Darmstadt absolvierten Informatikstudium (B.Sc. und M.Sc.) richtete er den Fokus auf maschinelles Lernen in Verbindung mit Friedens- und Sicherheitsforschung. Bei PEASEC setzt er seine Expertise in dem CYWARN-Projekt ein, um mit Deep Learning Herausforderungen von Sicherheitsereignisteams zu begegnen. Als übergeordnetes Ziel versucht er hochrelevante Praxisprobleme, wie zum Beispiel Explainable AI und Deep Learning in Niedrigdatenregimen, durch zielgerichtete und theoretisch fundierte Forschung zu adressieren.
Publikationen
[BibTeX] [Abstract] [Download PDF]
Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accordingly. It yields harvest rates of up to 52\%, which are, to the best of our knowledge, better than the current state of the art.
@techreport{kuehn_threatcrawl_2023,
title = {{ThreatCrawl}: {A} {BERT}-based {Focused} {Crawler} for the {Cybersecurity} {Domain}},
shorttitle = {{ThreatCrawl}},
url = {http://arxiv.org/abs/2304.11960},
abstract = {Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accordingly. It yields harvest rates of up to 52\%, which are, to the best of our knowledge, better than the current state of the art.},
number = {arXiv:2304.11960},
urldate = {2023-04-27},
institution = {arXiv},
author = {Kuehn, Philipp and Schmidt, Mike and Bayer, Markus and Reuter, Christian},
month = apr,
year = {2023},
note = {arXiv:2304.11960 [cs]},
keywords = {Student, Security, Projekt-ATHENE-SecUrban, Projekt-CYWARN},
}
[BibTeX] [Abstract] [Download PDF]
Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model’s generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation and a taxonomy for existing works, this survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners. Derived from the taxonomy, we divide more than 100 methods into 12 different groupings and give state-of-the-art references expounding which methods are highly promising by relating them to each other. Finally, research perspectives that may constitute a building block for future work are provided.
@article{bayer_survey_2023,
title = {Survey on {Data} {Augmentation} for {Text} {Classification}},
volume = {55},
url = {https://dl.acm.org/doi/pdf/10.1145/3544558},
doi = {10.1145/3544558},
abstract = {Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization capabilities, it can also address many other challenges and problems, from overcoming a limited amount of training data, to regularizing the objective, to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation and a taxonomy for existing works, this survey is concerned with data augmentation methods for textual classification and aims to provide a concise and comprehensive overview for researchers and practitioners. Derived from the taxonomy, we divide more than 100 methods into 12 different groupings and give state-of-the-art references expounding which methods are highly promising by relating them to each other. Finally, research perspectives that may constitute a building block for future work are provided.},
number = {7},
journal = {ACM Computing Surveys (CSUR)},
author = {Bayer, Markus and Kaufhold, Marc-André and Reuter, Christian},
year = {2023},
keywords = {AuswahlCrisis, Selected, A-Paper, Ranking-CORE-A*, Ranking-ImpactFactor, Projekt-ATHENE-SecUrban, Projekt-CYWARN, Projekt-emergenCITY, AuswahlKaufhold},
pages = {1--39},
}
[BibTeX] [Abstract] [Download PDF]
A Design Science Artefact for Cyber Threat Detection and Actor Specific Communication
@article{bayer_multi-level_2023,
title = {Multi-{Level} {Fine}-{Tuning}, {Data} {Augmentation}, and {Few}-{Shot} {Learning} for {Specialized} {Cyber} {Threat} {Intelligence}},
issn = {0167-4048},
url = {https://www.sciencedirect.com/science/article/pii/S0167404823003401},
doi = {10.1016/j.cose.2023.103430},
abstract = {A Design Science Artefact for Cyber Threat Detection and Actor Specific Communication},
journal = {Computers \& Security},
author = {Bayer, Markus and Frey, Tobias and Reuter, Christian},
year = {2023},
keywords = {Student, Security, A-Paper, Ranking-ImpactFactor, Ranking-CORE-B, Projekt-CROSSING, Projekt-CYWARN, Projekt-ATHENE},
}
[BibTeX] [Abstract] [Download PDF]
Despite the merits of public and social media in private and professional spaces, citizens and professionals are increasingly exposed to cyberabuse, such as cyberbullying and hate speech. Thus, Law Enforcement Agencies (LEA) are deployed in many countries and organisations to enhance the preventive and reactive capabilities against cyberabuse. However, their tasks are getting more complex by the increasing amount and varying quality of information disseminated into public channels. Adopting the perspectives of Crisis Informatics and safety-critical Human-Computer Interaction (HCI) and based on both a narrative literature review and group discussions, this paper first outlines the research agenda of the CYLENCE project, which seeks to design strategies and tools for cross-media reporting, detection, and treatment of cyberbullying and hatespeech in investigative and law enforcement agencies. Second, it identifies and elaborates seven research challenges with regard to the monitoring, analysis and communication of cyberabuse in LEAs, which serve as a starting point for in-depth research within the project.
@inproceedings{kaufhold_cylence_2023,
address = {Rapperswil, Switzerland},
title = {{CYLENCE}: {Strategies} and {Tools} for {Cross}-{Media} {Reporting}, {Detection}, and {Treatment} of {Cyberbullying} and {Hatespeech} in {Law} {Enforcement} {Agencies}},
url = {https://dl.gi.de/items/0e0efe8f-64bf-400c-85f7-02b65f83189d},
doi = {10.18420/muc2023-mci-ws01-211},
abstract = {Despite the merits of public and social media in private and professional spaces, citizens and professionals are increasingly exposed to cyberabuse, such as cyberbullying and hate speech. Thus, Law Enforcement Agencies (LEA) are deployed in many countries and organisations to enhance the preventive and reactive capabilities against cyberabuse. However, their tasks are getting more complex by the increasing amount and varying quality of information disseminated into public channels. Adopting the perspectives of Crisis Informatics and safety-critical Human-Computer Interaction (HCI) and based on both a narrative literature review and group discussions, this paper first outlines the research agenda of the CYLENCE project, which seeks to design strategies and tools for cross-media reporting, detection, and treatment of cyberbullying and hatespeech in investigative and law enforcement agencies. Second, it identifies and elaborates seven research challenges with regard to the monitoring, analysis and communication of cyberabuse in LEAs, which serve as a starting point for in-depth research within the project.},
language = {de},
booktitle = {Mensch und {Computer} 2023 - {Workshopband}},
publisher = {Gesellschaft für Informatik e.V.},
author = {Kaufhold, Marc-André and Bayer, Markus and Bäumler, Julian and Reuter, Christian and Stieglitz, Stefan and Basyurt, Ali Sercan and Mirabaie, Milad and Fuchß, Christoph and Eyilmez, Kaan},
year = {2023},
keywords = {Projekt-CYLENCE},
}
[BibTeX] [Abstract] [Download PDF]
The field of cybersecurity is evolving fast. Experts need to be informed about past, current and – in the best case – upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available
@techreport{bayer_cysecbert_2022,
title = {{CySecBERT}: {A} {Domain}-{Adapted} {Language} {Model} for the {Cybersecurity} {Domain}},
copyright = {arXiv.org perpetual, non-exclusive license},
url = {https://arxiv.org/abs/2212.02974},
abstract = {The field of cybersecurity is evolving fast. Experts need to be informed about past, current and - in the best case - upcoming threats, because attacks are becoming more advanced, targets bigger and systems more complex. As this cannot be addressed manually, cybersecurity experts need to rely on machine learning techniques. In the texutual domain, pre-trained language models like BERT have shown to be helpful, by providing a good baseline for further fine-tuning. However, due to the domain-knowledge and many technical terms in cybersecurity general language models might miss the gist of textual information, hence doing more harm than good. For this reason, we create a high-quality dataset and present a language model specifically tailored to the cybersecurity domain, which can serve as a basic building block for cybersecurity systems that deal with natural language. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the one hand, the results of the intrinsic tasks show that our model improves the internal representation space of words compared to the other models. On the other hand, the extrinsic, domain-dependent tasks, consisting of sequence tagging and classification, show that the model is best in specific application scenarios, in contrast to the others. Furthermore, we show that our approach against catastrophic forgetting works, as the model is able to retrieve the previously trained domain-independent knowledge. The used dataset and trained model are made publicly available},
institution = {arXiv},
author = {Bayer, Markus and Kuehn, Philipp and Shanehsaz, Ramin and Reuter, Christian},
year = {2022},
doi = {10.48550/ARXIV.2212.02974},
keywords = {Projekt-ATHENE-SecUrban, Projekt-CYWARN},
}
[BibTeX] [Abstract] [Download PDF]
Eine vorausschauende und global ausgerichtete deutsche Rüstungskontrollpolitik hat enorme sicherheitspolitische Potenziale. Denn: Eine aktive Rüstungskontrollpolitik Deutschlands kann helfen, die Gefahren der weltweiten Aufrüstung und regionaler Rüstungs- und Eskalationsdynamiken zu mindern. Rüstungskontrollpolitische Instrumente müssen jede Vereinbarung über das Ende des Kriegs gegen die Ukraine stützen. Vereinbarungen über die Nichtverbreitung von Massenvernichtungswaffen bestimmen Regeln, die den militärischen Missbrauch von dual use-Technologien verhindern können. Abrüstung und Rüstungskontrolle mindern schon jetzt menschliches Leid in anderen Weltregionen. Abrüstung kann dazu beitragen, das vorherrschende und im Krieg gegen die Ukraine zunehmend unberechenbare Abschreckungsparadigma zu überwinden. Eine engagierte Rüstungskontrollpolitik fügt sich dann in die feministische Außenpolitik der Bundesregierung, wenn sie partizipativ und restriktiv angelegt ist und negative Folgen von Aufrüstung und Krieg besonders für Frauen und marginalisierte Gruppen reduziert. Um diese sicherheitspolitischen Potenziale auszuschöpfen, sollte die Nationale Sicherheitsstrategie Eckpunkte einer eigenständigen deutschen Rüstungskontrollpolitik beschreiben. Drei Prinzipien können eine solche Politik anleiten.
@techreport{meier_fur_2022,
title = {Für eine umfassende, globale und aktive {Abrüstungs}- und {Rüstungskontrollpolitik}},
url = {https://fourninesecurity.de/2022/11/10/fuer-eine-umfassende-globale-und-aktive-abruestungs-und-ruestungskontrollpolitik},
abstract = {Eine vorausschauende und global ausgerichtete deutsche Rüstungskontrollpolitik hat enorme sicherheitspolitische Potenziale. Denn: Eine aktive Rüstungskontrollpolitik Deutschlands kann helfen, die Gefahren der weltweiten Aufrüstung und regionaler Rüstungs- und Eskalationsdynamiken zu mindern. Rüstungskontrollpolitische Instrumente müssen jede Vereinbarung über das Ende des Kriegs gegen die Ukraine stützen. Vereinbarungen über die Nichtverbreitung von Massenvernichtungswaffen bestimmen Regeln, die den militärischen Missbrauch von dual use-Technologien verhindern können. Abrüstung und Rüstungskontrolle mindern schon jetzt menschliches Leid in anderen Weltregionen. Abrüstung kann dazu beitragen, das vorherrschende und im Krieg gegen die Ukraine zunehmend unberechenbare Abschreckungsparadigma zu überwinden. Eine engagierte Rüstungskontrollpolitik fügt sich dann in die feministische Außenpolitik der Bundesregierung, wenn sie partizipativ und restriktiv angelegt ist und negative Folgen von Aufrüstung und Krieg besonders für Frauen und marginalisierte Gruppen reduziert.
Um diese sicherheitspolitischen Potenziale auszuschöpfen, sollte die Nationale Sicherheitsstrategie Eckpunkte einer eigenständigen deutschen Rüstungskontrollpolitik beschreiben. Drei Prinzipien können eine solche Politik anleiten.},
language = {de},
author = {Meier, Oliver and Brzoska, Michael and Ferl, Anna-Katharina and Hach, Sascha and Bayer, Markus (2) and Mutschler, Max and Prem, Berenike and Reinhold, Thomas and Schmid, Stefka and Schwarz, Matthias},
year = {2022},
}
[BibTeX] [Abstract] [Download PDF]
In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53\% and 3.56\% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.
@article{bayer_data_2022,
title = {Data {Augmentation} in {Natural} {Language} {Processing}: {A} {Novel} {Text} {Generation} {Approach} for {Long} and {Short} {Text} {Classifiers}},
url = {https://link.springer.com/article/10.1007/s13042-022-01553-3},
doi = {10.1007/s13042-022-01553-3},
abstract = {In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53\% and 3.56\% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.},
journal = {International Journal of Machine Learning and Cybernetics (IJMLC)},
author = {Bayer, Markus and Kaufhold, Marc-André and Buchhold, Björn and Keller, Marcel and Dallmeyer, Jörg and Reuter, Christian},
year = {2022},
keywords = {Student, Security, A-Paper, Ranking-ImpactFactor, Projekt-CYWARN, Projekt-emergenCITY},
}
[BibTeX] [Abstract] [Download PDF]
Vulnerability databases are one of the main information sources for IT security experts. Hence, the quality of their information is of utmost importance for anyone working in this area. Previous work has shown that machine readable information is either missing, incorrect, or inconsistent with other data sources. In this paper, we introduce a system called Overt Vulnerability source ANAlysis (OVANA), utilizing state-of-the-art machine learning (ML) and natural-language processing (NLP) techniques, which analyzes the information quality (IQ) of vulnerability databases, searches the free-form description for relevant information missing from structured fields, and updates it accordingly. Our paper shows that OVANA is able to improve the IQ of the National Vulnerability Database by 51.23\% based on the indicators of accuracy, completeness, and uniqueness. Moreover, we present information which should be incorporated into the structured fields to increase the uniqueness of vulnerability entries and improve the discriminability of different vulnerability entries. The identified information from OVANA enables a more targeted vulnerability search and provides guidance for IT security experts in finding relevant information in vulnerability descriptions for severity assessment.
@inproceedings{kuehn_ovana_2021,
title = {{OVANA}: {An} {Approach} to {Analyze} and {Improve} the {Information} {Quality} of {Vulnerability} {Databases}},
isbn = {978-1-4503-9051-4},
url = {https://peasec.de/paper/2021/2021_KuehnBayerWendelbornReuter_OVANAQualityVulnerabilityDatabases_ARES.pdf},
doi = {10.1145/3465481.3465744},
abstract = {Vulnerability databases are one of the main information sources for IT security experts. Hence, the quality of their information is of utmost importance for anyone working in this area. Previous work has shown that machine readable information is either missing, incorrect, or inconsistent with other data sources. In this paper, we introduce a system called Overt Vulnerability source ANAlysis (OVANA), utilizing state-of-the-art machine learning (ML) and natural-language processing (NLP) techniques, which analyzes the information quality (IQ) of vulnerability databases, searches the free-form description for relevant information missing from structured fields, and updates it accordingly. Our paper shows that OVANA is able to improve the IQ of the National Vulnerability Database by 51.23\% based on the indicators of accuracy, completeness, and uniqueness. Moreover, we present information which should be incorporated into the structured fields to increase the uniqueness of vulnerability entries and improve the discriminability of different vulnerability entries. The identified information from OVANA enables a more targeted vulnerability search and provides guidance for IT security experts in finding relevant information in vulnerability descriptions for severity assessment.},
booktitle = {Proceedings of the 16th {International} {Conference} on {Availability}, {Reliability} and {Security}},
publisher = {ACM},
author = {Kuehn, Philipp and Bayer, Markus and Wendelborn, Marc and Reuter, Christian},
year = {2021},
keywords = {Security, Peace, Ranking-CORE-B, AuswahlPeace, Projekt-ATHENE-SecUrban, Projekt-CYWARN},
pages = {1--11},
}
[BibTeX] [Abstract] [Download PDF]
Past studies in the domains of information systems have analysed the potentials and barriers of social media in emergencies. While information disseminated in social media can lead to valuable insights, emergency services and researchers face the challenge of information overload as data quickly exceeds the manageable amount. We propose an embedding-based clustering approach and a method for the automated labelling of clusters. Given that the clustering quality is highly dependent on embeddings, we evaluate 19 embedding models with respect to time, internal cluster quality, and language invariance. The results show that it may be sensible to use embedding models that were already trained on other crisis datasets. However, one must ensure that the training data generalizes enough, so that the clustering can adapt to new situations. Confirming this, we found out that some embeddings were not able to perform as well on a German dataset as on an English dataset.
@inproceedings{bayer_information_2021,
title = {Information {Overload} in {Crisis} {Management}: {Bilingual} {Evaluation} of {Embedding} {Models} for {Clustering} {Social} {Media} {Posts} in {Emergencies}},
url = {https://peasec.de/paper/2021/2021_BayerKaufholdReuter_InformationOverloadInCrisisManagementBilingualEvaluation_ECIS.pdf},
abstract = {Past studies in the domains of information systems have analysed the potentials and barriers of social media in emergencies. While information disseminated in social media can lead to valuable insights, emergency services and researchers face the challenge of information overload as data quickly exceeds the manageable amount. We propose an embedding-based clustering approach and a method for the automated labelling of clusters. Given that the clustering quality is highly dependent on embeddings, we evaluate 19 embedding models with respect to time, internal cluster quality, and language invariance. The results show that it may be sensible to use embedding models that were already trained on other crisis datasets. However, one must ensure that the training data generalizes enough, so that the clustering can adapt to new situations. Confirming this, we found out that some embeddings were not able to perform as well on a German dataset as on an English dataset.},
booktitle = {Proceedings of the {European} {Conference} on {Information} {Systems} ({ECIS})},
author = {Bayer, Markus and Kaufhold, Marc-André and Reuter, Christian},
year = {2021},
keywords = {Crisis, SocialMedia, A-Paper, Ranking-CORE-A, Projekt-ATHENE-SecUrban, Projekt-CYWARN},
pages = {1--18},
}
[BibTeX] [Abstract] [Download PDF]
Social media have an enormous impact on modern life but are prone to the dissemination of false information. In several domains, such as crisis management or political communication, it is of utmost importance to detect false and to promote credible information. Although educational measures might help individuals to detect false information, the sheer volume of social big data, which sometimes need to be anal- ysed under time-critical constraints, calls for automated and (near) real- time assessment methods. Hence, this paper reviews existing approaches before designing and evaluating three deep learning models (MLP, RNN, BERT) for real-time credibility assessment using the example of Twitter posts. While our BERT implementation achieved best results with an accuracy of up to 87.07\% and an F1 score of 0.8764 when using meta- data, text, and user features, MLP and RNN showed lower classification quality but better performance for real-time application. Furthermore, the paper contributes with a novel dataset for credibility assessment.
@inproceedings{kaufhold_design_2021,
address = {Bratislava},
title = {Design and {Evaluation} of {Deep} {Learning} {Models} for {Real}-{Time} {Credibility} {Assessment} in {Twitter}},
url = {https://peasec.de/paper/2021/2021_KaufholdBayerHartungReuter_DeepLearningCredibilityAssessmentTwitter_ICANN.pdf},
doi = {https://doi.org/10.1007/978-3-030-86383-8_32},
abstract = {Social media have an enormous impact on modern life but are prone to the dissemination of false information. In several domains, such as crisis management or political communication, it is of utmost importance to detect false and to promote credible information. Although educational measures might help individuals to detect false information, the sheer volume of social big data, which sometimes need to be anal- ysed under time-critical constraints, calls for automated and (near) real- time assessment methods. Hence, this paper reviews existing approaches before designing and evaluating three deep learning models (MLP, RNN, BERT) for real-time credibility assessment using the example of Twitter posts. While our BERT implementation achieved best results with an accuracy of up to 87.07\% and an F1 score of 0.8764 when using meta- data, text, and user features, MLP and RNN showed lower classification quality but better performance for real-time application. Furthermore, the paper contributes with a novel dataset for credibility assessment.},
booktitle = {30th {International} {Conference} on {Artificial} {Neural} {Networks} ({ICANN2021})},
author = {Kaufhold, Marc-André and Bayer, Markus and Hartung, Daniel and Reuter, Christian},
year = {2021},
keywords = {Student, Security, Ranking-CORE-B, Projekt-ATHENE-SecUrban, Projekt-CYWARN},
pages = {1--13},
}
[BibTeX] [Abstract] [Download PDF]
Despite the merits of digitisation in private and professional spaces, critical infrastructures and societies are increasingly ex-posed to cyberattacks. Thus, Computer Emergency Response Teams (CERTs) are deployed in many countries and organisations to enhance the preventive and reactive capabilities against cyberattacks. However, their tasks are getting more complex by the increasing amount and varying quality of information dissem-inated into public channels. Adopting the perspectives of Crisis Informatics and safety-critical Human-Computer Interaction (HCI) and based on both a narrative literature review and group discussions, this paper first outlines the research agenda of the CYWARN project, which seeks to design strategies and technolo-gies for cross-platform cyber situational awareness and actor-spe-cific cyber threat communication. Second, it identifies and elabo-rates eight research challenges with regard to the monitoring, analysis and communication of cyber threats in CERTs, which serve as a starting point for in-depth research within the project.
@inproceedings{kaufhold_cywarn_2021,
address = {Bonn},
series = {Mensch und {Computer} 2021 - {Workshopband}},
title = {{CYWARN}: {Strategy} and {Technology} {Development} for {Cross}-{Platform} {Cyber} {Situational} {Awareness} and {Actor}-{Specific} {Cyber} {Threat} {Communication}},
url = {https://dl.gi.de/server/api/core/bitstreams/8f470f6b-5050-4fb9-b923-d08cf84c17b7/content},
doi = {10.18420/muc2021-mci-ws08-263},
abstract = {Despite the merits of digitisation in private and professional spaces, critical infrastructures and societies are increasingly ex-posed to cyberattacks. Thus, Computer Emergency Response Teams (CERTs) are deployed in many countries and organisations to enhance the preventive and reactive capabilities against cyberattacks. However, their tasks are getting more complex by the increasing amount and varying quality of information dissem-inated into public channels. Adopting the perspectives of Crisis Informatics and safety-critical Human-Computer Interaction (HCI) and based on both a narrative literature review and group discussions, this paper first outlines the research agenda of the CYWARN project, which seeks to design strategies and technolo-gies for cross-platform cyber situational awareness and actor-spe-cific cyber threat communication. Second, it identifies and elabo-rates eight research challenges with regard to the monitoring, analysis and communication of cyber threats in CERTs, which serve as a starting point for in-depth research within the project.},
booktitle = {Workshop-{Proceedings} {Mensch} und {Computer}},
publisher = {Gesellschaft für Informatik},
author = {Kaufhold, Marc-André and Fromm, Jennifer and Riebe, Thea and Mirbabaie, Milad and Kuehn, Philipp and Basyurt, Ali Sercan and Bayer, Markus and Stöttinger, Marc and Eyilmez, Kaan and Möller, Reinhard and Fuchß, Christoph and Stieglitz, Stefan and Reuter, Christian},
year = {2021},
keywords = {Security, Projekt-CYWARN},
}
[BibTeX] [Abstract] [Download PDF]
Receiving relevant information on possible cyber threats, attacks, and data breaches in a timely manner is crucial for early response. The social media platform Twitter hosts an active cyber security community. Their activities are often monitored manually by security experts, such as Computer Emergency Response Teams (CERTs). We thus propose a Twitter-based alert generation system that issues alerts to a system operator as soon as new relevant cyber security related topics emerge. Thereby, our system allows us to monitor user accounts with significantly less workload. Our system applies a supervised classifier, based on active learning, that detects tweets containing relevant information. The results indicate that uncertainty sampling can reduce the amount of manual relevance classification effort and enhance the classifier performance substantially compared to random sampling. Our approach reduces the number of accounts and tweets that are needed for the classifier training, thus making the tool easily and rapidly adaptable to the specific context while also supporting data minimization for Open Source Intelligence (OSINT). Relevant tweets are clustered by a greedy stream clustering algorithm in order to identify significant events. The proposed system is able to work near real-time within the required 15-minutes time frame and detects up to 93.8\% of relevant events with a false alert rate of 14.81\%.
@inproceedings{riebe_cysecalert_2021,
title = {{CySecAlert}: {An} {Alert} {Generation} {System} for {Cyber} {Security} {Events} {Using} {Open} {Source} {Intelligence} {Data}},
url = {https://peasec.de/paper/2021/2021_RiebeWirthBayerKuehnKaufholdKnautheGutheReuter_CySecAlertOpenSourceIntelligence_ICICS.pdf},
doi = {10.1007/978-3-030-86890-1_24},
abstract = {Receiving relevant information on possible cyber threats, attacks, and data breaches in a timely manner is crucial for early response. The social media platform Twitter hosts an active cyber security community. Their activities are often monitored manually by security experts, such as Computer Emergency Response Teams (CERTs). We thus propose a Twitter-based alert generation system that issues alerts to a system operator as soon as new relevant cyber security related topics emerge. Thereby, our system allows us to monitor user accounts with significantly less workload. Our system applies a supervised classifier, based on active learning, that detects tweets containing relevant information. The results indicate that uncertainty sampling can reduce the amount of manual relevance classification effort and enhance the classifier performance substantially compared to random sampling. Our approach reduces the number of accounts and tweets that are needed for the classifier training, thus making the tool easily and rapidly adaptable to the specific context while also supporting data minimization for Open Source Intelligence (OSINT). Relevant tweets are clustered by a greedy stream clustering algorithm in order to identify significant events. The proposed system is able to work near real-time within the required 15-minutes time frame and detects up to 93.8\% of relevant events with a false alert rate of 14.81\%.},
booktitle = {Information and {Communications} {Security} ({ICICS})},
author = {Riebe, Thea and Wirth, Tristan and Bayer, Markus and Kuehn, Philipp and Kaufhold, Marc-André and Knauthe, Volker and Guthe, Stefan and Reuter, Christian},
year = {2021},
keywords = {Student, UsableSec, Security, Ranking-CORE-B, Projekt-ATHENE-SecUrban, Projekt-CYWARN},
pages = {429--446},
}
[BibTeX] [Abstract] [Download PDF]
The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are sui- table for identifying relevant messages and filter out irrelevant messages, thus mitigating in- formation overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for re- levance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28\%/89.19\% accuracy, 98.3\%/89.6\% precision and 80.4\%/87.5\% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary eva- luation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feed- back classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the tradi- tional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.
@article{kaufhold_rapid_2020,
title = {Rapid relevance classification of social media posts in disasters and emergencies: {A} system and evaluation featuring active, incremental and online learning},
volume = {57},
url = {https://peasec.de/paper/2020/2020_KaufholdBayerReuter_RapidRelevanceClassification_IPM.pdf},
abstract = {The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are sui- table for identifying relevant messages and filter out irrelevant messages, thus mitigating in- formation overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for re- levance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28\%/89.19\% accuracy, 98.3\%/89.6\% precision and 80.4\%/87.5\% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary eva- luation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feed- back classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the tradi- tional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.},
number = {1},
journal = {Information Processing \& Management (IPM)},
author = {Kaufhold, Marc-André and Bayer, Markus and Reuter, Christian},
year = {2020},
keywords = {Crisis, SocialMedia, A-Paper, Ranking-ImpactFactor, Ranking-CORE-A, Ranking-WKWI-B, Projekt-ATHENE-SecUrban, Projekt-emergenCITY, AuswahlKaufhold},
pages = {1--32},
}
[BibTeX] [Abstract] [Download PDF]
Diese Arbeit befasst sich mit der zunehmenden Digitalisierung der kritischen Infrastruktur Ernährungswirtschaft und setzt den Fokus insbesondere auf die dadurch resultierenden in-formationstechnologischen Folgen bezüglich der Angriffs- und Ausfallsicherheit in der Landwirtschaft und von ihr abhängigen Sektoren. In diesem Kontext wird die Modernisie-rungen der Landmaschinen und deren Vernetzung sowie das Cloud-Computing in der Landwirtschaft analysiert und zu treffende Maßnahmen bezüglich einer resilienten Struktur erläutert. In vielen Bereichen wird dabei aufgezeigt, dass das Ausfallrisiko der Produktion zugunsten von Vorteilen wie Ertrags- und Qualitätssteigerung vernachlässigt wird.
@inproceedings{reuter_resiliente_2018,
address = {Dresden, Germany},
title = {Resiliente {Digitalisierung} der kritischen {Infrastruktur} {Landwirtschaft} - mobil, dezentral, ausfallsicher},
url = {https://dl.gi.de/bitstream/handle/20.500.12116/16930/Beitrag_330_final__a.pdf},
abstract = {Diese Arbeit befasst sich mit der zunehmenden Digitalisierung der kritischen Infrastruktur Ernährungswirtschaft und setzt den Fokus insbesondere auf die dadurch resultierenden in-formationstechnologischen Folgen bezüglich der Angriffs- und Ausfallsicherheit in der Landwirtschaft und von ihr abhängigen Sektoren. In diesem Kontext wird die Modernisie-rungen der Landmaschinen und deren Vernetzung sowie das Cloud-Computing in der Landwirtschaft analysiert und zu treffende Maßnahmen bezüglich einer resilienten Struktur erläutert. In vielen Bereichen wird dabei aufgezeigt, dass das Ausfallrisiko der Produktion zugunsten von Vorteilen wie Ertrags- und Qualitätssteigerung vernachlässigt wird.},
booktitle = {Mensch und {Computer} 2018: {Workshopband}},
publisher = {Gesellschaft für Informatik e.V.},
author = {Reuter, Christian and Schneider, Wolfgang and Eberz, Daniel and Bayer, Markus and Hartung, Daniel and Kaygusuz, Cemal},
editor = {Dachselt, Raimund and Weber, Gerhard},
year = {2018},
keywords = {Crisis, Student, Projekt-KontiKat, Infrastructure, RSF, Projekt-MAKI, Projekt-GeoBox, Projekt-HyServ},
pages = {623--632},
}