Markus Bayer, M. Sc.

Wissenschaftlicher Mitarbeiter / Doktorand

Technische Universität Darmstadt, Fachbereich Informatik,
Wissenschaft und Technik für Frieden und Sicherheit (PEASEC)
Pankratiusstraße 2, 64289 Darmstadt

Markus Bayer, M.Sc. ist wissenschaftlicher Mitarbeiter und Doktorand am Lehrstuhl Wissenschaft und Technik für Frieden und Sicherheit (PEASEC) im Fachbereich Informatik der Technischen Universität Darmstadt.

In seinem an der Technischen Universität Darmstadt absolvierten Informatikstudium (B. Sc. und M. Sc.) richtete er den Fokus auf maschinelles Lernen in Verbindung mit Friedens- und Sicherheitsforschung. Bei PEASEC setzt er seine Expertise in dem CYWARN-Projekt ein, um mit Deep Learning Herausforderungen von Sicherheitsereignisteams zu begegnen. Als übergeordnetes Ziel versucht er hochrelevante Praxisprobleme, wie zum Beispiel Explainable AI und Deep Learning in Niedrigdatenregimen, durch zielgerichtete und theoretisch fundierte Forschung zu adressieren.

Publikationen

  • Philipp Kuehn, Markus Bayer, Marc Wendelborn, Christian Reuter (2021)
    OVANA: An Approach to Analyze and Improve the Information Quality of Vulnerability Databases
    Proceedings of the 16th International Conference on Availability, Reliability and Security . doi:10.1145/3465481.3465744
    [BibTeX] [Abstract] [Download PDF]

    Vulnerability databases are one of the main information sources for IT security experts. Hence, the quality of their information is of utmost importance for anyone working in this area. Previous work has shown that machine readable information is either missing, incorrect, or inconsistent with other data sources. In this paper, we introduce a system called Overt Vulnerability source ANAlysis (OVANA), utilizing state-of-the-art machine learning (ML) and natural-language processing (NLP) techniques, which analyzes the information quality (IQ) of vulnerability databases, searches the free-form description for relevant information missing from structured fields, and updates it accordingly. Our paper shows that OVANA is able to improve the IQ of the National Vulnerability Database by 51.23\% based on the indicators of accuracy, completeness, and uniqueness. Moreover, we present information which should be incorporated into the structured fields to increase the uniqueness of vulnerability entries and improve the discriminability of different vulnerability entries. The identified information from OVANA enables a more targeted vulnerability search and provides guidance for IT security experts in finding relevant information in vulnerability descriptions for severity assessment.

    @inproceedings{kuehn_ovana_2021,
    title = {{OVANA}: {An} {Approach} to {Analyze} and {Improve} the {Information} {Quality} of {Vulnerability} {Databases}},
    isbn = {978-1-4503-9051-4},
    url = {https://doi.org/10.1145/3465481.3465744},
    doi = {10.1145/3465481.3465744},
    abstract = {Vulnerability databases are one of the main information sources for IT security experts. Hence, the quality of their information is of utmost importance for anyone working in this area. Previous work has shown that machine readable information is either missing, incorrect, or inconsistent with other data sources. In this paper, we introduce a system called Overt Vulnerability source ANAlysis (OVANA), utilizing state-of-the-art machine learning (ML) and natural-language processing (NLP) techniques, which analyzes the information quality (IQ) of vulnerability databases, searches the free-form description for relevant information missing from structured fields, and updates it accordingly. Our paper shows that OVANA is able to improve the IQ of the National Vulnerability Database by 51.23\% based on the indicators of accuracy, completeness, and uniqueness. Moreover, we present information which should be incorporated into the structured fields to increase the uniqueness of vulnerability entries and improve the discriminability of different vulnerability entries. The identified information from OVANA enables a more targeted vulnerability search and provides guidance for IT security experts in finding relevant information in vulnerability descriptions for severity assessment.},
    booktitle = {Proceedings of the 16th {International} {Conference} on {Availability}, {Reliability} and {Security}},
    publisher = {ACM},
    author = {Kuehn, Philipp and Bayer, Markus and Wendelborn, Marc and Reuter, Christian},
    year = {2021},
    keywords = {Projekt-ATHENE-SecUrban, Projekt-CYWARN, Security, Ranking-CORE-B},
    pages = {11},
    }

  • Markus Bayer, Marc-André Kaufhold, Björn Buchhold, Marcel Keller, Jörg Dallmeyer, Christian Reuter (2021)
    Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers
    2021.
    [BibTeX] [Download PDF]

    @techreport{bayer_data_2021,
    title = {Data {Augmentation} in {Natural} {Language} {Processing}: {A} {Novel} {Text} {Generation} {Approach} for {Long} and {Short} {Text} {Classifiers}},
    url = {https://arxiv.org/abs/2103.14453},
    author = {Bayer, Markus and Kaufhold, Marc-André and Buchhold, Björn and Keller, Marcel and Dallmeyer, Jörg and Reuter, Christian},
    year = {2021},
    keywords = {Projekt-CYWARN},
    }

  • Markus Bayer, Marc-André Kaufhold, Christian Reuter (2021)
    Information Overload in Crisis Management: Bilingual Evaluation of Embedding Models for Clustering Social Media Posts in Emergencies
    Proceedings of the European Conference on Information Systems (ECIS) .
    [BibTeX] [Abstract] [Download PDF]

    Past studies in the domains of information systems have analysed the potentials and barriers of social media in emergencies. While information disseminated in social media can lead to valuable insights, emergency services and researchers face the challenge of information overload as data quickly exceeds the manageable amount. We propose an embedding-based clustering approach and a method for the automated labelling of clusters. Given that the clustering quality is highly dependent on embeddings, we evaluate 19 embedding models with respect to time, internal cluster quality, and language invariance. The results show that it may be sensible to use embedding models that were already trained on other crisis datasets. However, one must ensure that the training data generalizes enough, so that the clustering can adapt to new situations. Confirming this, we found out that some embeddings were not able to perform as well on a German dataset as on an English dataset.

    @inproceedings{bayer_information_2021,
    title = {Information {Overload} in {Crisis} {Management}: {Bilingual} {Evaluation} of {Embedding} {Models} for {Clustering} {Social} {Media} {Posts} in {Emergencies}},
    url = {https://aisel.aisnet.org/ecis2021_rp/64/},
    abstract = {Past studies in the domains of information systems have analysed the potentials and barriers of social media in emergencies. While information disseminated in social media can lead to valuable insights, emergency services and researchers face the challenge of information overload as data quickly exceeds the manageable amount. We propose an embedding-based clustering approach and a method for the automated labelling of clusters. Given that the clustering quality is highly dependent on embeddings, we evaluate 19 embedding models with respect to time, internal cluster quality, and language invariance. The results show that it may be sensible to use embedding models that were already trained on other crisis datasets. However, one must ensure that the training data generalizes enough, so that the clustering can adapt to new situations. Confirming this, we found out that some embeddings were not able to perform as well on a German dataset as on an English dataset.},
    booktitle = {Proceedings of the {European} {Conference} on {Information} {Systems} ({ECIS})},
    author = {Bayer, Markus and Kaufhold, Marc-André and Reuter, Christian},
    year = {2021},
    keywords = {Crisis, Projekt-ATHENE-SecUrban, Projekt-CYWARN, SocialMedia, A-Paper, Ranking-CORE-A},
    }

  • Marc-André Kaufhold, Markus Bayer, Daniel Hartung, Christian Reuter (2021)
    Design and Evaluation of Deep Learning Models for Real-Time Credibility Assessment in Twitter
    30th International Conference on Artificial Neural Networks (ICANN2021) Bratislava.
    [BibTeX]

    @inproceedings{kaufhold_design_2021,
    address = {Bratislava},
    title = {Design and {Evaluation} of {Deep} {Learning} {Models} for {Real}-{Time} {Credibility} {Assessment} in {Twitter}},
    booktitle = {30th {International} {Conference} on {Artificial} {Neural} {Networks} ({ICANN2021})},
    author = {Kaufhold, Marc-André and Bayer, Markus and Hartung, Daniel and Reuter, Christian},
    year = {2021},
    keywords = {Projekt-ATHENE-SecUrban, Projekt-CYWARN, Security, Ranking-CORE-B},
    }

  • Marc-André Kaufhold, Jennifer Fromm, Thea Riebe, Milad Mirbabaie, Philipp Kuehn, Ali Sercan Basyurt, Markus Bayer, Marc Stöttinger, Kaan Eyilmez, Reinhard Möller, Christoph Fuchß, Stefan Stieglitz, Christian Reuter (2021)
    CYWARN: Strategy and Technology Development for Cross-Platform Cyber Situational Awareness and Actor-Specific Cyber Threat Communication
    Workshop-Proceedings Mensch und Computer .
    [BibTeX]

    @inproceedings{kaufhold_cywarn_2021,
    title = {{CYWARN}: {Strategy} and {Technology} {Development} for {Cross}-{Platform} {Cyber} {Situational} {Awareness} and {Actor}-{Specific} {Cyber} {Threat} {Communication}},
    booktitle = {Workshop-{Proceedings} {Mensch} und {Computer}},
    author = {Kaufhold, Marc-André and Fromm, Jennifer and Riebe, Thea and Mirbabaie, Milad and Kuehn, Philipp and Basyurt, Ali Sercan and Bayer, Markus and Stöttinger, Marc and Eyilmez, Kaan and Möller, Reinhard and Fuchß, Christoph and Stieglitz, Stefan and Reuter, Christian},
    year = {2021},
    keywords = {Projekt-CYWARN, Security},
    }

  • Thea Riebe, Tristan Wirth, Markus Bayer, Philipp Kuehn, Marc-André Kaufhold, Volker Knauthe, Stefan Guthe, Christian Reuter (2021)
    CySecAlert: An Alert Generation System for Cyber Security Events Using Open Source Intelligence Data
    International Conference on Information and Communications Security (ICICS) .
    [BibTeX]

    @inproceedings{riebe_cysecalert_2021,
    title = {{CySecAlert}: {An} {Alert} {Generation} {System} for {Cyber} {Security} {Events} {Using} {Open} {Source} {Intelligence} {Data}},
    booktitle = {International {Conference} on {Information} and {Communications} {Security} ({ICICS})},
    author = {Riebe, Thea and Wirth, Tristan and Bayer, Markus and Kuehn, Philipp and Kaufhold, Marc-André and Knauthe, Volker and Guthe, Stefan and Reuter, Christian},
    year = {2021},
    keywords = {Projekt-ATHENE-SecUrban, Projekt-CYWARN, Security, UsableSec, Ranking-CORE-B},
    }

  • Markus Bayer, Marc-André Kaufhold, Christian Reuter (2021)
    Survey on Data Augmentation for Text Classification
    2021.
    [BibTeX] [Abstract] [Download PDF]

    Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).

    @techreport{bayer_survey_2021,
    title = {Survey on {Data} {Augmentation} for {Text} {Classification}},
    url = {http://arxiv.org/abs/2107.03158},
    abstract = {Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).},
    author = {Bayer, Markus and Kaufhold, Marc-André and Reuter, Christian},
    year = {2021},
    }

  • Marc-André Kaufhold, Markus Bayer, Christian Reuter (2020)
    Rapid relevance classification of social media posts in disasters and emergencies: A system and evaluation featuring active, incremental and online learning
    Information Processing & Management ;57(1):1–32.
    [BibTeX] [Abstract] [Download PDF]

    The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are sui- table for identifying relevant messages and filter out irrelevant messages, thus mitigating in- formation overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for re- levance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28\%/89.19\% accuracy, 98.3\%/89.6\% precision and 80.4\%/87.5\% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary eva- luation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feed- back classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the tradi- tional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.

    @article{kaufhold_rapid_2020,
    title = {Rapid relevance classification of social media posts in disasters and emergencies: {A} system and evaluation featuring active, incremental and online learning},
    volume = {57},
    url = {http://www.peasec.de/paper/2020/2020_KaufholdBayerReuter_RapidRelevanceClassification_IPM.pdf},
    abstract = {The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are sui- table for identifying relevant messages and filter out irrelevant messages, thus mitigating in- formation overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for re- levance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28\%/89.19\% accuracy, 98.3\%/89.6\% precision and 80.4\%/87.5\% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary eva- luation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feed- back classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the tradi- tional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.},
    number = {1},
    journal = {Information Processing \& Management},
    author = {Kaufhold, Marc-André and Bayer, Markus and Reuter, Christian},
    year = {2020},
    keywords = {Crisis, Projekt-ATHENE-SecUrban, SocialMedia, A-Paper, Ranking-ImpactFactor, Ranking-CORE-A, Ranking-WKWI-B, Projekt-emergenCITY},
    pages = {1--32},
    }

  • Christian Reuter, Wolfgang Schneider, Daniel Eberz, Markus Bayer, Daniel Hartung, Cemal Kaygusuz (2018)
    Resiliente Digitalisierung der kritischen Infrastruktur Landwirtschaft – mobil, dezentral, ausfallsicher
    Mensch und Computer 2018: Workshopband Dresden, Germany.
    [BibTeX] [Abstract] [Download PDF]

    Diese Arbeit befasst sich mit der zunehmenden Digitalisierung der kritischen Infrastruktur Ernährungswirtschaft und setzt den Fokus insbesondere auf die dadurch resultierenden in-formationstechnologischen Folgen bezüglich der Angriffs- und Ausfallsicherheit in der Landwirtschaft und von ihr abhängigen Sektoren. In diesem Kontext wird die Modernisie-rungen der Landmaschinen und deren Vernetzung sowie das Cloud-Computing in der Landwirtschaft analysiert und zu treffende Maßnahmen bezüglich einer resilienten Struktur erläutert. In vielen Bereichen wird dabei aufgezeigt, dass das Ausfallrisiko der Produktion zugunsten von Vorteilen wie Ertrags- und Qualitätssteigerung vernachlässigt wird.

    @inproceedings{reuter_resiliente_2018,
    address = {Dresden, Germany},
    title = {Resiliente {Digitalisierung} der kritischen {Infrastruktur} {Landwirtschaft} - mobil, dezentral, ausfallsicher},
    url = {https://dl.gi.de/bitstream/handle/20.500.12116/16930/Beitrag_330_final__a.pdf},
    abstract = {Diese Arbeit befasst sich mit der zunehmenden Digitalisierung der kritischen Infrastruktur Ernährungswirtschaft und setzt den Fokus insbesondere auf die dadurch resultierenden in-formationstechnologischen Folgen bezüglich der Angriffs- und Ausfallsicherheit in der Landwirtschaft und von ihr abhängigen Sektoren. In diesem Kontext wird die Modernisie-rungen der Landmaschinen und deren Vernetzung sowie das Cloud-Computing in der Landwirtschaft analysiert und zu treffende Maßnahmen bezüglich einer resilienten Struktur erläutert. In vielen Bereichen wird dabei aufgezeigt, dass das Ausfallrisiko der Produktion zugunsten von Vorteilen wie Ertrags- und Qualitätssteigerung vernachlässigt wird.},
    booktitle = {Mensch und {Computer} 2018: {Workshopband}},
    publisher = {Gesellschaft für Informatik e.V.},
    author = {Reuter, Christian and Schneider, Wolfgang and Eberz, Daniel and Bayer, Markus and Hartung, Daniel and Kaygusuz, Cemal},
    editor = {Dachselt, Raimund and Weber, Gerhard},
    year = {2018},
    keywords = {Crisis, Projekt-KontiKat, Student, Infrastructure, RSF, Projekt-HyServ, Projekt-MAKI, Projekt-GeoBox},
    pages = {623--632},
    }