Publications
Addressing Data Gaps in Sustainability Reporting: A Benchmark Dataset for Greenhouse Gas Emission Extraction (2025)
Scientific Data (Nature)
With Jacob Beck, Anna Steinberg, Andreas Dimmelmeier, Laia Domenech Burin, Maurice Fehr, and Malte Schierholz.
Abstract:
Reliable company-level greenhouse gas (GHG) emissions data are essential for stakeholders addressing the climate crisis. However, existing datasets are often fragmented, inconsistent, and lack transparent methodologies, making it difficult to obtain reliable emissions data. To address this challenge, we present a gold standard dataset containing emission metrics extracted from 139 sustainability reports collected from company websites. This dataset acts as an intermediate step to validate and fine-tune models for large-scale extraction of emissions data from thousands of reports. We employ a Large Language Model (LLM)-powered extraction pipeline to automatically extract emissions metrics. These values are then independently assessed by two non-expert annotators. Reports with full agreement are directly considered gold standard, while discrepancies undergo expert review in two stages, with remaining disagreements resolved through in-person discussions. This structured process ensures high data quality while reducing reliance on experts. Our dataset serves as a benchmark for human and automated annotation, with significant reuse potential for information extraction tasks in sustainable finance as well as other downstream tasks such as greenwashing analysis.
Informing climate risk analysis using textual information - A research agenda (2024)
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024) @ ACL 2024, available via this link.
With Malte Schierholz, Bolei Ma, Jacob Beck, Andreas Dimmelmeier, Hendrik Christian Doll, Maurice Fehr, Frauke Kreuter, and Alex Fraser.
Abstract:
We present a research agenda focused on efficiently extracting, assuring quality, and consolidating textual company sustainability information to address urgent climate change decision-making needs. Starting from the goal to create integrated FAIR (Findable, Accessible, Interoperable, Reusable) climate-related data, we identify research needs pertaining to the technical aspects of information extraction as well as to the design of the integrated sustainability datasets that we seek to compile. Regarding extraction, we leverage technological advancements, particularly in large language models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines, to unlock the underutilized potential of unstructured textual information contained in corporate sustainability reports. In applying these techniques, we review key challenges, which include the retrieval and extraction of CO2 emission values from PDF documents, especially from unstructured tables and graphs therein, and the validation of automatically extracted data through comparisons with human-annotated values. We also review how existing use cases and practices in climate risk analytics relate to choices of what textual information should be extracted and how it could be linked to existing structured data.
Houston, we have a problem: Can satellite information bridge the climate-related data gap? (2025)
Latin American Journal of Central Banking.
With Andrés Alonso Robisco, José Manuel Carbó Martínez, and Elena Triebskorn.
Presentations:
3rd Annual Workshop of the ESCB Research Cluster Climate Change
(Frankfurt, Germany)
2024 CEMLA Conference
(XXIX Meeting of the Central Bank Researchers Network, Center for Latin American Monetary Studies, Mexico City, México)
2024 YISF Annual Symposium
(Yale Initiative on Sustainable Finance Annual Symposium, Yale School of Management, New Haven, CT)
BIS-IFC Workshop on "Addressing climate change data needs: the global debate and central banks' contribution"
(Central Bank of the Republic of Türkiye, Izmir, 2024)
Abstract:
Central banks and international supervisors have identified the difficulty of obtaining climate information as one of the key obstacles to the development of green financial products and markets. To bridge this data gap, the use of satellite information from Earth Observation (EO) systems may be necessary. To better understand this process, we analyse the potential of applying satellite data to green finance. First, we summarise the policy debate from a central banking perspective. We then briefly describe the main challenges for economists in dealing with the EO data format and quantitative methodologies for measuring its economic materiality. Finally, using topic modelling, we perform a systematic literature review of recent academic studies to identify the research areas in which satellite data are currently being used in green finance. We find the following topics: physical risk materialisation (including both acute and chronic risk), deforestation, energy and emissions, agricultural risk and land use and land cover. We conclude with a comprehensive analysis on the financial materiality of this alternative data source, a mapping of these application domains to new green financial instruments and markets under development, such as thematic bonds or carbon credits, and some key considerations for policy discussion.
Building a retrieval-augmented generation pipeline to trace administrative data use in academic papers. (2025)
In: Foundations and Advances of Machine Learning in Official Statistics (pp. 347-373). Cham: Springer Nature Switzerland.
With Sebastian Seltmann and Hendrik Christian Doll.
Presentations:
Conference on Foundations and Advances of Machine Learning in Official Statistics
(DESTATIS - Federal Statistical Office of Germany, Wiesbaden, 2024)
3rd IFC Workshop on Data Science in Central Banking 2023
(Bank of Italy, Rome)
Abstract:
In national statistical offices and central banks, research data centers (RDCs) are tasked with providing access to granular administrative data and have proliferated in recent years. RDCs face challenges in tracing usage of their data in research papers: Tracing usage in papers to date relies on human readers and therefore remains time-consuming and error-prone. To address this, we explore the potential of using large language models (LLMs), specifically GPT-3.5, to automate the identification of data sources. Based on a comprehensive sample of research papers, we create a human-labeled validation dataset and analyze the accuracy of GPT-3.5 in detecting and summarizing data sources in economics and finance papers. Furthermore, we evaluate the detection and prediction accuracy and address the issue of false answers provided by the model. We find that LLMs can advance the status quo considerably: Our results indicate that LLMs can accurately identify dataset mentions verbatim in up to 62% of cases. When pairing the evaluation of the model with domain knowledge of the sought entities, i.e., allowing for the identification of synonymous dataset names, the model identifies 100% of these entities. Thus, we show that using LLMs to find unstructured dataset mentions can provide a valuable pathway for RDCs and data-providing institutions to gauge the impact of their data provision efforts. Our paper provides a detailed description of the pipeline to implement our solution at data-providing institutions, enabling the understanding of data impact and therefore efficient data provision services.
Working papers
Do Investors Use Sustainable Assets as Carbon Offsets? (2025)
SAFE Working Paper No. 431, available at SSRN.
With Jakob Famulok and Daniel Worring.
Presentations:
ASSA 2025
(IBEFA, International Banking, Economics, and Finance Association, part of the AEA/ASSA Annual Meeting 2025, San Francisco, CA)
Society for Experimental Finance (SEF)
(Stavanger 2024 and Sofia 2023)
NYU Stern / NY Fed Climate Finance Conference 2024
ZEW Conference on Ageing and Sustainable Finance 2024
(ZEW Mannheim)
EFA 2023
(Amsterdam, Annual Meeting of the European Finance Association, poster session)
CEPR European Conference on Household Finance 2023
(Torino)
DGF 2023
(29th Annual Meeting of the German Finance Association, Hohenheim)
16th International Conference on Computational and Financial Econometrics
(CFE 2022, King's College, London)
Tri-City Day-Ahead Workshop on the Future of Financial Intermediation 2022
(Frankfurt School of Finance & Management, Bayes Business School, University of Zurich, and Leibniz Institute SAFE in coordination with Regulating Financial Markets)
Prizes:
2023 Vernon L. Smith Young Talent Award in Experimental Finance (visit this link for more information)
Media:
Abstract:
We present novel evidence that retail investors attempt offsetting their carbon footprints by investing sustainably. Using highly granular transaction data from bank clients, we find that higher footprints are linked to greener portfolios. In an experiment with clients from the same bank, we show that an exogenous shock to the participants’ salience of their emissions causally shifts sustainable asset allocations upward. Finally, we identify a substitution effect between offsetting through donations and sustainable assets. Our findings add to an understanding of the behavioral drivers of sustainable investing, which is crucial to design effective policies aligning financial markets with environmental goals.
Do Gamblers Invest in Lottery Stocks? (2023)
With Tobin Hanspal and Andreas Hackethal.
SAFE Working Paper No. 373, available at SSRN.
Reject & Resubmit at Management Science.
Presented at:
6th Household Finance Workshop 2022 (Leibniz Institute SAFE)
Media:
Frankfurter Allgemeine Zeitung (F.A.Z.), on February 19, 2023
Abstract:
Previous studies document a relationship between gambling activity at the aggregate level and investments in securities with lottery-like features. We combine data on individual gambling consumption with portfolio holdings and trading records to examine whether gambling and trading act as substitutes or complements. We find that gamblers are more likely than the average investor to hold lottery stocks, but significantly less likely than active traders who do not gamble. Our results suggest that gambling behavior across domains is less relevant compared to other portfolio characteristics that predict investing in high-risk and high-skew securities, and that gambling on and off the stock market act as substitutes to satisfy the same need, e.g., sensation seeking.
The `President Reacts to News' Channel of Government Communication (2023)
With Farshid Abdi, Loriana Pelizzon, Mila Getmansky Sherman, and Zorka Simon.
SAFE Working Paper No. 314, available at SSRN.
Presented at:
28th Annual Meeting of the German Finance Association 2022 (DGF)
Annual Meeting of the Swiss Society for Financial Market Research (SGF Conference) 2022
15th International Conference on Computational and Financial Econometrics (CFE) 2021
Annual Financial Market and Liquidity Conference (AFML) Budapest 2021
Abstract:
Studying about 1,200 economy-related tweets of President Trump, we establish the "President reacts to news" channel of stock returns. Using high-frequency identification of market movements and machine learning to classify the topics and textual sentiment of tweets, we address the observed heterogeneity in the aggregate stock market response to these messages. After controlling for market trends preceding tweets, we find that 80% of tweets are reactive and predictable rather than novel and informative. The exceptions are trade war tweets, where the President has direct policy authority, and his tweets can reveal investable private information or information about his policy function.
Policy papers and gray literature
The climate data iceberg – A depth of information to integrate (2024)
With Hendrik Christian Doll, Susanne Walter, and Gabriela Alves Werb.
Bank for International Settlements (BIS) Bulletin Vol. 66, Irving-Fisher Committee on Central Bank Statistics.
Available via this link.
Presentations:
Conference of European Statistics Stakeholders (CESS) 2024
(Paris, France)
12th biennial IFC Conference on "Statistics and beyond: new data for decision making in central banks"
(BIS Basel, Switzerland)
Abstract:
Central banks need climate-related data to align evidence-based climate change considerations with their core tasks. While structured data from administrative and proprietary sources are limited and contain considerable gaps, a wealth of climate-related information is dispersed and lies below the surface in unstructured form, such as sustainability reports or satellite images. To characterise this situation, we introduce the image of the climate data iceberg. Information from unstructured sources can bridge current data gaps and enhance the usability of existing data by improving its accuracy, extending its scope, and reducing data sharing barriers. In this paper, we discuss the challenges and opportunities central banks and supervisors face in leveraging this unstructured information for climate analysis and research. We further investigate how innovative efforts between central banks and other institutions can help generate actionable and usable climate-related data, exemplified by our own experiences and early-stage learnings from such collaborations.
Selected work in progress
Extraction of CO2 emissions from corporate sustainability reports (2024)
With Malte Schierholz, Anna Steinberg, Jacob Beck, Laia Domenech Burin, and Lisa Reichenbach.
Accepted for presentation at the 65th ISI World Statistics Congress 2025, The Hague, NL.
This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here.
Abstract:
Financial regulators and central banks are increasingly integrating sustainability aspects into their operations, but significant data gaps remain. The CSRD directive requires all large European enterprises to annually publish their greenhouse gas emissions (CO2-equivalents) in their management report, annual report, or sustainability report. The amount of information available, i.e., the value and unit for each scope, direct emissions (Scope 1), indirect energy-related emissions (Scope 2), and other indirect emissions (Scope 3), is immense, but the data are spread over thousands of PDF documents, published online on company websites, and historically often without abiding to official standards or guidelines. Until now, private companies extract carbon emissions and other indicators from these PDF documents and sell it in a structured, tabular data format to the Bundesbank and to other public authorities. However, despite little apparent difficulties in value extraction from PDF documents the reliability between values extracted by different companies is rather low. Given the current dim situation, we leverage Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) to build several fully automated data extraction pipelines, which are then being compared with data bought from private providers and evaluated using a specially curated gold standard dataset of our own. Open-source software is shared with the community which enables everyone to extract CO2-related indicators from company sustainability reports.
ClimXtract: An open-source data extraction pipeline for company-level greenhouse gas emissions (2024)
With Anna Steinberg, Laia Domenech Burin, Ailin Liu, Malte Schierholz, Andreas Dimmelmeier, Lisa Reichenbach, and Maurice Fehr.
Accepted for presentation at DagStat 2025 (7th Joint Statistical Meeting of the Deutsche Arbeitsgemeinschaft Statistik, Berlin, Germany).
This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here.
Abstract:
Developing methods to supervise and regulate companies’ contributions to the climate crisis through emissions requires access to consistent and reliable data on company-level greenhouse gas (GHG) emissions. Currently, companies publish their sustainability reports (PDF format), which document GHG releases in a non-standardized and unstructured manner, and upload them to their websites instead of a central repository. Commercial data providers collect GHG emission data from the PDF files and other sources through non-transparent methods, raising doubts concerning the validity of the traded data. As an alternative, we present an open-source data extraction pipeline, which extracts the emissions for a given corporate sustainability report using a Large Language Model (LLM). The pipeline (1) identifies relevant pages in a sustainability report, (2) prompts the LLM for the emission value and unit of each scope (direct, indirect energy-related or other indirect emissions) and (3) parses the output to save the emission data in a database. Since emission values are often captured in tables inside company reports, we augment our pipeline with a table-specific extraction routine. Evaluation of our pipeline is achieved using a curated gold standard data set validated in a multi-stage annotation process. Our pipeline can be easily scaled up and is set up modularly for fast adaptability.
GeoCSR: Leveraging Geospatial Data from Corporate Reports for Sustainable Finance Insights (2025)
With Felicitas Sommer, Andreas Dimmelmeier, and Christophe Christiaen.
This project is part of the larger research agenda GIST - Greenhouse Gas Insights and Sustainability Tracking, a research collaboration between Deutsche Bundesbank and LMU Munich to generate high-quality, granular firm-level emission and sustainability data. More information can be found here.
Includes upcoming presentations. Unpublished papers available upon request.