Publications

You can also find my articles on my Google Scholar profile.

Conference Papers


Something’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks

Published in Table Representation Learning Workshop @ ACL, 2025

Recent table representation learning and data discovery methods tackle table union search (TUS) within data lakes, which involves identifying tables that can be unioned with a given query table to enrich its content. These methods are commonly evaluated using benchmarks that aim to assess semantic understanding in real-world TUS tasks. However, our analysis of prominent TUS benchmarks reveals several limitations that allow simple baselines to perform surprisingly well, often outperforming more sophisticated approaches. This suggests that current benchmark scores are heavily influenced by dataset-specific characteristics and fail to effectively isolate the gains from semantic understanding. To address this, we propose essential criteria for future benchmarks to enable a more realistic and reliable evaluation of progress in semantic table union search.

Citation: Boutaleb, A., Amann, B., Naacke, H., & Angarita, R. (2025). Something's Fishy In The Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks. Proceedings of Table Representation Learning Workshop @ ACL 2025.
Download Paper

HEARTS: Hypergraph-based related table search

Published in RLGMSD - ELLIS workshop on Representation Learning and Generative Models for Structured Data, 2025

Recent related table search methods leverage tabular representation learning and language models to encode tables into vector representations for efficient semantic search. The main challenge of these models is to retain essential structural properties of tabular data. Graph-neural networks have shown to be efficient in solving certain challenges like row/column permutation sensitivity and multi-table representation. In this context, we present HEARTS, a related-table search system powered by HyTrel, a hypergraph-enhanced Tabular Language Model (TaLM). By representing tables as hypergraphs with cells as nodes and rows, columns, and tables as hyperedges, HyTrel preserves relational properties such as row and column order invariance, making it a robust solution for related table search tasks.

Citation: Boutaleb, A., Almutawa, A., Amann, B., Angarita, R., & Naacke, H. (2025). HEARTS: Hypergraph-based Related Table Search. ELLIS Workshop on Representation Learning and Generative Models for Structured Data.
Download Paper

BERTrend: Neural Topic Modeling for Emerging Trends Detection

Published in FuturED @ EMNLP - Workshop on the Future of Event Detection, 2024

Detecting and tracking emerging trends and weak signals in large, evolving text corpora is vital for applications such as monitoring scientific literature, managing brand reputation, surveilling critical infrastructure and more generally to any kind of text-based event detection. Existing solutions often fail to capture the nuanced context or dynamically track evolving patterns over time. BERTrend, a novel method, addresses these limitations using neural topic modeling in an online setting. It introduces a new metric to quantify topic popularity over time by considering both the number of documents and update frequency. This metric classifies topics as noise, weak, or strong signals, flagging emerging, rapidly growing topics for further investigation. Experimentation on two large real-world datasets demonstrates BERTrend’s ability to accurately detect and track meaningful weak signals while filtering out noise, offering a comprehensive solution for monitoring emerging trends in large-scale, evolving text corpora. The method can also be used for retrospective analysis of past events. In addition, the use of Large Language Models together with BERTrend offers efficient means for the interpretability of trends of events.

Citation: Boutaleb, A., Picault, J., & Grosjean, G. (2024). BERTrend: Neural Topic Modeling for Emerging Trends Detection. EMNLP 2024 - Proceedings of the Workshop on the Future of Event Detection (FuturED), 1–17. Association for Computational Linguistics.
Download Paper