About the Spoken Academic Belgian Dutch Corpus (SABeD)

About the corpus application

The corpus application is developed by the INT. The backend of the application is the BlackLab Lucene based search engine developed for corpora with token-based annotation (https://blacklab.ivdnt.org/). The web-based frontend is a further development of the corpus-frontend application developed by INT (https://github.com/instituutnederlandsetaal/blacklab-frontend) in CLARIN and CLARIAH projects. Its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University (https://github.com/Taalmonsters/WhiteLab2.0).

About SABeD

The Spoken Academic Belgian Dutch Corpus consists of 200 lectures given in higher education institutions in Flanders. The first 25 and the last 5 minutes of each lecture were transcribed using an ASR system tuned to Belgian Dutch and then manual utterance segmentation was applied, followed by manual correction of the automated transcription. More information about the project can be found at: https://www.arts.kuleuven.be/ling/language-education-society/projects/sabed.

Linguistic Annotation

The resulting text has been processed with the Ucto tokenizer (https://languagemachines.github.io/ucto/) and the Frog (https://languagemachines.github.io/frog/) language processing (NLP) modules developed for Dutch. Part of speech tagging uses the D-COI tagset which is an extension of the CGN tagset.

Documentation for the CGN tagset and the extended version used for written Dutch can be found in:

Frank Van Eynde (2004). "Part of Speech Tagging en Lemmatisering van het Corpus Gesproken Nederlands" [English version]
Frank Van Eynde (2005). "Part of Speech Tagging en Lemmatisering van het D-Coi Corpus" (slightly extended version of CGN tag set).

References

Publication: Jolien Mathysen, Vincent Vandeghinste, Elke Peters and Patrick Wambacq (2024). Constructing SABeD: A Spoken Academic Belgian Dutch Corpus. Selected papers from the CLARIN Annual Conference 2023. pp. 153- 163. https://doi.org/10.3384/ecp210001

Blog post: https://clariahvl.hypotheses.org/2310

Credits

When referring to the SABeD corpus, please use the following reference:

Mathysen, Jolien, Vincent Vandeghinste en Elke Peters (2024). Spoken Academic Belgian Dutch Corpus - SABeD (Version 1.1) (2025) [Online]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a3-c7

When referring to the SABeD Data, please use the following reference:

Mathysen, Jolien, Vincent Vandeghinste en Elke Peters (2024). Spoken Academic Belgian Dutch Corpus - SABeD (Version 1.1) (2025) [Data set]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a3-a9

For BlackLab:

Software available at https://github.com/instituutnederlandsetaal/BlackLab

For the corpus frontend:

Software available at: https://github.com/instituutnederlandsetaal/blacklab-frontend