The corpus application is developed by the INT. The backend of the application is the BlackLab Lucene based search engine developed for corpora with token-based annotation (https://blacklab.ivdnt.org/). The web-based frontend is a further development of the corpus-frontend application developed by INT (https://github.com/instituutnederlandsetaal/blacklab-frontend) in CLARIN and CLARIAH projects. Its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University (https://github.com/Taalmonsters/WhiteLab2.0).
The Spoken Academic Belgian Dutch Corpus consists of 200 lectures given in higher education institutions in Flanders. The first 25 and the last 5 minutes of each lecture were transcribed using an ASR system tuned to Belgian Dutch and then manual utterance segmentation was applied, followed by manual correction of the automated transcription. More information about the project can be found at: https://www.arts.kuleuven.be/ling/language-education-society/projects/sabed.
The resulting text has been processed with the Ucto tokenizer (https://languagemachines.github.io/ucto/) and the Frog (https://languagemachines.github.io/frog/) language processing (NLP) modules developed for Dutch. Part of speech tagging uses the D-COI tagset which is an extension of the CGN tagset.
Documentation for the CGN tagset and the extended version used for written Dutch can be found in:
Publication: Jolien Mathysen, Vincent Vandeghinste, Elke Peters and Patrick Wambacq (2024). Constructing SABeD: A Spoken Academic Belgian Dutch Corpus. Selected papers from the CLARIN Annual Conference 2023. pp. 153- 163. https://doi.org/10.3384/ecp210001
Blog post: https://clariahvl.hypotheses.org/2310
When referring to the SABeD corpus, please use the following reference:
Mathysen, Jolien, Vincent Vandeghinste en Elke Peters (2024). Spoken Academic Belgian Dutch Corpus - SABeD (Version 1.1) (2025) [Online]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a3-c7
When referring to the SABeD Data, please use the following reference:
Mathysen, Jolien, Vincent Vandeghinste en Elke Peters (2024). Spoken Academic Belgian Dutch Corpus - SABeD (Version 1.1) (2025) [Data set]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a3-a9
For BlackLab:
Software available at https://github.com/instituutnederlandsetaal/BlackLab
For the corpus frontend:
Software available at: https://github.com/instituutnederlandsetaal/blacklab-frontend