logo

/

Periodicals

/

Prispevki za novejšo zgodovino

Treebanking Spoken Slovenian

New Data, Models, and Lessons Learned

Author(s):Kaja Dobrovoljc
Co-author(s):Jure Gašparič (gl. ur.), Mojca Šorn (ur.), Andreja Jezernik (lekt.), Cody J. Inglis (lekt.), Studio S.U.R (lekt., prev.)
Leto:2025
Publisher(s):Inštitut za novejšo zgodovino, Ljubljana
Language(s):slovenščina, angleščina
Type(s) of material:text
Identifier:https://doi.org/10.51663/pnz.65.3.01
Rights:
CC license

This work by Kaja Dobrovoljc is licensed under Creative Commons Attribution-ShareAlike 4.0 International

Files (1)
Name:PNZ_03_2025.pdf
Size:12.31MB
Format:
Open
Download
Description

This paper presents a new version of the Spoken Slovenian Treebank (SST), a balanced and representative collection of transcribed spontaneous speech with manually annotated lemmas, part-of-speech tags, morphological features, and syntactic dependencies, recently expanded with over 3,000 newly annotated utterances. After a brief overview of the data sampling, annotation, and consolidation processes—presented in detail in previous work—we evaluate the significance of this new language resource for both linguistic research and natural language processing by first highlighting its distinctive lexical and morphosyntactic features in comparison to writing , and then assessing their impact on the performance of tools for automatic grammatical

annotation. Finally, we reflect on the methodological insights gained during treebank creation, discuss the potential of SST for advancing spoken language research, and argue for the necessity of such resources in supporting linguistic diversity in language technology.


Metadata (13)
  • identifierhttps://hdl.handle.net/11686/71600
    • title
      • Treebanking Spoken Slovenian
      • New Data, Models, and Lessons Learned
      • Drevesnica govorjene slovenščine
      • Novi podatki, modeli in ključni nauki
    • creator
      • Kaja Dobrovoljc
    • contributor
      • Jure Gašparič (gl. ur.)
      • Mojca Šorn (ur.)
      • Andreja Jezernik (lekt.)
      • Cody J. Inglis (lekt.)
      • Studio S.U.R (lekt., prev.)
    • subject
      • označevanje korpusov
      • odvisnostna drevesnica
      • spontani govor
      • razčlenjevanje
      • Universal Dependencies
      • corpus annotation
      • dependency treebank
      • spontaneous speech
      • parsing
    • description
      • This paper presents a new version of the Spoken Slovenian Treebank (SST), a balanced and representative collection of transcribed spontaneous speech with manually annotated lemmas, part-of-speech tags, morphological features, and syntactic dependencies, recently expanded with over 3,000 newly annotated utterances. After a brief overview of the data sampling, annotation, and consolidation processes—presented in detail in previous work—we evaluate the significance of this new language resource for both linguistic research and natural language processing by first highlighting its distinctive lexical and morphosyntactic features in comparison to writing , and then assessing their impact on the performance of tools for automatic grammatical annotation. Finally, we reflect on the methodological insights gained during treebank creation, discuss the potential of SST for advancing spoken language research, and argue for the necessity of such resources in supporting linguistic diversity in language technology.
      • Prispevek predstavlja novo različico drevesnice govorjene slovenščine (SST), uravnotežene in reprezentativne zbirke transkribiranega spontanega govora z ročno označenimi lemami, besednimi vrstami, oblikoslovnimi značilnostmi in skladenjskimi odvisnostmi, ki je bila nedavno razširjena z več kot 3.000 na novo razčlenjenimi izjavami. Po kratkem pregledu postopkov vzorčenja, označevanja in poenotenja korpusnih podatkov – ki smo jih podrobneje predstavili že v predhodni razpravi – ponazorimo pomen tega jezikovnega vira za raziskave na področju jezikoslovja in strojne obdelave jezika. S primerjavo govorne in pisne drevesnice najprej izpostavimo leksikalne ter oblikoslovno-skladenjske posebnosti govora v primerjavi s pisnim jezikom, nato pa predstavimo njihov vpliv na delovanje orodij za samodejno slovnično razčlenjevanje govornih transkripcij. Na koncu predstavimo metodološke izkušnje, pridobljene pri razvoju drevesnice, razpravljamo o njenem potencialu za nadaljnje raziskave govorjenega jezika in poudarimo pomen tovrstnih virov z vidika naslavljanja jezikovne raznolikosti pri razvoju jezikovnih tehnologij.
    • publisher
      • Inštitut za novejšo zgodovino
    • date
      • 2025
    • type
      • besedilo
    • identifier
      • identifier: https://doi.org/10.51663/pnz.65.3.01
    • language
      • Slovenščina
      • Angleščina
    • isPartOf
    • rights
      • license: ccBySa