Periodicals

Prispevki za novejšo zgodovino

Treebanking Spoken Slovenian

New Data, Models, and Lessons Learned

Co-author(s):Jure Gašparič (gl. ur.), Mojca Šorn (ur.), Andreja Jezernik (lekt.), Cody J. Inglis (lekt.), Studio S.U.R (lekt., prev.)

Leto:2025

Publisher(s):Inštitut za novejšo zgodovino, Ljubljana

Source(s):Prispevki za novejšo zgodovino, 2025, št. 3

Language(s):slovenščina, angleščina

Type(s) of material:text

Keywords:označevanje korpusov, odvisnostna drevesnica, spontani govor, razčlenjevanje, Universal Dependencies, corpus annotation, dependency treebank, spontaneous speech, parsing

Identifier:https://doi.org/10.51663/pnz.65.3.01

Rights:

This work by Kaja Dobrovoljc is licensed under Creative Commons Attribution-ShareAlike 4.0 International

Files (1)

Name:PNZ_03_2025.pdf

Size:12.31MB

Format:

Permanent link:https://hdl.handle.net/11686/file61994

Open

Download

Description

This paper presents a new version of the Spoken Slovenian Treebank (SST), a balanced and representative collection of transcribed spontaneous speech with manually annotated lemmas, part-of-speech tags, morphological features, and syntactic dependencies, recently expanded with over 3,000 newly annotated utterances. After a brief overview of the data sampling, annotation, and consolidation processes—presented in detail in previous work—we evaluate the significance of this new language resource for both linguistic research and natural language processing by first highlighting its distinctive lexical and morphosyntactic features in comparison to writing , and then assessing their impact on the performance of tools for automatic grammatical

annotation. Finally, we reflect on the methodological insights gained during treebank creation, discuss the potential of SST for advancing spoken language research, and argue for the necessity of such resources in supporting linguistic diversity in language technology.

Metadata (13)

identifierhttps://hdl.handle.net/11686/71600
title
- Treebanking Spoken Slovenian
- New Data, Models, and Lessons Learned
- Drevesnica govorjene slovenščine
- Novi podatki, modeli in ključni nauki
creator
- Kaja Dobrovoljc
contributor
- Jure Gašparič (gl. ur.)
- Mojca Šorn (ur.)
- Andreja Jezernik (lekt.)
- Cody J. Inglis (lekt.)
- Studio S.U.R (lekt., prev.)
subject
- označevanje korpusov
- odvisnostna drevesnica
- spontani govor
- razčlenjevanje
- Universal Dependencies
- corpus annotation
- dependency treebank
- spontaneous speech
- parsing
description
- This paper presents a new version of the Spoken Slovenian Treebank (SST), a balanced and representative collection of transcribed spontaneous speech with manually annotated lemmas, part-of-speech tags, morphological features, and syntactic dependencies, recently expanded with over 3,000 newly annotated utterances. After a brief overview of the data sampling, annotation, and consolidation processes—presented in detail in previous work—we evaluate the significance of this new language resource for both linguistic research and natural language processing by first highlighting its distinctive lexical and morphosyntactic features in comparison to writing , and then assessing their impact on the performance of tools for automatic grammatical annotation. Finally, we reflect on the methodological insights gained during treebank creation, discuss the potential of SST for advancing spoken language research, and argue for the necessity of such resources in supporting linguistic diversity in language technology.
- Prispevek predstavlja novo različico drevesnice govorjene slovenščine (SST), uravnotežene in reprezentativne zbirke transkribiranega spontanega govora z ročno označenimi lemami, besednimi vrstami, oblikoslovnimi značilnostmi in skladenjskimi odvisnostmi, ki je bila nedavno razširjena z več kot 3.000 na novo razčlenjenimi izjavami. Po kratkem pregledu postopkov vzorčenja, označevanja in poenotenja korpusnih podatkov – ki smo jih podrobneje predstavili že v predhodni razpravi – ponazorimo pomen tega jezikovnega vira za raziskave na področju jezikoslovja in strojne obdelave jezika. S primerjavo govorne in pisne drevesnice najprej izpostavimo leksikalne ter oblikoslovno-skladenjske posebnosti govora v primerjavi s pisnim jezikom, nato pa predstavimo njihov vpliv na delovanje orodij za samodejno slovnično razčlenjevanje govornih transkripcij. Na koncu predstavimo metodološke izkušnje, pridobljene pri razvoju drevesnice, razpravljamo o njenem potencialu za nadaljnje raziskave govorjenega jezika in poudarimo pomen tovrstnih virov z vidika naslavljanja jezikovne raznolikosti pri razvoju jezikovnih tehnologij.
publisher
- Inštitut za novejšo zgodovino
date
- 2025
type
- besedilo
identifier
- identifier: https://doi.org/10.51663/pnz.65.3.01
language
- Slovenščina
- Angleščina
isPartOf
- https://hdl.handle.net/11686/71598
rights
- license: ccBySa

Archive sources

Museum items

Printed sources

Oral sources

Critical editions

Monographs

Reference collections

Periodicals

Thesis and textbooks

Typescript

Text Collections

Conferences

Lectures

Exhibitions

Social Science Data Archive

CLARIN.SI

Research data

World War II casualties

History Citation Index

DARIAH-SI

Tools

Documentation

SI-DIH

slv

Treebanking Spoken Slovenian

New Data, Models, and Lessons Learned

Files (1)

Description

Metadata (13)