Periodicals

Prispevki za novejšo zgodovino

Leveraging a Morphological Lexicon for a Semi-Automatic Approach to Correcting Lemmas and Morphosyntactic Tags

Co-author(s):Jure Gašparič (gl. ur.), Mojca Šorn (ur.), Andreja Jezernik (lekt.), Cody J. Inglis (lekt.), Studio S.U.R (lekt., prev.)

Leto:2025

Publisher(s):Inštitut za novejšo zgodovino, Ljubljana

Source(s):Prispevki za novejšo zgodovino, 2025, št. 3

Language(s):slovenščina, angleščina

Type(s) of material:text

Keywords:lematizacija, oblikoskladenjsko označevanje, učni korpusi, oblikoslovni leksikon, označevanje korpusov, lemmatization, morphosyntactic tagging, training corpora, morphological lexicon, corpus annotation

Identifier:DOI: https://doi.org/10.51663/pnz.65.3.06

Rights:

This work by Jaka Čibej, Tina Munda is licensed under Creative Commons Attribution-ShareAlike 4.0 International

Files (1)

Name:PNZ_03_2025.pdf

Size:12.31MB

Format:

Permanent link:https://hdl.handle.net/11686/file61994

Open

Download

Description

In the paper, we present a new semi-automatic approach to correcting lemmas and morphosyntactic tags. Unlike previous manual annotation approaches for Slovene corpora, the new method contains an additional step in which tokens and their automatically assigned lemmas and morphosyntactic tags are cross-referenced with the set of forms included in the Sloleks Morphological Lexicon of Slovene. Based on the comparison, each token is classified into one of several annotation scenarios. The new approach has noticeably reduced the time and resources invested into annotation by eliminating a large number of redundant tasks. The advantages of this method include the possibility of dividing annotation tasks into groups consisting of similar annotation problems (e.g. disambiguation of grammatical homographs). With adequate data preparation, it also drastically reduces the necessity for annotators to be familiar with the extensive Multext-East morphosyntactic tag set for Slovene, a restriction that created a bottleneck in the annotation process in similar annotation campaigns. The method was tested during the annotation process for the ROG Training Corpus of Spoken Slovene. In addition, we also test the scenario classification algorithm on the SUK Training Corpus of Written Slovene, which was annotated using the traditional sentence-by-sentence, token-by-token approach. We present the results and argue that the method should be used in future annotation campaigns to save resources and improve overall annotation consistency, while also discussing some of the caveats and disadvantages of the proposed approach.

Metadata (13)

identifierhttps://hdl.handle.net/11686/71605
title
- Leveraging a Morphological Lexicon for a Semi-Automatic Approach to Correcting Lemmas and Morphosyntactic Tags
- Uporaba oblikoslovnega leksikona pri polavtomatskem pristopu k popravljanju lem in oblikoskladenjskih oznak
creator
- Jaka Čibej
- Tina Munda
contributor
- Jure Gašparič (gl. ur.)
- Mojca Šorn (ur.)
- Andreja Jezernik (lekt.)
- Cody J. Inglis (lekt.)
- Studio S.U.R (lekt., prev.)
subject
- lematizacija
- oblikoskladenjsko označevanje
- učni korpusi
- oblikoslovni leksikon
- označevanje korpusov
- lemmatization
- morphosyntactic tagging
- training corpora
- morphological lexicon
- corpus annotation
description
- In the paper, we present a new semi-automatic approach to correcting lemmas and morphosyntactic tags. Unlike previous manual annotation approaches for Slovene corpora, the new method contains an additional step in which tokens and their automatically assigned lemmas and morphosyntactic tags are cross-referenced with the set of forms included in the Sloleks Morphological Lexicon of Slovene. Based on the comparison, each token is classified into one of several annotation scenarios. The new approach has noticeably reduced the time and resources invested into annotation by eliminating a large number of redundant tasks. The advantages of this method include the possibility of dividing annotation tasks into groups consisting of similar annotation problems (e.g. disambiguation of grammatical homographs). With adequate data preparation, it also drastically reduces the necessity for annotators to be familiar with the extensive Multext-East morphosyntactic tag set for Slovene, a restriction that created a bottleneck in the annotation process in similar annotation campaigns. The method was tested during the annotation process for the ROG Training Corpus of Spoken Slovene. In addition, we also test the scenario classification algorithm on the SUK Training Corpus of Written Slovene, which was annotated using the traditional sentence-by-sentence, token-by-token approach. We present the results and argue that the method should be used in future annotation campaigns to save resources and improve overall annotation consistency, while also discussing some of the caveats and disadvantages of the proposed approach.
- V prispevku predstavljamo nov polavtomatski pristop k popravljanju lem in oblikoskladenjskih oznak. Za razliko od predhodnih pristopov k ročnemu označevanju slovenskih korpusov nova metoda vsebuje dodaten korak, v katerem pojavnice ter njihove strojno pripisane leme in oblikoskladenjske oznake navzkrižno primerjamo z naborom oblik v Slovenskem oblikoslovnem leksikonu Sloleks. Na podlagi primerjave vsako pojavnico uvrstimo v enega od označevalnih scenarijev. Novi pristop občutno zmanjša količino časa in sredstev, ki jih je treba vložiti v označevanje, tako da odstrani veliko število odvečnih označevalnih nalog. Med prednostmi te metode je tudi možnost, da označevalne naloge razdelimo v sklope s podobnimi označevalnimi problemi (npr. razločevanje slovničnih enakopisnic). Ob ustrezni pripravi podatkov lahko metoda tudi drastično zmanjša potrebo po tem, da se označevalci seznanijo z obširnim označevalnim sistemom Multext-East za slovenščino, kar je v sorodnih označevalnih kampanjah predstavljalo ozko grlo. Metodo smo preizkusili med označevanjem Učnega korpusa govorjene slovenščine ROG. Algoritem pripisovanja označevalnih scenarijev preizkusimo tudi na Učnem korpusu pisne slovenščine SUK, ki je bil označen s tradicionalnim označevalnim pristopom (poved za povedjo, pojavnica za pojavnico). Predstavimo rezultate primerjave in zagovarjamo, da bi bilo metodo treba uporabiti pri prihodnjih označevalnih kampanjah, da z njo prihranimo čas in stroške ter nasploh izboljšamo doslednost označevanja, pri čemer razpravljamo tudi o nekaterih slabostih in pasteh predlaganega pristopa.
publisher
- Inštitut za novejšo zgodovino
date
- 2025
type
- besedilo
identifier
- identifier: DOI: https://doi.org/10.51663/pnz.65.3.06
language
- Slovenščina
- Angleščina
isPartOf
- https://hdl.handle.net/11686/71598
rights
- license: ccBySa

Archive sources

Museum items

Printed sources

Oral sources

Critical editions

Monographs

Reference collections

Periodicals

Thesis and textbooks

Typescript

Text Collections

Conferences

Lectures

Exhibitions

Social Science Data Archive

CLARIN.SI

Research data

World War II casualties

History Citation Index

DARIAH-SI

Tools

Documentation

SI-DIH

slv

Leveraging a Morphological Lexicon for a Semi-Automatic Approach to Correcting Lemmas and Morphosyntactic Tags

Files (1)

Description

Metadata (13)