Improving text-based search of inscriptions

Posted on behalf of Michelangelo Ceci1, Gianvito Pio1, Anita Rocco2

 

The Epigraphic Database Bari (EDB) stores inscriptions by Christians from Rome, between 3rd and 8th cent. It provides a web-based system to search for almost all the Greek and Latin inscriptions published in the corpus of the Inscriptiones Christianae Vrbis Romae, nova series [ICVR]. For each epigraphic document, a set of data and metadata is stored, about both the artifact/support (context, conservation, support, etc.) and the inscribed text (language, graphical and onomastic notes, etc.).

EDB provides an advanced text-based system which allows users to obtain different results according to a predefined syntax. Moreover, it is possible to select whether to consider diacritical marks, Greek accents and spirits and capital letters. The text-based search can also be combined with other metadata, such as bibliographic data, context, conservation, support, dating, etc.

This wide range of possibilities allows users to retrieve the desired inscriptions according to different needs. For example, an occasional user looking for a specific inscription can type one or more words in order to search for possible matching inscriptions. On the other hand, scholars can use the system to retrieve details about inscriptions they are studying and, by exploiting the phrase matching, can identify all the epitaphs containing the so-called “formulas”, i.e. recurrent expressions that are useful, for example, for dating purposes.

EDB provides a source of noteworthy importance for the study of the history of Greek and Latin language in Late Antiquity. Indeed, in this period, language underwent a gradual transformation and was enriched with forms and expressions of common use. Moreover, the possibility that something initially appearing as an important linguistic phenomenon could actually be just a spelling mistake must not be ignored. For this reason, in EDB, the so-called aberrant forms are not normalized to the classical model, if they are grapho-phonetic outcomes of linguistic modifications. However, a standard query system is not able to match a query with the inscriptions containing different spellings of a word. To face with this issue, we store each inscription in its original form and in a lemmatized form, where each term is replaced with its corresponding lemma. The user’s query is also lemmatized and the matching between the lemmatized form of the transcription and of the query is actually performed.

For future work, we will exploit the lemmatized terms to automatically identify possible misspellings and/or currently unknown aberrant forms.

 

 

1Dip. di Informatica, Università degli Studi di Bari “Aldo Moro”

2Dip. di Scienze dell’Antichità e del Tardoantico, Università degli Studi di Bari “Aldo Moro”

E-mail: michelangelo.ceci@uniba.it, gianvito.pio@uniba.it, anita.rocco@uniba.it

This entry was posted in Guest post, news. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.