The aim of the project is to apply techniques that allow textual searches on massive collections of 15th-16th century manuscripts which contain key information to identify the sunken remaing of thousands of shipwrecks that occurred during that period.
The project will focus on 150,000 images of collections of interest to underwater archaeology that belong to the General Archive of the Indies and the Provincial Historical Archive of Cádiz. These are manuscripts related to travel and Spanish naval trade during the XV-XIX centuries. For this type of manuscripts, the OCR techniques (designed for printed text) are totally unusable. On the other hand, more modern techniques, specifically developed for manuscript materials, generally only achieve transcription results that are too erroneous when applied to most of the historical texts of interest.
However, Carabela's research team has developed automatic learning methodologies that allow probabilistically indexing of handwritten text images. This indexation allows for approximate (but effective) textual searches over the massive collections of untranscribed historical manuscripts.
These techniques will be adapted to the specific difficulties of Carabela's manuscripts, allowing manual searches for information in the manuscripts under consideration. These manuscripts make up a large collection of documents on shipwrecks, the contents of which constitute an archaeological heritage of enormous historical and cultural importance.
The project goes one step further than indexing for manual searches. It will also develop new information retrieval techniques that allow the effective extraction of valuable information from untranscribed text images. The final objective is to automatically classify handwritten text images according to their "level of risk" of public exposure, in order to control access to them and avoid as much as possible the plundering of Spanish underwater heritage.