CRITERIA: Evaluation Criteria for Quality Corpora in Artificial Intelligence → GPLSI

CRITERIA: Evaluation Criteria for Quality Corpora in Artificial Intelligence

In recent years, the development of artificial intelligence (AI) systems based on Large Language Models (LLMs) has transformed how we interact with textual data, as well as the possibilities for generating, understanding, and analyzing human language on a large scale. In the fields of Digital Humanities and Natural Language Processing (NLP), these models have demonstrated unprecedented performance in tasks such as text generation, machine translation, summarization, semantic classification, and automatic question answering. However, this success has been accompanied by a fundamental challenge that still lacks a standardized and systematic solution: the quality of the data used to train these systems.

Traditionally, the predominant approach to training language models has focused on scale. Access to large volumes of data has been considered a competitive advantage for years, and it remains a key element in LLM design. However, recent studies have shown that the performance of these models depends not only on the quantity of data, but is closely related to its quality, diversity, and contextual relevance (Zhou et al., 2023). In this sense, the quality of the corpora emerges as a critical factor, both from a technical and ethical point of view, that can directly influence the behavior, robustness, and bias of the models.

Mass access to digital linguistic data has largely eliminated the barriers to textual content availability. Open sources like CommonCrawl offer trillions of words extracted from the web, enabling the training of models with an unprecedented volume of data. However, these web-based corpora are often riddled with redundant, irrelevant, biased, linguistically poor, or even toxic content. As Austermühl (2001) warned, obtaining data online is relatively easy, but identifying accurate and relevant information remains a complex process. This observation, which was already valid two decades ago, is more relevant than ever today in the context of LLMs and NLP.

To address this problem, the research community has begun developing refined corpora that apply strict cleaning filters such as C4 (Raffel et al., 2020), RedPajama (Together Computer, 2023), SlimPajama (Soboleva et al., 2024), and DCLM-baseline (Li et al., 2024). These filters remove duplicate content, non-textual noise, spam, offensive text, low-quality linguistic data, and irrelevant documents, transforming raw data into clean, useful, and ethically responsible corpora for model training. More recently, datasets such as RefinedWeb (Penedo et al., 2023), FineWeb (Penedo et al., 2024a), and FineWeb-2 (Penedo et al., 2024b) have marked new milestones in the construction of optimized corpora, using modular pipelines that aim to guarantee the final corpus quality. However, there is no clear consensus or standard methodology that systematically defines and measures the quality of a corpus, which makes it difficult to objectively and reproducibly evaluate these resources.

Furthermore, it is important to emphasize that most of these developments are designed almost exclusively for English, both in terms of linguistic coverage and the quality criteria employed. Consequently, the ecosystem of training resources for other languages, including Spanish, is considerably more limited, less systematized, and, in many cases, lacks automated quality assessment tools. This inequality of resources has significant implications for the training of multilingual models, as well as for the development of language applications in Spanish-speaking contexts. The scarcity of curated corpora—that is, those that have undergone cleaning, standardization, and even error correction processes—specifically in Spanish, both in general and in specialized domains, limits model performance, increases biases, and reduces their generalizability.

Given this scenario, it is essential to develop a robust, transparent, and reproducible methodological framework that allows for the computational evaluation of the quality of linguistic corpora. This framework will include clear criteria and specific metrics to facilitate the analysis of the impact of corpus quality on the behavior and performance of trained models. To this end, a mixed-methods approach will be adopted, combining the theoretical formulation of quality criteria with their empirical validation through experiments with models trained on corpora of varying quality levels.