There is a shortage of data for a full deployment of Language Resources and Technologies (LR’s). On the one hand, rule-based methods on which, for instance, most of the commercial MT systems are based, have not been able to cover all languages and all domains. On the other hand, statistically or ML-based mehods that need Language Resources-data for inducing information encounter also the problem of a shortage of ready-to-use material for all languages and domains.
In addition to the problems in achieving a full coverage, the use of existing data is hindered by several factors:
1. Little understanding of the need for standards for the representation of data, which makes difficult the use of several sources, and also, crucially, the evaluation of the quality and particular value of a resource for a given application. Although some different standards have been proposed for the representation of LR data, they are considered to be too much research-oriented, not documented, too abstract and too cumbersome to be implemented, as the return of the investment is not obvious. Industrials have preferred to implement standards that are too much driven only by specific needs and lack a long-term vision.
2. Insufficient documentation of existing resources, which, in addition, are not maintained in regular terms (update, bug reporting and coverage enlargement) because they are normally the results of finite projects. A common, again standard, way of representing and documenting linguistic information should be devised.
3. The LR market for most of the written resources is rather reduced and the legal framework is too complex.
These facts seem to be hints of the need for changes in the behavioral patterns and culture of LR-data consumers and providers. Given the breadth of the current landscape of LRs, are the changes needed along the following lines? We should start to discuss about that.
- The market of LR-data has to be rethought/remodelled by introducing collaborative strategies that overcome the current model based mostly on purely competitive terms (leading to the non-adherence to standards, to the repetition of work and efforts, etc). The supply of language resources is conditioned by traditional business behavioural patterns and culture that still overprotect products from competition.
- Not only the academy, but also the industry of LR have to undergo a cultural change and to recognize the high added value of participating in the creation of common pools of LR-data. Such a change requires movements of all the stakeholders in unison, the creation of rules and guidelines for new forms of cooperation, and sharpening a culture of mutual respect and fairness. Actions towards fostering such a change are a first priority for the field.
- The coverage problem is of a nature and magnitude such that strategies approaching or envisaging the full automation of LR-data production have to be promoted along campaigns for fostering evaluation in real-life applications. Thus, research can progressively approach the characteristics of the materials needed by the industry in size and granularity of the information contained.
