Warming up
Categories: Language Resources, Machine Translation, Natural Language Processing
There is a shortage of data for a full deployment of Language Resources and Technologies (LR’s). On the one hand, rule-based methods on which, for instance, most of the commercial MT systems are based, have not been able to cover all languages and all domains. On the other hand, statistically or ML-based mehods that need Language Resources-data for inducing information encounter also the problem of a shortage of ready-to-use material for all languages and domains.
In addition to the problems in achieving a full coverage, the use of existing data is hindered by several factors:
1. Little understanding of the need for standards for the representation of data, which makes difficult the use of several sources, and also, crucially, the evaluation of the quality and particular value of a resource for a given application. Although some different standards have been proposed for the representation of LR data, they are considered to be too much research-oriented, not documented, too abstract and too cumbersome to be implemented, as the return of the investment is not obvious. Industrials have preferred to implement standards that are too much driven only by specific needs and lack a long-term vision.
2. Insufficient documentation of existing resources, which, in addition, are not maintained in regular terms (update, bug reporting and coverage enlargement) because they are normally the results of finite projects. A common, again standard, way of representing and documenting linguistic information should be devised.
3. The LR market for most of the written resources is rather reduced and the legal framework is too complex.
These facts seem to be hints of the need for changes in the behavioral patterns and culture of LR-data consumers and providers. Given the breadth of the current landscape of LRs, are the changes needed along the following lines? We should start to discuss about that.
- The market of LR-data has to be rethought/remodelled by introducing collaborative strategies that overcome the current model based mostly on purely competitive terms (leading to the non-adherence to standards, to the repetition of work and efforts, etc). The supply of language resources is conditioned by traditional business behavioural patterns and culture that still overprotect products from competition.
- Not only the academy, but also the industry of LR have to undergo a cultural change and to recognize the high added value of participating in the creation of common pools of LR-data. Such a change requires movements of all the stakeholders in unison, the creation of rules and guidelines for new forms of cooperation, and sharpening a culture of mutual respect and fairness. Actions towards fostering such a change are a first priority for the field.
- The coverage problem is of a nature and magnitude such that strategies approaching or envisaging the full automation of LR-data production have to be promoted along campaigns for fostering evaluation in real-life applications. Thus, research can progressively approach the characteristics of the materials needed by the industry in size and granularity of the information contained.

You can add coments by clicking the Comment field.
I just wanted to shortly comment on the adoption and use of one specific LR standard by the industry. In my opinion this is a kind of vicious circle: on the one hand, no commercial company would ever invest the huge effort involved in exporting all its LRs (leave apart purely commercial factors) unless it is dead sure that this standard is going to be THE standard used by everybody. On the other hand, experience and evidence show that potential standards (and even “unexpected” standards) only reach the category of actual “universal” standards by the sheer fact that companies/organisms/universities/etc DO USE them. So, unless it is THE standard we won’t use it, and unless we DO USE it, it won’t be THE standard.
The question (well, one of them…) is how can we move away from this circle.
Yes, this is a circle unless a clear benefit motivates the change. Then, which of these two possible benefits would be more convincing for you as a clear benefit for devoting time/resources to map your resources to one standard?
A the possibility of using services that save time/resources, i.e. to clean a corpus of duplicates, or to derive a list of out of vocabulary or missing words will motivate the conversion of a corpus into UTF-8, or into a XCES format.
B the possibility of entering a pool of resources (type TAUS) where you put your own resources but you have the right to use the resources of the others that cover other languages and/or a variety of domains.
Just a short comment relating on standards and the “warm up” sentence:
“The market of LR-data has to be rethought/remodelled by introducing collaborative strategies that overcome the current model based mostly on purely competitive terms (leading to the non-adherence to standards, to the repetition of work and efforts, etc).”
I have the impression that competition is still a crucial issues, and that, in a sense, lack of real competition is obstaculating the raise of standards. There is the tendency to consider standard a little bit like ISO-9000 certification: if a big customer or more customers are asking for it. it might become worth (but only if my my direct competitors are already qualified), otherwise, why should I? In my view competition (real and sane competition) would enforce standards, assuming the following lines are kept as directive:
1) Availability of tools/utilities based on standards. This is what Nuria hinted at, and UTF-8 is a good example: I would be surprised if some language industry is considering now different format. But: can we proceed a little bit further than character encoding?
2) Availability of industry oriented benchmark or evaluation suite: this will allow to assess in an objective way the quality of an industrial resources and would at the same time foster competitivity and push towards the adoption of standards. Unfortunately these kind of benchmark should be more industry oriented then current ones. Just to make an examples: you cannot assume that in order to evaluate dependency parsing you start from fully disambiguated input (as it happens most of the cases), as this is not a real world situation, although one much more interesting for a scientific evaluation. But of course this is again a egg and chicken problem: they should be produced mostly by academy in order to stay vendor neuter, but from an academy point of view they are less interesting than more in-lab evaluation.
3) I think that customer maturity is crucial in the enforcement of standard. The fact is that in many cases NLP applications are kind of standalone, or obscurely integrated into solutions. So on average customers do not ask the vendor to comply specific standards, simply because they do not see an advantage on this. On this respect it would be interesting to work on compliance from the point of view of consumer applications. Just to make a short example: Lucene is one of the more used search engine, and SOLR is becoming a standard for facet search (aka “semantic serach
). They (i.e. apache Group) would be in the position to enforce stardard: but they need to be convinced.