PANACEA will participate in META-Forum 2010

PANACEA will soon be participating in META-Forum 2010. This event, which will take place in Brussels, on November 17-18, 2010, is the first edition of the annual conference series organized by META-NET (http://www.meta-net.eu).

Posted on by Núria Bel
Filed under: Conferences, Language Resources, Machine Translation, Natural Language Processing | No Comments »

WP8 defines requirements for PANACEA platform

Some requirements for PANACEA Platform are listed below; however, consult the first deliverable from WP8: Evaluation in industrial environment, for a more complete list of requirements.

However, if you consider that there are requirements missing, please let us know and e-mail accompanying information to: info@panacea-lr.eu

1.1. Functional Requirements
Req-FCT-001: Inspect available services
Req-FCT-002: Run a service
Req-FCT-004: Inspect input/output data
Req-FCT-008: Configure services into workflows
Req-FCT-009: Run workflows
1.2. Registry Requirements
Req-FCT-123: Announce a web service
Req-FCT-124: List web services
Req-FCT-125: Search web services
Req-FCT-126: Documentation and annotation of web services
1.3. Operational Requirements
Req-FCT-302: Speed / Waiting times
Req-FCT-303: Scalability
Req-FCT-305: Error Handling
Req-FCT-306: Validity Checks

Posted on by Núria Bel
Filed under: Evaluation, Language Resources, Machine Translation, Natural Language Processing | No Comments »

PANACEA defines its typical use and user

WP8 defines its typical user which will perform typical use cases in PANACEA web service platform.

Typical use cases and operations that PANACEA web services will cover include the following:

Corpus Tasks

• Build a corpus by web crawling
• Process a corpus by different services: sentence-segment it, tokenize / lemmatize / tag it
• Align two parallel texts: on document level, on paragraph level, on sentence level

Dictionary tasks

• Input a corpus for dictionary extraction (general purpose or domain specific)
• Submit a corpus for dictionary gap identification
• Acquire corpora for new / unknown words
• Enlarge a dictionary merging corpus-extracted information (on entry level), on transfer level and annotation level (additional translations)
• Trace word occurrences over time (‘word of the day’)

Extraction tasks

• Send a corpus to extract information items (named entities, or just key terms)
• Build an “Alerting System” (do texts match the alerting profile?) by intercalating a detecting dictionary gaps service
• Construct a workflow for “Topic Assignment” by using services for keyword extraction and training a classifier with pre-annotated data.

Translation Tasks

• Use a crawling system to collect / add corpus data for SMT creation
• Send a corpus to create a Language Model, for specific language, and / or for specific domain
• Send a parallel or aligned corpus to create your Translation Model (new language direction, new specific domain)
• Create / Adapt an (R)MT dictionary [with translations, with linguistic annotations (monolingual, transfer)]

Posted on by Núria Bel
Filed under: Language Resources, Machine Translation, Natural Language Processing, Use Cases, Users | No Comments »

Warming up

There is a shortage of data for a full deployment of Language Resources and Technologies (LR’s). On the one hand, rule-based methods on which, for instance, most of the commercial MT systems are based, have not been able to cover all languages and all domains. On the other hand, statistically or ML-based mehods that need Language Resources-data for inducing information encounter also the problem of a shortage of ready-to-use material for all languages and domains.
In addition to the problems in achieving a full coverage, the use of existing data is hindered by several factors:

1. Little understanding of the need for standards for the representation of data, which makes difficult the use of several sources, and also, crucially, the evaluation of the quality and particular value of a resource for a given application. Although some different standards have been proposed for the representation of LR data, they are considered to be too much research-oriented, not documented, too abstract and too cumbersome to be implemented, as the return of the investment is not obvious. Industrials have preferred to implement standards that are too much driven only by specific needs and lack a long-term vision.

2. Insufficient documentation of existing resources, which, in addition, are not maintained in regular terms (update, bug reporting and coverage enlargement) because they are normally the results of finite projects. A common, again standard, way of representing and documenting linguistic information should be devised.

3. The LR market for most of the written resources is rather reduced and the legal framework is too complex.

These facts seem to be hints of the need for changes in the behavioral patterns and culture of LR-data consumers and providers. Given the breadth of the current landscape of LRs, are the changes needed along the following lines? We should start to discuss about that.

- The market of LR-data has to be rethought/remodelled by introducing collaborative strategies that overcome the current model based mostly on purely competitive terms (leading to the non-adherence to standards, to the repetition of work and efforts, etc). The supply of language resources is conditioned by traditional business behavioural patterns and culture that still overprotect products from competition.

- Not only the academy, but also the industry of LR have to undergo a cultural change and to recognize the high added value of participating in the creation of common pools of LR-data. Such a change requires movements of all the stakeholders in unison, the creation of rules and guidelines for new forms of cooperation, and sharpening a culture of mutual respect and fairness. Actions towards fostering such a change are a first priority for the field.

- The coverage problem is of a nature and magnitude such that strategies approaching or envisaging the full automation of LR-data production have to be promoted along campaigns for fostering evaluation in real-life applications. Thus, research can progressively approach the characteristics of the materials needed by the industry in size and granularity of the information contained.

Posted on by Núria Bel
Filed under: Language Resources, Machine Translation, Natural Language Processing | 4 Comments »
continue: