gpt4 book ai didi

java - 如何将自定义注释转换为 UIMA CAS 结构并将其序列化为 XMI

转载 作者:行者123 更新时间:2023-12-01 19:46:51 26 4
gpt4 key购买 nike

我在将自定义注释文档转换为 UIMA CAS,然后将其序列化为 XMI 以便通过 UIMA 注释查看器 GUI 查看注释时遇到问题。

我使用 uimaFIT 来构建我的组件,因为它更容易控制、测试和调试。该管道由 3 个组件构成:

  • CollectionReader 组件读取包含原始文本的文件。
  • Annotator 组件,用于将注释从自定义文档转换为 UIMA 注释
  • CasConsumer 组件,将 CAS 序列化为 XMI

我的管道工作并在最后输出 XMI 文件,但没有注释。我不太清楚 CAS 对象如何在组件之间传递。注释器逻辑包括对某些端点进行 RESTful 调用,并通过使用我尝试转换注释模型的服务提供的客户端 SDK。 Annotator 组件的转换逻辑部分如下所示:

public class CustomDocumentToUimaCasConverter implements UimaCasConverter {
private TypeSystemDescription tsd;

private AnnotatedDocument startDocument;

private ArrayFS annotationFeatureStructures;

private int featureStructureArrayCapacity;

public AnnotatedDocument getStartDocument() {
return startDocument;
}

public CustomDocumentToUimaCasConverter(AnnotatedDocument startDocument) {
try {
this.tsd = TypeSystemDescriptionFactory.createTypeSystemDescription();
} catch (ResourceInitializationException e) {
LOG.error("Error when creating default type system", e);
}
this.startDocument = startDocument;
}


public TypeSystemDescription getTypeSystemDescription() {
return this.tsd;
}

@Override
public void convertAnnotations(CAS cas) {
Map<String, List<Annotation>> entities = this.startDocument.entities;
int featureStructureArrayIndex = 0;

inferCasTypeSystem(entities.keySet());
try {
/*
* This is a hack allowing the CAS object to have an updated type system.
* We are creating a new CAS by passing the new TypeSystemDescription which actually
* should have been updated by an internal call of typeSystemInit(cas.getTypeSystem())
* originally part of the CasInitializer interface that is now deprecated and the CollectionReader
* is calling it internally in its implementation. The problem consists in the fact that now the
* the typeSystemInit method of the CasInitializer_ImplBase has an empty implementation and
* nothing changes!
*/
LOG.info("Creating new CAS with updated typesystem...");
cas = CasCreationUtils.createCas(tsd, null, null);
} catch (ResourceInitializationException e) {
LOG.info("Error creating new CAS!", e);
}

TypeSystem typeSystem = cas.getTypeSystem();
this.featureStructureArrayCapacity = entities.size();
this.annotationFeatureStructures = cas.createArrayFS(featureStructureArrayCapacity);

for (Map.Entry<String, List<Annotation>> entityEntry : entities.entrySet()) {
String annotationName = entityEntry.getKey();
annotationName = UIMA_ANNOTATION_TYPES_PACKAGE + removeDashes(annotationName);
Type type = typeSystem.getType(annotationName);

List<Annotation> annotations = entityEntry.getValue();
LOG.info("Get Type -> " + type);
for (Annotation ann : annotations) {
AnnotationFS afs = cas.createAnnotation(type, (int) ann.startOffset, (int) ann.endOffset);
cas.addFsToIndexes(afs);
if (featureStructureArrayIndex + 1 == featureStructureArrayCapacity) {
resizeArrayFS(featureStructureArrayCapacity * 2, annotationFeatureStructures, cas);
}
annotationFeatureStructures.set(featureStructureArrayIndex++, afs);
}
}
cas.removeFsFromIndexes(annotationFeatureStructures);
cas.addFsToIndexes(annotationFeatureStructures);
}

@Override
public void inferCasTypeSystem(Iterable<String> originalTypes) {
for (String typeName : originalTypes) {
//UIMA Annotations are not allowed to contain dashes
typeName = removeDashes(typeName);
tsd.addType(UIMA_ANNOTATION_TYPES_PACKAGE + typeName,
"Automatically generated type for " + typeName, "uima.tcas.Annotation");
LOG.info("Inserted new type -> " + typeName);
}
}

/**
* Removes dashes from UIMA Annotations because they are not allowed to contain dashes.
*
* @param typeName the annotation name of the current annotation of the source document
* @return the transformed annotation name suited for the UIMA typesystem
*/
private String removeDashes(String typeName) {
if (typeName.contains("-")) {
typeName = typeName.replaceAll("-", "_");
}
return typeName;
}

@Override
public void setSourceDocumentText(CAS cas) {
cas.setSofaDataString(startDocument.text, "text/plain");
}

private void resizeArrayFS(int newCapacity, ArrayFS originalArray, CAS cas) {
ArrayFS biggerArrayFS = cas.createArrayFS(newCapacity);
biggerArrayFS.copyFromArray(originalArray.toArray(), 0, 0, originalArray.size());
this.annotationFeatureStructures = biggerArrayFS;
this.featureStructureArrayCapacity = annotationFeatureStructures.size();
}
}

`如果有人处理过 UIMA 类型的注释转换,我将不胜感激。

最佳答案

我认为您对 CAS 和 Annotations 的理解可能是错误的:

来自

* This is a hack allowing the CAS object to have an updated type system.

 LOG.info("Creating new CAS with updated typesystem...");
cas = CasCreationUtils.createCas(tsd, null, null);

我了解到您尝试在注释器的 process() 方法中创建一个新的 CAS(我假设您发布的代码在那里执行)。除非您正在实现 CAS 乘法器,否则这不是执行此操作的方法。通常,collectionreader 会获取原始数据并在其 getNext() 方法中为您创建一个 CAS。该 CAS 会沿着整个 UIMA 管道传递,您所需要做的就是向其添加 UIMA 注释。

对于要添加的每个注释,UIMA 应该知道类型系统。如果您使用 JCasGen 及其生成的代码,这应该不是问题。确保可以按照此处所述自动检测您的类型:http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 )。

这允许您使用 Java 对象实例化注释,而不是使用低级 Fs 调用。以下代码片段在整个文档文本上添加了注释。在文本中的标记及其摄取的(非 UIMA)注释(使用您的 Web 服务)上添加迭代逻辑应该很简单。

@Override
public void process(JCas aJCas) throws AnalysisEngineProcessException {
String text = aJCas.getDocumentText();
SomeAnnotation a = new SomeAnnotation(aJCas);
// set the annotation properties
// for each property, JCasGen should have
// generated a setter
a.setSomePropertyValue(someValue);
// add your annotation to the indexes
a.setBegin(0);
a.setEnd(text.length());
a.addToIndexes(aJCas);
}

为了避免弄乱开始和结束字符串索引,我建议您使用一些 Token 注释(来自 DKPro Core,例如: https://dkpro.github.io/dkpro-core/ ),您可以将其用作自定义注释的 anchor 。

关于java - 如何将自定义注释转换为 UIMA CAS 结构并将其序列化为 XMI,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28888962/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com