gpt4 book ai didi

matlab - 从文本文件导入并在 matlab 中创建元胞数组

转载 作者:行者123 更新时间:2023-12-04 01:42:29 25 4
gpt4 key购买 nike

我有一个包含基因信息的文本文件,例如基因之间的关系和部分关系。

此文本文件包含每个 GOTerm 的段落(GO 术语是一个包含特定代码编号的节点,如:GO:0030436)具有:Go 术语 ID(每个段落的第一行)和 isa(如果有)(以 isa 开始,以 isa 结束)和 partof Go 术语(如果有)(以 partof: 开始,以 partof 结束)这个文本文件的一个小样本是:

GO:0030436
isa:
GO:0034297
GO:0043936
GO:0048315
end of isa
partof:
GO:0042243
end of partof
genes:
end of genes
GO:0034297
isa:
end of isa
partof:
end of partof
genes:
end of genes
GO:0043936
isa:
GO:0001410
GO:0034300
GO:0034301
GO:0034302
GO:0034303
GO:0034304
end of isa
partof:
end of partof
genes:
end of genes

我需要读取这个文本文件并从中获取三个数据并制作一个包含 3 列的单元格矩阵,如下所示:

map=

ID GoTerms is_a partof
GO:0030436 GO:0034297 GO:0042243
GO:0030436 GO:0043936 0
GO:0030436 GO:0048315 0
GO:0034297 0 0
GO:0043936 GO:0001410 0
GO:0043936 GO:0034300 0
GO:0043936 GO:0034301 0
GO:0043936 GO:0034302 0
GO:0043936 GO:0034303 0
GO:0043936 GO:0034304 0

请注意,如果每个 Go 术语包含多个 is a 或 part of 术语,我应该重复 Go 术语 ID 以使单元格矩阵适合且组织良好。

知道如何编写这段代码吗?

我试图编写一个代码,但它不起作用,因为我不知道如何获取超过 1 个 isa 和部分术语:

s={};
fid = fopen('Opt.pad'); % read from the certain text file
tline = fgetl(fid);
while ischar(tline)
s=[s;tline];
tline = fgetl(fid);
end
% find start and end positions of every [Term] marker in s
terms = [find(~cellfun('isempty', regexp(s, '\GO:\w*'))); numel(s)+1];
% for every [Term] section, run the previously implemented regexps
% and save the results into a map - a cell array with 3 columns map = cell(0,3);
for term=1:numel(terms)-1
% extract single [Term] data
s_term = s(terms(term):terms(term+1)-1); % match regexps
%To generate the GO_Terms vector from the text file
tok = regexp(s_term, '^(GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
GO_Terms=cellfun(@(x)x{1}, (tok(idx))); %To generate the is_a relations vector from the text file
tok = regexp(s_term, '^isa: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
is_a_relations =cellfun(@(x)x{1}, (tok(idx))); %To generate the part_of relaions vector from the text file
tok = regexp(s_term, '^partof: (GO:\w*)', 'tokens');
idx = ~cellfun('isempty', tok);
part_of_relations =cellfun(@(x)x{1}, (tok(idx))); % map. note the end+1 - here we create a new map row. Only once!
map{end+1,1} = GO_Terms;
map{end, 2} = is_a_relations;
map{end, 3} = part_of_relations;
end map( cellfun(@isempty, map) ) = {0};

最佳答案

一个简短的解决方案(虽然可能不是最快的):

% # Parse text file
C = textread('Opt.pad', '%s', 'delimiter', '');

% # Obtain indices for isa elements
idx = reshape(find(~cellfun(@isempty, strfind(C, 'isa')))', 2, []);
isa = arrayfun(@(x, y)x + 1:y - 1, idx(1, :), idx(2, :), 'Uniform', false);

% # Obtain indices for partof elements
idx = reshape(find(~cellfun(@isempty, strfind(C, 'partof')))', 2, []);
partof = arrayfun(@(x, y)x + 1:y - 1, idx(1, :), idx(2, :), 'Uniform', false);

% # Obtain indices of GO term elements and IDs
go = find(cellfun(@(s)any(strfind(s, 'GO:')), C));
id = go(~ismember(go, [isa{:}, partof{:}]));

% # Construct a new cell array
N = cellfun(@(x, y)max([numel(x), numel(y), 1]), isa, partof);
k = cumsum([1, N(1:end - 1)]);
X = cell(sum(N), 3); % # Preallocate memory!
repcell = @(x, n)arrayfun(@(y)x, 1:n, 'Uniform', false);
for ii = 1:numel(id)
idx = k(ii):k(ii) + N(ii) - 1;
X(idx, 1) = repcell(C{id(ii)}, N(ii));
X(idx, 2) = [C{isa{ii}}, repcell('0', N(ii) - numel(isa{ii}))];
X(idx, 3) = [C{partof{ii}}, repcell('0', N(ii) - numel(partof{ii}))];
end

这应该产生以下输出:

X = 

'GO:0030436' 'GO:0034297' 'GO:0042243'
'GO:0030436' 'GO:0043936' '0'
'GO:0030436' 'GO:0048315' '0'
'GO:0034297' '0' '0'
'GO:0043936' 'GO:0001410' '0'
'GO:0043936' 'GO:0034300' '0'
'GO:0043936' 'GO:0034301' '0'
'GO:0043936' 'GO:0034302' '0'
'GO:0043936' 'GO:0034303' '0'
'GO:0043936' 'GO:0034304' '0'

关于matlab - 从文本文件导入并在 matlab 中创建元胞数组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13902447/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com