- android - 多次调用 OnPrimaryClipChangedListener
- android - 无法更新 RecyclerView 中的 TextView 字段
- android.database.CursorIndexOutOfBoundsException : Index 0 requested, 光标大小为 0
- android - 使用 AppCompat 时,我们是否需要明确指定其 UI 组件(Spinner、EditText)颜色
我正在处理单细胞 RNA 测序数据,最近有 10k-100k 个样本(cell
)x 20k 个特征(gene
)的稀疏值,还包括很多元数据,例如起源的组织(“大脑”与“肝脏”)。元数据约为 10-100 列,我将其存储为 pandas.DataFrame
。现在,我正在通过 dict-ifiying 元数据并将它们添加为坐标来制作 xarray.DataSets
。它看起来笨拙且容易出错,因为我正在笔记本之间复制代码片段。有没有更简单的方法?
cell_metadata_dict = cell_metadata.to_dict(orient='list')
coords = {k: ('cell', v) for k, v in cell_metadata_dict.items()}
coords.update(dict(gene=counts.columns, cell=counts.index))
ds = xr.Dataset(
{'counts': (['cell', 'gene'], counts),
},
coords=coords)
编辑:
为了显示一些示例数据,这里是 cell_metadata.head().to_csv()
:
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
和counts.iloc[:5, :20].to_csv()
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65
回复:pandas.DataFrame.to_xarray()
- 这非常慢,将如此多的数字和分类数据编码为 100 级 MultiIndex 对我来说似乎很奇怪。那,每次我尝试使用 MultiIndex
时,它总是让我说“哦,这就是我不使用 MultiIndex 的原因”,然后恢复使用单独的元数据和计数数据帧。
最佳答案
Xarray 使用 pandas 索引/列标签作为默认元数据。当所有变量共享相同维度时,您可以在单个函数调用中进行转换,但如果不同变量具有不同维度,则需要分别从 pandas 转换它们,然后将它们放在 xarray 端。例如:
import pandas as pd
import io
import xarray
# read your data
cell_metadata = pd.read_csv(io.StringIO(u"""\
cell,Uniquely mapped reads number,Number of input reads,EXP_ID,TAXON,WELL_MAPPING,Lysis Plate Batch,dNTP.batch,oligodT.order.no,plate.type,preparation.site,date.prepared,date.sorted,tissue,subtissue,mouse.id,FACS.selection,nozzle.size,FACS.instument,Experiment ID ,Columns sorted,Double check,Plate,Location ,Comments,mouse.age,mouse.number,mouse.sex
A1-MAA100140-3_57_F-1-1,428699,502312,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A10-MAA100140-3_57_F-1-1,324428,360285,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A11-MAA100140-3_57_F-1-1,381310,431800,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A12-MAA100140-3_57_F-1-1,393498,446705,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F
A2-MAA100140-3_57_F-1-1,717,918,170928_A00111_0068_AH3YKKDMXX,mus,MAA100140,,,,Biorad 96well,Stanford,,170720,Liver,Hepatocytes,3_57_F,,,,,,,,,,3,57,F"""))
counts = pd.read_csv(io.StringIO(u"""\
cell,0610005C13Rik,0610007C21Rik,0610007L01Rik,0610007N19Rik,0610007P08Rik,0610007P14Rik,0610007P22Rik,0610008F07Rik,0610009B14Rik,0610009B22Rik,0610009D07Rik,0610009L18Rik,0610009O20Rik,0610010B08Rik,0610010F05Rik,0610010K14Rik,0610010O12Rik,0610011F06Rik,0610011L14Rik,0610012G03Rik
A1-MAA100140-3_57_F-1-1,308,289,81,0,4,88,52,0,0,104,65,0,1,0,9,8,12,283,12,37
A10-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A11-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A12-MAA100140-3_57_F-1-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
A2-MAA100140-3_57_F-1-1,375,325,70,0,2,72,36,13,0,60,105,0,13,0,0,29,15,264,0,65"""))
# build the output
xarray_counts = xarray.DataArray(counts.set_index('cell'), dims=['cell', 'gene'])
xarray_counts.coords.update(cell_metadata.set_index('cell').to_xarray())
print(xarray_counts)
这会产生一个漂亮、整洁的 xarray.DataArray
用于计数:
<xarray.DataArray (cell: 5, gene: 20)>
array([[308, 289, 81, 0, 4, 88, 52, 0, 0, 104, 65, 0, 1, 0,
9, 8, 12, 283, 12, 37],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0],
[375, 325, 70, 0, 2, 72, 36, 13, 0, 60, 105, 0, 13, 0,
0, 29, 15, 264, 0, 65]])
Coordinates:
* cell (cell) object 'A1-MAA100140-3_57_F-1-1' ...
* gene (gene) object '0610005C13Rik' ...
Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717
Number of input reads (cell) int64 502312 360285 431800 446705 918
EXP_ID (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
TAXON (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
WELL_MAPPING (cell) object 'MAA100140' 'MAA100140' ...
Lysis Plate Batch (cell) float64 nan nan nan nan nan
dNTP.batch (cell) float64 nan nan nan nan nan
oligodT.order.no (cell) float64 nan nan nan nan nan
plate.type (cell) object 'Biorad 96well' ...
preparation.site (cell) object 'Stanford' 'Stanford' ...
date.prepared (cell) float64 nan nan nan nan nan
date.sorted (cell) int64 170720 170720 170720 170720 ...
tissue (cell) object 'Liver' 'Liver' 'Liver' ...
subtissue (cell) object 'Hepatocytes' 'Hepatocytes' ...
mouse.id (cell) object '3_57_F' '3_57_F' '3_57_F' ...
FACS.selection (cell) float64 nan nan nan nan nan
nozzle.size (cell) float64 nan nan nan nan nan
FACS.instument (cell) float64 nan nan nan nan nan
Experiment ID (cell) float64 nan nan nan nan nan
Columns sorted (cell) float64 nan nan nan nan nan
Double check (cell) float64 nan nan nan nan nan
Plate (cell) float64 nan nan nan nan nan
Location (cell) float64 nan nan nan nan nan
Comments (cell) float64 nan nan nan nan nan
mouse.age (cell) int64 3 3 3 3 3
mouse.number (cell) int64 57 57 57 57 57
mouse.sex (cell) object 'F' 'F' 'F' 'F' 'F'
如果您想要一个 Dataset,请将 DataArray 对象放入 Dataset 构造函数中,例如,
# shouldn't really need to use .data_vars here, that might be an xarray bug
>>> xarray.Dataset({'counts': xarray.DataArray(counts.set_index('cell'),
... dims=['cell', 'gene'])},
... coords=cell_metadata.set_index('cell').to_xarray().data_vars) <xarray.Dataset>
Dimensions: (cell: 5, gene: 20)
Coordinates:
* cell (cell) object 'A1-MAA100140-3_57_F-1-1' ...
* gene (gene) object '0610005C13Rik' ...
Uniquely mapped reads number (cell) int64 428699 324428 381310 393498 717
Number of input reads (cell) int64 502312 360285 431800 446705 918
EXP_ID (cell) object '170928_A00111_0068_AH3YKKDMXX' ...
TAXON (cell) object 'mus' 'mus' 'mus' 'mus' 'mus'
WELL_MAPPING (cell) object 'MAA100140' 'MAA100140' ...
Lysis Plate Batch (cell) float64 nan nan nan nan nan
dNTP.batch (cell) float64 nan nan nan nan nan
oligodT.order.no (cell) float64 nan nan nan nan nan
plate.type (cell) object 'Biorad 96well' ...
preparation.site (cell) object 'Stanford' 'Stanford' ...
date.prepared (cell) float64 nan nan nan nan nan
date.sorted (cell) int64 170720 170720 170720 170720 ...
tissue (cell) object 'Liver' 'Liver' 'Liver' ...
subtissue (cell) object 'Hepatocytes' 'Hepatocytes' ...
mouse.id (cell) object '3_57_F' '3_57_F' '3_57_F' ...
FACS.selection (cell) float64 nan nan nan nan nan
nozzle.size (cell) float64 nan nan nan nan nan
FACS.instument (cell) float64 nan nan nan nan nan
Experiment ID (cell) float64 nan nan nan nan nan
Columns sorted (cell) float64 nan nan nan nan nan
Double check (cell) float64 nan nan nan nan nan
Plate (cell) float64 nan nan nan nan nan
Location (cell) float64 nan nan nan nan nan
Comments (cell) float64 nan nan nan nan nan
mouse.age (cell) int64 3 3 3 3 3
mouse.number (cell) int64 57 57 57 57 57
mouse.sex (cell) object 'F' 'F' 'F' 'F' 'F'
Data variables:
counts (cell, gene) int64 308 289 81 0 4 88 52 0 ...
关于python - 从元数据 + 值创建 xarray 数据集的简单方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46800559/
前言: 有时候,一个数据库有多个帐号,包括数据库管理员,开发人员,运维支撑人员等,可能有很多帐号都有比较大的权限,例如DDL操作权限(创建,修改,删除存储过程,创建,修改,删除表等),账户多了,管理
所以我用 Create React App 创建并设置了一个大型 React 应用程序。最近我们开始使用 Storybook 来处理和创建组件。它很棒。但是,当我们尝试运行或构建应用程序时,我们不断遇
遵循我正在创建的控件的代码片段。这个控件用在不同的地方,变量也不同。 我正在尝试编写指令来清理代码,但在 {{}} 附近插入值时出现解析错误。 刚接触 Angular ,无法确定我错过了什么。请帮忙。
我正在尝试创建一个 image/jpeg jax-rs 提供程序类,它为我的基于 post rest 的 Web 服务创建一个图像。我无法制定请求来测试以下内容,最简单的测试方法是什么? @POST
我一直在 Windows 10 的模拟器中练习 c。后来我改用dev C++ IDE。当我在 C 中使用 FILE 时。创建的文件的名称为 test.txt ,而我给出了其他名称。请帮助解决它。 下面
当我们创建自定义 View 时,我们将 View 文件的所有者设置为自定义类,并使用 initWithFrame 或 initWithCode 对其进行实例化。 当我们创建 customUITable
我正在尝试为函数 * Producer 创建一个线程,但用于创建线程的行显示错误。我为这句话加了星标,但我无法弄清楚它出了什么问题...... #include #include #include
今天在做项目时,遇到了需要创建JavaScript对象的情况。所以Bing了一篇老外写的关于3种创建JavaScript对象的文章,看后跟着打了一遍代码。感觉方法挺好的,在这里与大家分享一下。 &
我正在阅读将查询字符串传递给 Amazon 的 S3 以进行身份验证的文档,但似乎无法理解 StringToSign 的创建和使用方式。我正在寻找一个具体示例来说明 (1) 如何构造 String
前言:我对 C# 中任务的底层实现不太了解,只了解它们的用法。为我在下面屠宰的任何东西道歉: 对于“我怎样才能开始一项任务但不等待它?”这个问题,我找不到一个好的答案。在 C# 中。更具体地说,即使任
我有一个由一些复杂的表达式生成的 ILookup。假设这是按姓氏查找人。 (在我们简单的世界模型中,姓氏在家庭中是唯一的) ILookup families; 现在我有两个对如何构建感兴趣的查询。 首
我试图创建一个 MSI,其中包含 和 exe。在 WIX 中使用了捆绑选项。这样做时出错。有人可以帮我解决这个问题。下面是代码: 错误 error LGH
在 Yii 中,Create 和 Update 通常使用相同的形式。因此,如果我在创建期间有电子邮件、密码、...other_fields...等字段,但我不想在更新期间专门显示电子邮件和密码字段,但
上周我一直在努力创建一个给定一行和一列的 QModelIndex。 或者,我会满足于在已经存在的 QModelIndex 中更改 row() 的值。 任何帮助,将不胜感激。 编辑: QModelInd
出于某种原因,这不起作用: const char * str_reset_command = "\r\nReset"; const char * str_config_command = "\r\nC
现在,我有以下由 original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n)) 创建的 data.frame。 DF 5),
在今天之前,我使用/etc/vim/vimrc来配置我的vim设置。今天,我想到了创建.vimrc文件。所以,我用 touch .vimrc cat /etc/vim/vimrc > .vimrc 所
我可以创建一个 MKAnnotation,还是只读的?我有坐标,但我发现使用 setCooperative 手动创建 MKAnnotation 并不容易。 想法? 最佳答案 MKAnnotation
在以下代码中,第一个日志语句按预期显示小数,但第二个日志语句记录 NULL。我做错了什么? NSDictionary *entry = [[NSDictionary alloc] initWithOb
我正在使用与此类似的代码动态添加到数组; $arrayF[$f+1][$y][$x+1] = $value+1; 但是我在错误报告中收到了这个: undefined offset :1 问题:尝试创
我是一名优秀的程序员,十分优秀!