补充说明

1.第一点，癌症的类型与准确的命名。像是LUAD，COAD，KICH啥的这些常见的就不说了。但是在cBioportal中发现了更加详尽的划分。足足有885种.....具体来说在cBioportal.type_of_cancer中。

但是有个问题，我始终没有找到对患者的分型数据。

patient表里面有患者编号(stable_id)和(cancer_study_id)

现在需要找cancer_study_id，然后就有一个cancer_study表，表里面有cancer_study_id和type_of_cancer_id。然后看来又得去查type_of_cancer,其中有type_of_cancer_id对应。

整体上来说根据cancer_type查询所有样本的业务逻辑是：

##table:type_of_cancer,cancer_study,patient.

omicset.synchronize(cancer_type)-----|------>type_of_cancer(cancer_type==type_of_cancer_id)---->cancer_study(type_of_cancer_id==type_of_cancer_id,cancer_study_id)-------->patient(cancer_study_id==cancer_study_id,stable_id)

cancer_type------……------>stable_id==case_id

所以其实只需要 cancer_study，patient即可,现在能顺利获取了。

2.现在有了case_id(stable_id)，如果指定组学类型就能填充data了。从表达数据开始

表达矩阵read_count在structure_variant表里面，有sample_id和site1/2_ENTREZ_GENE_ID 能够对应到基因上。sample_id通过sample表能对应到sample的类型和stable_id。而site1/2_ENTREZ_GENE_ID可以通过gene表对应到HUGO_GENE_SYMBOL 。所以通过case_id获取表达数据的业务逻辑是

##table:gene,sample,structure_vatiant,

omicser.synchronize(){case_id--------->==stable_id----->sample(stable_id->sample_id)------>structure_variant(sample_id->read_count，

表达数据好像存在一点问题，我们先用cnv和突变数据

##table:mutation,sample,gene,

case_id==stable_id---->{sample}----->sample_id-------->{mutation}------------>(stable_id, entrez_gene_id,tumor/norm_alt/ref_count)-------->{gene}-------->(stable_id,enseml_gene,*count)

##清洗阶段在笛卡尔积转置。

我算是理清楚了，cancer_study_id对应一个list_id,list_id对应一系列sample_id.

cancer_study_id对应patient_id，patient_id对应stable_id.

所以还不如用cancer_type到cancer_study_di……最后一直到sample_id

sample_id能直接对应mutation和structural_variant中的sample_id，然后得到count*2

还能通过sample_cnv_event，从cnv_event中得到alteration。

越来越觉得不不如用api，因为真的没有表达数据。