标题: [文本处理] [已解决]批处理怎样提取文本里的特定内容信息并判断另一个文本里是否存在该内容 [打印本页]
作者: sweet惜缘 时间: 2015-5-8 12:01 标题: [已解决]批处理怎样提取文本里的特定内容信息并判断另一个文本里是否存在该内容
本帖最后由 pcl_test 于 2016-9-14 13:33 编辑
data中的文件类似于下面,是一个drug,需要提取的是红色部分,一个是entry部分(其实就是药物的名字信息),一个是pathway信息部分(有的没有pathway信息的以~代替),有的然后把它的pathway信息与pathway文件中的比较,找到完全相同的,得到其前面的代码写到另一个txt中
ENTRY D07670 Drug
NAME Chlormadinone (INN)
FORMULA C21H27ClO3
EXACT_MASS 362.1649
MOL_WEIGHT 362.8903
ACTIVITY Progestin
REMARK ATC code: G03DB06
Drug group: DG00469
TARGET progesterone receptor agonist [HSA:5241] [KO:K08556]
PATHWAY hsa04114(5241) Oocyte meiosis
hsa04914(5241) Progesterone-mediated oocyte maturation
INTERACTION
STR_MAP map07225 Glucocorticoid and meneralocorticoid receptor agonists/antagonists
map07226 Progesterone, androgen and estrogen receptor agonists/antagonists
BRITE Anatomical Therapeutic Chemical (ATC) classification [BR:br08303]
G GENITO URINARY SYSTEM AND SEX HORMONES
G03 SEX HORMONES AND MODULATORS OF THE GENITAL SYSTEM
G03D PROGESTOGENS
G03DB Pregnadien derivatives
G03DB06 Chlormadinone
D07670 Chlormadinone (INN)
Target-based classification of drugs [BR:br08310]
Nuclear receptors
Estrogen like receptors
3-Ketosteroid receptor
progesterone receptor
Chlormadinone
D07670 Chlormadinone (INN)
DBLINKS CAS: 1961-77-9
PubChem: 51091972
LigandBox: D07670
NIKKAJI: J13.853C
ATOM 25
1 C1z C 22.2255 -14.2313
2 C1z C 21.0340 -14.8622
3 C1x C 23.3470 -14.8622
4 O1a O 23.3470 -13.5304
5 C5a C 22.2255 -12.8295
6 C1y C 21.0340 -16.2640
7 C1x C 19.8424 -14.1612
8 C1a C 21.0340 -13.5304
9 C1x C 23.3470 -16.2640
10 C1a C 21.0339 -12.1285
11 O5a O 23.4170 -12.1285
12 C1y C 19.8424 -16.9650
13 C1x C 18.6508 -14.8622
14 C1y C 18.6508 -16.2640
15 C2x C 19.8424 -18.2968
16 C1z C 17.4592 -16.8949
17 C2y C 18.6508 -18.9977
18 C2y C 17.4592 -18.2968
19 C1x C 16.2676 -16.1939
20 C1a C 17.4592 -15.5631
21 X Cl 18.6507 -20.3295
22 C2x C 16.2676 -18.9977
23 C1x C 15.0760 -16.8949
24 C5x C 15.0760 -18.2968
25 O5x O 13.8844 -18.9977
BOND 28
1 1 2 1
2 1 3 1
3 1 4 1 #Down
4 1 5 1 #Up
5 2 6 1
6 2 7 1
7 2 8 1 #Up
8 3 9 1
9 5 10 1
10 5 11 2
11 6 12 1
12 7 13 1
13 12 14 1
14 12 15 1
15 14 16 1
16 15 17 2
17 16 18 1
18 16 19 1
19 16 20 1 #Up
20 17 21 1
21 18 22 2
22 19 23 1
23 22 24 1
24 24 25 2
25 6 9 1
26 13 14 1
27 17 18 1
28 23 24 1
pathway中大概是这样的
1. Metabolism
1.0.1 Global and overview maps
1.0.2 Metabolic pathways
1.0.3 Biosynthesis of secondary metabolites
1.0.4 Microbial metabolism in diverse environments
1.0.5 Biosynthesis of antibiotics
1.0.6 Carbon metabolism
1.0.7 2-Oxocarboxylic acid metabolism
1.0.8 Fatty acid metabolism
1.0.9 Biosynthesis of amino acids
1.0.10 Degradation of aromatic compounds
1.1 Carbohydrate metabolism
1.1.1 Glycolysis / Gluconeogenesis
1.1.2 Citrate cycle (TCA cycle)
1.1.3 Pentose phosphate pathway
1.1.4 Pentose and glucuronate interconversions
1.1.5 Fructose and mannose metabolism
1.1.6 Galactose metabolism
1.1.7 Ascorbate and aldarate metabolism
1.1.8 Starch and sucrose metabolism
1.1.9 Amino sugar and nucleotide sugar metabolism
1.1.10 Pyruvate metabolism
1.1.11 Glyoxylate and dicarboxylate metabolism
1.1.12 Propanoate metabolism
1.1.13 Butanoate metabolism
1.1.14 C5-Branched dibasic acid metabolism
1.1.15 Inositol phosphate metabolism
1.2 Energy metabolism
1.2.1 Oxidative phosphorylation
1.2.2 Photosynthesis
1.2.3 Photosynthesis - antenna proteins
1.2.4 Carbon fixation in photosynthetic organisms
1.2.5 Carbon fixation pathways in prokaryotes
1.2.6 Methane metabolism
1.2.7 Nitrogen metabolism
1.2.8 Sulfur metabolism
最后得到的txt中的格式应该是这样的
ENTRY PATHWAY
D00011 1.2.8
D00019 ~(没有的)
D00090 1.1.9,1.2.7(这种是有两个pathway信息的,如上,则两个用逗号隔开)
作者: pcl_test 时间: 2015-5-8 12:26
打包上传data和pathway两个文件的完整版本,文件过大可发网盘地址
作者: sweet惜缘 时间: 2015-5-8 15:35
回复 2# pcl_test
已上传。谢拉
作者: sweet惜缘 时间: 2015-5-8 15:42
本帖最后由 sweet惜缘 于 2015-5-11 09:31 编辑
回复 2# pcl_test
data中是不完整的数据集
非常感谢
作者: sweet惜缘 时间: 2015-5-9 13:52
求各位大神解答,要是有哪里不明白,我可以解答
作者: sweet惜缘 时间: 2015-5-9 22:41
。。。。大神都去过母亲节了么
作者: bailong360 时间: 2015-5-10 00:56
- # 2>nul&@gawk -f %0 drug>$New.txt&exit
- BEGIN{
- printf("ENTRY PATHWAY\n")
- while((getline<"pathway.txt")>0) {split($0,tmp," ");pathway[substr($0,match($0,"[ ]")+1)]=tmp[1]}P="~"
- }
- {
- if($1~"///") {
- printf("%-15s %s\n",E,P);P="~"
- } else if($1~"ENTRY") {
- E=$2
- } else if(GoOn==0&&$0~/[a-z]+[0-9]+/&&$0!~"COMMENT") {
- temp=pathway[substr($0,index($0,")")+3)];if(temp!=0) P=P!="~"?P" "temp:temp
- } else if($0!~/[a-z]+[0-9]+/) {
- GoOn=1
- } else if($0~"PATHWAY") {
- temp=pathway[substr($0,index($0,")")+3)]
- if(temp!=0)P=P!="~"?P" "temp:temp;GoOn=0}
- }
复制代码
作者: sweet惜缘 时间: 2015-5-11 08:09 标题: 标题
回复 7# bailong360
又是你~~~~感谢!!母亲节快乐
作者: sweet惜缘 时间: 2015-5-11 09:20
回复 7# bailong360
亲测有效~~一万个赞~~~
作者: bailong360 时间: 2015-5-11 17:44
回复 8# sweet惜缘
我是男的......
问题解决后,请编辑顶楼帖子在标题前面注明[已解决]
http://www.bathome.net/thread-3473-1-1.html
欢迎光临 批处理之家 (http://bathome.net./) |
Powered by Discuz! 7.2 |