标题: [文本处理] 【已解决】如何批量提取文本中的指定内容? [打印本页]
作者: sweet惜缘 时间: 2015-12-27 17:24 标题: 【已解决】如何批量提取文本中的指定内容?
drugbank.txt中的信息大致如下
(由于附件太大,我只上传了一个drug的具体信息,全部的上传到了百度网盘,连接如下),想要提取的内容都是红色加粗部分:
http://pan.baidu.com/s/1numNLLf
#BEGIN_DRUGCARD DB00001
# AHFS_Codes:
20:12.04.12
# ATC_Codes:
B01AE02
# Absorption:
Bioavailability is 100% following injection.
# Drug_Interactions:
Ginkgo biloba Additive anticoagulant/antiplatelet effects may increase bleed risk. Concomitant therapy should be avoided.
Treprostinil The prostacyclin analogue, Treprostinil, increases the risk of bleeding when combined with the anticoagulant, Lepirudin. Monitor for increased bleeding during concomitant thearpy.
# Indication:
For the treatment of heparin-induced thrombocytopenia
# KEGG_Drug_ID:
D06880
# Pathways:
Lepirudin Pathway SMP00278
#END_DRUGCARD DB00001
希望经过处理后得到的out.txt如下:
BEGIN_DRUGCARD ATC_Codes Drug_Interactions Indication KEGG_Drug_ID Pathways
DB00001 B01AE02 Ginkgo biloba,Treprostinil heparin-induced thrombocytopenia D06880 SMP00278
DB00002 ... .... ... ... ...
如下是我的程序,得不到结果,希望大神能给出有效的程序(不需要帮我改我的程序),只要能得到我想要的结果就好,灰常感谢!!
# 2>nul&@Gawk -f %0 drugbank.txt&Exit
BEGIN{printf("ENTRY ATC code Indication Drug_Interactions PATHWAY\n")>>"$Data.txt";A[2]=I[2]=D[2]=P[2]="~"}
END{printf("\n拥有ATC code的药物有%d种\n拥有Drug group的药物有%d种\n拥有Therapeutic category的药物有%d种\n拥有PATHWAY的药物有%d种\n",_A,_I,_D,_P)>>"$Data.txt"}
$1~"///"{
A[2]!="~"?_A++:0;I[2]!="~"?_I++:0;D[2]!="~"?_D++:0!="~"?_P++:0
printf("%-16s %-15s %-16s %-31s %s\n",E,A[2],I[2],D[2],P[2])>>"$Data.txt"
A[2]=I[2]=D[2]=P[2]="~"
}
$1~"ENTRY"{{split($0,B,"BEGIN_DRUGCARD ");gsub(" ",E[2])}
$0~"ATC code"{split($0,A,"ATC_Codes: ");gsub(" ",",",A[2])}
$0~"Indication:"{split($0,I,"Indication: ");gsub(" ",",",I[2])}
$0~"Drug_Interactions:"&&$0!~"of"{split($0,D,Drug_Interactions:: ");gsub(" ",",",D[2])}
$0~"PATHWAY"{split($0,P,"Pathways: ");gsub(" ",",",P[2])}
作者: DAIC 时间: 2015-12-27 17:55
在论坛上传文件之前是否可以先压缩一下?
作者: pcl_test 时间: 2015-12-27 22:10
本帖最后由 pcl_test 于 2015-12-27 22:14 编辑
- 1>1/* :
- @echo off
- cscript -nologo -e:jscript "%~f0" "drugbank.txt">结果.txt
- echo;完成
- pause & exit/b
- */
-
- var txt ='';
- var fso = new ActiveXObject('Scripting.FileSystemObject');
- var f = fso.OpenTextFile(WScript.Arguments(0));
- while(!f.AtEndOfStream) {
- var str = f.ReadLine();
- if(/^\s*$/.test(str))var ATC_Codes=Drug_Interactions=Indication=KEGG_Drug_ID=Pathways=0;
- if(ATC_Codes==1){
- tmp+=str+',';
- }else{
- if(Drug_Interactions==1){
- if(!/Not Available/.test(str)){
- tmp+=str.replace(/\s[A-Z].+$/,'')+',';
- }else tmp+=str+',';
- }else{
- if(Indication==1){
- tmp+=str+',';
- }else{
- if(KEGG_Drug_ID==1){
- tmp+=str+',';
- }else{
- if(Pathways==1){
- if(!/Not Available/.test(str)){
- tmp+=str.replace(/^.+\s/,'');
- }else tmp+=str;
- }
- }
- }
- }
- }
- if(/^#\s?BEGIN_DRUGCARD/.test(str)){var tmp='';tmp+=str.replace(/^.+\s/,'')+'\t\t';}
- if(/^# ATC_Codes/.test(str)){var ATC_Codes=1;tmp+='\t';}
- if(/^# Drug_Interactions/.test(str)){var Drug_Interactions=1;tmp+='\t';}
- if(/^# Indication/.test(str)){var Indication=1;tmp+='\t';}
- if(/^# KEGG_Drug_ID/.test(str)){var KEGG_Drug_ID=1;tmp+='\t';}
- if(/^# Pathways/.test(str)){var Pathways=1;tmp+='\t';}
- if(/^#\s?END_DRUGCARD/.test(str))txt+=tmp.replace(/,\t/g,'\t\t')+'\r\n'
- }
- var caption = 'BEGIN_DRUGCARD\t\tATC_Codes\t\tDrug_Interactions\t\tIndication\t\tKEGG_Drug_ID\t\tPathways\r\n'
- WSH.Echo(caption+txt);
复制代码
作者: sweet惜缘 时间: 2015-12-28 11:13
本帖最后由 sweet惜缘 于 2015-12-28 14:55 编辑
回复 3# pcl_test
程序有效!谢谢大神,现在还有一个问题。能否提取每个drug的Drug_Target_ID信息,(有的dug_target信息不止下面3个,可能多达几十个):
# Drug_Target_1_ID:
3819
# Drug_Target_1_Locus:
11p11-q12
# Drug_Target_1_Molecular_Weight:
70037
# Drug_Target_1_Name:
Prothrombin
# Drug_Target_1_Number_of_Residues:
622
# Drug_Target_1_PDB_ID:
1HAG
# Drug_Target_2_ID:
54
最后希望得到的out_target.txt内容如下(有的没有target的信息就空着):
BEGIN_DRUGCARD Drug_Target_1_ID Drug_Target_2_ID Drug_Target_3_ID .....
DB00001 3819 54 ...
DB00002 ... ... ...
...
...
灰常感谢!!!
欢迎光临 批处理之家 (http://bathome.net./) |
Powered by Discuz! 7.2 |