批处理之家 - Powered by Discuz! Board

标题: [文本处理] 【已解决】如何批量提取文本中的指定内容？ [打印本页]

作者: sweet惜缘 时间: 2015-12-27 17:24 标题: 【已解决】如何批量提取文本中的指定内容？

drugbank.txt中的信息大致如下
（由于附件太大，我只上传了一个drug的具体信息，全部的上传到了百度网盘，连接如下），想要提取的内容都是红色加粗部分：
http://pan.baidu.com/s/1numNLLf

#BEGIN_DRUGCARD DB00001

# AHFS_Codes:
20:12.04.12

# ATC_Codes:
B01AE02

# Absorption:
Bioavailability is 100% following injection.

# Drug_Interactions:
Ginkgo biloba Additive anticoagulant/antiplatelet effects may increase bleed risk. Concomitant therapy should be avoided.
Treprostinil The prostacyclin analogue, Treprostinil, increases the risk of bleeding when combined with the anticoagulant, Lepirudin. Monitor for increased bleeding during concomitant thearpy.

# Indication:
For the treatment of heparin-induced thrombocytopenia

# KEGG_Drug_ID:
D06880

# Pathways:
Lepirudin Pathway SMP00278

#END_DRUGCARD DB00001

希望经过处理后得到的out.txt如下：
BEGIN_DRUGCARD    ATC_Codes    Drug_Interactions                   Indication                                        KEGG_Drug_ID    Pathways
DB00001                   B01AE02       Ginkgo biloba，Treprostinil    heparin-induced thrombocytopenia    D06880                SMP00278
DB00002                   ...                ....                                     ...                                                    ...                         ...

如下是我的程序，得不到结果，希望大神能给出有效的程序（不需要帮我改我的程序），只要能得到我想要的结果就好，灰常感谢！！

# 2>nul&@Gawk -f %0 drugbank.txt&Exit
BEGIN{printf("ENTRY          ATC code       Indication    Drug_Interactions          PATHWAY\n")>>"$Data.txt";A[2]=I[2]=D[2]=P[2]="~"}
END{printf("\n拥有ATC code的药物有%d种\n拥有Drug group的药物有%d种\n拥有Therapeutic category的药物有%d种\n拥有PATHWAY的药物有%d种\n",_A,_I,_D,_P)>>"$Data.txt"}
$1~"///"{
A[2]!="~"?_A++:0;I[2]!="~"?_I++:0;D[2]!="~"?_D++:0

!="~"?_P++:0
printf("%-16s %-15s %-16s %-31s %s\n",E,A[2],I[2],D[2],P[2])>>"$Data.txt"
A[2]=I[2]=D[2]=P[2]="~"
}
$1~"ENTRY"{{split($0,B,"BEGIN_DRUGCARD ");gsub(" ",E[2])}
$0~"ATC code"{split($0,A,"ATC_Codes: ");gsub(" ",",",A[2])}
$0~"Indication:"{split($0,I,"Indication: ");gsub(" ",",",I[2])}
$0~"Drug_Interactions:"&&$0!~"of"{split($0,D,Drug_Interactions:: ");gsub(" ",",",D[2])}
$0~"PATHWAY"{split($0,P,"Pathways: ");gsub(" ",",",P[2])}

作者: DAIC 时间: 2015-12-27 17:55

在论坛上传文件之前是否可以先压缩一下？

作者: pcl_test 时间: 2015-12-27 22:10

本帖最后由 pcl_test 于 2015-12-27 22:14 编辑

1>1/* :
@echo off
cscript -nologo -e:jscript "%~f0" "drugbank.txt">结果.txt
echo;完成
pause & exit/b
*/

var txt ='';
var fso = new ActiveXObject('Scripting.FileSystemObject');
var f = fso.OpenTextFile(WScript.Arguments(0));
while(!f.AtEndOfStream) {
    var str = f.ReadLine();
    if(/^\s*$/.test(str))var ATC_Codes=Drug_Interactions=Indication=KEGG_Drug_ID=Pathways=0;
    if(ATC_Codes==1){
        tmp+=str+',';
    }else{
        if(Drug_Interactions==1){
            if(!/Not Available/.test(str)){
                tmp+=str.replace(/\s[A-Z].+$/,'')+',';
            }else tmp+=str+',';
        }else{
            if(Indication==1){
                tmp+=str+',';
            }else{
                if(KEGG_Drug_ID==1){
                    tmp+=str+',';
                }else{
                    if(Pathways==1){
                        if(!/Not Available/.test(str)){
                            tmp+=str.replace(/^.+\s/,'');
                        }else tmp+=str;  
                    }
                }
            }
        }
    }
    if(/^#\s?BEGIN_DRUGCARD/.test(str)){var tmp='';tmp+=str.replace(/^.+\s/,'')+'\t\t';}
    if(/^# ATC_Codes/.test(str)){var ATC_Codes=1;tmp+='\t';}
    if(/^# Drug_Interactions/.test(str)){var Drug_Interactions=1;tmp+='\t';}
    if(/^# Indication/.test(str)){var Indication=1;tmp+='\t';}
    if(/^# KEGG_Drug_ID/.test(str)){var KEGG_Drug_ID=1;tmp+='\t';}
    if(/^# Pathways/.test(str)){var Pathways=1;tmp+='\t';}
    if(/^#\s?END_DRUGCARD/.test(str))txt+=tmp.replace(/,\t/g,'\t\t')+'\r\n'
}
var caption = 'BEGIN_DRUGCARD\t\tATC_Codes\t\tDrug_Interactions\t\tIndication\t\tKEGG_Drug_ID\t\tPathways\r\n'
WSH.Echo(caption+txt);
复制代码

作者: sweet惜缘 时间: 2015-12-28 11:13

本帖最后由 sweet惜缘于 2015-12-28 14:55 编辑

回复 3# pcl_test
程序有效！谢谢大神，现在还有一个问题。能否提取每个drug的Drug_Target_ID信息，（有的dug_target信息不止下面3个，可能多达几十个）：

# Drug_Target_1_ID:
3819

# Drug_Target_1_Locus:
11p11-q12

# Drug_Target_1_Molecular_Weight:
70037

# Drug_Target_1_Name:
Prothrombin

# Drug_Target_1_Number_of_Residues:
622

# Drug_Target_1_PDB_ID:
1HAG

# Drug_Target_2_ID:
54

最后希望得到的out_target.txt内容如下（有的没有target的信息就空着）：
BEGIN_DRUGCARD                Drug_Target_1_ID                      Drug_Target_2_ID                Drug_Target_3_ID    .....
DB00001                               3819                                           54                                        ...
DB00002                               ...                                              ...                                        ...
...
...

灰常感谢！！！

欢迎光临批处理之家 (http://bathome.net./)

Powered by Discuz! 7.2