Board logo

标题: [文本处理] 批处理-按关键字分割TXT文件 [打印本页]

作者: qd2024    时间: 2023-2-2 08:11     标题: 批处理-按关键字分割TXT文件

本帖最后由 qd2024 于 2023-2-2 14:32 编辑

TXT文件中,分割位置行首有关键字,如★★★★★

以“★★★★★”为标记,把一个TXT文件分割 为多个TXT文件,
生成的TXT文件,以“★★★★★”所在行为首行,以这一行的文字内容为文件名,
“★★★★★”写标记了开始位置,第2个“★★★★★”,即是第2个TXT文件的开始,也是第1个文件的结束,最后一个到文件尾
但生成的文行首行和文件名中都不包含“★★★★★”

下面的示例文本,最后生成4个txt文件
Module1 unit1 What a delicious smell
Module1 unit2 You sound just like me!
Module2 unit1 What are you doing
Module2 unit2 They have been to many interesting places.

如标点中有非法字符,就删除非法字符,不用写到文件名中。


谢谢



文本原文示例:
★★★★★Module1 unit1 What a delicious smell
Tony: Mnn…What a delicious smell! Your pizza looks so nice.
Betty: Thanks! Would you like to try some?
Tony: Yes, please. It looks lovely, it smells delicious and mm, it tastes good.
Daming: What’s that on top?
Betty: Oh, that’s cheese. Do you want to try a piece?
Daming: Ugh! No, thanks. I’m afraid I don’t like cheese. It doesn’t smell fresh. It smells too strong and it tastes a bit sour.
Betty: Well, my chocolate cookies are done now. Have a try!
Daming: Thanks! They taste really sweet and they feel soft in the middle.
Tony: Are you cooking lots of different things? You look very busy!
Betty: Yes, I am! There’s some pizza and some cookies, and now I’m making an apple pie and a cake.
Daming: Apple pie sounds nice. I have a sweet tooth, you know. Shall I get the sugar?
Betty: Yes, please. Oh, are you sure that’s sugar? Taste it first. It might be salt!
Daming: No, it’s OK. It tastes sweet. It’s sugar.
Tony: What’s this? It tastes sweet too.
Betty: That’s strawberry jam, for the cake.
Daming: Good, everything tastes so sweet! It’s my lucky day!
★★★★★Module1 unit2 You sound just like me!
Hi Lingling,
Thanks for your last message. It was great to hear from you, and I can't wait to meet you.
I hope you will know me from my photo when I arrive at the airport. I'm quite tall, with short fair hair, and I wear glasses. I'll wear jeans and a T-shirt for the journey, but I'll also carry my warm coat. I've got your photo - you look very pretty. I'm sure we'll find each other!
Thanks for telling me about your hobbies. You sound just like me! I spend a lot of time playing classical music with my friends at school, but I also like dance music - I love dancing! I enjoy sports as well, especially tennis. My brother is in the school tennis team - I'm very proud of him! He's good at everything, but I'm not. Sometimes I get bad marks at school, and I feel sad. I should work harder.
You asked me, "How do yo feel about coming to China? "Well, I often feel a bit sad at first when I leave my mum and dad for a few days, and I'm quite shy when I'm with strangers. I feel nervous when I speak Chinese, but I'll be fine in a few days. I'm always sorry when I don't know how to do things in the right way, so please help me when I'm with you in China! Oh, I'm afraid of flying too. But I can't tell you how excited I am about going to China!
See you next week!
★★★★★Module2 unit1 What are you doing?
Tony: Hi, Lingling. What are you doing?
Lingling: I'm entering a competition.
Tony: What kind of competition?
Lingling: A speaking competition.
Tony: "Great. "It'll help you improve your speaking. And maybe you will win a prize.
Lingling: The first prize is "My dream holiday".
Tony: Have you ever won any prizes before?
Lingling: No, I haven't. I've always wanted to go on a dream holiday. But I can't afford it. The plane tickets are too expensive.
Tony: Well, good luck! I've also entered lots of speaking competitions, but haven't won any prizes. I've stopped trying now.
Lingling: That's a pity. Have you ever thought about other kinds of competitions?
Tony: What do you mean?
Lingling: look! Here's a writing competition: Around the world in 80 Days. To win it, you need to write a short story about a place you've visited.
Tony: That sounds wonderful, but I haven't travelled much. How can I write about it?
Lingling: Don't worry. It doesn’t need to be true! You can make it up.
Tony: You're right. I'll try. I hope I will win, then I will invite you to come with me.
Lingling: Sorry! The first prize is only the book called Around the World in 80 Day!
★★★★★Module2 unit2 They have been to many interesting places.
Mike Robinson is a fifteen-year-old American boy and his sister Clare is fourteen. At the moment, Mike and Clare are in Cairo in Egypt, one of the biggest and busiest cities in Africa.
They moved here with their parents two years ago. Their father, Peter works for a very big company. "The company has offices in many countries, and it has sent Peter to work in Germany, France and China before. "Peter usually stays in a country for about two years. Then the company moves him again. His family always goes with him.
The Robinsons love seeing the world. They have been to many interesting places. For example, in Egypt, they seen the Pyramids, travelled on a boat on the Nile River, and visited the palaces and towers of ancient kings and queens.
Mike and Clare have also begun to learn the language of the country, Arabic. This language is different from English in many ways, and they find it hard to spell and pronounce the words. However, they still enjoy learning it. So far they have learnt to speak German, French, Chinese and Arabic. Sometimes they mix the languages. "It's really fun, "said Clare.
The Robinsons are moving again. The company has asked Peter to work back in the US. Mike and Clare are happy about this. They have friends all over the word, but they also miss their friends in the US. They are counting down the days.
作者: qixiaobin0715    时间: 2023-2-2 14:13

楼主好像忘了有些字符是不能出现在文件名中的,比如生成的第3个文件,文件名中会出现字符"?"的,这是非法的。
作者: qd2024    时间: 2023-2-2 14:31

回复 2# qixiaobin0715


    收到 谢谢
如有非法字符,直接删除就行
作者: qixiaobin0715    时间: 2023-2-2 14:50

批处理保存为ANSI编码:
  1. @echo off
  2. setlocal enabledelayedexpansion
  3. findstr /n /rb "★★★★★" 1.txt>1.log
  4. for /f "delims=:" %%a in (1.log) do set _%%a=true
  5. del 1.log
  6. for /f "tokens=* delims=★" %%i in (1.txt) do (
  7.     set /a n+=1
  8.     if defined _!n! (
  9.         for /f "tokens=1,2" %%a in ("%%i") do set filename=%%a %%b.txt
  10.     )
  11.     echo,%%i>>!filename!
  12. )
  13. pause
复制代码

作者: qixiaobin0715    时间: 2023-2-2 14:56

对示范中的文本来说,实际上不做标记也能实现。
作者: qd2024    时间: 2023-2-2 15:14

回复 5# qixiaobin0715


    如果不做标记,怎样实现 ,我的目的是,给孩子把英语文章 一篇一篇 的放在单独的TXT文件里,
怎么做能更简单

感谢
作者: qd2024    时间: 2023-2-2 15:21

回复 4# qixiaobin0715


提示   
系统找不到指定的路径。

不会用了
作者: qixiaobin0715    时间: 2023-2-2 15:28

本帖最后由 qixiaobin0715 于 2023-2-2 15:29 编辑

你把源文件发到网盘上,帮你测试看看。
复制你的示范文本没问题。
作者: 77七    时间: 2023-2-2 15:40

回复 8# qixiaobin0715

大佬,我发现一个问题,像下面这行(倒数第11行),从"!The first"开始都被当成变量省略了
Lingling: Sorry! The first prize is only the book called Around the World in 80 Day!

作者: qd2024    时间: 2023-2-2 15:42

回复 8# qixiaobin0715


    链接:https://pan.baidu.com/s/1HW-cqi8FXobeeYjs-3wkQw?pwd=28h1
提取码:28h1
--来自百度网盘超级会员V9的分享
作者: 77七    时间: 2023-2-2 16:05

  1. @echo off
  2. for /f "delims=" %%a in ('type "文本.txt" ^| findstr /n .*') do (
  3. set "line=%%a"
  4. setlocal enabledelayedexpansion
  5. set "line=!line:*:=!"
  6. set "line2=!line:★★★★★=!"
  7. if not "!line2!" equ "!line!" (
  8. if not "!line2!" equ "★★★★★=" (
  9. set "line2=!line2:?=!"
  10. >xxx.temp echo !line2!
  11. )
  12. )
  13. set /p filename=<xxx.temp
  14. if "!line2!" equ "★★★★★=" (
  15. (echo,!line!)>>"!filename!.txt"
  16. )
  17. if "!line2!" equ "!line!" (
  18. (echo,!line!)>>"!filename!.txt"
  19. )
  20. endlocal
  21. )
  22. del xxx.temp
  23. pause
复制代码

我想了很久,还是利用了临时文件...通用性不大...
作者: terse    时间: 2023-2-2 16:49

处理一下特殊字符吧
  1. @echo off
  2. set/p w=<%~fs0 >nul
  3. set "s=★★★★★"
  4. setlocal enabledelayedexpansion
  5. for /f "tokens=*" %%i in (1.txt) do (
  6.     set "str=%%~i"
  7.     if "!str:~,5!" == "!s!" (
  8.        set "file=!str:*%s%=!.txt"
  9.        set "filename="
  10.        call :loop "!file!"
  11.     ) else if defined filename (>>"!filename!" echo;!str!)
  12. )
  13. pause & exit
  14. :loop
  15. for /f tokens^=1*delims^=:\/*?^<^>^" %%a in ("%~1") do (
  16.      set filename=!filename!%%a
  17.      call :loop "%%b"
  18. )
  19. exit /b
复制代码

作者: qixiaobin0715    时间: 2023-2-2 16:54

回复 10# qd2024
可以这样处理:
1.打开要处理的word文件;
2.另存为中,选择文件类型为纯文本文件,对话框中文本编码中选择其它编码中的GB2312,确定;
3.保存的文本中删除Module1 unit1这一行前面的所有行,删除文本中每行前面的全角空格;
4.将下面代码保存为ANSI编码,运行批处理文件。
  1. @echo off
  2. findstr /n /rb "Module[0-9]*.unit[0-9]" 1.txt>1.log
  3. for /f "delims=:" %%a in (1.log) do set _%%a=true
  4. del 1.log
  5. for /f "tokens=1* delims=:" %%i in ('findstr /n .* 1.txt') do (
  6.     if defined _%%i set "filename=%%j.txt"
  7.     set "str=%%j"
  8.     setlocal enabledelayedexpansion
  9.     echo,!str!>>!filename!
  10.     endlocal
  11. )
  12. pause
复制代码

作者: hfxiang    时间: 2023-2-2 18:14

回复 1# qd2024

下载gawk( http://bcn.bathome.net/tool/4.1.3/gawk.exe ),确保文本及脚本都已以ANSI编码格式保存,执行后即可获取想要结果
  1. gawk -F"^★★★★★" "/^★★★★★/{F_n=gensub(/[!&<>/\|:*?\"]+/,\"\",\"g\",$2)}F_n{print $0^>F_n}" 文本.txt
复制代码

作者: qd2024    时间: 2023-2-3 07:58

回复 13# qixiaobin0715


    成功 万分感谢
作者: qd2024    时间: 2023-2-3 07:58

回复 14# hfxiang
作者: qd2024    时间: 2023-2-3 07:59

回复 11# 77七


    谢谢
作者: qd2024    时间: 2023-2-3 07:59

回复 14# hfxiang


    谢谢
作者: qd2024    时间: 2023-2-3 20:01

回复 13# qixiaobin0715
分割后文件里有不确定乱码 能解决吗

    Module4 unit1
Doctor: How can I help you?
Daming: I fell ill. I鈥檝e got a stomach ache and my head hurts.
Doctor: How long have you been like this?
Daming: Since Friday. I've been ill for about three days!
Doctor: I see. Have you caught a cold?
Daming: I don't think so.
Doctor: Let me take your temperature鈥mm, there's no fever. What kind of food do you eat?
Daming: Usually fast food.
Doctor: Do you have breakfast?
Daming: No, not usually.
Doctor: 鈥淭hat's the problem! Fast food and no breakfast.鈥?That's why you've got a stomach ache.
Daming: What about the headache?
Doctor: Do you do any exercise?
Daming: Not really. I haven't done much exercise since I got my computer last year.
Doctor: 鈥淵ou spend too much time in front of the computer.鈥?It can be very harmful to your health.
Daming: OK, so what should I do?
Doctor: Well, don't worry. It's not serious. First, stop eating fast food and have breakfast every day. Second, get some exercise, such as running. And I'll give you some medicine. Take it three times a day.
Daming: Thank you, doctor.

====

Module3 unit2
鈥淪cientists think that there has been life on the earth for hundreds of millions of years.鈥?However, we have not found life on any other planets yet.
The earth is a planet and it goes around the sun. Seven other planets also go around the sun. None of them has an environment like that of the earth, so scientists do not think they will find life on them. The sun and its planets are called the solar system, and our solar system is a small part of a much larger group of stars and planets, called the Galaxy or the Milky Way. There are billions of stars in the Galaxy, and our sun is only one of them.
Scientists have also discovered many other galaxies in the universe. They are very far away and their light has to travel for many years to reach us. So how large is the universe? It is impossible to imagine.
Scientists have sent spaceships to the planet Mars to take photos. They have even sent spaceships to travel outside the solar system. However, no spaceship has travelled far enough to reach other stars in our Galaxy.
Scientists have always asked the questions: with so many stars in the universe, are we alone, or is there life out there in space? Have there been visitors to the earth from other planets? Why has no one communicated with us? We do not know the answers... yet.



=====

Module3 unit1
Daming: Hi, Tony. What are you up to?
Tony: Hi Daming. I've just made a model spaceship for our school project.
Daming: I haven't started yet because I'm not sure how to make it. Can you help me?
Tony: Sure, no problem. Have you heard the latest news? Scientists have sent a spaceship to Mars. The journey has taken several months.
Daming: Has it arrived yet?
Tony: Yes, it has arrived already. That's why it's on the news.
Daming: So have they discovered life on the Mars?
Tony: No, they haven't yet.
Daming: Are there any astronauts in the spaceship?
Tony: No, there aren't.
Daming: 鈥淲hy not? Astronauts have already been to the moon.鈥?
Tony: Yes, but no one has been to Mars yet, because Mars is very far away, much farther than the moon. Lots of scientists are working hard in order to send astronauts to Mars one day.
Daming: That's interesting! How can I get information on space travel?
Tony: You can go online to search for information.
Daming: I will. Thank you, Tony!
作者: qd2024    时间: 2023-2-3 22:06

回复 13# qixiaobin0715


    分割后的TXT文件,后面还有进一步处理,试了很久,发现 只有UTF-8才可以 没有乱码 谢谢 帮我看看
作者: aloha20200628    时间: 2023-2-3 23:33

一。lz可先用记事本将原文件存为ANSI编码
二。以下批处理脚本代码亦存为ANSI编码
  1. @set @v=1 /*
  2. @echo off
  3. set "tF=" &set/p "tF=原文件:"
  4. if not defined tF exit/b
  5. (cscript.exe -e:jscript "%~f0" %tF%)
  6. exit/b
  7. */
  8. var v=WScript.arguments;
  9. var fso=new ActiveXObject('scripting.filesystemobject');
  10. var fr=fso.opentextfile(v(0));
  11. var alllines=fr.readall().split('\r\n'); fr.close();
  12. var n, nL=alllines.length, outF='';
  13. for (n=0; n<nL; ++n)
  14. if (alllines[n].indexOf('★') != -1) {
  15. if (outF != '') fw.close();
  16. outF=alllines[n].replace(/[★\?]/g, '');
  17. outF+='.txt';
  18. fw=fso.opentextfile(outF, 2, true);
  19. }
  20. else fw.write(alllines[n]+'\r\n');
  21. WSH.quit(0);
复制代码

作者: qd2024    时间: 2023-2-4 06:07

回复 21# aloha20200628


    分割后的TXT文件,后面还有进一步处理,试了很久,
    只有UTF-8编辑才可以

谢谢
作者: qixiaobin0715    时间: 2023-2-4 09:28

将源文件和批处理文件统一UTF-8编码:
  1. @echo off &@cls&chcp>nul 65001
  2. findstr /n /rb "Module[0-9]*.unit[0-9]" 1.txt>1.log
  3. for /f "delims=:" %%a in (1.log) do set _%%a=true
  4. del 1.log
  5. for /f "tokens=1* delims=:" %%i in ('findstr /n .* 1.txt') do (
  6.     if defined _%%i set "filename=%%j.txt"
  7.     set "str=%%j"
  8.     setlocal enabledelayedexpansion
  9.     echo,!str!>>!filename!
  10.     endlocal
  11. )
  12. pause
复制代码

作者: aloha20200628    时间: 2023-2-4 11:26

与lz分享一下我的调试过程》
一。系统环境是win8.1简中版
二。复制lz的原文到记事本,用ANSI编码存盘为a.txt
三。本人的批处理脚本代码用记事本亦选ANSI编码存盘a.cmd
四。a.txt与a.cmd在同一目录
五。运行a.cmd,拖入或键入a.txt
六。结果是在a.txt目录中生成4个*.txt文件,完好复刻lz的需求效果(原文中的!...!段落不会丢失)。
      请问lz的调试方法与上述有何不同?
作者: qixiaobin0715    时间: 2023-2-4 14:50

回复 24# aloha20200628
楼主的需求是,分割后的文件编码为UTF-8
作者: qixiaobin0715    时间: 2023-2-4 14:51

回复 22# qd2024
后续还需要如何处理?
作者: hfxiang    时间: 2023-2-5 10:12

本帖最后由 hfxiang 于 2023-2-5 12:30 编辑

回复 10# qd2024

把Word文档以GB2312编码另存为“最新八年级外研版英语下册课文.txt”,经Windows10下反复测试,如下gawk( http://bcn.bathome.net/tool/4.1.3/gawk.exe )脚本能胜任(无乱码):
  1. gawk -vRS="Module[0-9]+ unit[0-9]+" "F_n{print F_n\"\n\"$0>F_n\".txt\"}{F_n=RT}" 最新八年级外研版英语下册课文.txt
复制代码

作者: terse    时间: 2023-2-5 18:22

powershell 直接从word文档导出txt 这里档名为 a.docx
  1. <# : batch portion (begins PowerShell multi-line comment block)
  2. @echo off & setlocal
  3. powershell -noprofile -NoLogo "iex (${%~f0} | out-string)"
  4. pause
  5. exit
  6. #>
  7. $word = New-Object -ComObject Word.Application
  8. $file = (ls a.docx).FullName
  9. $doc = $word.Documents.Open($file)
  10. $text = $doc.Content.Text
  11. $pattern =[regex] '(?i)(Module\d+\s+unit\d+)[\r\n]*(.+?)(?=Module\d+\s+unit\d+|$)'
  12. $paragraphs = [regex]::matches($text,$pattern)
  13. $doc.Close()
  14. $word.Quit()
  15. $paragraphs.ForEach({[IO.File]::WriteAllText( $_.Groups[1].Value+ '.txt',$_.Groups[2].Value,[Text.Encoding]::Default)})
复制代码

作者: qd2024    时间: 2023-2-6 16:08

回复 26# qixiaobin0715


     我用另外的代码 给单词加中文 另一个要求U8

现在可以了
作者: qd2024    时间: 2023-2-6 16:09

回复 28# terse


    谢谢 我测试一下 感谢
作者: qd2024    时间: 2023-2-6 16:09

回复 27# hfxiang


    好 谢谢




欢迎光临 批处理之家 (http://bathome.net./) Powered by Discuz! 7.2