一段批处理脚本: 将多个不同目录下的pdf文件转成txt文件
2013-07-13 14:02阅读:
(1) 文件夹组织如下:
root------- src
#-
用于存放待转换pdf文件; 解决了文件名带空格的难题;
|
|---- result
#- 存放转换后的text文件, 文件名与pdf文件名相同;
|
|---- PDFMiner
#- 保留着转换pdf的Python脚本; 这个需要安装在你的
|
#
computer中, 安装方法: python setup.py
install
|
|---- pdf2txt.bat
#-
批处理文件, 用于遍历 .\src 目录下的所有pdf文件,
|
#
并转换为text文件,存放于result中
|
|---- tempdir.txt
#-
src中pdf文件(含路径) 每行以';'结束.{自动生成\删除}
(2) 用法:
Step 1: [可选] 首先确保你已经安装了python 2.7或以上版本,
安装PDFMiner,记住安装的路径
Step 2: [可选] 修改pdf2txt.bat文件中:
set tool.path=C:\PDFMiner\tools
确保后面路径下能找到 pdf2txt.py
文件
Step 3: [必要] 将需要转换的pdf文件 ( 可以在子文件夹内) copy 到 root\src
目录下
Step 4: [必要] 运行 pdf2txt.bat
Step 5: [必要] 坐下喝杯茶, .... 一个小时后 .... 看 result.
(小心电脑太烫!!!)
PS: 其中用到pdf转txt的python工具:
C:\PDFMiner\tools\pdf2txt.py
需事先安装 pdfminer, 免费工具.
脚本如下:
@echo off
rem
********************************************************************************
rem * step 1: set pdf path
--- pdf.path
*
rem * set destenation path
--- dst.path // save the text file
*
rem * set toolpath path
--- tool.path // tools path
*
rem * set file exetend
--- extend // search file
*
rem * delete old file
'tempdir.txt' all searched file (*.extend)
*
rem *
*
rem * step 2: search all files which extend is %extend% in
%src.path% *
rem * and its sub-folders by
using command 'for' with options
*
rem * '/r', then save the
filename to text file 'tempdir.txt'
*
rem * with full path
*
rem *
*
rem * step 3: call python program to translate pdf files to
text file *
rem * and save them to the
folder (%dst.path%) with same file
*
rem * name.
*
rem * Created by Juking at 2013-7-13,
*
rem
********************************************************************************
rem Step 1: set the globle variables.
set pdf.path=.\PDF
set dst.path=.\Result
set tool.path=C:\PDFMiner\tools
set extend=pdf
del tempdir.txt
rem Step 2: search files ,save these file name in
tempdir.txt ,
rem use notation ';' for delims=; in step 3.
for /r '%pdf.path%' %%a in (*.%extend%) do (
echo %%a; >> tempdir.txt
)
rem Step 3: pdf to text, save to %dst.path%
for /f 'delims=;' %%i in (.\tempdir.txt) do (
rem step 3.1: cut off the path string, shortname
equals to filename without exetend( i.e. '.pdf').
echo %%i
set shortname=%%~ni
rem step 3.2: call python file to translate pdf to
txt
python %tool.path%\pdf2txt.py '%%i' >
%dst.path%\%%~ni.txt
)
要安装 PDFMiner