新浪博客

一段批处理脚本: 将多个不同目录下的pdf文件转成txt文件

2013-07-13 14:02阅读:
(1) 文件夹组织如下:
root------- src #- 用于存放待转换pdf文件; 解决了文件名带空格的难题;
|
|---- result #- 存放转换后的text文件, 文件名与pdf文件名相同;
|
|---- PDFMiner #- 保留着转换pdf的Python脚本; 这个需要安装在你的
| # computer中, 安装方法: python setup.py install
|
|---- pdf2txt.bat #-
批处理文件, 用于遍历 .\src 目录下的所有pdf文件, | # 换为text文件,存放于result中
|
|---- tempdir.txt #- src中pdf文件(含路径) 每行以';'结束.{自动生成\删除}
(2) 用法:
Step 1: [可选] 首先确保你已经安装了python 2.7或以上版本, 安装PDFMiner,记住安装的路径
Step 2: [可选] 修改pdf2txt.bat文件中: set tool.path=C:\PDFMiner\tools
确保后面路径下能找到 pdf2txt.py 文件
Step 3: [必要] 将需要转换的pdf文件 ( 可以在子文件夹内) copy 到 root\src 目录下
Step 4: [必要] 运行 pdf2txt.bat
Step 5: [必要] 坐下喝杯茶, .... 一个小时后 .... 看 result. (小心电脑太烫!!!)
PS: 其中用到pdf转txt的python工具: C:\PDFMiner\tools\pdf2txt.py
需事先安装 pdfminer, 免费工具.
脚本如下:
@echo off
rem ********************************************************************************
rem * step 1: set pdf path --- pdf.path *
rem * set destenation path --- dst.path // save the text file *
rem * set toolpath path --- tool.path // tools path *
rem * set file exetend --- extend // search file *
rem * delete old file 'tempdir.txt' all searched file (*.extend) *
rem * *
rem * step 2: search all files which extend is %extend% in %src.path% *
rem * and its sub-folders by using command 'for' with options *
rem * '/r', then save the filename to text file 'tempdir.txt' *
rem * with full path *
rem * *
rem * step 3: call python program to translate pdf files to text file *
rem * and save them to the folder (%dst.path%) with same file *
rem * name. *
rem * Created by Juking at 2013-7-13, *
rem ********************************************************************************
rem Step 1: set the globle variables.
set pdf.path=.\PDF
set dst.path=.\Result
set tool.path=C:\PDFMiner\tools
set extend=pdf
del tempdir.txt
rem Step 2: search files ,save these file name in tempdir.txt ,
rem use notation ';' for delims=; in step 3.
for /r '%pdf.path%' %%a in (*.%extend%) do (
echo %%a; >> tempdir.txt
)

rem Step 3: pdf to text, save to %dst.path%
for /f 'delims=;' %%i in (.\tempdir.txt) do (
rem step 3.1: cut off the path string, shortname equals to filename without exetend( i.e. '.pdf').
echo %%i
set shortname=%%~ni

rem step 3.2: call python file to translate pdf to txt
python %tool.path%\pdf2txt.py '%%i' > %dst.path%\%%~ni.txt
)
要安装 PDFMiner

我的更多文章

下载客户端阅读体验更佳

APP专享