Converting Word .DOC and .PDF to text files

[1/13] from: jjmmes::yahoo::es at: 25-Sep-2002 11:20

Has anybody written any code to parse .DOC or .PDF native formats and extract just the text ? In Linux there are the wv tools that do this well but I am looking for a REBOL cross-platform solution Thanks

[2/13] from: greggirwin:mindspring at: 25-Sep-2002 10:45

Hi Jose, << Has anybody written any code to parse .DOC or .PDF native formats and extract just the text ? >> Not that I know of. --Gregg

[3/13] from: anton:lexicon at: 26-Sep-2002 19:54

Gabriele wrote a pdf generator, but he said parsing pdf files is really hard, but perhaps just the text might be easier. Anton.

[4/13] from: jjmmes:y:ahoo:es at: 26-Sep-2002 15:49

I've used pdf-maker and it's a great tool. I just want to do the opposite, strip the tags and get the plain text. wvware.com has open source for these utilities but I'd rather do it in Rebol directly, if it is not too complicated. I�ll check with Gabrielle Thanks --- Anton <[anton--lexicon--net]> escribi�: > Gabriele wrote a pdf generator, but he said parsing

[5/13] from: anton:lexicon at: 27-Sep-2002 0:59

Maybe you could run off and grab the relevant parts of the source (just the relevant parts). One of us here might be able to convert it without too much trouble. I can understand C. Anton.

[6/13] from: g:santilli:tiscalinet:it at: 26-Sep-2002 15:09

Hi Anton, On Thursday, September 26, 2002, 11:54:15 AM, you wrote: A> Gabriele wrote a pdf generator, but he said parsing A> pdf files is really hard, but perhaps just the text A> might be easier. It depends. If you are lucky, getting the text out of it is really easy. You can easily get the text out of a PDF generated by PDF Maker. However, if the file is compressed and/or encrypted, you need to do a fair amount of job to get to the text... And usually, PDF files are at least compressed. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[7/13] from: jjmmes:yah:oo:es at: 26-Sep-2002 18:21

Gabrielle, I want to get the text of any arbitrary PDF file. Is there a spec I can look at ? There is an open source toolset, wvware.com that does it but I'd rather not look at their code Thanks --- Gabriele Santilli <[g--santilli--tiscalinet--it]> escribi�: > Hi Anton,

[8/13] from: g:santilli:tiscalinet:it at: 26-Sep-2002 19:36

Hi jose, On Thursday, September 26, 2002, 6:21:10 PM, you wrote: j> I want to get the text of any arbitrary PDF file. Is j> there a spec I can look at ? On the Adobe web site you'll find the full specifications for the PDF format. I can send it to you, if you don't want to search for it. However, as I said, parsing a PDF file is harder than creating one, because you'll have to deal with all possibilities (compression, encryption, linearized format...); of course, this does not mean it is impossible. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[9/13] from: jjmmes:yah:oo:es at: 26-Sep-2002 19:59

Thanks. This is the advice I need. I'll probably be better off using wvware library I hope word is easier than PDF ! --- Gabriele Santilli <[g--santilli--tiscalinet--it]> escribi�: > Hi jose,

[10/13] from: g:santilli:tiscalinet:it at: 26-Sep-2002 21:07

Hi jose, On Thursday, September 26, 2002, 7:59:53 PM, you wrote: j> I hope word is easier than PDF ! I don't think so. Actually, MS does its best to avoid other people reading their formats... however, maybe just reading the text is not difficult. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[11/13] from: brett:codeconscious at: 27-Sep-2002 22:18

Jose, Do you have a choice between using .DOC and .PDF or is it that you must be able to extract from both? If a choice, can you use another format instead? Brett. ----- Original Message ----- From: "jose" <[jjmmes--yahoo--es]> To: <[rebol-list--rebol--com]> Sent: Friday, September 27, 2002 3:59 AM Subject: [REBOL] Re: Converting Word .DOC and .PDF to text files Thanks. This is the advice I need. I'll probably be better off using wvware library I hope word is easier than PDF ! --- Gabriele Santilli <[g--santilli--tiscalinet--it]> escribi�: > Hi jose,

[12/13] from: jjmmes::yahoo:es at: 27-Sep-2002 18:50

I want to create a document management solution and I believe most documents will be in word format so that should be the top choice. I could also store the documents in a neutral format apart from the native formats. Any ideas ? Jose --- Brett Handley <[brett--codeconscious--com]> escribi�:

[13/13] from: brett:codeconscious at: 29-Sep-2002 12:26

> I want to create a document management solution and I > believe most documents will be in word format so that > should be the top choice. > > I could also store the documents in a neutral format > apart from the native formats. Any ideas ?

Your requirements dictate the solution I guess. Obviously the solution must work now, but give a thought to the future. Question why that particular format or is that just the assumed choice. User training and experience with Word is a powerful decider, but if the information must be organised and further processed then maybe you have a case for a more meaningful (as in computer processable) format. Brett. --- Brett Handley <[brett--codeconscious--com]> escribi�: