How do I convert LaTeX to plain-text (ASCII)?
LatexAsciiLatex Problem Overview
Scenario:
I have a document I created using LaTeX (my resume in this case), it's compiling in pdflatex
correctly and outputting exactly what I'd like. Now I need the same document to be converted to plain old ASCII.
Example:
I have seen this done (at least once) here, where the author has a PDF version and an ASCII version that matches the PDF version in almost every way, including margins, spacing and bullet points.
I realize this type of conversion cannot be exact due to limitations in the ASCII format, but a very close approximation does seem possible based on what I have found so far. What is the process for doing this?
Latex Solutions
Solution 1 - Latex
Opendetex is available both for Windows and Linux (compiles fine on a Mac as well). It can be downloaded from https://github.com/pkubowicz/opendetex
Usage:
> detex project
>
> opens project.tex, reads all files included using \include or
> \includeonly commands, outputs resulting text to standard output.
>
> detex -n project > out.txt
>
> opens project.tex, does not follow \include or \includeonly commands,
> outputs resulting text to out.txt
>
> detex --help
>
> shows full help
Extract it to any directory of your choice. Say you extracted it to your Downloads directory.
Create another directory of any name in that (this is optional but recommended). Let's say the directory name is “my_paper”. Put your paper in the “my_paper” directory. Assume your paper name is project.tex.
Navigate to the path
cd ~/Downloads/opendetex
Run the command
detex my_paper/project.tex > out.txt
generic form
detex -n full_path_to_tex_file.tex > output_text_file.txt
Solution 2 - Latex
CatDVI can convert DVI to text and attempts to preserve the formatting.
Solution 3 - Latex
You can try some of the programs proposed here:
Solution 4 - Latex
You can also try Pandoc, it can transform latex to many other formats. I suggest reading its documentation, for there may be some tricky cases that you need pass some arguments to handle.
Solution 5 - Latex
pdftotext can preserve layout
If you are using pdflatex
, you probably don't want to mess around with your package options to switch to latex
to generate a DVI.
Instead, take your pdf file and convert that. This worked for my CV/resume made with the Curve package:
pdftotext -layout MyResume.pdf
Note the -layout
flag.
Solution 6 - Latex
Another option is to use htlatex to create a web page from the LaTeX sources, then use links to convert to plain text. I used the command line
links -dump -no-numbering -no-references input.html > output.txt
in the past which gave a rather nice result. This will of course rather match the view of the rendered HTML than the original PDF, thus maybe not exactly what you want.
Solution 7 - Latex
My usual strategy is to use http://hyperlatex.sourceforge.net/">hyperlatex</a> to turn it into a web page, and then cope and paste from a web browser. I find that this gives the best formatting.
I usually then have to go through and manually fix some line-wrapping...
Solution 8 - Latex
Try the steps here: http://zanedp.livejournal.com/201222.html
Here is a sequence that converts my LaTeX file to plain text:
$ latex file.tex
$ catdvi -e 1 -U file.dvi | sed -re "s/\[U\+2022\]/*/g" | sed -re "s/([^^[:space:]])\s+/\1 /g" > file.txt
The -e 1 option to catdvi tells it to output ASCII. If you use 0 instead of 1, it will output Unicode. Unicode will include all the special characters like bullets, emdashes, and Greek letters. It also include ligatures for some letter combinations like "fi" and "fl." You may not like that. So, use -e 1 instead. Use the -U option to tell it to print out the unicode value for unknown characters so that you can easily find and replace them.
The second part of the command finds the string [U+2022] which is used to designate bullet characters (•) and replaces them with an asterisk (*).
The third part eats up all the extra whitespace catdvi threw in to make the text full-justified while preserving spaces at the start of lines (indentation).
After running these commands, you would be wise to search the .txt file for the string [U+ to make sure no Unicode characters that can't be mapped to ASCII were left behind and fix them.
Solution 9 - Latex
When I needed to get the plain text from my TEX file for indexing and searching, I found LaTeX2RTF to be a good solution - it has an installer and GUI for windows, and it produced a RTF file of my 50 pages thesis that I could open in Word.
Solution 10 - Latex
The solution that works best for me is the following. Assuming you have the latex document name (without extension) stored in ${BASENAME}
you apply these 3 steps:
htlatex ${BASENAME}.tex
iconv -f iso-8859-1 -t utf-8 ${BASENAME}.html > ${BASENAME}-utf8.html
html2markdown ${BASENAME}-utf8.html > ${BASENAME}.txt
Apparently, you need to have tex4ht
and python-html2text
installed.
Solution 11 - Latex
I've tried LyX and it works pretty well. The only nuance is that if you have a TeX file that is including other TeX files, you will need to export them all separately, unless I'm missing something.
Solution 12 - Latex
Pandoc allows you to convert files from one format to other Use following pandoc command:
pandoc -s /path/to/foobar.tex -o foobar.txt
If you want your lines to break at a certain column use --column
flag. Use --columns 10000
for non-breaking line.
You can convert -o foobar.txt
to a number of other formats like markdown (.md) etc. If you don't specify the -o foobar.txt
, pandoc will print the html that you can render in any online tool.
To install pandoc follow this official documentation
Solution 13 - Latex
you can import into lyx and use lyx's export to text feature.
kind of silly if you don't use lyx but if you already have it, very quick and easy solution. Good result for me, although to be fair my files are pretty simple. Not sure how more elaborate files get converted.
Solution 14 - Latex
Emacs has the commands iso-iso2tex
and iso-tex2iso
that work very well, except it doesn't convert single commands like \OE
to Œ
.