How do I preserve line breaks when using jsoup to convert html to plain text?

JavaJsoup

Java Problem Overview


I have the following code:

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
		 "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

And I have the result:

hello world yo googlez

But I want to break the line:

hello world
yo googlez

I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.

If there's a <br> in the markup I parse, how can I get a line break in my resulting output?

Java Solutions


Solution 1 - Java

The real solution that preserves linebreaks should be like this:

public static String br2nl(String html) {
    if(html==null)
        return html;
	Document document = Jsoup.parse(html);
	document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
	document.select("br").append("\\n");
	document.select("p").prepend("\\n\\n");
	String s = document.html().replaceAll("\\\\n", "\n");
	return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

It satisfies the following requirements:

  1. if the original html contains newline(\n), it gets preserved
  2. if the original html contains br or p tags, they gets translated to newline(\n).

Solution 2 - Java

With

Jsoup.parse("A\nB").text();

you have output

"A B" 

and not

A

B

For this I'm using:

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

Solution 3 - Java

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

Solution 4 - Java

On Jsoup v1.11.2, we can now use Element.wholeText().

String cleanString = Jsoup.parse(htmlString).wholeText();

user121196's answer still works. But wholeText() preserves the alignment of texts.

Solution 5 - Java

Try this by using jsoup:

public static String cleanPreserveLineBreaks(String bodyHtml) {

    // get pretty printed html with preserved br and p tags
    String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
    // get plain text with preserved line breaks by disabled prettyPrint
    return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}

Solution 6 - Java

For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:

Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);

(version 1.10.3)

Solution 7 - Java

You can traverse a given element

public String convertNodeToText(Element element)
{
	final StringBuilder buffer = new StringBuilder();
	
	new NodeTraversor(new NodeVisitor() {
		boolean isNewline = true;
		
		@Override
		public void head(Node node, int depth) {
			if (node instanceof TextNode) {
				TextNode textNode = (TextNode) node;
				String text = textNode.text().replace('\u00A0', ' ').trim();                    
				if(!text.isEmpty())
				{                        
					buffer.append(text);
					isNewline = false;
				}
			} else if (node instanceof Element) {
				Element element = (Element) node;
				if (!isNewline)
				{
					if((element.isBlock() || element.tagName().equals("br")))
					{
						buffer.append("\n");
						isNewline = true;
					}
				}
			}                
		}
		
		@Override
		public void tail(Node node, int depth) {                
		}                        
	}).traverse(element);        
	
	return buffer.toString();               
}

And for your code

String result = convertNodeToText(JSoup.parse(html))

Solution 8 - Java

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

works if the html itself doesn't contain "br2n"

So,

text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();

works more reliable and easier.

Solution 9 - Java

Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.

Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: [HtmlToPlainText.java][1]

The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.

To avoid link rot, here is Jonathan Hedley's solution in full:

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;

/**
 * HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
 * plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
 * scrape.
 * <p>
 * Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
 * </p>
 * <p>
 * To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
 * <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
 * where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
 * 
 * @author Jonathan Hedley, [email protected]
 */
public class HtmlToPlainText {
    private static final String userAgent = "Mozilla/5.0 (jsoup)";
    private static final int timeout = 5 * 1000;

    public static void main(String... args) throws IOException {
        Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
        final String url = args[0];
        final String selector = args.length == 2 ? args[1] : null;

        // fetch the specified URL and parse to a HTML DOM
        Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();

        HtmlToPlainText formatter = new HtmlToPlainText();

        if (selector != null) {
            Elements elements = doc.select(selector); // get each element that matches the CSS selector
            for (Element element : elements) {
                String plainText = formatter.getPlainText(element); // format that element to plain text
                System.out.println(plainText);
            }
        } else { // format the whole doc
            String plainText = formatter.getPlainText(doc);
            System.out.println(plainText);
        }
    }

    /**
     * Format an Element to plain-text
     * @param element the root element to format
     * @return formatted text
     */
    public String getPlainText(Element element) {
        FormattingVisitor formatter = new FormattingVisitor();
        NodeTraversor traversor = new NodeTraversor(formatter);
        traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node

        return formatter.toString();
    }

    // the formatting rules, implemented in a breadth-first DOM traverse
    private class FormattingVisitor implements NodeVisitor {
        private static final int maxWidth = 80;
        private int width = 0;
        private StringBuilder accum = new StringBuilder(); // holds the accumulated text

        // hit when the node is first seen
        public void head(Node node, int depth) {
            String name = node.nodeName();
            if (node instanceof TextNode)
                append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
            else if (name.equals("li"))
                append("\n * ");
            else if (name.equals("dt"))
                append("  ");
            else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
                append("\n");
        }

        // hit when all of the node's children (if any) have been visited
        public void tail(Node node, int depth) {
            String name = node.nodeName();
            if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
                append("\n");
            else if (name.equals("a"))
                append(String.format(" <%s>", node.absUrl("href")));
        }

        // appends text to the string builder with a simple word wrap method
        private void append(String text) {
            if (text.startsWith("\n"))
                width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
            if (text.equals(" ") &&
                    (accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
                return; // don't accumulate long runs of empty spaces

            if (text.length() + width > maxWidth) { // won't fit, needs to wrap
                String words[] = text.split("\\s+");
                for (int i = 0; i < words.length; i++) {
                    String word = words[i];
                    boolean last = i == words.length - 1;
                    if (!last) // insert a space if not the last word
                        word = word + " ";
                    if (word.length() + width > maxWidth) { // wrap and reset counter
                        accum.append("\n").append(word);
                        width = word.length();
                    } else {
                        accum.append(word);
                        width += word.length();
                    }
                }
            } else { // fits as is, without need to wrap text
                accum.append(text);
                width += text.length();
            }
        }

        @Override
        public String toString() {
            return accum.toString();
        }
    }
}

[1]: https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java "HtmlToPlainText.java"

Solution 10 - Java

Try this:

public String noTags(String str){
    Document d = Jsoup.parse(str);
    TextNode tn = new TextNode(d.body().html(), "");
    return tn.getWholeText();
}

Solution 11 - Java

Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator. Here's some scala code I use for this, java port should be easy:

val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
    				.asScala.mkString("<br />\n")

Solution 12 - Java

This is my version of translating html to text (the modified version of user121196 answer, actually).

This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).

It's originally written in Scala, but you can change it to Java easily

def html2text( rawHtml : String ) : String = {
  
	val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
	htmlDoc.select("br").append("\\nl")
	htmlDoc.select("div").prepend("\\nl").append("\\nl")
	htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")
	
	org.jsoup.parser.Parser.unescapeEntities(
		Jsoup.clean(
		  htmlDoc.html(),
		  "",
		  Whitelist.none(),
		  new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
		),false
    ).
	replaceAll("\\\\nl", "\n").
	replaceAll("\r","").
	replaceAll("\n\\s+\n","\n").
	replaceAll("\n\n+","\n\n").    	
	trim()  	
}

Solution 13 - Java

Try this by using jsoup:

    doc.outputSettings(new OutputSettings().prettyPrint(false));

	//select all <br> tags and append \n after that
	doc.select("br").after("\\n");

	//select all <p> tags and prepend \n before that
	doc.select("p").before("\\n");

	//get the HTML from the document, and retaining original new lines
	String str = doc.html().replaceAll("\\\\n", "\n");

Solution 14 - Java

/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
	String result = "";
	if(html.contains(linebreakerString)){
		result = replaceBrWithNewLine(html, linebreakerString+"1");
	} else {
		result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
		result = result.replaceAll(linebreakerString, "\n");
	}
	return result;
}

Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder. For example:

replaceBrWithNewLine(element.html(), "br2n")

The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.

Solution 15 - Java

Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:

org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBillyView Question on Stackoverflow
Solution 1 - Javauser121196View Answer on Stackoverflow
Solution 2 - JavaMirco AttocchiView Answer on Stackoverflow
Solution 3 - JavaPaulius ZView Answer on Stackoverflow
Solution 4 - JavazeenosaurView Answer on Stackoverflow
Solution 5 - JavamkowaView Answer on Stackoverflow
Solution 6 - JavaAndy ResView Answer on Stackoverflow
Solution 7 - JavapopcornyView Answer on Stackoverflow
Solution 8 - JavaGreen BeretView Answer on Stackoverflow
Solution 9 - JavaMalcolm SmithView Answer on Stackoverflow
Solution 10 - JavamanjiView Answer on Stackoverflow
Solution 11 - JavaMichael Bar-SinaiView Answer on Stackoverflow
Solution 12 - JavaabdolenceView Answer on Stackoverflow
Solution 13 - JavaAbhay GuptaView Answer on Stackoverflow
Solution 14 - JavaChris6647View Answer on Stackoverflow
Solution 15 - JavaBevorView Answer on Stackoverflow