Tuesday, November 15, 2011

Truncating HTML in Java

I'm currently working on a web application written in Java. Part of this application generate summary of long HTML texts by truncating them to a fixed size. In Python, I usually use truncate_html_words of Django template engine so I looked for a similar easy method in Java.

I did a couple searches on Google but couldn't find something quick and easy. So, I went back and looked at the Django's code and fortunately it was very straight forward. Find the adapted code below:
/**
 * Copyright (c) Django Software Foundation and individual contributors.
 * All rights reserved.
 * 
 * Copyright (c) 2011 Masood Behabadi <masood@dentcat.com>
 *
 * Redistribution and use in source and binary forms, with or without modification,
 * are permitted provided that the following conditions are met:
 *
 *    1. Redistributions of source code must retain the above copyright notice, 
 *       this list of conditions and the following disclaimer.
 *    
 *    2. Redistributions in binary form must reproduce the above copyright 
 *       notice, this list of conditions and the following disclaimer in the
 *       documentation and/or other materials provided with the distribution.
 *
 *    3. Neither the name of Django nor the names of its contributors may be used
 *       to endorse or promote products derived from this software without
 *       specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
 * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
 * ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */ 

public static String truncateHtmlWords(String html, int length){
 if (length <= 0)
  return new String();
 
 List<string> html4Singlets = Arrays.asList(
  "br", "col", "link", "base", "img",
  "param", "area", "hr", "input");
 // Set up regular expressions
 Pattern pWords = Pattern.compile("&.*?;|<.*?>|(\\w[\\w-]*)");
 Pattern pTag = Pattern.compile("<(/)?([^ ]+?)(?: (/)| .*?)?>");
 Matcher mWords = pWords.matcher(html);
 // Count non-HTML words and keep note of open tags
 int endTextPos = 0;
 int words = 0;
 List<string> openTags = new ArrayList<string>();
 while (words <= length) {
  if (!mWords.find())
   break;
  if (mWords.group(1) != null) {
   // It's an actual non-HTML word
   words += 1;
   if (words == length)
    endTextPos = mWords.end();     
   continue;
  }
  // Check for tag
  Matcher tag = pTag.matcher(mWords.group());
  if (!tag.find() || endTextPos != 0)
   // Don't worry about non tags or tags after our
   // truncate point
   continue;
  String closingTag  = tag.group(1);
  // Element names are always case-insensitive
  String tagName     = tag.group(2).toLowerCase();
  String selfClosing = tag.group(3);
  if (closingTag != null) {
   int i = openTags.indexOf(tagName);
   if (i != -1)
    openTags = openTags.subList(i + 1, openTags.size());
  }
  else if (selfClosing == null && !html4Singlets.contains(tagName))
   openTags.add(0, tagName);
 }
 
 if (words <= length)
  return html;
 StringBuilder out = new StringBuilder(html.substring(0, endTextPos));
 for (String tag: openTags)
  out.append("");
 
 return out.toString();
}

Feel free to use the code under Modified-BSD License but keep in mind unlike Django, it's not been thoroughly tested and may not function correctly in all cases.

Links:
Django Project Website
Original Source Code

2 comments:

  1. Thank you for the blog. Found it interesting and useful. Java is a general purpose, high-level, class-based and object-oriented programming language. And we provide Java training in Chennai at Fita.

    ReplyDelete