Tuesday, November 15, 2011

Truncating HTML in Java

I'm currently working on a web application written in Java. Part of this application generate summary of long HTML texts by truncating them to a fixed size. In Python, I usually use truncate_html_words of Django template engine so I looked for a similar easy method in Java.

I did a couple searches on Google but couldn't find something quick and easy. So, I went back and looked at the Django's code and fortunately it was very straight forward. Find the adapted code below:
/**
 * Copyright (c) Django Software Foundation and individual contributors.
 * All rights reserved.
 * 
 * Copyright (c) 2011 Masood Behabadi <masood@dentcat.com>
 *
 * Redistribution and use in source and binary forms, with or without modification,
 * are permitted provided that the following conditions are met:
 *
 *    1. Redistributions of source code must retain the above copyright notice, 
 *       this list of conditions and the following disclaimer.
 *    
 *    2. Redistributions in binary form must reproduce the above copyright 
 *       notice, this list of conditions and the following disclaimer in the
 *       documentation and/or other materials provided with the distribution.
 *
 *    3. Neither the name of Django nor the names of its contributors may be used
 *       to endorse or promote products derived from this software without
 *       specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
 * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
 * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
 * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
 * ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */ 

public static String truncateHtmlWords(String html, int length){
 if (length <= 0)
  return new String();
 
 List<string> html4Singlets = Arrays.asList(
  "br", "col", "link", "base", "img",
  "param", "area", "hr", "input");
 // Set up regular expressions
 Pattern pWords = Pattern.compile("&.*?;|<.*?>|(\\w[\\w-]*)");
 Pattern pTag = Pattern.compile("<(/)?([^ ]+?)(?: (/)| .*?)?>");
 Matcher mWords = pWords.matcher(html);
 // Count non-HTML words and keep note of open tags
 int endTextPos = 0;
 int words = 0;
 List<string> openTags = new ArrayList<string>();
 while (words <= length) {
  if (!mWords.find())
   break;
  if (mWords.group(1) != null) {
   // It's an actual non-HTML word
   words += 1;
   if (words == length)
    endTextPos = mWords.end();     
   continue;
  }
  // Check for tag
  Matcher tag = pTag.matcher(mWords.group());
  if (!tag.find() || endTextPos != 0)
   // Don't worry about non tags or tags after our
   // truncate point
   continue;
  String closingTag  = tag.group(1);
  // Element names are always case-insensitive
  String tagName     = tag.group(2).toLowerCase();
  String selfClosing = tag.group(3);
  if (closingTag != null) {
   int i = openTags.indexOf(tagName);
   if (i != -1)
    openTags = openTags.subList(i + 1, openTags.size());
  }
  else if (selfClosing == null && !html4Singlets.contains(tagName))
   openTags.add(0, tagName);
 }
 
 if (words <= length)
  return html;
 StringBuilder out = new StringBuilder(html.substring(0, endTextPos));
 for (String tag: openTags)
  out.append("");
 
 return out.toString();
}

Feel free to use the code under Modified-BSD License but keep in mind unlike Django, it's not been thoroughly tested and may not function correctly in all cases.

Links:
Django Project Website
Original Source Code

6 comments:

  1. Thank you for the blog. Found it interesting and useful. Java is a general purpose, high-level, class-based and object-oriented programming language. And we provide Java training in Chennai at Fita.

    ReplyDelete
  2. howdy, your websites are really good. I appreciate your work. web design agency san francisco

    ReplyDelete
  3. I completely understand everything you have said. Actually, I browsed through your additional content articles and I think you happen to be absolutely right. Great job with this online site. web design agency

    ReplyDelete
  4. The luxury proposed might be incomparable; citizens are never fail to looking for bags is a Native goals. The idea numerous insert goals uniquely to push diversity with visibility during the travel and leisure arena. Hotels Discounts website tips

    ReplyDelete
  5. Is operated by Probe Investments Limited which is registered underneath the legal guidelines of the European Union member state of Malta. It has registration number C51749 with buying and selling handle at Suite 109, Level four, Sir William Reid Street, Gzira, GZR1033, Malta. Red/Black — Betting on either all 18 purple numbers or all 18 black numbers. Corner Bet — Placing a wager on the nook where four numbers intersect. If your company is a startup or because of of} 1xbet lack of available knowledge we fail to relate it to playing industry, we could ask you to provide further particulars wanted for the request authenticity validation.

    ReplyDelete