Les Hazlewood

Where Les is More…

Java Email Address class

Filed under: Software, Java, General — Les at 12:04 pm on Monday, November 6, 2006

UPDATE: This Java class was updated on February 1st, 2008 to account for domain literals and quoted strings such as “John Smith” <john.smith@somewhere.com>. It is now effectively the only complete and semantically correct email validator for Java.

PETTY REQUEST: The update required considerably more effort than the original as it now accounts for all valid RFC parsing conditions. Because of this, and that this page is easily my most visited, I’d appreciate it if you could show your appreciation by hooking a brother up and clicking on some ads. It helps pay for my hosting. Thanks!

I’m writing a new open source CMS (Community Management System) based entirely on the Spring Framework and Hibernate.

This CMS will take many cues from Drupal and other PHP based CMS systems, but be far more flexible, have fantastic OO and design pattern architectures and be a benefit to the Java Enterprise software community. I’ve decided to abandon JSR-168 support - I want a much cleaner, easier to implement, OO-based and typesafe “plugin” support framework - which I’m working on now. Suffice it to say I’m not a fan of JSR-168, but that’s a whole ‘nuther post.

Anyway, In this CMS, I’m using time-honored OO classes I’ve used on many many projects. One such is the EmailAddress class that I’ve referenced in earlier posts in this blog for email address validation. I’ve gotten some good feedback on this class, so I thought I’d just post the whole thing in case anyone wants to benefit from it (instead of just using code chunks I’ve posted before).

Here it is:

/* * Copyright 2008 Les Hazlewood * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.Serializable; import java.util.regex.Pattern; /** * An email address represents the textual string of an * <a href="http://www.ietf.org/rfc/rfc2822.txt">RFC 2822</a> email address and other corresponding * information of interest. * * <p>If you use this code, please keep the author information in tact and reference * my site at <a href="http://www.leshazlewood.com">leshazlewood.com</a>. Thanks! * * @author Les Hazlewood */ public class EmailAddress implements Serializable { /** * This constant states that domain literals are allowed in the email address, e.g.: * * <p><tt>someone@[192.168.1.100]</tt> or <br/> * <tt>john.doe@[23:33:A2:22:16:1F]</tt> or <br/> * <tt>me@[my computer]</tt></p> * * <p>The RFC says these are valid email addresses, but most people don't like allowing them. * If you don't want to allow them, and only want to allow valid domain names * (<a href="http://www.ietf.org/rfc/rfc1035.txt">RFC 1035</a>, x.y.z.com, etc), * change this constant to <tt>false</tt>. * * <p>Its default value is <tt>true</tt> to remain RFC 2822 compliant, but * you should set it depending on what you need for your application. */ private static final boolean ALLOW_DOMAIN_LITERALS = true; /** * This contstant states that quoted identifiers are allowed * (using quotes and angle brackets around the raw address) are allowed, e.g.: * * <p><tt>"John Smith" &lt;john.smith@somewhere.com&gt;</tt> * * <p>The RFC says this is a valid mailbox. If you don't want to * allow this, because for example, you only want users to enter in * a raw address (<tt>john.smith@somewhere.com</tt> - no quotes or angle * brackets), then change this constant to <tt>false</tt>. * * <p>Its default value is <tt>true</tt> to remain RFC 2822 compliant, but * you should set it depending on what you need for your application. */ private static final boolean ALLOW_QUOTED_IDENTIFIERS = true; // RFC 2822 2.2.2 Structured Header Field Bodies private static final String wsp = "[ \\t]"; //space or tab private static final String fwsp = wsp + "*"; //RFC 2822 3.2.1 Primitive tokens private static final String dquote = "\\\""; //ASCII Control characters excluding white space: private static final String noWsCtl = "\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x7F"; //all ASCII characters except CR and LF: private static final String asciiText = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]"; // RFC 2822 3.2.2 Quoted characters: //single backslash followed by a text char private static final String quotedPair = "(\\\\" + asciiText + ")"; //RFC 2822 3.2.4 Atom: private static final String atext = "[a-zA-Z0-9\\!\\#\\$\\%\\&\\'\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~]"; private static final String atom = fwsp + atext + "+" + fwsp; private static final String dotAtomText = atext + "+" + "(" + "\\." + atext + "+)*"; private static final String dotAtom = fwsp + "(" + dotAtomText + ")" + fwsp; //RFC 2822 3.2.5 Quoted strings: //noWsCtl and the rest of ASCII except the doublequote and backslash characters: private static final String qtext = "[" + noWsCtl + "\\x21\\x23-\\x5B\\x5D-\\x7E]"; private static final String qcontent = "(" + qtext + "|" + quotedPair + ")"; private static final String quotedString = dquote + "(" + fwsp + qcontent + ")*" + fwsp + dquote; //RFC 2822 3.2.6 Miscellaneous tokens private static final String word = "((" + atom + ")|(" + quotedString + "))"; private static final String phrase = word + "+"; //one or more words. //RFC 1035 tokens for domain names: private static final String letter = "[a-zA-Z]"; private static final String letDig = "[a-zA-Z0-9]"; private static final String letDigHyp = "[a-zA-Z0-9-]"; private static final String rfcLabel = letDig + "(" + letDigHyp + "{0,61}" + letDig + ")?"; private static final String rfc1035DomainName = rfcLabel + "(\\." + rfcLabel + ")*\\." + letter + "{2,6}"; //RFC 2822 3.4 Address specification //domain text - non white space controls and the rest of ASCII chars not including [, ], or \: private static final String dtext = "[" + noWsCtl + "\\x21-\\x5A\\x5E-\\x7E]"; private static final String dcontent = dtext + "|" + quotedPair; private static final String domainLiteral = "\\[" + "(" + fwsp + dcontent + "+)*" + fwsp + "\\]"; private static final String rfc2822Domain = "(" + dotAtom + "|" + domainLiteral + ")"; private static final String domain = ALLOW_DOMAIN_LITERALS ? rfc2822Domain : rfc1035DomainName; private static final String localPart = "((" + dotAtom + ")|(" + quotedString + "))"; private static final String addrSpec = localPart + "@" + domain; private static final String angleAddr = "<" + addrSpec + ">"; private static final String nameAddr = "(" + phrase + ")?" + fwsp + angleAddr; private static final String mailbox = nameAddr + "|" + addrSpec; //now compile a pattern for efficient re-use: //if we're allowing quoted identifiers or not: private static final String patternString = ALLOW_QUOTED_IDENTIFIERS ? mailbox : addrSpec; public static final Pattern VALID_PATTERN = Pattern.compile(patternString); //class attributes private String text; private boolean bouncing = true; private boolean verified = false; private String label; public EmailAddress() { super(); } public EmailAddress(String text) { super(); setText(text); } /** * Returns the actual email address string, e.g. <tt>someone@somewhere.com</tt> * * @return the actual email address string. */ public String getText() { return text; } public void setText(String text) { this.text = text; } /** * Returns whether or not any emails sent to this email address come back as bounced * (undeliverable). * * <p>Default is <tt>false</tt> for convenience's sake - if a bounced message is ever received for this * address, this value should be set to <tt>true</tt> until verification can made. * * @return whether or not any emails sent to this email address come back as bounced * (undeliverable). */ public boolean isBouncing() { return bouncing; } public void setBouncing(boolean bouncing) { this.bouncing = bouncing; } /** * Returns whether or not the party associated with this email has verified that it is * their email address. * * <p>Verification is usually done by sending an email to this * address and waiting for the party to respond or click a specific link in the email. * * <p>Default is <tt>false</tt>. * * @return whether or not the party associated with this email has verified that it is * their email address. */ public boolean isVerified() { return verified; } public void setVerified(boolean verified) { this.verified = verified; } /** * Party label associated with this address, for example, 'Home', 'Work', etc. * * @return a label associated with this address, for example 'Home', 'Work', etc. */ public String getLabel() { return label; } public void setLabel(String label) { this.label = label; } /** * Returns whether or not the text represented by this object instance is valid * according to the <tt>RFC 2822</tt> rules. * * @return true if the text represented by this instance is valid according * to RFC 2822, false otherwise. */ public boolean isValid() { return isValidText(getText()); } /** * Utility method that checks to see if the specified string is a valid * email address according to the * RFC 2822 specification. * * @param email the email address string to test for validity. * @return true if the given text valid according to RFC 2822, false otherwise. */ public static boolean isValidText(String email) { return (email != null) && VALID_PATTERN.matcher(email).matches(); } public boolean equals(Object o) { if (o instanceof EmailAddress) { EmailAddress ea = (EmailAddress) o; return getText().equals(ea.getText()); } return false; } public int hashCode() { return getText().hashCode(); } public String toString() { return getText(); } public static void main(String[] args) { String addy = "\"John Smith\" <john.smith@u.washington.edu>"; if (isValidText(addy)) { System.out.println("Valid email address."); } else { System.out.println("Invalid email address!"); } } }

14 Comments »

Pingback by Les Hazlewood » Java Email Address Validation using Regular Expressions (the Right Way)

November 15, 2006 @ 6:40 pm

[…] For example, I don’t save or reference an email address as a String: strings as objects don’t tell me anything about the email address itself, like if its valid, if its bouncing, if it has been verified by the user with which it is associated, etc, etc. As such, I have created an EmailAddress class to represent this information. Doing this is a small example of the beauty of OO over functional programming. […]

Comment by Sebastian

November 23, 2006 @ 8:47 am

Hi Les,

I am doing a project for one of my courses at University, and I was thinking of using parts of your email class. While looking through your specification of the RFC2822 regular expression, I noticed a few parts missing. What happened to the “quoted-string” and “domain-literal” identifiers? Also, could you quickly explain what the “^” and “$” at the start and end of your final regular expression do?

Thanks for your help,

Sebastian

Comment by Les

November 24, 2006 @ 11:51 pm

Hi Sebastian,

To be honest, I only incorporated rfc2822 for standard email text addresses (without quoted text). The domain-literals are represented by the RFC 1035 domain tokens in the class. Since you pointed this out, I’m now working on including the quoted-string tokens into the email address class’s final regular expression. Thanks for pointin that out!

Also, the ^ character, when not inside a character class, means “the stuff after me must start the string”. So in the final regexp in the class, it means “Match all strings where the beginning of the string is the localpart”. Similarly, the $ character means “the stuff before me must finish the string”. So for the final regexp, it means “match all strings where the end of the string is the domain.” Putting the two together in the same regexp means “the string must start with a localpart and must end with a domain”, with of course, those two being seperated by the @ character.

Cheers,

Les

Comment by Sebastian

December 23, 2006 @ 7:52 am

Hi Les,

I noticed a bug in your class. Testing the email host.@domain.com will return as being valid. The reason for this I believe is due to your use of the raw symbols in the “sp” token definition. If you change the symbols to the ascii hex characters, it works better!

Regards,

Sebastian

Comment by Les

January 4, 2007 @ 11:04 am

@Sebastian

Thanks! I thought something was funny. Instead, I just escaped each of the characters in the ’sp’ constant in the file (using double backslashes). The blog entry has been updated with the change for future reference.

Cheers,

Les

Comment by Mikec

March 1, 2007 @ 7:24 pm

Thanks for this!

Horray for the lazy-web!

Comment by DanB

March 3, 2007 @ 5:25 pm

That’s a great email validator, thanks!

Comment by mc

January 2, 2008 @ 10:16 am

where can I find source for:
org.sakuracms.util.Entity ?

Regards

Comment by Les

January 2, 2008 @ 2:23 pm

@MC

I just posted it in a new blog post.

Regards,

Les

Comment by Matt

January 31, 2008 @ 3:29 pm

Hi Les,

According to the spec, shouldn’t the email address:
“blah”@blah.com
be allowed?

Thanks,

Matt

Comment by Les

February 1, 2008 @ 5:23 pm

Hi Matt,

The latest update to this blog adds support for quoted strings and domain literals, properly validating “blah”@blah.com as valid.

Cheers,

Les

Comment by Jake

March 13, 2008 @ 6:18 pm

Very nice code, cheers.

One thing I noticed though is that using the code supplied, the sample string keeps coming back invalid for me:

String addy = “\”John Smith\” “;

this is after setting ALLOW_DOMAIN_LITERALS and ALLOW_QUOTED_IDENTIFIERS both to true.

Not sure why this is, and it is not relevant for my app (since this format is not allowed) but either there is a bug or I messed something up in the copy/paste :)

Comment by Casey

March 25, 2008 @ 7:23 pm

Hi - thanks, great work, very glad to see it.

Might be worth mentioning in your post above that the parser does not include the “obsolete” parts of the address syntax which are a part of 2822, and, according to sec 4, “MUST be accepted and parsed by a conformant receiver”.

Unless I’m misunderstanding something.

-c

Comment by Casey

March 25, 2008 @ 11:06 pm

Hi, some further notes.

I noticed that CFWS is not included in your parser. We needed that, since we’re doing checking of addresses out of emails, so I went ahead and (partially) implemented it. I’m new to this stuff, so if you have the time to review the code I’d greatly appreciate it. I may well have made some mistakes!! Of course you are welcome to include it in your own code, if you wish. I tested it on ~1700 real-world addresses and there weren’t any false-negatives. Didn’t check for false-positives yet.

I say “partially” because under 2822, comments in CFWS are allowed to nest, but the structure of strings inside of strings doesn’t allow this. So only one-level comments are possible. E.g. the valid address:

“Bob Smith” (Bob Smith)

works, but the valid address:

“Bob Smith” (Bob (the man) Smith)

won’t. Not a deal breaker for us. :-)

I also added a flag to permit a “.” in unquoted text, e.g., allowed:
Superstore.com

I also added a flag to permit “[” and “]” in the same place but I turned it off because it seemed to cause an extremely long delay in the parsing.

What I did (maybe I should have made CFWS handling a switch, but I didn’t):

Added:

/**
* This constant allows “.” to appear in atext.
*
* The address:
* Kayaks.org
* …is not valid. It should be:
* “Kayaks.org”
*
* If this boolean is set to false, the parser will act per 2822 and will require
* the quotes; if set to true, it will allow this.
*/
private static final boolean ALLOW_DOT_IN_ATEXT = true;

/**
* This constant allows “.” to appear in atext.
*
* The address:
* [Kayaks]
* …is not valid. It should be:
* “[Kayaks]”
*
* If this boolean is set to false, the parser will act per 2822 and qill require
* the quotes; if set to true, it will allow this.
*
* WARNING: This may be a bug, but it seems like this can cause the parser to hang
* for a while before completing (apparently accurately), e.g. on the corrupted address string:
*
* Bob Smith [mailto:bob@gmail.com]=20
*/
private static final boolean ALLOW_SQUARE_BRACKETS_IN_ATEXT = false;

[note: this section is added just after section 3.2.2]

// RFC 2822 3.2.3 CFWS specification
// note: nesting should be permitted but is not by these rules given code limitations:
private static final String ctext = “[” + noWsCtl + “\\x21-\\x27\\x2A-\\x5B\\x5D-\\x7E]”;
private static final String ccontent = ctext + “|” + quotedPair; // + “|” + comment;
private static final String comment = “\\((” + fwsp + ccontent + “+)*” + fwsp + “\\)”;
private static final String cfws = “(” + fwsp + comment + “+)*((” + fwsp + comment +
“+)|” + fwsp + “+)+”;

[The following lines already existed, but they were modified. Shown here in order, but there is lots of intervening code in some cases:]

private static final String atext = “[a-zA-Z0-9\\!\\#\\$\\%\\&\\’\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~” + (ALLOW_DOT_IN_ATEXT ? “\\.” : “”) + (ALLOW_SQUARE_BRACKETS_IN_ATEXT ? “\\[\\]” : “”) + “]”;

private static final String atom = cfws + atext + “+” + cfws;

private static final String dotAtom = cfws + “(” + dotAtomText + “)” + cfws;

private static final String quotedString = cfws + dquote + “(” + fwsp + qcontent + “)*” + fwsp + dquote + cfws;

private static final String domainLiteral = cfws + “\\[” + “(” + fwsp + dcontent + “+)*” + fwsp + “\\]” + cfws;

private static final String angleAddr = cfws + “” + cfws;

Comment by Casey

March 25, 2008 @ 11:32 pm

About that code i submitted: it implements strict 2822, so the addresses:
bob @example.com
and
bobjones(comment)@example.com
are both valid, even though the spec says you “SHOULD NOT” do that, becase CFWS is allowed after the dot-atom on the left side of the @…
-c

Comment by Casey

May 12, 2008 @ 8:55 pm

[ooops, this was also posted at http://leshazlewood.com/?p=5 — Les, perhaps you could erase my previous comments on this page, since this supersedes them. Thanks.]

Hi there!

I wanted to let you know that I have taken your code and added a number of features to it. I post the link here in case it’s useful to you or anyone reading this. Essentially it adds a number of functions for extracting addresses (and parts of addresses), as well as verifying whole headers (including group tokens, etc.)

You can find it (along with documentation, etc) at:

http://boxbe.com/freebox.html

Modified/added: removed some functions, added support for CFWS token, corrected FWSP token, added some boolean flags, added getInternetAddress and extractHeaderAddresses and other methods, did some optimization of the regex.

Where Mr. Hazlewood’s version was more for ensuring certain forms that were passed in during registrations, etc, this handles more types of verifying as well a few forms of extracting the data in predictable, cleaned-up chunks.

(I see that you removed my other rambling comments, which I was going to ask you to do anyway. :-) )

Thanks again,
-Casey

Comment by Kenneth P. Turvey

May 24, 2008 @ 9:06 am

Should bademail@squeak?.com be considered a valid email address?

I’m not looking at the pertinent rfc right now, but it doesn’t look like a valid email address to me.

Comment by Nanne

June 16, 2008 @ 12:56 am

The domainname specifies that it follows RFC1035, however this RFC states the following:

The labels must follow the rules for ARPANET host names. They must
start with a letter
, end with a letter or digit, and have as interior
characters only letters, digits, and hyphen. There are also some
restrictions on the length. Labels must be 63 characters or less.

So according to this www.3com.com is not a valid domainname, however it is according to the pattern describe above.

Comment by Les

June 16, 2008 @ 8:07 am

@Nanne

Thanks for the pointer. I think it is ok to leave in the definition I have now because if I changed it to prevent domains starting with a letter _or_ number, then obviously 3com.com wouldn’t match.

Clearly this is a domain name resolvable by DNS and has email addresses associated with it, so its probably not a good idea to be a 100% reflection of RFC 1035. 99% is good for email, as your example demonstrates ;)

Cheers,

Les

RSS feed for comments on this post. TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Powered by WP Hashcash