I've taken the liberty of reconstructing your WHOLE regex. Here it is... (([a-zA-Z0-9_\\-]+)(\\s*=\\s*(\"(.*?)\"|'(.*?)'|([^'\">\\s]+)))?)
As predicted, one of your main groups is optional. (I can't recall ALL the rules for Oro's numbering of nested parentheses. Like Perl, it's dynamic, and depends on the existence of which groups are recognized. I avoid such complexities. You should, too. See below.) I suggest you insert this in Perl and run your input through it. Print out everything that the whole thing recognizes. Also print and label each group. It will show that different numbers of groups are recognized and that YOUR expectations of which group is which are NOT what IT expects!!! Your expression is too complicated for you (or me) to debug. There are up to 7 capturing groups here!!! Suggestion. Use the NONCAPTURING GROUP when you don't need to capture or when optional. I.e., (?:pattern). And make each group you intend to capture (and number) stuff REQUIRED. If there are two overall patterns you need to test, test them independently. 2nd suggestion. Use an open source HTML parser instead. They've already solved this problem. Final conclusion: There ain't a bug in Oro. The bug is in your logic. Enjoy. Kevin -----Original Message----- From: Balaji [mailto:[email protected]] Sent: Thu 4/2/2009 4:14 AM To: Kevin Markey; [email protected] Subject: RE: Is this a bug with oro? Hi Kevin, Apologize for missing the diTag pattern. Here it is, private static String start = "<" ; private static String tagNames = "(form|input\\s+|head|/?select\\s+|option\\s+|textarea\\s+" + "|checkboxgroup\\s+|radiogroup|/?optintrue){1}" ; private static String anything = "([^>]*)" ; private static String end = "[/]*>" ; private static Pattern diTag; private static String attribute = "[a-zA-Z0-9_\\-]+" ; private static String optWS = "\\s*" ; private static String dquoted = "\"(.*?)\"" ; private static String squoted = "'(.*?)'" ; private static String plain = "([^'\">\\s]+)" ; private static Pattern nvps ; private static PatternMatcher primaryMatcher = new Perl5Matcher() ; private static PatternCompiler compiler = new Perl5Compiler() ; diTag = compiler.compile( start + tagNames + anything + end , Perl5Compiler.CASE_INSENSITIVE_MASK ) ; nvps = compiler.compile( "((" + attribute + ")" + "(" + optWS + "=" + optWS + "(" + dquoted + "|" + squoted + "|" + plain + "))?)" ); The different scenarios for failure that you have mentioned, should fail consistently(for the same input). correct? In this case, for the same input the NPE occurs only occassionally. Here the input is a HTML file read over http. Do you think, the NPE can occur when the HTML is not available for some reason(network issue, etc..)? Thanks, Balaji Prabhakaran _____ From: Kevin Markey [mailto:[email protected]] Sent: Tuesday, March 31, 2009 11:31 PM To: ORO Users List; [email protected]; [email protected] Cc: Kevin Markey Subject: RE: Is this a bug with oro? One more thing to do for your diagnostics. Do these so you can identify where in __setLastMatchResult() you fail. - Get the source, recompile the jar with debugging information so you get the line number. - Turn off any obfuscation. Also provide the diTag pattern that is used when this fails. (I don't see it defined in your snippet.) That is key. Still, I have a hunch... The regex apparently has 2 groups. I predict your pattern allows a match **without** matching the groups. As result, __originalInput is reset to null at the conclusion of __setLastMatchResult() after matching the 1st group, setting off the NPE the next iteration of your WHILE loop, or the __beginGroupOffset or __endGroupOffset or __endMatchOffsets arrays might be null. I'm not totally familiar with the source code, but I've used it for several years, and these are the things that typically fail. B.t.w., 2.0.6 and 2.0.8 are not substantially different in these regards. So, make sure that BOTH groups are required in your regex. Kevin -----Original Message----- From: Balaji [mailto:[email protected]] Sent: Tue 3/31/2009 9:09 AM To: [email protected] Subject: RE: Is this a bug with oro? Hi Kevin, Thanks a lot for your reply. Highly appreciate your help. Here are required details. The version is 2.0.8 The context is this.. trying to read a html file over http and parse values of some hidden attributes in the html form. Here is the code.. the exception occurs at the line marked below. Occurs randomly and is not reproducable at will. The string passed to contains() is never null and is always checked for true before calling getMatch(). Please check if Iam missing something. ******************class that contains the code that throws the exception************ public class Parser { private static Pattern diTag; private static PatternMatcher primaryMatcher = new Perl5Matcher() ; private static PatternCompiler compiler = new Perl5Compiler() ; public static void initialize(){ . . . } public Parser( StringBuffer input) { this.input = input ; } public Vector parse() { Vector returnValue=null; PatternMatcherInput patternMatcherInput = new PatternMatcherInput(input.toString()); int previous = 0 ; while(primaryMatcher.contains(patternMatcherInput,diTag)) { MatchResult result = primaryMatcher.getMatch(); //exception is thrown here.... String dataString = input.substring(previous,patternMatcherInput.getMatchBeginOffset()); String tag = result.group(1); String inputS = result.group(2); try { returnValue=processDITag( tag.toUpperCase(),inputS ) ; previous = patternMatcherInput.getCurrentOffset() ; } catch(NotHandledException nh) { previous = patternMatcherInput.getMatchBeginOffset() ; } } return returnValue; } public Vector processDITag( String tag, String inputString ) throws NotHandledException { . . . } } ******************code that calls the method in the above class******************************* diHTML = readInputFile(queryParametersBean.getSurveyName()); //reads the data from a html file over http if(diHTML.length()==0) { LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","The file name is not available" + sHtmlPath); sFileName=ConfigBean.getProperty(sSerPathFileName); // replace with exact file name sFileName=sFilePath + sFileName; queryParametersBean.setSurveyName(sFileName); diHTML = readInputFile(queryParametersBean.getSurveyName()); LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","The file name from config file" + sFileName); } if(diHTML.length()==0) { LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","The file name is not in akamai server"); } else { if(!( queryParametersBean.getEmail() != null && queryParametersBean.getEmail().length() != 0 && (ProcessorSupport.validateEmailAddress(queryParametersBean.getEmail())==fals e) && diHTML.length() !=0)) { LogWriter.info(CLASS_NAME,"loadPageEvent(HttpServletRequest req)","queryParametersBean track page load " + queryParametersBean.getEmail()); System.out.println("inside load event"); Parser myParser = new Parser(diHTML, queryParameters) ; Vector resultString=myParser.parse(); Iterator itrelements=resultString.iterator(); . . . } } **************************************************************************** ********************************* Thanks, Balaji Prabhakaran _____ From: Kevin Markey [mailto:[email protected]] Sent: Tuesday, March 31, 2009 6:48 PM To: ORO Users List; [email protected]; [email protected] Subject: RE: Is this a bug with oro? Some context and code in which this fails and data with which this fails would help. Also the version you are using would help. However, inspecting 2.0.6 code (which is the most handy on the machine I'm on -- I suspect other code is similar), there is only one place in __setLastMatchResult() where you can get a NPE. __lastMatchResult is non-null. OpCode is non-null. However, __originalInput MIGHT be null. Hence you can get a NPE where the __originalInput.length is tested. Check your code whether the string in contains() is null, and always check if the result is true. E.g., private PatternCompiler m_compiler = new Perl5Compiler(); private PatternMatcher m_matcher = new Perl5Matcher(); private Pattern m_commentRegex = m_compiler.compile ( "#" ); /** Extract comment from string. */ public String findComment ( String s ) { if ( s == null ) return null; if ( m_matcher.contains ( s, m_commentRegex ) ) { MatchResult result = m_matcher.getMatch(); String comment = s.substring ( result.endOffset(0) ); return comment; } return null; } Enjoy. Kevin Markey -----Original Message----- From: Balaji [mailto:[email protected]] Sent: Tue 3/31/2009 6:22 AM To: [email protected] Subject: Is this a bug with oro? Hello, I occassionally get the below exception. The call to getMatch is causing a NullPointerException. Caused by: java.lang.NullPointerException at org.apache.oro.text.regex.Perl5Matcher.__setLastMatchResult(Unknown Source) at org.apache.oro.text.regex.Perl5Matcher.getMatch(Unknown Source) Here is what the API documentation says, A MatchResult instance containing the pattern match found by the last call to any one of the matches() or contains() methods. If no match was found by the last call, returns null. I believe this is a bug. Can you guys, please confirm? If so, is there a fix or a workaround for this bug? Any help will be greatly appreciated. Thanks, Balaji Prabhakaran
