[PATCH] D12906: [RFC] Bug identification("issue_hash") change for CmpRuns.py

Honggyu Kim via cfe-commits Wed, 16 Sep 2015 07:57:18 -0700

honggyu.kim created this revision.
honggyu.kim added reviewers: jordan_rose, krememek, zaks.anna, danielmarjamaki, 
babati, dcoughlin.
honggyu.kim added subscribers: cfe-commits, phillip.power, seaneveson, 
j.trofimovich, hk.kang, eszasip, dkrupp, o.gyorgy, xazax.hun, premalatha_mvs.


This patch brings bug identification method from D10305 to the existing 
infrastructure.
By applying this patch, two different bug reports can be compared with existing 
CmpRuns.py.

Currently, "issue_hash" in plist file is just line offset from the beginning of 
function.
But it even cannot distinguish those kind of simple cases that are completely 
different bugs.

BUG 1. garbage return value
```
1 int main()
2 {
3   int a;
4   return a;
5 }

test.c:4:3: warning: Undefined or garbage value returned to caller
  return a;
  ^~~~~~~~
```
BUG 2. garbage assignment
```
1 int main()
2 {
3   int a;
4   int b = a;
5   return b;
6 }

test.c:4:3: warning: Assigned value is garbage or undefined
  int b = a;
  ^~~~~   ~
```

Moreover, The following case are regarded as different bugs when it is compared 
with BUG 1.

BUG 3. a single line of comment is added based on BUG 1 code.
```
1 int main()
2 {
3   // main function
4   int a;
5   return a;
6 }

test.c:5:3: warning: Undefined or garbage value returned to caller
  return a;
  ^~~~~~~~
```
The comparison result is as follows:
```
REMOVED: 'test.c:4:3, Logic error: Undefined or garbage value returned to 
caller'
ADDED: 'test.c:5:3, Logic error: Undefined or garbage value returned to caller'
TOTAL REPORTS: 1
TOTAL DIFFERENCES: 2
```

This patch brought the bug identification method and code from D10305, and it 
generates the "issue_hash" with the following information:
1. column number
2. source line string after removing whitespace
3. bug type (bug message)

This patch is not the final solution, but it enhances "issue_hash" to 
distinguish such kind of cases by generating stronger hash value.

http://reviews.llvm.org/D12906

Files:
  lib/StaticAnalyzer/Core/PlistDiagnostics.cpp

Index: lib/StaticAnalyzer/Core/PlistDiagnostics.cpp
===================================================================
--- lib/StaticAnalyzer/Core/PlistDiagnostics.cpp
+++ lib/StaticAnalyzer/Core/PlistDiagnostics.cpp
@@ -22,6 +22,11 @@
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Support/LineIterator.h"
+#include "clang/AST/ASTContext.h"
+#include "llvm/Support/MD5.h"
+#include <sstream>
+
 using namespace clang;
 using namespace ento;
 using namespace markup;
@@ -285,6 +290,57 @@
   }
 }
 
+static std::string GetNthLineOfFile(llvm::MemoryBuffer *Buffer, int Line) {
+  if (!Buffer)
+    return "";
+
+  llvm::line_iterator LI(*Buffer, false);
+  for (; !LI.is_at_eof() && LI.line_number() != Line; ++LI)
+    ;
+
+  return LI->str();
+}
+
+static std::string NormalizeLine(const SourceManager *SM, FullSourceLoc &L,
+                                 const Decl *D) {
+  static const std::string whitespaces = " \t\n";
+
+  const LangOptions &Opts = D->getASTContext().getLangOpts();
+  std::string str = GetNthLineOfFile(SM->getBuffer(L.getFileID(), L), L.getExpansionLineNumber());
+  unsigned col = str.find_first_not_of(whitespaces);
+
+  SourceLocation StartOfLine = SM->translateLineCol(SM->getFileID(L), L.getExpansionLineNumber(), col);
+  llvm::MemoryBuffer *Buffer = SM->getBuffer(SM->getFileID(StartOfLine), StartOfLine);
+  if (!Buffer) return {};
+
+  const char *BufferPos = SM->getCharacterData(StartOfLine);
+
+  Token Token;
+  Lexer Lexer(SM->getLocForStartOfFile(SM->getFileID(StartOfLine)), Opts,
+              Buffer->getBufferStart(), BufferPos, Buffer->getBufferEnd());
+
+  size_t nextStart = 0;
+  std::ostringstream lineBuff;
+  while (!Lexer.LexFromRawLexer(Token) && nextStart < 2) {
+    if (Token.isAtStartOfLine() && nextStart++ > 0) continue;
+    lineBuff << std::string(SM->getCharacterData(Token.getLocation()), Token.getLength());
+  }
+
+  return lineBuff.str();
+}
+
+static llvm::SmallString<32> GetHashOfContent(StringRef Content) {
+  llvm::MD5 Hash;
+  llvm::MD5::MD5Result MD5Res;
+  llvm::SmallString<32> Res;
+
+  Hash.update(Content);
+  Hash.final(MD5Res);
+  llvm::MD5::stringifyResult(MD5Res, Res);
+
+  return Res;
+}
+
 void PlistDiagnostics::FlushDiagnosticsImpl(
                                     std::vector<const PathDiagnostic *> &Diags,
                                     FilesMade *filesMade) {
@@ -420,9 +476,12 @@
           EmitString(o, declName) << '\n';
         }
 
-        // Output the bug hash for issue unique-ing. Currently, it's just an
-        // offset from the beginning of the function.
-        if (const Stmt *Body = DeclWithIssue->getBody()) {
+        // Output the bug hash for issue unique-ing.
+        // Currently, it contains the following information:
+        //   1. column number
+        //   2. source line string after removing whitespace
+        //   3. bug type
+        if (DeclWithIssue->getBody()) {
 
           // If the bug uniqueing location exists, use it for the hash.
           // For example, this ensures that two leaks reported on the same line
@@ -433,19 +492,22 @@
           if (UPDLoc.isValid()) {
             FullSourceLoc UL(SM->getExpansionLoc(UPDLoc.asLocation()),
                              *SM);
-            FullSourceLoc UFunL(SM->getExpansionLoc(
-              D->getUniqueingDecl()->getBody()->getLocStart()), *SM);
             o << "  <key>issue_hash</key><string>"
-              << UL.getExpansionLineNumber() - UFunL.getExpansionLineNumber()
+              << GetHashOfContent(
+                 std::to_string(UL.getExpansionColumnNumber()) + "$" +
+                 ::NormalizeLine(SM, UL, DeclWithIssue) + "$" +
+                 D->getBugType().str())
               << "</string>\n";
 
           // Otherwise, use the location on which the bug is reported.
           } else {
             FullSourceLoc L(SM->getExpansionLoc(D->getLocation().asLocation()),
                             *SM);
-            FullSourceLoc FunL(SM->getExpansionLoc(Body->getLocStart()), *SM);
             o << "  <key>issue_hash</key><string>"
-              << L.getExpansionLineNumber() - FunL.getExpansionLineNumber()
+              << GetHashOfContent(
+                 std::to_string(L.getExpansionColumnNumber()) + "$" +
+                 ::NormalizeLine(SM, L, DeclWithIssue) + "$" +
+                 D->getBugType().str())
               << "</string>\n";
           }

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D12906: [RFC] Bug identification("issue_hash") change for CmpRuns.py

Reply via email to