I'll have a look to it, thanks to everyone.

--
Gian Maria Ricci
Mobile: +39 320 0136949
    


-----Original Message-----
From: Steve Rowe [mailto:sar...@gmail.com] 
Sent: Thursday, June 13, 2013 9:03 PM
To: solr-user@lucene.apache.org
Subject: Re: analyzer for Code

Hi Gian Maria,

OpenGrok <http://opengrok.github.io/OpenGrok/> has a bunch of JFlex-based
computer language tokenizers for Lucene:
<https://github.com/OpenGrok/OpenGrok/tree/master/src/org/opensolaris/opengr
ok/analysis>.  Not sure how much work it would be to use them in another
project, though.

There's a bunch of JFlex grammars listed here, though most (almost all?) are
not integrated with Lucene: 

 
<http://sourceforge.net/apps/mediawiki/jflex/index.php?title=ExternalJFlexGr
ammars>

Looks like at least the Jsyntaxpane and RSyntaxTextArea projects have
multiple programming language lexers.

Steve

On Jun 13, 2013, at 1:40 PM, Gian Maria Ricci <alkamp...@nablasoft.com>
wrote:

> Thanks for the suggestions, I'll try with the 
> WordDelimiterFilterFactory. My aim is not to have a perfect analysis, 
> just a way to quick search for words in the whole history of a 
> codebase. J
>  
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>    
>  
>  
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, June 13, 2013 1:24 PM
> To: solr-user@lucene.apache.org; Gian Maria Ricci
> Subject: Re: analyzer for Code
>  
> Well, WordDelimiterFilterFactory would split on the punctuation, so 
> you could add it to the analyzer chain along with StandardAnalyzer.
>  
> You could use one of the regex filters to break up tokens that make it 
> through the analyzer as you see fit.
>  
> But in general, this will be a bunch of compromises since programming 
> languages are, shall we say, not standard <G>
>  
> Best
> Erick
>  
> 
> On Thu, Jun 13, 2013 at 4:19 AM, Gian Maria Ricci
<alkamp...@nablasoft.com> wrote:
> I did a little search around and did not find anything interesting. Anyone
know if some analyzers exists to better index source code (es C#, C++. Java
etc)?
>  
> Standard analyzer is quite good, but I wish to know if there are some more
specific analyzers that can do a better indexing. Es I did a little try with
C# and the full class name was indexed without splitting by dots. So
MyLib.Helpers.Myclass becomes one token and when I search for MyClass I did
not find matches.
>  
> Thanks in advance.
>  
> --
> Gian Maria Ricci
> Mobile: +39 320 0136949
>    
>  
>  


Reply via email to