Re: Entity extraction?

2008-10-24 Thread Rafael Rossini
Solr can do a simple facet seach like FAST, but the entity extraction
demands other tecnologies. I do not know how FAST does it but at the company
I´m working on (www.cortex-intelligence.com), we use a mix of statistical
and language-specific tasks to recognize and categorize entities in the
text. Ling Pipe is another tool (free) that does that too. In case you would
like to see a simple demo: http://www.cortex-intelligence.com/tech/

Rossini


On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson <[EMAIL PROTECTED]
> wrote:

> During a recent sales pitch to my company by FAST, they mentioned entity
> extraction. I'd never heard of it before, but they described it as
> basically recognizing people/places/things in documents being indexed
> and then being able to do faceting on this data at query time. Does
> anything like this already exist in SOLR? If not, I'm not opposed to
> developing it myself, but I could use some pointers on where to start.
>
>
>
> Thanks,
>
> - Charlie
>
>


Re: Entity extraction?

2008-10-27 Thread Rafael Rossini
Well... IMHO that depends. One of the services we provide is a "automatic
clipping" in which our client chooses 20~30 texts from the media he woud
like to be aware. With classification algorithms we then keep him aware of
every new text of his interest. We gained about 10% of precision just by
adding EE information to the algorithm.

Rossini

On Mon, Oct 27, 2008 at 2:17 PM, Walter Underwood <[EMAIL PROTECTED]>wrote:

> The vendor mentioned entity extraction, but that doesn't mean you need it.
> Entity extraction is a pretty specific technology, and it has been a
> money-losing product at many companies for many years, going back to
> Xerox ThingFinder well over ten years ago.
>
> My guess is that very few people really need entity extraction.
>
> Using EE for automatic taxonomy generation is even harder to get right.
> At best, that is a way to get a starter set of categories that you can
> edit. You will not get a production quality taxonomy automatically.
>
> wunder
>
> On 10/27/08 8:31 AM, "Charlie Jackson" <[EMAIL PROTECTED]> wrote:
>
> > True, though I may be able to convince the powers that be that it's worth
> the
> > investment.
> >
> > There are a number of open source or free tools listed on the Wikipedia
> entry
> > for entity extraction
> > (
> http://en.wikipedia.org/wiki/Named_entity_recognition#Open_source_or_free)
> --
> > does anyone have any experience with any of these?
> >
> > 
> > Charlie Jackson
> > 312-873-6537
> > [EMAIL PROTECTED]
> >
> > -Original Message-
> > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> > Sent: Monday, October 27, 2008 10:23 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Entity extraction?
> >
> > For the record, LingPipe is not free.  It's good, but it's not free.
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> >> From: Rafael Rossini <[EMAIL PROTECTED]>
> >> To: solr-user@lucene.apache.org
> >> Sent: Friday, October 24, 2008 6:08:14 PM
> >> Subject: Re: Entity extraction?
> >>
> >> Solr can do a simple facet seach like FAST, but the entity extraction
> >> demands other tecnologies. I do not know how FAST does it but at the
> company
> >> I´m working on (www.cortex-intelligence.com), we use a mix of
> statistical
> >> and language-specific tasks to recognize and categorize entities in the
> >> text. Ling Pipe is another tool (free) that does that too. In case you
> would
> >> like to see a simple demo: http://www.cortex-intelligence.com/tech/
> >>
> >> Rossini
> >>
> >>
> >> On Fri, Oct 24, 2008 at 6:18 PM, Charlie Jackson
> >>> wrote:
> >>
> >>> During a recent sales pitch to my company by FAST, they mentioned
> entity
> >>> extraction. I'd never heard of it before, but they described it as
> >>> basically recognizing people/places/things in documents being indexed
> >>> and then being able to do faceting on this data at query time. Does
> >>> anything like this already exist in SOLR? If not, I'm not opposed to
> >>> developing it myself, but I could use some pointers on where to start.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> - Charlie
> >>>
> >>>
> >
> >
> >
>
>


ArrayIndexOutOfBoundsException on TermScorer

2007-07-23 Thread Rafael Rossini

Hello all,

In one simple query on my index
"http://localhost:8983/solr/select/?q=brasilI get this:

1226511

java.lang.ArrayIndexOutOfBoundsException: 1226511
at org.apache.lucene.search.TermScorer.score(TermScorer.java:74)
at org.apache.lucene.search.TermScorer.score(TermScorer.java:61)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)
at org.apache.lucene.search.Searcher.search(Searcher.java:97)
at org.apache.solr.search.SolrIndexSearcher.getDocListNC(
SolrIndexSearcher.java:888)
at org.apache.solr.search.SolrIndexSearcher.getDocListC(
SolrIndexSearcher.java:805)
at org.apache.solr.search.SolrIndexSearcher.getDocList(
SolrIndexSearcher.java:698)
at com.cortex.solr.handler.StandardRequestHandler.handleRequestBody(
StandardRequestHandler.java:151)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(
RequestHandlerBase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:659)
at org.apache.solr.servlet.SolrDispatchFilter.execute(
SolrDispatchFilter.java:193)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
SolrDispatchFilter.java:161)
at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(
ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java
:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(
ContextHandlerCollection.java:211)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java
:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(
HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java
:368)
at org.mortbay.thread.BoundedThreadPool$PoolThread.run(
BoundedThreadPool.java:442)

Does anyone have a clue about what is the problem? in the lucene´s
TermScorer.class the exception i´m getting is in this line:


score *= normDecoder[norms[doc] & 0xFF]; // normalize for field



Thanks for any help


olap with solr (math operations on facets)

2007-09-21 Thread Rafael Rossini
Hi all,

I´m considering on doing something like a "light-weight olap" server with
lucene/solr. To achieve that I´d have to do some math operantions on facets.
Is that possible?
For example, my documents would be a purchase row, like (id,
value, id_department, id_store, id_region ...). If I did a facet query for
id_deparment the server would return me something like: deparment1: 500,
deparment2: 400... Is it possible to get the sum, or avg or any math
operation on the field value? Than the server would return me: deparment1:
100 (the sum of each value)    Is it clear?


[]s
Rossini


Re: olap with solr (math operations on facets)

2007-09-21 Thread Rafael Rossini
Thanks for the reply Mike. Is there any plans on doing some like this? Or
some direction anyone could give?

[]s
Rossini


On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> On 21-Sep-07, at 8:27 AM, Rafael Rossini wrote:
>
> > Hi all,
> >
> > I´m considering on doing something like a "light-weight olap"
> > server with
> > lucene/solr. To achieve that I´d have to do some math operantions
> > on facets.
> > Is that possible?
> > For example, my documents would be a purchase row, like (id,
> > value, id_department, id_store, id_region ...). If I did a facet
> > query for
> > id_deparment the server would return me something like: deparment1:
> > 500,
> > deparment2: 400... Is it possible to get the sum, or avg or any math
> > operation on the field value? Than the server would return me:
> > deparment1:
> > 100 (the sum of each value)    Is it clear?
>
> Currently this is not possible out of the box with Solr.
>
> -Mike


Re: olap with solr (math operations on facets)

2007-09-22 Thread Rafael Rossini
Thanks for the tip, I´ll look at it

[]s
   Rossini


On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> On 21-Sep-07, at 2:42 PM, Rafael Rossini wrote:
>
> > Thanks for the reply Mike. Is there any plans on doing some like
> > this? Or
> > some direction anyone could give?
>
> Probably the easiest thing to do is write a custom request handlers
> that iterates over the field cache and computes the statistics you
> want (loading the docs would probably be too slow).
>
> Check out SimpleFacets.java to see how it uses the FieldCache.
>
> -Mike
>


Re: solr+hadoop = next solr

2007-06-07 Thread Rafael Rossini

Hi, Jeff and Mike.

  Would you mind telling us about the architecture of your solutions a
little bit? Mike, you said that you implemented a highly-distributed search
engine using Solr as indexing nodes. What does that mean? You guys
implemented a master, multi-slave solution for replication? Or the whole
index shards for high availability and fail over?


On 6/7/07, Jeff Rodenburg <[EMAIL PROTECTED]> wrote:


Mike - thanks for the comments.  Some responses added below.

On 6/7/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
>
> I've implemented a highly-distributed search engine using Solr (200m
> docs and growing, 60+ servers).   It is not a Solr-based solution in
> the vein of FederatedSearch--it is a higher-level architecture that
> uses Solr as indexing nodes.  I'll note that it is a lot of work and
> would be even more work to develop in the generic extensible
> philosophy that Solr espouses.


Yeah, we've done the same thing in the .Net world, and it's a tough slog.
We're in the same situation -- making our solution generically extensible
is
pretty much a non-starter.

> In terms of the FederatedSearch wiki entry (updated last year), has
> > there
> > been any progress made this year on this topic, at least something
> > worthy of
> > being added or updated to the wiki page?  Not to splinter efforts
> > here, but
> > maybe a working group that was focused on that topic could help to
> > move
> > things forward a bit.
>
> I don't believe that absence of organization has been the cause of
> lack of forward progress on this issue, but simply that there has
> been no-one sufficiently interested and committed to prioritizing
> this huge task to work on it.  There is no need to form a working
> group (not when there are only a handful of active committers to
> begin with)--all interested people could just use solr-dev@ for
> discussion.


That makes sense, just didn't want to bombard the list with the subject if
it was a detractor from the core project, i.e. keep lucene messages on
lucene, solr messages on solr, etc.  The good-community-participant
approach, if you will.

Solr is an open-source project, so huge features will get implemented
> when there is a person or group of people devoted to leading the
> charge on the issue.  If you're interested in being that person,
> that's great!
>
>
Glad to jump in, not sure I qualify as such for that, but certainly a big
cheerleader nonetheless.



Re: multiple indices

2007-06-27 Thread Rafael Rossini

I have 3 different instances of solr on jetty 6.1.13, but you need the jetty
plus.
my etc/jetty.xml looks like this

   
 

   
   *
/webapps/solr1*
   */solr1*
  
  /etc/webdefault.xml
  
 solr/home
 override this value
  
 
 
   
   
 

   
   *
/webapps/solr2*
   */solr2*
  
  /etc/webdefault.xml
  
 solr/home
 override this value
  
 
 
   


then, on the webapps/solr1/WEB-INF you need a jetty-env.xml like this:


http://jetty.mortbay.org/configure.dtd";>





 solr/home
 /solr1






Hope it helps



On 6/26/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


Hm, that JNDI again... this makes it sound like SOLR-215 is completely
superfluous?
I have not configured Jetty this way yet, but I do see some docs on
http://wiki.apache.org/solr/SolrJetty .  Interestingly, the configs look a
lot different than what's described on
http://docs.codehaus.org/display/JETTY/JNDI .  I also remember Jetty Plus
from a while back, but now I cannot find any information about Jetty Plus
6.*, only 5 - http://jetty.mortbay.org/jetty5/plus/index.html .

Otis



- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, June 26, 2007 8:10:46 PM
Subject: Re: multiple indices


:   I have multiple applications (blogs/forums/video/etc) - each of these
: is independent (no need to perform queries on multiple indices).

:   Would it be best to use multiple instances of SOLR/JVM - one for each
: index or use a solution where only one JVM instance is running (maybe
: solr-215?)?


you don't actaully need multiple JVM instances to run multiple Solr
instance ... you can configure your ServletContainer to run the solr.war
in multiple contexts each of which has a differnet solrconfig.xml and
schema.xml (using JNDI) ... that way you get most of hte benefits of
isolated instances but also can also take advantage of a single large heap
and common connection management.




-Hoss