Re: Tokenize Sentence and Set Attribute

2013-05-08 Thread Edward Garrett
i find UpdateRequestProcessors (
http://wiki.apache.org/solr/UpdateRequestProcessor) a handy way to add and
remove NLP-related fields to a document as it is processed by Solr. this is
also how UIMA integrates with Solr (http://wiki.apache.org/solr/SolrUIMA).
you might want to take a look at UIMA as well.


On Mon, May 6, 2013 at 6:22 PM, Jack Krupansky wrote:

> Sounds like a very ambitious project. I'm sure you COULD do it in Solr,
> but not in very short order.
>
> Check out some discussion of simply searching within sentences:
> http://markmail.org/message/**aoiq62a4mlo25zzk?q=apache#**
> query:apache+page:1+mid:**aoiq62a4mlo25zzk+state:results
>
> First, how do you expect to use/query the corpus?  In other words, what
> are your user requirements? They will determine what structure the Solr
> index, analysis chains, and custom search components will need.
>
> Also, check out the Solr OpenNLP wiki:
> http://wiki.apache.org/solr/**OpenNLP
>
> And see "LUCENE-2899: Add OpenNLP Analysis capabilities as a module":
> https://issues.apache.org/**jira/browse/LUCENE-2899
>
> -- Jack Krupansky
>
> -Original Message- From: Rendy Bambang Junior
> Sent: Monday, May 06, 2013 11:41 AM
> To: solr-user@lucene.apache.org
> Subject: Tokenize Sentence and Set Attribute
>
>
> Hello,
>
> I am trying to use part of speech tagger for bahasa Indonesia to filter
> tokens in Solr.
> The tagger receive input as word list of a sentence and return tag array.
>
> I think the process should by like this:
> - tokenize sentence
> - tokenize word
> - pass it into the tagger
> - set attribute using tagger output
> - pass it into a FilteringTokenFilter implementation
>
> Is it possible to do this in Solr/Lucene? If it is, how?
>
> I've read similar solution for Japanese language but since I am lack of
> Japanese understanding, it couldn't help a lot.
>
> --
> Regards,
> Rendy Bambang Junior
> Informatics Engineering '09
> Bandung Institute of Technology
>



-- 
edge


Re: Calculate a sum.

2013-01-14 Thread Edward Garrett
i've had perfectly fine performance with StatsComponent, but have only
tested with 50,000 documents. for example i have field syllables and
numeric field syllables_count. then i sum the syllable count for any
search query. how many documents are you working with?

On Mon, Jan 14, 2013 at 10:54 AM, Mikhail Khludnev
 wrote:
> Stored fields are famous for its' slowness as well as they requires two io
> operation per doc. You can spend some heap for uninverting the index and
> utilize wiki.apache.org/solr/StatsComponent
> Let us know whether it works for you.
> 14.01.2013 13:14 пользователь "stockii" 
> написал:
>
>> hello.
>>
>> My problem is, that i need to calculate a sum of amounts. this amount is in
>> my index (stored="true"). my php script get all values with paging. but if
>> a
>> request takes too long, jetty is killing this process and i get a "broken
>> pipe".
>>
>> Which is the best/fastest way to get the values of many fields from index?
>> exists an ResponseHandler for exports? Or which is the fastest?
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Calculate-a-sum-tp4033091.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>



-- 
edge


Re: indexing Text file in solr

2013-01-29 Thread Edward Garrett
i don't have experience with this but it looks like you could use, from DIH:

http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor


On Sun, Jan 27, 2013 at 10:23 AM, hadyelsahar  wrote:
> i have a large Arabic Text File that contains Tweets each line contains one
> tweet , that i want to index in solr such that each line of this document
> should be indexed in a separate solr document
>
> what i tried so far :
>
> i know how to SQL databse records in solr
> i know how to change solr schema to fit the data and working with Data
> import handler
> i know how the queries used to index data in solr
> what i want is :
>
> know how to index text file in solr in order that each line is considered a
> solr document
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-Text-file-in-solr-tp4036496.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
edge


get a list of terms sorted by total term frequency

2012-11-07 Thread Edward Garrett
hi,

is there a simple way to get a list of all terms that occur in a field
sorted by their total term frequency within that field?

TermsComponent (http://wiki.apache.org/solr/TermsComponent) "provides
fast field faceting over the whole index", but as counts it gives the
number of documents that each term occurs in (given a field or set of
fields). in place of document counts, i want total term frequency
counts. the ttf function
(http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides
this, but only if you know what term to pass to the function.

edward


Re: get a list of terms sorted by total term frequency

2012-11-07 Thread Edward Garrett
i see... using the -t flag

it would be cool if TermsComponent had an option to sort by total term
frequency, something like

terms.sort={count|index|ttf}

surely that's a common enough use case


On Wed, Nov 7, 2012 at 6:17 PM, Michael McCandless
 wrote:
> Lucene's misc module has HighFreqTerms tool.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 7, 2012 at 1:15 PM, Edward Garrett  
> wrote:
>> hi,
>>
>> is there a simple way to get a list of all terms that occur in a field
>> sorted by their total term frequency within that field?
>>
>> TermsComponent (http://wiki.apache.org/solr/TermsComponent) "provides
>> fast field faceting over the whole index", but as counts it gives the
>> number of documents that each term occurs in (given a field or set of
>> fields). in place of document counts, i want total term frequency
>> counts. the ttf function
>> (http://wiki.apache.org/solr/FunctionQuery#totaltermfreq) provides
>> this, but only if you know what term to pass to the function.
>>
>> edward



-- 
edge


highlighting phrasal hits

2006-12-11 Thread Edward Garrett

hello,

i'm doing phrasal searches, and am not happy with how highlighting is done
by default.

if i search for something, like "w1 w2 w3", then correctly, only fields that
match perfectly will be found. however, when i specify highlighting with
hl=true&hl.fl=myfield, then two things don't work according to (my)
expectations:

1) "w1 w2 w3" is not highlighted as a whole, but rather the pieces are
highlighted. e.g. w1 w2 w3. really, the whole
thing should be contained within a single  element.

2) relatedly, and presumably for the same reason, all instances of "w1",
"w2" and "w3" in myfield are highlighted, even when they don't occur
together.

i can't see any possible reason for things working this way, but perhaps
SOLR is just following lucene here.

any thoughts appreciated,
edward

p.s. haven't actually tested the above against indexed english data, so it's
possible that it's an artifact of the data and analysis procedures i am
using.
--
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA


Re: How to tell the highlighter not to escape?

2007-01-02 Thread Edward Garrett

thorsten,

see the following for discussion. your case is indeed an annoyance--the
thread below discusses motivations for it and ways of working around it. (i
too confess that i wish it were not so.)

http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

-edward

On 1/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:


Hi Thorsten,

The highlighter does not escape anything itself: you are seeing the
results of solr's automatic escaping of xml data within its xml
response.  This should be transparent (your xml decoder should
un-escape the values on the way out).  I'm not really familiar with
xslt so I'm unsure why that isn't so (perhaps it is automatically
html-escaping the values after un-xml-escaping them?)

Be careful of documents containing html fragments natively.

cheers,
-MIke

On 1/2/07, Thorsten Scherler <[EMAIL PROTECTED]>
wrote:
> Hi all,
>
> I am playing around with the highlighter and found that all highlight
> terms get escaped.
>
> I mean solr will return
>  <em>TERM</em> and not
>  TERM 
>
> I am not sure where this escaping is happening but I would need the
> highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> since it is horror to work with cdata sections in xsl.
>
> I had a look in the lucene highlighter and it seem that it does not
> escape the tags.
>
> Can somebody point me to code which is responsible for escaping and
> maybe give me a tip how I can patch to make it configurable.
>
> TIA
>
> salu2
>
>





--
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA


Re: How to tell the highlighter not to escape?

2007-01-03 Thread Edward Garrett

for what it's worth, i wrote a recursive template in xsl that replaces the
escaped characters with actual elements. here, the variable $val would be
the tag, e.g. "em". this has been working okay for me so far.


   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   


On 1/3/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:


On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote:
> thorsten,
>
> see the following for discussion. your case is indeed an annoyance--the
> thread below discusses motivations for it and ways of working around it.
(i
> too confess that i wish it were not so.)
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

Thanks Edward, the problem is with the suggestion in the above thread is
that:
"just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side"

Is not possible with xsl. See e.g.
http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
"> How can I match the Cdata Section?!?
>
You can't, the XPath data model regards CDATA as merely an input shortcut,
not as an information-bearing part of the XML content. In other words,
"" and "x" look exactly the same to the XSLT processor.

Mike Kay"

Michael Kay is the xsl guru and I can say as well from my own experience
one would need to write a custom parser since 
is equal to <em>TERM</em> and this in xsl is a string (XPath
would match text()).

IMO the highlighter should really return pure xml and not escape it.
I will have a look in the XmlResponseWriter maybe I find a way to change
this.

salu2


>
> -edward
>
> On 1/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> > Hi Thorsten,
> >
> > The highlighter does not escape anything itself: you are seeing the
> > results of solr's automatic escaping of xml data within its xml
> > response.  This should be transparent (your xml decoder should
> > un-escape the values on the way out).  I'm not really familiar with
> > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > html-escaping the values after un-xml-escaping them?)
> >
> > Be careful of documents containing html fragments natively.
> >
> > cheers,
> > -MIke
> >
> > On 1/2/07, Thorsten Scherler <
[EMAIL PROTECTED]>
> > wrote:
> > > Hi all,
> > >
> > > I am playing around with the highlighter and found that all
highlight
> > > terms get escaped.
> > >
> > > I mean solr will return
> > >  <em>TERM</em> and not
> > >  TERM 
> > >
> > > I am not sure where this escaping is happening but I would need the
> > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > since it is horror to work with cdata sections in xsl.
> > >
> > > I had a look in the lucene highlighter and it seem that it does not
> > > escape the tags.
> > >
> > > Can somebody point me to code which is responsible for escaping and
> > > maybe give me a tip how I can patch to make it configurable.
> > >
> > > TIA
> > >
> > > salu2
> > >
> > >
> >
>
>
>
--
thorsten

"Together we stand, divided we fall!"
Hey you (Pink Floyd)






--
Edward Garrett

Visiting Fellow (2006-07)
Endangered Languages Academic Programme
School of Oriental and African Studies
London, UK
0207 898 4536

Assistant Professor, Linguistics Program
Eastern Michigan University
612 Pray-Harrold Building
Ypsilanti, MI, USA


Re: How to tell the highlighter not to escape?

2007-01-04 Thread Edward Garrett

just to add a note on this, the whole idea of inserting "pseudo-markup" into
XML text elements seems to be pretty much in disrepute, and certainly caused
many complaints about RSS 1.0, see e.g.

http://www.biglist.com/lists/xsl-list/archives/200505/msg00316.html

in xsl, you **can** use disable-output-escaping="yes" to convert
pseudo-markup to markup, but xslt processors are not required to support
this, and so some do not.

it sure seems to me that if SOLR is returning XML, it might as well return
XML with real markup through and through instead of exploiting
pseudo-markup. if there is concern about introducing validation errors, then
perhaps you could use namespaces in the XML and put the highlighting markup
in a non-SOLR namespace???