Push ExternalFileField to Solr

2015-05-19 Thread Floyd Wu
Hi I have two server(Physical) that run my application and solr. I use
external file field to do some search result ranking.

According to the wiki page, external file field data need to resident in
{solr}\data directory. Because EFF data is generated by my application. How
can I push this file to solr. Is there any API or solr web services or any
mechanism help this?

floyd


API support upload file for External File Field

2015-06-17 Thread Floyd Wu
Is there any API to support upload file for ExternalFileField to /data/
directory or any good practice on this?

My application and Solr Server were physically separated on two place.
Application will calculate a score and generate a file for
ExternalFileField.

Thanks for any input.


how to index 20 MB plain-text xml

2014-03-30 Thread Floyd Wu
I have many plain text xml that I transfer to form of solr xml format.
But every time I send them to solr, I hit OOM exception.
How to configure solr to "eat" these big xml?
Please guide me a way. Thanks

floyd


Re: how to index 20 MB plain-text xml

2014-03-30 Thread Floyd Wu
Hi Alex,

Thanks for your responding. Personally I don't want to feed these big xml
to solr. But users wants.
I'll try your suggestions later.

Many thanks.

Floyd



2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch :

> Without digging too deep into why exactly this is happening, here are
> the general options:
>
> 0. Are you actually committing? Check the messages in the logs and see
> if the records show up when you expect them too.
> 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> buffer that's blowing up? Try using stream.file instead (notice
> security warning though): http://wiki.apache.org/solr/ContentStream
> 2. Split file into smaller ones and and commit each separately
> 3. Set hard auto-commit in solrconfig.xml based on number of documents
> to flush in-memory structures to disk
> 4. Switch to using DataImportHandler to pull from XML instead of pushing
> 5. Increase amount of memory to Solr (-X command line flags)
>
> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
> On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu  wrote:
> > I have many plain text xml that I transfer to form of solr xml format.
> > But every time I send them to solr, I hit OOM exception.
> > How to configure solr to "eat" these big xml?
> > Please guide me a way. Thanks
> >
> > floyd
>


Re: how to index 20 MB plain-text xml

2014-03-31 Thread Floyd Wu
Hi Upayavira,
User don't hit solr directly, the search documents through my application.
The application is a entrance for user to upload documents and then indexed
by solr.
the situation is they upload a plain-text, something like dictionary. You
know, that dictionary is something big.
I'm trying to figure out some good technique before I can split these xml
to small one and streaming to solr.

Floyd



2014-04-01 2:55 GMT+08:00 Upayavira :

> Tell the user they can't have!
>
> Or, write a small app that reads in their XML in one go, and pushes it
> in parts to Solr. Generally, I'd say letting a user hit Solr directly is
> a bad thing - especially a user who doesn't know the details of how Solr
> works.
>
> Upayavira
>
> On Mon, Mar 31, 2014, at 07:17 AM, Floyd Wu wrote:
> > Hi Alex,
> >
> > Thanks for your responding. Personally I don't want to feed these big xml
> > to solr. But users wants.
> > I'll try your suggestions later.
> >
> > Many thanks.
> >
> > Floyd
> >
> >
> >
> > 2014-03-31 13:44 GMT+08:00 Alexandre Rafalovitch :
> >
> > > Without digging too deep into why exactly this is happening, here are
> > > the general options:
> > >
> > > 0. Are you actually committing? Check the messages in the logs and see
> > > if the records show up when you expect them too.
> > > 1. Are you actually trying to feed 20Mb file to Solr? Maybe it's HTTP
> > > buffer that's blowing up? Try using stream.file instead (notice
> > > security warning though): http://wiki.apache.org/solr/ContentStream
> > > 2. Split file into smaller ones and and commit each separately
> > > 3. Set hard auto-commit in solrconfig.xml based on number of documents
> > > to flush in-memory structures to disk
> > > 4. Switch to using DataImportHandler to pull from XML instead of
> pushing
> > > 5. Increase amount of memory to Solr (-X command line flags)
> > >
> > > Regards,
> > >Alex.
> > >
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > > proficiency
> > >
> > > On Mon, Mar 31, 2014 at 12:00 PM, Floyd Wu  wrote:
> > > > I have many plain text xml that I transfer to form of solr xml
> format.
> > > > But every time I send them to solr, I hit OOM exception.
> > > > How to configure solr to "eat" these big xml?
> > > > Please guide me a way. Thanks
> > > >
> > > > floyd
> > >
>


Re: ranking retrieval measure

2014-03-31 Thread Floyd Wu
Usually IR system is measured using Precision & Recall.
But depends on what kind of system you are developing to fit what scenario.

Take a look
http://en.wikipedia.org/wiki/Precision_and_recall



2014-04-01 10:23 GMT+08:00 azhar2007 :

> Hi people. Ive developed a search engine to implement and improve it using
> another search engine as a test case. Now I want to compare and test
> results
> from both to determine which is better. I am unaware of how to do this so
> someone please point me in the right direction.
>
> Regards
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/ranking-retrieval-measure-tp4128324.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


What is the best approach to send lots of XML Messages to Solr to build index?

2014-06-15 Thread Floyd Wu
Hi,
I have many XML Message file formatted like this
https://wiki.apache.org/solr/UpdateXmlMessages

These files are generated by my index builder daily.
Currently I am sending these file through http post to Solr but sometimes I
hit OOM exception or pending too many tlog.

Do you have better way to "import" these files to Solr to build index?

Thanks for the suggestion

Floyd


Re: What is the best approach to send lots of XML Messages to Solr to build index?

2014-06-15 Thread Floyd Wu
Thank you Alex.
I'm doing commit every 100 fiels.
Maybe there is a better way to do this job, something like DIH(possible?)
Sometimes i have much bigger xml file (2MB) and post to SOLR(jetty enabled)
may encounter slow or exceed limitation.

Floyd



2014-06-15 16:48 GMT+08:00 Alexandre Rafalovitch :

> When are you doing commit? You can issue one manually, have one with
> timeout parameter (commitWithin), or you can configure it to happen
> automatically (in solrconfig.xml).
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Sun, Jun 15, 2014 at 3:44 PM, Floyd Wu  wrote:
> > Hi,
> > I have many XML Message file formatted like this
> > https://wiki.apache.org/solr/UpdateXmlMessages
> >
> > These files are generated by my index builder daily.
> > Currently I am sending these file through http post to Solr but
> sometimes I
> > hit OOM exception or pending too many tlog.
> >
> > Do you have better way to "import" these files to Solr to build index?
> >
> > Thanks for the suggestion
> >
> > Floyd
>


Re: What is the best approach to send lots of XML Messages to Solr to build index?

2014-06-15 Thread Floyd Wu
Hi Mikhail,
What is the pros. to disable tlog?
Each of my xml file contained to doc, one is main content and the other is
acl.
Currently I'm not using SolrCloud due to my poor understanding of this
architecture and pros/cons.
The main system is developed using .Net C# so using SolrJ won't be a
solution.

Floyd



2014-06-15 18:14 GMT+08:00 Mikhail Khludnev :

> Hello Floyd,
>
> Did you consider to disable tlog?
> Does a file consist of many docs?
> Do you have SolrCloud? Do you use just sh/curl or have a java program?
> DIH is not really performant so far. Submitting roughly ten huge files in
> parallel is a way to perform good. Once again, nuke tlog.
>
>
> On Sun, Jun 15, 2014 at 12:44 PM, Floyd Wu  wrote:
>
> > Hi,
> > I have many XML Message file formatted like this
> > https://wiki.apache.org/solr/UpdateXmlMessages
> >
> > These files are generated by my index builder daily.
> > Currently I am sending these file through http post to Solr but
> sometimes I
> > hit OOM exception or pending too many tlog.
> >
> > Do you have better way to "import" these files to Solr to build index?
> >
> > Thanks for the suggestion
> >
> > Floyd
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  
>


Re: What is the best approach to send lots of XML Messages to Solr to build index?

2014-06-15 Thread Floyd Wu
Hi Erick, Thanks for your advice. autoCommit is configured 30 sec in my
environment.
i'm using C# to develop main system and Solr as a service, so using SolrJ
would consider as impossible(for now).
I;m seeking the better way to directly input(import) the offline generated
XML to build index.
Currently i'm using my own C# code to send these xml files one by one
through HTTP but result poor performance. (parallel will hit OOM or
generate lots of tlog files).

Actually a main question is "what is the best(better) way to rebuild whole
index from scratch".

Floyd





2014-06-15 23:59 GMT+08:00 Erick Erickson :

> A couple of things:
>
> > Consider indexing them with SolrJ, here's a place to get started:
> http://searchhub.org/2012/02/14/indexing-with-solrj/. Especially if you
> use a SAX-based parser you have more control over memory consumption, it's
> on the client after all. And, you can rack together as many clients all
> going to Solr as you need.
>
> > Here's a bunch of information about tlogs and commits that might be
> useful background.
>
> http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> .
> Consider setting your  interval quite short (15 seconds)
> with openSearcher set to false. That'll truncate your tlog, although
> how that relates to your error is something of a mystery to me...
>
> Best,
> Erick
>
> On Sun, Jun 15, 2014 at 3:14 AM, Mikhail Khludnev
>  wrote:
> > Hello Floyd,
> >
> > Did you consider to disable tlog?
> > Does a file consist of many docs?
> > Do you have SolrCloud? Do you use just sh/curl or have a java program?
> > DIH is not really performant so far. Submitting roughly ten huge files in
> > parallel is a way to perform good. Once again, nuke tlog.
> >
> >
> > On Sun, Jun 15, 2014 at 12:44 PM, Floyd Wu  wrote:
> >
> >> Hi,
> >> I have many XML Message file formatted like this
> >> https://wiki.apache.org/solr/UpdateXmlMessages
> >>
> >> These files are generated by my index builder daily.
> >> Currently I am sending these file through http post to Solr but
> sometimes I
> >> hit OOM exception or pending too many tlog.
> >>
> >> Do you have better way to "import" these files to Solr to build index?
> >>
> >> Thanks for the suggestion
> >>
> >> Floyd
> >>
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  
>


Re: What is the best approach to send lots of XML Messages to Solr to build index?

2014-06-15 Thread Floyd Wu
Hi Shawn,
I've tried to set 4GB heap for Solr and the OOM exception rellay get reduce
and also performance gained.

Floyd



2014-06-16 0:00 GMT+08:00 Shawn Heisey :

> On 6/15/2014 2:54 AM, Floyd Wu wrote:
> > Thank you Alex.
> > I'm doing commit every 100 fiels.
> > Maybe there is a better way to do this job, something like DIH(possible?)
> > Sometimes i have much bigger xml file (2MB) and post to SOLR(jetty
> enabled)
> > may encounter slow or exceed limitation.
>
> If you are getting OOM exceptions on your Solr server, then you need to
> increase your Java heap size.  I have never seen the "pending too many"
> error that you mentioned at the beginning of the thread, and Google
> didn't turn up anything useful, so I don't know what needs to be done
> for that.  If you can post the entire exception stacktrace for this
> error, perhaps we can figure it out.  We would also need the exact Solr
> version.
>
> Solr has a default 2MB limit on POST requests.  This can be increased
> with the formdataUploadLimitInKB parameter on the requestDispatcher tag
> in solrconfig.xml -- assuming that you're running 4.1 or later.
> Previous versions required changing the request size in the servlet
> container config, but there was a bug in the example Jetty included in
> 4.0.0 that made it impossible to change the size.
>
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130
> https://issues.apache.org/jira/browse/SOLR-4223
>
> Regarding the advice to disable the updateLog: I never like to do this,
> but if you are sending a very large number of updates in a single
> request, it might be advisable until indexing is complete, so that Solr
> restart times are not excessive.
>
> Thanks,
> Shawn
>
>


Re: What is the best approach to send lots of XML Messages to Solr to build index?

2014-06-16 Thread Floyd Wu
Hi Mikhail
Thanks for you suggestions.
Floyd


2014-06-16 17:28 GMT+08:00 Mikhail Khludnev :

> On Mon, Jun 16, 2014 at 6:57 AM, Floyd Wu  wrote:
>
> > Hi Mikhail,
> > What is the pros. to disable tlog?
> >
> I consumes the heap much providing the benefits (real-time get, recover
> uncommitted docs on failure) which are not necessary in old-school bulk
> index scenario.
>
>
> > Each of my xml file contained to doc, one is main content and the other
> is
> > acl.
> >
> How I can guess how many of them you have? Once again submitting a few
> (let's say ten) huge files in parallel allows to utilize indexing jvm
> fully, and yields the best performance.
>
>
> > Currently I'm not using SolrCloud due to my poor understanding of this
> > architecture and pros/cons.
> > The main system is developed using .Net C# so using SolrJ won't be a
> > solution.
> >
> anyway, if you submit small requests by C# code via REST, make sure that
> http keep-alive is enabled, and you don't waste time for establishing TCP
> connection. I might be wrong but I've thought Lucid guys provide some C#
> client or just its' scratch for Solr. Don't they?
>
>
> >
> > Floyd
> >
> >
> >
> > 2014-06-15 18:14 GMT+08:00 Mikhail Khludnev  >:
> >
> > > Hello Floyd,
> > >
> > > Did you consider to disable tlog?
> > > Does a file consist of many docs?
> > > Do you have SolrCloud? Do you use just sh/curl or have a java program?
> > > DIH is not really performant so far. Submitting roughly ten huge files
> in
> > > parallel is a way to perform good. Once again, nuke tlog.
> > >
> > >
> > > On Sun, Jun 15, 2014 at 12:44 PM, Floyd Wu  wrote:
> > >
> > > > Hi,
> > > > I have many XML Message file formatted like this
> > > > https://wiki.apache.org/solr/UpdateXmlMessages
> > > >
> > > > These files are generated by my index builder daily.
> > > > Currently I am sending these file through http post to Solr but
> > > sometimes I
> > > > hit OOM exception or pending too many tlog.
> > > >
> > > > Do you have better way to "import" these files to Solr to build
> index?
> > > >
> > > > Thanks for the suggestion
> > > >
> > > > Floyd
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  
>


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-20 Thread Floyd Wu
Will these awesome features being implemented in Solr soon
 2014/6/20 下午10:43 於 "Yonik Seeley"  寫道:

> On Fri, Jun 20, 2014 at 10:15 AM, Yago Riveiro 
> wrote:
> > Yonik,
> >
> > This native code uses in any way the docValues?
>
> Nope... not yet.  It is something I think we should look into in the
> future though.
>
> > In the past I was forced to indexed a big portion of my data with
> docValues enable. OOP problems with large terms dictionaries and GC was my
> main problem.
> >
> > Other good optimization can be do facet aggregations offsite the heap to
> minimize the GC,
>
> Yeah, the single-valued string faceting in Heliosearch currently does
> this (the "counts" array is also off-heap).
>
> > To ensure that facet aggregations has enough ram we need a large heap,
> in machines with a lot of ram maybe if this aggregation was made offsite
> this allow us reduce the heap size.
>
> Yeah, it's nice not having to worry so much about the correct heap size
> too.
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data
>


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-20 Thread Floyd Wu
Hi Yonik, i dont' understand the relationship between solr and heliosearch
since you were committer of solr?

I just curious.
2014/6/21 上午12:07 於 "Yonik Seeley"  寫道:

> On Fri, Jun 20, 2014 at 11:16 AM, Floyd Wu  wrote:
> > Will these awesome features being implemented in Solr soon
> >  2014/6/20 下午10:43 於 "Yonik Seeley"  寫道:
>
> Given the current makeup of the joint Lucene/Solr PMC, it's unclear.
> I'm not worrying about that for now, and just pushing Heliosearch as
> far and as fast as I can.
> Come join us if you'd like to help!
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data
>


Lots of tlog files remained, why?

2013-11-03 Thread Floyd Wu
After re-index 2 xml files and done commit, optimization many times, I
still have many tlog files in data/tlof directory.

Why?

How to remove those files(delete them directly or just ignored them?)

What is the difference if tlog files exist or not?

Please kindly guide me.

Thanks

Floyd


Re: Lots of tlog files remained, why?

2013-11-05 Thread Floyd Wu
Hi Eric,

Sorry for replay being late.
The tlog file stay there for one week and seems no decease. Most of them
are 3~5 MB and totally 40MB.

The article your point I've read many times but no working. Everytime I
reindex files solr generate many tlog of them and no matter how many hard
commit I di , tlog still there.

I'm using Solr 4.3.2 on Windown server 2003 32bit enviroment.
What else detail information should I provide, please le me know.

PS: Should I ugrade Solr to 4.5.1?

Floyd



2013/11/4 Erick Erickson 

> What is your commit strategy? A hard commit
> (openSearcher=true or false doesn't matter)
> should close the current tlog file, open
> a new one and delete old ones. That said, there
> will be enough tlog files kept around to hold at
> least 100 documents. So if you're committing
> too often (say after every document or something),
> you can expect to have a bunch around. The
> real question is whether they stay around forever
> or not. If you index more documents, do old ones
> disappear?
>
> Here's a writerup:
>
> http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> If that doesn't help, what version of Solr? How
> big are you tlog files? Details matter.
>
> Best,
> Erick
>
>
> On Sun, Nov 3, 2013 at 10:03 AM, Floyd Wu  wrote:
>
> > After re-index 2 xml files and done commit, optimization many times,
> I
> > still have many tlog files in data/tlof directory.
> >
> > Why?
> >
> > How to remove those files(delete them directly or just ignored them?)
> >
> > What is the difference if tlog files exist or not?
> >
> > Please kindly guide me.
> >
> > Thanks
> >
> > Floyd
> >
>


DocValues uasge and senarios?

2013-11-20 Thread Floyd Wu
Hi there,

I'm not fully understand what kind of usage example that DocValues can be
used?

When I set field docValues=true, do i need to change anyhting in xml that I
sent to solr for indexing?

Please point me.

Thanks

Floyd

PS: I've googled and read lots of DocValues discussion but confused.


Re: DocValues uasge and senarios?

2013-11-20 Thread Floyd Wu
Hi Yago

Thanks for you reply. I once thought that DocValues feature is one for me
to store some extra values.

May I summarized that DocValues is a feature that "speed up" sorting and
faceting?

Floyd



2013/11/20 Yago Riveiro 

> Hi Floyd,
>
> DocValues are useful for sorting and faceting per example.
>
> You don't need to change nothing in your xml's, the only thing that you
> need to do is set the docValues=true in your field definition in the schema.
>
> If you don't want use the default implementation (all loaded in the heap),
> you need to add the tag  in
> the solrconfig.xml and the docValuesFormat=true on the fieldType definition.
>
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Wednesday, November 20, 2013 at 9:38 AM, Floyd Wu wrote:
>
> > Hi there,
> >
> > I'm not fully understand what kind of usage example that DocValues can be
> > used?
> >
> > When I set field docValues=true, do i need to change anyhting in xml
> that I
> > sent to solr for indexing?
> >
> > Please point me.
> >
> > Thanks
> >
> > Floyd
> >
> > PS: I've googled and read lots of DocValues discussion but confused.
>
>


Re: DocValues uasge and senarios?

2013-11-20 Thread Floyd Wu
Thanks Yago,

I've read this article
http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
But I don't understand well.
I'll try to figure out the missing part. Thanks for helping.

Floyd




2013/11/20 Yago Riveiro 

> You should understand the DocValues as feature that allow you to do
> sorting and faceting without blow the heap.
>
> They are not necessary faster than the traditional method, they are more
> memory efficient and in huge indexes this is the main limitation.
>
> This post resumes the docvalues feature and the main goals
> http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
>
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Wednesday, November 20, 2013 at 10:15 AM, Floyd Wu wrote:
>
> > Hi Yago
> >
> > Thanks for you reply. I once thought that DocValues feature is one for me
> > to store some extra values.
> >
> > May I summarized that DocValues is a feature that "speed up" sorting and
> > faceting?
> >
> > Floyd
> >
> >
> >
> > 2013/11/20 Yago Riveiro  yago.rive...@gmail.com)>
> >
> > > Hi Floyd,
> > >
> > > DocValues are useful for sorting and faceting per example.
> > >
> > > You don't need to change nothing in your xml's, the only thing that you
> > > need to do is set the docValues=true in your field definition in the
> schema.
> > >
> > > If you don't want use the default implementation (all loaded in the
> heap),
> > > you need to add the tag  class="solr.SchemaCodecFactory"/> in
> > > the solrconfig.xml and the docValuesFormat=true on the fieldType
> definition.
> > >
> > > --
> > > Yago Riveiro
> > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > >
> > >
> > > On Wednesday, November 20, 2013 at 9:38 AM, Floyd Wu wrote:
> > >
> > > > Hi there,
> > > >
> > > > I'm not fully understand what kind of usage example that DocValues
> can be
> > > > used?
> > > >
> > > > When I set field docValues=true, do i need to change anyhting in xml
> > > that I
> > > > sent to solr for indexing?
> > > >
> > > > Please point me.
> > > >
> > > > Thanks
> > > >
> > > > Floyd
> > > >
> > > > PS: I've googled and read lots of DocValues discussion but confused.
>
>


Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Hi there,

I've built a SolrCloud cluster from example, but I have some question.
When I send query to one leader (say
http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything
will be fine.

When I shutdown that leader, the other replica(
http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be new
leader. The problem is:

The application doesn't know new leader's location and still send request
to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no response.

How can I know new leader in my application?
Are there any mechanism that application can send request to one fixed
endpoint no matter who is leader?

For example, application just send to
http://xxx.xxx.xxx.xxx:8983/solr/collection1
even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1

Please help on this or give me some key infomation to google it.

Many thanks.

Floyd


Re: Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Hi anshum
Thanks for your response.
My application is developed using C#, so I can't use  CloudSolrServer with
SolrJ.

My problem is there is a setting in my application

SolrUrl = http://xxx.xxx.xxx.xxx:8983/solr/collection1

When this Solr instance shutdown or crash, I have to change this setting.
I've read source code of CloudSolrServer.java in SolrJ just few minutes ago.

It seems to that CloudSolrServer first read cluster state from zk ( or some
live node)
to retrieve info and then use this info to decide which node to send
request.

Maybe I have to modify my application to mimic CloudSolrServer impl.

Any idea?

Floyd




2013/7/10 Anshum Gupta 

> You don't really need to direct any query specifically to a leader. It will
> automatically be routed to the right leader.
> You may put a load balancer on top to just fix the problem with querying a
> node that has gone away.
>
> Also, ZK aware SolrJ Java client that load-balances across all nodes in
> cluster.
>
>
> On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu  wrote:
>
> > Hi there,
> >
> > I've built a SolrCloud cluster from example, but I have some question.
> > When I send query to one leader (say
> > http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem everything
> > will be fine.
> >
> > When I shutdown that leader, the other replica(
> > http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will be
> > new
> > leader. The problem is:
> >
> > The application doesn't know new leader's location and still send request
> > to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
> response.
> >
> > How can I know new leader in my application?
> > Are there any mechanism that application can send request to one fixed
> > endpoint no matter who is leader?
> >
> > For example, application just send to
> > http://xxx.xxx.xxx.xxx:8983/solr/collection1
> > even the real leader run on http://xxx.xxx.xxx.xxx:9983/solr/collection1
> >
> > Please help on this or give me some key infomation to google it.
> >
> > Many thanks.
> >
> > Floyd
> >
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>


Re: Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Hi Furkan
I'm using C#,  SolrJ won't help on this, but its impl is a good reference
for me. Thanks for your help.

by the way, how to fetch/get cluster state from zk directly in plain http
or tcp socket?
In my SolrCloud cluster, I'm using standalone zk to coordinate.

Floyd




2013/7/10 Furkan KAMACI 

> You can define a CloudSolrServer as like that:
>
> *private static CloudSolrServer solrServer;*
>
> and then define the addres of your zookeeper host:
>
> *private static String zkHost = "localhost:9983";*
>
> initialize your variable:
>
> *solrServer = new CloudSolrServer(zkHost);*
>
> You can get leader list as like:
>
> *ClusterState clusterState =
> cloudSolrServer.getZkStateReader().getClusterState();
> List leaderList = new ArrayList<>();
>   for (Slice slice : clusterState.getSlices(collectionName)) {
>   leaderList.add(slice.getLeader()); /
> }*
>
>
> For querying you can try that:
> *
> *
> *SolrQuery solrQuery = new SolrQuery();*
> *//fill your **solrQuery variable here**
> *
> *QueryRequest queryRequest = new QueryRequest(solrQuery,
> SolrRequest.METHOD.POST);
> queryRequest.process(**solrServer**);*
>
> CloudSolrServer uses LBHttpSolrServer by default. It's definiton is like
> that: *LBHttpSolrServer or "Load Balanced HttpSolrServer" is just a wrapper
> to CommonsHttpSolrServer. This is useful when you have multiple SolrServers
> and query requests need to be Load Balanced among them. It offers automatic
> failover when a server goes down and it detects when the server comes back
> up.*
> *
> *
> *
> *
>
> 2013/7/10 Anshum Gupta 
>
> > You don't really need to direct any query specifically to a leader. It
> will
> > automatically be routed to the right leader.
> > You may put a load balancer on top to just fix the problem with querying
> a
> > node that has gone away.
> >
> > Also, ZK aware SolrJ Java client that load-balances across all nodes in
> > cluster.
> >
> >
> > On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu  wrote:
> >
> > > Hi there,
> > >
> > > I've built a SolrCloud cluster from example, but I have some question.
> > > When I send query to one leader (say
> > > http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
> everything
> > > will be fine.
> > >
> > > When I shutdown that leader, the other replica(
> > > http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard will
> be
> > > new
> > > leader. The problem is:
> > >
> > > The application doesn't know new leader's location and still send
> request
> > > to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
> > response.
> > >
> > > How can I know new leader in my application?
> > > Are there any mechanism that application can send request to one fixed
> > > endpoint no matter who is leader?
> > >
> > > For example, application just send to
> > > http://xxx.xxx.xxx.xxx:8983/solr/collection1
> > > even the real leader run on
> http://xxx.xxx.xxx.xxx:9983/solr/collection1
> > >
> > > Please help on this or give me some key infomation to google it.
> > >
> > > Many thanks.
> > >
> > > Floyd
> > >
> >
> >
> >
> > --
> >
> > Anshum Gupta
> > http://www.anshumgupta.net
> >
>


Re: Switch to new leader transparently?

2013-07-10 Thread Floyd Wu
Thanks Aloke, I will do some research.
2013/7/10 下午9:45 於 "Aloke Ghoshal"  寫道:

> Hi Floyd,
>
> We use SolrNet to connect to Solr from a C# application. Since SolrNet is
> not aware about SolrCloud or ZK, we use a Http load balancer in front of
> the Solr nodes & query via the load balancer url. You could use something
> like HAProxy or Apache reverse proxy for load balancing.
>
> On the other hand in order to write a ZK aware client in C# you could start
> here: https://github.com/ewhauser/zookeeper/tree/trunk/src/dotnet
>
> Regards,
> Aloke
>
>
> On Wed, Jul 10, 2013 at 4:11 PM, Furkan KAMACI  >wrote:
>
> > By the this is not related to your question but this may help you for
> > connecting Solr via C#: http://solrsharp.codeplex.com/
> >
> > 2013/7/10 Floyd Wu 
> >
> > > Hi Furkan
> > > I'm using C#,  SolrJ won't help on this, but its impl is a good
> reference
> > > for me. Thanks for your help.
> > >
> > > by the way, how to fetch/get cluster state from zk directly in plain
> http
> > > or tcp socket?
> > > In my SolrCloud cluster, I'm using standalone zk to coordinate.
> > >
> > > Floyd
> > >
> > >
> > >
> > >
> > > 2013/7/10 Furkan KAMACI 
> > >
> > > > You can define a CloudSolrServer as like that:
> > > >
> > > > *private static CloudSolrServer solrServer;*
> > > >
> > > > and then define the addres of your zookeeper host:
> > > >
> > > > *private static String zkHost = "localhost:9983";*
> > > >
> > > > initialize your variable:
> > > >
> > > > *solrServer = new CloudSolrServer(zkHost);*
> > > >
> > > > You can get leader list as like:
> > > >
> > > > *ClusterState clusterState =
> > > > cloudSolrServer.getZkStateReader().getClusterState();
> > > > List leaderList = new ArrayList<>();
> > > >   for (Slice slice : clusterState.getSlices(collectionName)) {
> > > >   leaderList.add(slice.getLeader()); /
> > > > }*
> > > >
> > > >
> > > > For querying you can try that:
> > > > *
> > > > *
> > > > *SolrQuery solrQuery = new SolrQuery();*
> > > > *//fill your **solrQuery variable here**
> > > > *
> > > > *QueryRequest queryRequest = new QueryRequest(solrQuery,
> > > > SolrRequest.METHOD.POST);
> > > > queryRequest.process(**solrServer**);*
> > > >
> > > > CloudSolrServer uses LBHttpSolrServer by default. It's definiton is
> > like
> > > > that: *LBHttpSolrServer or "Load Balanced HttpSolrServer" is just a
> > > wrapper
> > > > to CommonsHttpSolrServer. This is useful when you have multiple
> > > SolrServers
> > > > and query requests need to be Load Balanced among them. It offers
> > > automatic
> > > > failover when a server goes down and it detects when the server comes
> > > back
> > > > up.*
> > > > *
> > > > *
> > > > *
> > > > *
> > > >
> > > > 2013/7/10 Anshum Gupta 
> > > >
> > > > > You don't really need to direct any query specifically to a leader.
> > It
> > > > will
> > > > > automatically be routed to the right leader.
> > > > > You may put a load balancer on top to just fix the problem with
> > > querying
> > > > a
> > > > > node that has gone away.
> > > > >
> > > > > Also, ZK aware SolrJ Java client that load-balances across all
> nodes
> > in
> > > > > cluster.
> > > > >
> > > > >
> > > > > On Wed, Jul 10, 2013 at 2:52 PM, Floyd Wu 
> > wrote:
> > > > >
> > > > > > Hi there,
> > > > > >
> > > > > > I've built a SolrCloud cluster from example, but I have some
> > > question.
> > > > > > When I send query to one leader (say
> > > > > > http://xxx.xxx.xxx.xxx:8983/solr/collection1) and no problem
> > > > everything
> > > > > > will be fine.
> > > > > >
> > > > > > When I shutdown that leader, the other replica(
> > > > > > http://xxx.xxx.xxx.xxx:9983/solr/collection1) in the some shard
> > will
> > > > be
> > > > > > new
> > > > > > leader. The problem is:
> > > > > >
> > > > > > The application doesn't know new leader's location and still send
> > > > request
> > > > > > to http://xxx.xxx.xxx.xxx:8983/solr/collection1 and of course no
> > > > > response.
> > > > > >
> > > > > > How can I know new leader in my application?
> > > > > > Are there any mechanism that application can send request to one
> > > fixed
> > > > > > endpoint no matter who is leader?
> > > > > >
> > > > > > For example, application just send to
> > > > > > http://xxx.xxx.xxx.xxx:8983/solr/collection1
> > > > > > even the real leader run on
> > > > http://xxx.xxx.xxx.xxx:9983/solr/collection1
> > > > > >
> > > > > > Please help on this or give me some key infomation to google it.
> > > > > >
> > > > > > Many thanks.
> > > > > >
> > > > > > Floyd
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Anshum Gupta
> > > > > http://www.anshumgupta.net
> > > > >
> > > >
> > >
> >
>


How to avoid underscore sign indexing problem?

2013-08-21 Thread Floyd Wu
When using StandardAnalyzer to tokenize string "Pacific_Rim" will get

ST
textraw_bytesstartendtypeposition
pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]0111

How to make this string to be tokenized to these two tokens "Pacific",
"Rim"?
Set _ as stopword?
Please kindly help on this.
Many thanks.

Floyd


Re: How to avoid underscore sign indexing problem?

2013-08-21 Thread Floyd Wu
Thank you all.
By the way, Jack I gonna by your book. Where to buy?
Floyd


2013/8/22 Jack Krupansky 

> "I thought that the StandardTokenizer always split on punctuation, "
>
> Proving that you haven't read my book! The section on the standard
> tokenizer details the rules that the tokenizer uses (in addition to
> extensive examples.) That's what I mean by "deep dive."
>
> -- Jack Krupansky
>
> -Original Message- From: Shawn Heisey
> Sent: Wednesday, August 21, 2013 10:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to avoid underscore sign indexing problem?
>
>
> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>
>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>
>> ST
>> textraw_**bytesstartendtypeposition
>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]0111
>>
>> How to make this string to be tokenized to these two tokens "Pacific",
>> "Rim"?
>> Set _ as stopword?
>> Please kindly help on this.
>> Many thanks.
>>
>
> Interesting.  I thought that the StandardTokenizer always split on
> punctuation, but apparently that's not the case for the underscore
> character.
>
> You can always use the WordDelimeterFilter after the StandardTokenizer.
>
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>
> Thanks,
> Shawn
>


Re: How to avoid underscore sign indexing problem?

2013-08-21 Thread Floyd Wu
After trying some search case and different params combination of
WordDelimeter. I wonder what is the best strategy to index string
"2DA012_ISO MARK 2" and can be search by term "2DA012"?

What if I just want _ to be removed both query/index time, what and how to
configure?

Floyd



2013/8/22 Floyd Wu 

> Thank you all.
> By the way, Jack I gonna by your book. Where to buy?
> Floyd
>
>
> 2013/8/22 Jack Krupansky 
>
>> "I thought that the StandardTokenizer always split on punctuation, "
>>
>> Proving that you haven't read my book! The section on the standard
>> tokenizer details the rules that the tokenizer uses (in addition to
>> extensive examples.) That's what I mean by "deep dive."
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Shawn Heisey
>> Sent: Wednesday, August 21, 2013 10:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to avoid underscore sign indexing problem?
>>
>>
>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>>
>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>>
>>> ST
>>> textraw_**bytesstartendtypeposition
>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]0111
>>>
>>> How to make this string to be tokenized to these two tokens "Pacific",
>>> "Rim"?
>>> Set _ as stopword?
>>> Please kindly help on this.
>>> Many thanks.
>>>
>>
>> Interesting.  I thought that the StandardTokenizer always split on
>> punctuation, but apparently that's not the case for the underscore
>> character.
>>
>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>>
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>>
>> Thanks,
>> Shawn
>>
>
>


Re: How to avoid underscore sign indexing problem?

2013-08-22 Thread Floyd Wu
Alright, thanks for all your help. I finally fix this problem using
PatternReplaceFilterFactory + WordDelimeterfilterFactory.

I first replace _ (underscore) using PatternReplaceFilterFactory and then
using WordDelimeterFilterFactory to generate word and number part to
increase user search hit. Although this decrease search quality a little,
but user need higher recall rate than precision.

Thank you all.

Floyd





2013/8/22 Floyd Wu 

> After trying some search case and different params combination of
> WordDelimeter. I wonder what is the best strategy to index string
> "2DA012_ISO MARK 2" and can be search by term "2DA012"?
>
> What if I just want _ to be removed both query/index time, what and how to
> configure?
>
> Floyd
>
>
>
> 2013/8/22 Floyd Wu 
>
>> Thank you all.
>> By the way, Jack I gonna by your book. Where to buy?
>> Floyd
>>
>>
>> 2013/8/22 Jack Krupansky 
>>
>>> "I thought that the StandardTokenizer always split on punctuation, "
>>>
>>> Proving that you haven't read my book! The section on the standard
>>> tokenizer details the rules that the tokenizer uses (in addition to
>>> extensive examples.) That's what I mean by "deep dive."
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Shawn Heisey
>>> Sent: Wednesday, August 21, 2013 10:41 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to avoid underscore sign indexing problem?
>>>
>>>
>>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>>>
>>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>>>
>>>> ST
>>>> textraw_**bytesstartendtypeposition
>>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]0111
>>>>
>>>> How to make this string to be tokenized to these two tokens "Pacific",
>>>> "Rim"?
>>>> Set _ as stopword?
>>>> Please kindly help on this.
>>>> Many thanks.
>>>>
>>>
>>> Interesting.  I thought that the StandardTokenizer always split on
>>> punctuation, but apparently that's not the case for the underscore
>>> character.
>>>
>>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>>>
>>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>>
>


Re: Very slow query when boosting involve with EnternalFileField

2013-03-21 Thread Floyd Wu
Anybody can point me a direction?
Many thanks.



2013/3/20 Floyd Wu 

> Hi everyone,
>
> I have a problem and have no luck to figure out.
>
> When I issue a query to
> Query 1
>
> http://localhost:8983/solr/select?q={!boost+b=recip(ms(NOW/HOUR,last_modified_datetime),3.16e-11,1,1)}all<http://localhost:8983/solr/select?q=%7B!boost+b=recip(ms(NOW/HOUR,last_modified_datetime),3.16e-11,1,1)%7Dall>
> :"java"&start=0&rows=10&fl=score,author&sort=score+desc
>
> Query 2
>
> http://localhost:8983/solr/select?q={!boost+b=sum(ranking,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)}all<http://localhost:8983/solr/select?q=%7B!boost+b=sum(ranking,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)%7Dall>
> :"java"&start=0&rows=10&fl=score,author&sort=score+desc
>
> The difference between two query is "boost".
> The boost function of Query 2 using a field named ranking and this field
> is ExternalFileField.
> External file is key=value pair about 1 lines.
>
> Execution time
> Query 1-->100ms
> Query 2-->2300ms
>
> I tried to issue Query 3 and change ranking to a constant "1"
>
> http://localhost:8983/solr/select?q={!boost+b=sum(1,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)}all<http://localhost:8983/solr/select?q=%7B!boost+b=sum(1,recip(ms(NOW/HOUR,last_modified_datetime)),3.16e-11,1,1)%7Dall>
> :"java"&start=0&rows=10&fl=score,author&sort=score+desc
>
> Execution time
> Query 3-->110ms
>
> one thing I can sure that involved with externalFileField will slow down
> query execution time significantly. But I have no idea how to solve this
> problem as my boost function must calculate value of ranking field.
>
> Please help on this.
>
> PS: I'm using SOLR-4.1
>
> Floyd
>
>
>
>


Re: Slow Highlighter Performance Even Using FastVectorHighlighter

2013-06-17 Thread Floyd Wu
Hi Michael, How do I configure posthighlighter with my solr 4.2 box?
Please kindly point me. Many thanks.
2013/6/15 下午10:48 於 "Michael McCandless"  寫道:

> You could also try the new[ish] PostingsHighlighter:
>
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Jun 15, 2013 at 8:50 AM, Michael Sokolov
>  wrote:
> > If you have very large documents (many MB) that can lead to slow
> > highlighting, even with FVH.
> >
> > See https://issues.apache.org/jira/browse/LUCENE-3234
> >
> > and try setting phraseLimit=1 (or some bigger number, but not infinite,
> > which is the default)
> >
> > -Mike
> >
> >
> >
> > On 6/14/13 4:52 PM, Andy Brown wrote:
> >>
> >> Bryan,
> >>
> >> For specifics, I'll refer you back to my original email where I
> >> specified all the fields/field types/handlers I use. Here's a general
> >> overview.
> >>   I really only have 3 fields that I index and search against: "name",
> >> "description", and "content". All of which are just general text
> >> (string) fields. I have a catch-all field called "text" that is only
> >> used for querying. It's indexed but not stored. The "name",
> >> "description", and "content" fields are copied into the "text" field.
> >>   For partial word matching, I have 4 more fields: "name_par",
> >> "description_par", "content_par", and "text_par". The "text_par" field
> >> has the same relationship to the "*_par" fields as "text" does to the
> >> others (only used for querying). Those partial word matching fields are
> >> of type "text_general_partial" which I created. That field type is
> >> analyzed different than the regular text field in that it goes through
> >> an EdgeNGramFilterFactory with the minGramSize="2" and maxGramSize="7"
> >> at index time.
> >>   I query against both "text" and "text_par" fields using edismax
> deftype
> >> with my qf set to "text^2 text_par^1" to give full word matches a higher
> >> score. This part returns back very fast as previously stated. It's when
> >> I turn on highlighting that I take the huge performance hit.
> >>   Again, I'm using the FastVectorHighlighting. The hl.fl is set to "name
> >> name_par description description_par content content_par" so that it
> >> returns highlights for full and partial word matches. All of those
> >> fields have indexed, stored, termPositions, termVectors, and termOffsets
> >> set to "true".
> >>   It all seems redundant just to allow for partial word
> >> matching/highlighting but I didn't know of a better way. Does anything
> >> stand out to you that could be the culprit? Let me know if you need any
> >> more clarification.
> >>   Thanks!
> >>   - Andy
> >>
> >> -Original Message-
> >> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> >> Sent: Wednesday, May 29, 2013 5:44 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: RE: Slow Highlighter Performance Even Using
> >> FastVectorHighlighter
> >>
> >> Andy,
> >>
> >>> I don't understand why it's taking 7 secs to return highlights. The
> >>
> >> size
> >>>
> >>> of the index is only 20.93 MB. The JVM heap Xms and Xmx are both set
> >>
> >> to
> >>>
> >>> 1024 for this verification purpose and that should be more than
> >>
> >> enough.
> >>>
> >>> The processor is plenty powerful enough as well.
> >>>
> >>> Running VisualVM shows all my CPU time being taken by mainly these 3
> >>> methods:
> >>>
> >>>
> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> >>>
> >>> nfo.getStartOffset()
> >>>
> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList$WeightedPhraseI
> >>>
> >>> nfo.getStartOffset()
> >>>
> >> org.apache.lucene.search.vectorhighlight.FieldPhraseList.addIfNoOverlap(
> >>>
> >>> )
> >>
> >> That is a strange and interesting set of things to be spending most of
> >> your CPU time on. The implication, I think, is that the number of term
> >> matches in the document for terms in your query (or, at least, terms
> >> matching exact words or the beginning of phrases in your query) is
> >> extremely high . Perhaps that's coming from this "partial word match"
> >> you
> >> mention -- how does that work?
> >>
> >> -- Bryan
> >>
> >>> My guess is that this has something to do with how I'm handling
> >>
> >> partial
> >>>
> >>> word matches/highlighting. I have setup another request handler that
> >>> only searches the whole word fields and it returns in 850 ms with
> >>> highlighting.
> >>>
> >>> Any ideas?
> >>>
> >>> - Andy
> >>>
> >>>
> >>> -Original Message-
> >>> From: Bryan Loofbourrow [mailto:bloofbour...@knowledgemosaic.com]
> >>> Sent: Monday, May 20, 2013 1:39 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: RE: Slow Highlighter Performance Even Using
> >>> FastVectorHighlighter
> >>>
> >>> My guess is that the problem is those 200M documents.
> >>> FastVectorHighlighter is fast at deciding whether a match, especially
> >>
> >> a
> >>>
> >>> phrase, appear

PostingsSolrHighlighter not working on Multivalue field

2013-06-18 Thread Floyd Wu
In my test case, it seems this new highlighter not working.

When field set multivalue=true, the stored text in this field can not be
highlighted.

Am I miss something? Or this is current limitation? I have no luck to find
any documentations mentioned this.

Floyd


Re: PostingsSolrHighlighter not working on Multivalue field

2013-06-19 Thread Floyd Wu
Hi Erick,

"multivalue" is my typo, thanks for your reminding.

There is no log show anything wrong or exception occurred.

The field definition as following





The PostingSolrHighlighter only do highlight on summary field.

When I send a xml file to solr like this



  

  facebook yahoo plurk twitter social
nextworing
  facebook yahoo plurk twitter social
nextworing

  


As you can see the body_0 will be treated using dynamicField definition.

Part of the debug response return of Solr like this


  

  Facebook... Facebook





I'm sure hl.fl contains both summary and body_0.
This behavior is different between PostingSolrHighlighter and
FastVectorhighlighter.

Please kindly help on this.
Many thanks.

Floyd



2013/6/19 Erick Erickson 

> Well, _how_ does it fail? unless it's a type it should be
> multiValued (not capital 'V'). This probably isn't the
> problem, but just in case.
>
> Anything in the logs? What is the field definition?
> Did you re-index after changing to multiValued?
>
> Best
> Erick
>
> On Tue, Jun 18, 2013 at 11:01 PM, Floyd Wu  wrote:
> > In my test case, it seems this new highlighter not working.
> >
> > When field set multivalue=true, the stored text in this field can not be
> > highlighted.
> >
> > Am I miss something? Or this is current limitation? I have no luck to
> find
> > any documentations mentioned this.
> >
> > Floyd
>


Re: PostingsSolrHighlighter not working on Multivalue field

2013-06-23 Thread Floyd Wu
Any idea can help on this?


2013/6/22 Erick Erickson 

> Unfortunately, from here I need to leave it to people who know
> the highlighting code
>
> Erick
>
> On Wed, Jun 19, 2013 at 8:40 PM, Floyd Wu  wrote:
> > Hi Erick,
> >
> > "multivalue" is my typo, thanks for your reminding.
> >
> > There is no log show anything wrong or exception occurred.
> >
> > The field definition as following
> >
> >  > omitNorms="false" termVectors="true" termPositions="true"
> > termOffsets="true" storeOffsetsWithPositions="true"/>
> >
> >  > multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true" omitNorms="false" storeOffsetsWithPositions="true"/>
> >
> > The PostingSolrHighlighter only do highlight on summary field.
> >
> > When I send a xml file to solr like this
> >
> > 
> > 
> >   
> > 
> >   facebook yahoo plurk twitter social
> > nextworing
> >   facebook yahoo plurk twitter social
> > nextworing
> > 
> >   
> > 
> >
> > As you can see the body_0 will be treated using dynamicField definition.
> >
> > Part of the debug response return of Solr like this
> >
> > 
> >   
> > 
> >   Facebook... Facebook
> > 
> > 
> > 
> > 
> >
> > I'm sure hl.fl contains both summary and body_0.
> > This behavior is different between PostingSolrHighlighter and
> > FastVectorhighlighter.
> >
> > Please kindly help on this.
> > Many thanks.
> >
> > Floyd
> >
> >
> >
> > 2013/6/19 Erick Erickson 
> >
> >> Well, _how_ does it fail? unless it's a type it should be
> >> multiValued (not capital 'V'). This probably isn't the
> >> problem, but just in case.
> >>
> >> Anything in the logs? What is the field definition?
> >> Did you re-index after changing to multiValued?
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Jun 18, 2013 at 11:01 PM, Floyd Wu  wrote:
> >> > In my test case, it seems this new highlighter not working.
> >> >
> >> > When field set multivalue=true, the stored text in this field can not
> be
> >> > highlighted.
> >> >
> >> > Am I miss something? Or this is current limitation? I have no luck to
> >> find
> >> > any documentations mentioned this.
> >> >
> >> > Floyd
> >>
>


Does anybody has experience in Chinese soundex(sounds like) of SOLR?

2011-10-20 Thread Floyd Wu
Hi  there,

There are many English soundex implementation can be referenced, but I
wonder how to do Chinese soundex(sounds like) filter (maybe).

any idea?

Floyd


Re: Does anybody has experience in Chinese soundex(sounds like) of SOLR?

2011-10-20 Thread Floyd Wu
Hi Ken,

Indeed, I want to support function like phonetic (pinyin or zhuyin)
search, not soundex (sorry and thanks correct me).

any further idea?

Floyd


2011/10/20 Ken Krugler :
>> Wow, interesting question.  Can soundex even be applied to a language like 
>> Chinese, which is tonal and doesn't have individual letters, but whole 
>> characters?  I'm no expert, but intuitively speaking it sounds hard or maybe 
>> even impossible...
>
> The only two cases I can think of are:
>
>  - Cases where you have two (or more) characters that are variant forms. 
> Unicode tried to unify all of these, but some still exist. And in GB 18030 
> there are tons.
>
>  - If you wanted to support phonetic (pinyin or zhuyin) search, then you 
> might want to collapse syllables that are commonly confused. But then of 
> course you'd have to be storing the phonetic forms for all of the words.
>
> -- Ken
>
>
>>> From: Floyd Wu 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, October 20, 2011 5:43 AM
>>> Subject: Does anybody has experience in Chinese soundex(sounds like) of 
>>> SOLR?
>>>
>>> Hi  there,
>>>
>>> There are many English soundex implementation can be referenced, but I
>>> wonder how to do Chinese soundex(sounds like) filter (maybe).
>>>
>>> any idea?
>>>
>>> Floyd
>>>
>>>
>>>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>


Want to support "did you mean xxx" but is Chinese

2011-10-20 Thread Floyd Wu
Does anybody know how to implement this idea in SOLR. Please kindly
point me a direction.

For example, when user enter a keyword in Chinese "貝多芬" (this is
Beethoven in Chinese)
but key in a wrong combination of characters  "背多分" (this is
pronouncation the same with previous keyword "貝多芬").

There in solr index exist token "貝多芬" actually. How to hit documents
where "貝多芬" exist when "背多分" is enter.

This is basic function of commercial search engine especially in
Chinese processing. I wonder how to implements in SOLR and where is
the start point.

Floyd


Re: Want to support "did you mean xxx" but is Chinese

2011-10-23 Thread Floyd Wu
Hi Li Li,

Thanks for your detail explanation. Basically I have similar
implementation like yours. I just want to know if there is a better
and total solution. I'll keep trying and see if I have any improvement
that can share with you and the community.

Any idea or advice are welcome .

Floyd



2011/10/21 Li Li :
>we have implemented one supporting "did you mean" and preffix suggestion
> for Chinese. But we base our working on solr 1.4 and we did many
> modifications so it will cost time to integrate it to current solr/lucene.
>
> Here are our solution. glad to see any advices.
>
> 1. offline words and phrases discovery.
>   we discovery new words and new phrases by mining query logs
>
> 2. online matching algorithm
>   for each word, e.g., 贝多芬
>   we convert it to pinyin bei duo fen, then we indexing it using
> n-gram, which means gram3:bei gram3:eid ...
>   to get "did you mean" result, we convert query 背朵分 into n-gram,
> it's a boolean or query, so there are many results( the words' pinyin
> similar to query will be ranked top)
>  Then we reranks top 500 results by fine-grained algorithm
>  we use edit distance to align query and result, we also take
> character into consideration. e.g query 十度,matches are 十渡 and 是度,their
> pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
> both query and match
>  also you need consider the hotness(popular degree) of different
> words/phrases. which can be known from query logs
>
>  Another question is to convert Chinese into pinyin. because some
> character has more than one pinyin.
> e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
> words/phrases first. word segmentation is a basic problem is Chinese IR
>
>
> 2011/10/21 Floyd Wu 
>
>> Does anybody know how to implement this idea in SOLR. Please kindly
>> point me a direction.
>>
>> For example, when user enter a keyword in Chinese "��多芬" (this is
>> Beethoven in Chinese)
>> but key in a wrong combination of characters  "背多分" (this is
>> pronouncation the same with previous keyword "��多芬").
>>
>> There in solr index exist token "��多芬" actually. How to hit documents
>> where "��多芬" exist when "背多分" is enter.
>>
>> This is basic function of commercial search engine especially in
>> Chinese processing. I wonder how to implements in SOLR and where is
>> the start point.
>>
>> Floyd
>>
>


Re: Replicating Large Indexes

2011-10-31 Thread Floyd Wu
Hi Jason,

I'm very curious about how you build( rebuild ) such a big index efficiently?
Sorry that hijack this topic.

Floyd

2011/11/1 Jason Biggin :
> Wondering if anyone has experience with replicating large indexes.  We have a 
> Solr deployment with 1 master, 1 master/slave and 5 slaves.  Our index 
> contains 15+ million articles and is ~55GB in size.
>
> Performance is great on all systems.
>
> Debian Linux
> Apache-Tomcat
> 100GB disk
> 6GB RAM
> 2 proc
>
> on VMWare ESXi 4.0
>
>
> We notice however that whenever the master is optimized, the complete index 
> is replicated to the slaves.  This causes a 100%+ bloat in disk requirements.
>
> Is this normal?  Is there a way around this?
>
> Currently our optimize is configured as such:
>
>        curl 
> 'http://localhost:8080/solr/update?optimize=true&maxSegments=1&waitFlush=true&expungeDeletes=true'
>
> Willing to share our experiences with Solr.
>
> Thanks,
> Jason
>


Separate ACL and document index

2011-11-22 Thread Floyd Wu
Hi there,

Is it possible to separate ACL index and document index and achieve to
search by user role in SOLR?

Currently my implementation is to index ACL with document, but the
document itself change frequently. I have to perform rebuild index
every time when ACL change. It's heavy for whole system due to
document are so many and content are huge.

Do you guys have any solution to solve this problem. I've been read
mailing list for a while. Seem there is not suitable solution for me.

I want user searches result only for him according to his role but I
don't want to re-index document every time when document's ACL change.

To my knowledge, is this possible to perform a join like database to
achieve this? How and possible?

Thanks

Floyd


Re: Separate ACL and document index

2011-11-23 Thread Floyd Wu
I've been read much about Document Level Security
https://issues.apache.org/jira/browse/SOLR-1895
https://issues.apache.org/jira/browse/SOLR-1872
https://issues.apache.org/jira/browse/SOLR-1834

But I not fully sure that these patch solved my problem?
It seems to that change the original document ACL will need to
re-build index "with document content".

It make no sense to rebuild when I only change ACL.

Have any idea? Or I just misunderstanding these patch?

Floyd



2011/11/23 Floyd Wu :
> Hi there,
>
> Is it possible to separate ACL index and document index and achieve to
> search by user role in SOLR?
>
> Currently my implementation is to index ACL with document, but the
> document itself change frequently. I have to perform rebuild index
> every time when ACL change. It's heavy for whole system due to
> document are so many and content are huge.
>
> Do you guys have any solution to solve this problem. I've been read
> mailing list for a while. Seem there is not suitable solution for me.
>
> I want user searches result only for him according to his role but I
> don't want to re-index document every time when document's ACL change.
>
> To my knowledge, is this possible to perform a join like database to
> achieve this? How and possible?
>
> Thanks
>
> Floyd
>


Re: Separate ACL and document index

2011-11-23 Thread Floyd Wu
Thank you for your sharing, My current solution is similar to 2).
But my problem is ACL is early-binding (which means I build index and
embedded ACL with document index) I don't want to rebuild full index(a
lucene/solr Document with PDF content and ACL) when front end change
only permission settings.

Seems solution 2)  have same problem.

Floyd


2011/11/24 Robert Stewart :
> I have used two different ways:
>
> 1) Store mapping from users to documents in some external database
> such as MySQL.  At search time, lookup mapping for user to some unique
> doc ID or some group ID, and then build query or doc set which you can
> cache in SOLR process for some period.  Then use that as a filter in
> your search.  This is more involved approach but better if you have
> lots of ACLs per user, but it is non-trivial to implement it well.  I
> used this in a system with over 100 million docs, and approx. 20,000
> ACLs per user.  The ACL mapped user to a set of group IDs, and each
> group could have 10,000+ documents.
>
> 2) Generate a query filter that you pass to SOLR as part of the
> search.  Potentially it could be a pretty large query if user has
> granular ACL over may documents or groups.  I've seen it work ok with
> up to 1000 or so ACLs per user query.  So you build that filter query
> from the client using some external database to lookup user ACLs
> before sending request to SOLR.
>
> Bob
>
>
> On Tue, Nov 22, 2011 at 10:48 PM, Floyd Wu  wrote:
>> Hi there,
>>
>> Is it possible to separate ACL index and document index and achieve to
>> search by user role in SOLR?
>>
>> Currently my implementation is to index ACL with document, but the
>> document itself change frequently. I have to perform rebuild index
>> every time when ACL change. It's heavy for whole system due to
>> document are so many and content are huge.
>>
>> Do you guys have any solution to solve this problem. I've been read
>> mailing list for a while. Seem there is not suitable solution for me.
>>
>> I want user searches result only for him according to his role but I
>> don't want to re-index document every time when document's ACL change.
>>
>> To my knowledge, is this possible to perform a join like database to
>> achieve this? How and possible?
>>
>> Thanks
>>
>> Floyd
>>
>


Solr with example Jetty and score problem

2010-09-28 Thread Floyd Wu
Hi there

I have a problem, the situation is when I issue a query to single instance,
Solr response XML like following
as you can see, the score is normal()
===
 

0
23

_l_title,score
0
_l_unique_key:12
*
true
999




1.9808292
GTest





12





===

But when I issue the query with shard(two instances), the response XML will
be like following.
as you can see, that score has bee tranfer to a element  of 
===
 

0
64

localhost:8983/solr/core0,172.16.6.35:8983/solr
_l_title,score
0
_l_unique_key:12
*
true
999




Gtest

1.9808292






12





===
My Schema.xml like following

   
   
   
   
   

   
 
 _l_unique_key
 _l_body

I don't really know what happended. Is my schema problem or is the behavior
of Solr?
please help on this.


Re: Solr with example Jetty and score problem

2010-09-29 Thread Floyd Wu
Does anybody can help on this ?
Many thanks

2010/9/29 Floyd Wu 

> Hi there
>
> I have a problem, the situation is when I issue a query to single instance,
> Solr response XML like following
> as you can see, the score is normal()
> ===
>  
> 
> 0
> 23
> 
> _l_title,score
> 0
> _l_unique_key:12
> *
> true
> 999
> 
> 
> 
> 
> 1.9808292
> GTest
> 
> 
> 
> 
> 
> 12
> 
> 
> 
> 
>
> ===
>
> But when I issue the query with shard(two instances), the response XML will
> be like following.
> as you can see, that score has bee tranfer to a element  of 
> ===
>  
> 
> 0
> 64
> 
> localhost:8983/solr/core0,172.16.6.35:8983/solr
> _l_title,score
> 0
> _l_unique_key:12
> *
> true
> 999
> 
> 
> 
> 
> Gtest
> 
> 1.9808292
> 
> 
> 
> 
> 
> 
> 12
> 
> 
> 
> 
>
> ===
> My Schema.xml like following
> 
> required="true" omitNorms="true"/>
> stored="true" omitNorms="true" multiValued="true"/>
> omitNorms="false" termVectors="true" termPositions="true"
> termOffsets="true"/>
> omitNorms="false" termVectors="true" termPositions="true"
> termOffsets="true"/>
> multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" omitNorms="false"/>
>
> multiValued="true" termVectors="true"
> termPositions="true"
> termOffsets="true" omitNorms="false"/>
>  
>  _l_unique_key
>  _l_body
> 
> I don't really know what happended. Is my schema problem or is the behavior
> of Solr?
> please help on this.
>


Re: Solr with example Jetty and score problem

2010-10-04 Thread Floyd Wu
Hi Chris

Thanks. But do you have any suggest or work-around to deal with it?

Floyd



2010/10/2 Chris Hostetter 

>
> : But when I issue the query with shard(two instances), the response XML
> will
> : be like following.
> : as you can see, that score has bee tranfer to a element  of 
>...
> : 
> : 1.9808292
> : 
>
> The root cause of these seems to be your catchall dynamic field
> declaration...
>
> : : multiValued="true" termVectors="true"
> : termPositions="true"
> : termOffsets="true" omitNorms="false"/>
>
> ...that line (specificly the fact that it's multiValued="true") seems to
> be confusing the results aggregation code.  my guess is that it's
> looping over all the fields, and looking them up in the schema to see if
> they are single/multi valued but not recognizing that "score" is
> special.
>
> https://issues.apache.org/jira/browse/SOLR-2140
>
>
> -Hoss
>
> --
> http://lucenerevolution.org/  ...  October 7-8, Boston
> http://bit.ly/stump-hoss  ...  Stump The Chump!
>
>


Different between Lucid dist. & Apache dist. ?

2010-10-04 Thread Floyd Wu
Hi there,

What is the difference between Lucid distribution of Solr and Apache
distribution?

And can I use Lucid distribution for free in my commercial project?


Tuning Solr

2010-10-04 Thread Floyd Wu
Hi there,

If I dont need Morelikethis, spellcheck, highlight.
Can I remove this configuration section in solrconfig.xml?
In other workd, does solr load and use these SearchComponet on statup and
suring runtime?

Remove this configuration will or will not speedup query?

Thanks


Re: Solr with example Jetty and score problem

2010-10-20 Thread Floyd Wu
I tried this work-around, but seems not work for me.
I still get array of score in the response.

I have two physical server A and B

localhost --> A
test -->B

I issue query to A like this

http://localhost:8983/solr/core0/select?shards=test:8983/solr,localhost:8983/solr/core0&indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard
Hi Hoss,

But when I change query to

http://localhost:8983/solr/core0/select?shards=test:8983/solr&indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard

The score will be noraml. (that's just like issue query to test:8983)

any idea?



2010/10/16 Chris Hostetter 

>
> : Thanks. But do you have any suggest or work-around to deal with it?
>
> Posted in SOLR-2140
>
>   
>
> ..this key is to make sure solr knows "score" is not multiValued
>
>
> -Hoss
>


Re: Solr with example Jetty and score problem

2010-10-20 Thread Floyd Wu
Ok I Do a little test after previous email. The work-around that hoss
provided is not work when you issue query "*:*"

I tried to issue query like" key:aaa" and work-around works no matter shards
node is one or two or more.

Thanks hoss. And maybe you could try and help me confirmed this situation is
not coincidence.




2010/10/20 Floyd Wu 

> I tried this work-around, but seems not work for me.
> I still get array of score in the response.
>
> I have two physical server A and B
>
> localhost --> A
> test -->B
>
> I issue query to A like this
>
>
> http://localhost:8983/solr/core0/select?shards=test:8983/solr,localhost:8983/solr/core0&indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard
> Hi Hoss,
>
> But when I change query to
>
> http://localhost:8983/solr/core0/select?shards=test:8983/solr&indent=on&version=2.2&q=*%3A*&fq=&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard
>
> The score will be noraml. (that's just like issue query to test:8983)
>
> any idea?
>
>
>
> 2010/10/16 Chris Hostetter 
>
>
>> : Thanks. But do you have any suggest or work-around to deal with it?
>>
>> Posted in SOLR-2140
>>
>>   
>>
>> ..this key is to make sure solr knows "score" is not multiValued
>>
>>
>> -Hoss
>>
>
>


Ranking by sorting score and rankingField better or by product(score, rankingField)?

2012-11-19 Thread Floyd Wu
Hi  there,

I have a field(which is externalFileField, called rankingField) and that
value(type=float) is calculated by client app.

For the solr original scoring model, affect boost value will result
different ranking. So I think product(score,rankingField) may equivalent to
solr scoring model.

What I curious is which will be better in practice and the different
meanings on these three solutions?

1. sort=score+desc,ranking+desc
2. sort=ranking+desc,score+desc
3. sort=product(score,ranking) -->is this possible?

I'd like to hear your thoughts.

Many thanks

Floyd


Re: Ranking by sorting score and rankingField better or by product(score, rankingField)?

2012-11-19 Thread Floyd Wu
Thanks Otis,

But the sort=product(score, rankingField) is not working in my test. What
probably wrong?

Floyd


2012/11/20 Otis Gospodnetic 

> Hi,
>
> 3. yes, you can sort by function -
> http://search-lucene.com/?q=solr+sort+by+function
> 2. this will sort by score only when there is a tie in ranking (two docs
> have the same rank value)
> 1. the reverse of 2.
>
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html
>
>
>
>
> On Mon, Nov 19, 2012 at 9:40 PM, Floyd Wu  wrote:
>
> > Hi  there,
> >
> > I have a field(which is externalFileField, called rankingField) and that
> > value(type=float) is calculated by client app.
> >
> > For the solr original scoring model, affect boost value will result
> > different ranking. So I think product(score,rankingField) may equivalent
> to
> > solr scoring model.
> >
> > What I curious is which will be better in practice and the different
> > meanings on these three solutions?
> >
> > 1. sort=score+desc,ranking+desc
> > 2. sort=ranking+desc,score+desc
> > 3. sort=product(score,ranking) -->is this possible?
> >
> > I'd like to hear your thoughts.
> >
> > Many thanks
> >
> > Floyd
> >
>


Re: Custom ranking solutions?

2012-11-19 Thread Floyd Wu
HI Otis,
The debug information as following, seems there is no "product() process" .


_l_all:"測試"
_l_all:"測試"
PhraseQuery(_l_all:"測 試")
_l_all:"測 試"


41.11747 = (MATCH) weight(_l_all:"測 試" in 0) [DefaultSimilarity], result
of: 41.11747 = fieldWeight in 0, product of: 4.1231055 = tf(freq=17.0),
with freq of: 17.0 = phraseFreq=17.0 1.4246359 = idf(), sum of: 0.71231794
= idf(docFreq=3, maxDocs=3) 0.71231794 = idf(docFreq=3, maxDocs=3) 7.0 =
fieldNorm(doc=0)


14.246359 = (MATCH) weight(_l_all:"測 試" in 0) [DefaultSimilarity], result
of: 14.246359 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq
of: 1.0 = phraseFreq=1.0 1.4246359 = idf(), sum of: 0.71231794 =
idf(docFreq=3, maxDocs=3) 0.71231794 = idf(docFreq=3, maxDocs=3) 10.0 =
fieldNorm(doc=0)


10.073696 = (MATCH) weight(_l_all:"測 試" in 0) [DefaultSimilarity], result
of: 10.073696 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0),
with freq of: 2.0 = phraseFreq=2.0 1.4246359 = idf(), sum of: 0.71231794 =
idf(docFreq=3, maxDocs=3) 0.71231794 = idf(docFreq=3, maxDocs=3) 5.0 =
fieldNorm(doc=0)


LuceneQParser

6.0

0.0

0.0


0.0


0.0


0.0


0.0


0.0



6.0

3.0


0.0


0.0


0.0


0.0


3.0






2012/11/20 Otis Gospodnetic 

> Hi Floyd,
>
> Use &debugQuery=true and let's see it.:)
>
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html
>
>
>
>
> On Mon, Nov 19, 2012 at 9:29 PM, Floyd Wu  wrote:
>
> > Hi there,
> >
> > Before ExternalFielField introduced, change document boost value to
> achieve
> > custom ranking. My client app will update each boost value for documents
> > daily and seem to worked fine.
> > Actual ranking could be predicted based on boost value. (value is
> > calculated based on click, recency, and rating ).
> >
> > I'm now try to use ExternalFileField to do some ranking, after some
> test, I
> > did not get my expectation.
> >
> > I'm doing a sort like this
> >
> > sort=product(score,abs(rankingField))+desc
> > But the query result ranking won't change anyway.
> >
> > The external file as following
> > doc1=3
> > doc2=5
> > doc3=9
> >
> > The original score get from Solr result as fllowing
> > doc1=41.042
> > doc2=10.1256
> > doc3=8.2135
> >
> > Expected ranking
> > doc1
> > doc3
> > doc2
> >
> > What wrong in my test, please kindly help on this.
> >
> > Floyd
> >
>


Re: Ranking by sorting score and rankingField better or by product(score, rankingField)?

2012-11-19 Thread Floyd Wu
Hi Otis,

There is no error in console nor in log file. I'm using Solr-4.0.
The External file name is external_rankingField.txt and exist is directory
 "C:\solr-4.0.0\example\solr\collection1\data\external_rankingField.txt"

External file should work as well because when I issue query
"sort=sqrt(rankingField)+desc" or "sort=sqrt(rankingField)+asc" or
"sort=sqrt(rankingField)+desc"

Things will change accordingly.

By the way, I first try external field according document here
http://lucidworks.lucidimagination.com/display/solr/Working+with+External+Files+and+Processes

"Format of the External File

The file itself is located in Solr's index directory, which by default is
$SOLR_HOME/data/index. The name of the file should beexternal_*fieldname*
 or external_*fieldname*.*. For the example above, then, the file could be
named external_entryRankFile orexternal_entryRankFile.txt.
"

But actually the external file should put in
$SOLR_HOME/data/

Floyd




2012/11/20 Otis Gospodnetic 

> Hi,
>
> Do you see any errors?
> Which version of Solr?
> What does debugQuery=true say?
> Are you sure your file with ranks is being used? (remove it, put some junk
> in it, see if that gives an error)
>
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html
>
>
>
>
> On Mon, Nov 19, 2012 at 10:16 PM, Floyd Wu  wrote:
>
> > Thanks Otis,
> >
> > But the sort=product(score, rankingField) is not working in my test. What
> > probably wrong?
> >
> > Floyd
> >
> >
> > 2012/11/20 Otis Gospodnetic 
> >
> > > Hi,
> > >
> > > 3. yes, you can sort by function -
> > > http://search-lucene.com/?q=solr+sort+by+function
> > > 2. this will sort by score only when there is a tie in ranking (two
> docs
> > > have the same rank value)
> > > 1. the reverse of 2.
> > >
> > > Otis
> > > --
> > > Performance Monitoring - http://sematext.com/spm/index.html
> > > Search Analytics - http://sematext.com/search-analytics/index.html
> > >
> > >
> > >
> > >
> > > On Mon, Nov 19, 2012 at 9:40 PM, Floyd Wu  wrote:
> > >
> > > > Hi  there,
> > > >
> > > > I have a field(which is externalFileField, called rankingField) and
> > that
> > > > value(type=float) is calculated by client app.
> > > >
> > > > For the solr original scoring model, affect boost value will result
> > > > different ranking. So I think product(score,rankingField) may
> > equivalent
> > > to
> > > > solr scoring model.
> > > >
> > > > What I curious is which will be better in practice and the different
> > > > meanings on these three solutions?
> > > >
> > > > 1. sort=score+desc,ranking+desc
> > > > 2. sort=ranking+desc,score+desc
> > > > 3. sort=product(score,ranking) -->is this possible?
> > > >
> > > > I'd like to hear your thoughts.
> > > >
> > > > Many thanks
> > > >
> > > > Floyd
> > > >
> > >
> >
>


Re: Custom ranking solutions?

2012-11-19 Thread Floyd Wu
Hi Otis,

I'm doing some test like this,

http://localhost:8983/solr/select/?fl=score,_l_unique_key&defType=func&q=product(abs(rankingField),abs(score))<http://localhost:8983/solr/select/?fl=score,_l_unique_key&defType=func&q=product(abs(ranking),abs(score))>

and I get following response,


can not use FieldCache on unindexed field: score
400


if change score to rankingField like this

http://localhost:8983/solr/select/?fl=score,_l_unique_key&defType=func&q=product(abs(rankingField),abs(rankingField))<http://localhost:8983/solr/select/?fl=score,_l_unique_key&defType=func&q=product(abs(ranking),abs(score))>



211
2500.0


223
4.0


222
0.01001



Seems like score could not put into function query?

Floyd




2012/11/20 Otis Gospodnetic 

> Hi Floyd,
>
> Use &debugQuery=true and let's see it.:)
>
> Otis
> --
> Performance Monitoring - http://sematext.com/spm/index.html
> Search Analytics - http://sematext.com/search-analytics/index.html
>
>
>
>
> On Mon, Nov 19, 2012 at 9:29 PM, Floyd Wu  wrote:
>
> > Hi there,
> >
> > Before ExternalFielField introduced, change document boost value to
> achieve
> > custom ranking. My client app will update each boost value for documents
> > daily and seem to worked fine.
> > Actual ranking could be predicted based on boost value. (value is
> > calculated based on click, recency, and rating ).
> >
> > I'm now try to use ExternalFileField to do some ranking, after some
> test, I
> > did not get my expectation.
> >
> > I'm doing a sort like this
> >
> > sort=product(score,abs(rankingField))+desc
> > But the query result ranking won't change anyway.
> >
> > The external file as following
> > doc1=3
> > doc2=5
> > doc3=9
> >
> > The original score get from Solr result as fllowing
> > doc1=41.042
> > doc2=10.1256
> > doc3=8.2135
> >
> > Expected ranking
> > doc1
> > doc3
> > doc2
> >
> > What wrong in my test, please kindly help on this.
> >
> > Floyd
> >
>


Re: Ranking by sorting score and rankingField better or by product(score, rankingField)?

2012-11-20 Thread Floyd Wu
Hi Chris,

Thanks! Before your great suggestions, I give up using function query to
calculate product of score and rankingField and using exactly the same with
your boost query solution. Of course it works fine. The next step will be
design suitable function to output a ranking value that also consider with
popularity, recency, relevance and also rating of documents.

Many thanks to the community.

Floyd



2012/11/21 Chris Hostetter 

>
> : But the sort=product(score, rankingField) is not working in my test. What
> : probably wrong?
>
> the problem is "score" is not a field or a function -- Solr doesn't know
> exactly what "score" you want it to use there (scores from which query?)
>
> You either need to refrence the query in the function (using the
> "query(...)" function) or you need to incorporate your function directly
> into the score (using something like the "boost" QParser).
>
> Unless you need the "score" of the docs, from your orriginal query, to be
> returned in the fl, or used in some other clause of your sort, i would
> suggest using the boost parser -- that way your final scores will match
> the scores you computed with the function...
>
>qq=your original query
> q={!boost b=rankingField v=$qq}
>
>
> https://lucene.apache.org/solr/4_0_0/solr-core/org/apache/solr/search/BoostQParserPlugin.html
> https://people.apache.org/~hossman/ac2012eu/
>
>
> -Hoss
>


Re: Custom ranking solutions?

2012-11-20 Thread Floyd Wu
Hi Dan,

Thanks! I'm using boost query to solve this problem.

Floyd




2012/11/21 Daniel Rosher 

> Hi
>
> The product function query needs a valuesource, not the pseudo score field.
>
> You probably need something like (with Solr 4.0):
>
> q={!lucene}*:*&sort=product(query($q),2) desc,score
> desc&fl=score,_score_:product(query($q),2),[explain]
>
> Cheers,
> Dan
>
> On Tue, Nov 20, 2012 at 2:29 AM, Floyd Wu  wrote:
>
> > Hi there,
> >
> > Before ExternalFielField introduced, change document boost value to
> achieve
> > custom ranking. My client app will update each boost value for documents
> > daily and seem to worked fine.
> > Actual ranking could be predicted based on boost value. (value is
> > calculated based on click, recency, and rating ).
> >
> > I'm now try to use ExternalFileField to do some ranking, after some
> test, I
> > did not get my expectation.
> >
> > I'm doing a sort like this
> >
> > sort=product(score,abs(rankingField))+desc
> > But the query result ranking won't change anyway.
> >
> > The external file as following
> > doc1=3
> > doc2=5
> > doc3=9
> >
> > The original score get from Solr result as fllowing
> > doc1=41.042
> > doc2=10.1256
> > doc3=8.2135
> >
> > Expected ranking
> > doc1
> > doc3
> > doc2
> >
> > What wrong in my test, please kindly help on this.
> >
> > Floyd
> >
>


Re: Dynamic ranking based on search term

2012-11-28 Thread Floyd Wu
Hi Upayavira

Let me explain what I need in the other words.

The list is the result that after analyzing log.
Key value pairs list actually means that when search term is java, then
boosting these documents(doc1,doc2,doc5).
for example
  java, doc1,doc2,doc5

Any ideas?

Thanks.




2012/11/28 Upayavira 

> Isn't this what Solr/Lucene are designed to do??
>
> On indexing a document, Lucene creates an inverted index, mapping terms
> back to their containing documents. The data you have is already
> inverted.
>
> I'd suggest uninverting it and then hand it to Solr in that format,
> thus:
>
> doc1: java
> doc2: java
> doc4: book
> doc5: java
> doc9: book
> doc77: book
>
> With that structure, you'll have in your index exactly what Solr
> expects, and will be able to take advantage of the inbuilt ranking
> capabilities of Lucene and Solr.
>
> Upayavira
>
> On Wed, Nov 28, 2012, at 10:15 AM, Floyd Wu wrote:
> > Hi there,
> >
> > If I have a list that is key-value pair in text filed or database table.
> > How do I achieve dynamic ranking based on search term? That say when user
> > search term "java" and doc1,doc2, doc5 will get higher ranking.
> >
> > for example( key is search term, value is related index document unique
> > key):
> > ==
> > key, value
> > ==
> > java, doc1,doc2,doc5
> > book, doc9, doc4,doc77
> > ==
> >
> > I've finished implementation using externalFileField to do ranking, but
> > in
> > this way, ranking is static.
> >
> > Please kindly point me a way to do this.
> >
> > PS: SearchComponent maybe?
>


Difference between 'bf' and 'boost' when using eDismax

2012-12-03 Thread Floyd Wu
Hi there,

I'm not sure if I understand this clearly.

'bf' is that final score will be add some value return by bf?
for example->  score + bf = final score

'boost' is that score will be multiply with value that return by boost?
for example-> score * boost = final score

When using both( 'bf' and 'boost')
score * boost + bf = final score

If I would like to make recent created document ranking higher, using 'bf'
or 'boost' will be better solution(Assume bf and boost will use the same
function recip(ms(NOW,datefield),3.16e-11,1,1))?


Please help on this.


Re: Difference between 'bf' and 'boost' when using eDismax

2012-12-03 Thread Floyd Wu
Thanks Jack!
It helps a lots.

Floyd



2012/12/4 Jack Krupansky 

> "bf" is processed first, then "boost".
>
> All the bf's will be added, then the resulting scores will be boosted by
> the product of all the "boost" function queries.
>
> -- Jack Krupansky
>
> -Original Message- From: Floyd Wu
> Sent: Monday, December 03, 2012 11:00 PM
> To: solr-user@lucene.apache.org
> Subject: Difference between 'bf' and 'boost' when using eDismax
>
>
> Hi there,
>
> I'm not sure if I understand this clearly.
>
> 'bf' is that final score will be add some value return by bf?
> for example->  score + bf = final score
>
> 'boost' is that score will be multiply with value that return by boost?
> for example-> score * boost = final score
>
> When using both( 'bf' and 'boost')
> score * boost + bf = final score
>
> If I would like to make recent created document ranking higher, using 'bf'
> or 'boost' will be better solution(Assume bf and boost will use the same
> function recip(ms(NOW,datefield),3.16e-**11,1,1))?
>
>
> Please help on this.
>


Re: difference these two queries

2012-12-10 Thread Floyd Wu
Thanks Otis.

When talked about query performance(ignore scoring). To use fq is better?

Floyd


2012/12/11 Otis Gospodnetic 

> Hi,
>
> The fq one is a FilterQuery that only does matching, but not scoring. It's
> results are stored in the filter cache, while the q uses the query cache.
>
> Otis
> --
> SOLR Performance Monitoring - http://sematext.com/spm/index.html
>
>
>
>
>
> On Mon, Dec 10, 2012 at 10:11 PM, Floyd Wu  wrote:
>
> > Hi There,
> > Sorry for sapmming if this question had already asked.
> >
> > Wha't the main difference between
> >
> > q=fieldA:value AND fieldB:value
> >
> > q=fieldA:value&fq=fieldB:value
> >
> > both query will give me the same result, I wonder what's the main
> > difference and in practice what the better way?
> >
> > Thanks in advance
> >
> > Floyd
> >
>


Remove underscore char when indexing and query problem

2012-03-02 Thread Floyd Wu
Hi there,

I have a document and its title is "20111213_solr_apache conference report".

When I use analysis web interface to see what tokens exactly solr analyze
and the following is the result

term text20111213_solrapacheconferencereportterm type



Why 20111213_solr tokenized as  and "_" char won't be removed? (I've
add "_" as stop word in stopwords.txt)

I did another test when "20111213_solr_apache conference_report".
As you can see the difference is I add an underscore char between
conference and report. To analyze this string
term text20111213_solrapacheconferencereportterm type

this time the underscore char between conference and report is removed!

Why? How to make solr remove underscore char and behave consistent?
Please help on this.

Thanks in advance.

Floyd


how to get lots fields this way?

2011-04-13 Thread Floyd Wu
Hi,

As I know when using fl=*, score means we need to get all field and score as
returned search result. And if field is stored, all text will be returned as
part of result.

Now I have 2x fields, some of fields name have no prefix or fixed naming
rule and cannot be predicted what name will be.
I want to get all of them except one of them.

How to list these field name in fl=

for example, I have two documents, doc A fields are (a,b, df, gh, t,p)
doc B fields are (a,b, xc, zw, t,p)

What I want is all of them except p.

Please help on this.
Many thanks


Floyd


Re: how to get lots fields this way?

2011-04-13 Thread Floyd Wu
Can solr list fields in fl=...  like this way?  fl=!fieldName,score

Floyd


2011/4/14 Otis Gospodnetic 

> Floyd,
>
> You need to explicitly list all fields in &fl=...
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message 
> > From: Floyd Wu 
> > To: solr-user@lucene.apache.org
> > Sent: Wed, April 13, 2011 2:34:49 PM
> > Subject: how to get lots fields this way?
> >
> > Hi,
> >
> > As I know when using fl=*, score means we need to get all field and
>  score as
> > returned search result. And if field is stored, all text will be
>  returned as
> > part of result.
> >
> > Now I have 2x fields, some of fields name  have no prefix or fixed naming
> > rule and cannot be predicted what name will  be.
> > I want to get all of them except one of them.
> >
> > How to list these  field name in fl=
> >
> > for example, I have two documents, doc A fields are  (a,b, df, gh, t,p)
> > doc B fields are (a,b, xc, zw, t,p)
> >
> > What I want is  all of them except p.
> >
> > Please help on this.
> > Many  thanks
> >
> >
> > Floyd
> >
>


Re: Fuzzy Query Param

2011-06-30 Thread Floyd Wu
if this is edit distance implementation, what is the result apply to CJK
query? For example, "您好"~3

Floyd


2011/6/30 entdeveloper 

> I'm using Solr trunk.
>
> If it's levenstein/edit distance, that's great, that's what I want. It just
> didn't seem to be officially documented anywhere so I wanted to find out
> for
> sure. Thanks for confirming.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3122418.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


How to make a valid date facet query?

2011-07-25 Thread Floyd Wu
Hi all,

I need to make date faceted query and I tried to use facet.range but can't
get result I need.

I want to make 4 facet like following.

1 Months,3 Months, 6Months, more than 1 Year

The onlinedate field in schema.xml like this



I hit the solr by this url

http://localhost:8983/solr/select/?q=*%3A*
&start=0
&rows=10
&indent=on
&facet=true
&facet.range=onlinedate
&f.onlinedate.facet.range.start=NOW-1YEARS
&f.onlinedate.facet.range.end=NOW%2B1YEARS
&f.onlinedate.facet.range.gap=NOW-1MONTHS, NOW-3MONTHS,
NOW-6MONTHS,NOW-1YEAR

But the solr complained Exception during facet.range of onlinedate
org.apache.solr.common.SolrException: Can't add gap NOW-1MONTHS,
NOW-3MONTHS, NOW-6MONTHS,NOW-1YEAR to value Mon Jul 26 11:56:40 CST 2010 for


What is correct way to make this requirement to realized? Please help on
this.
Floyd


Re: How to make a valid date facet query?

2011-07-26 Thread Floyd Wu
Hi Tomás

Is facet queries support following queries?

facet.query=onlinedate:[NOW/YEAR-3YEARS TO NOW/YEAR+5YEARS]

I tried this but returned result was not correct.

Am I missing something?

Floyd

2011/7/26 Tomás Fernández Löbbe 

> Hi Floyd, I don't think the feature that allows to use multiple gaps for a
> range facet is committed. See
> https://issues.apache.org/jira/browse/SOLR-2366
> You can achieve a similar functionality by using facet.query. see:
>
> http://wiki.apache.org/solr/SimpleFacetParameters#Facet_Fields_and_Facet_Queries
>
> Regards,
>
> Tomás
> On Tue, Jul 26, 2011 at 1:23 AM, Floyd Wu  wrote:
>
> > Hi all,
> >
> > I need to make date faceted query and I tried to use facet.range but
> can't
> > get result I need.
> >
> > I want to make 4 facet like following.
> >
> > 1 Months,3 Months, 6Months, more than 1 Year
> >
> > The onlinedate field in schema.xml like this
> >
> > 
> >
> > I hit the solr by this url
> >
> > http://localhost:8983/solr/select/?q=*%3A*
> > &start=0
> > &rows=10
> > &indent=on
> > &facet=true
> > &facet.range=onlinedate
> > &f.onlinedate.facet.range.start=NOW-1YEARS
> > &f.onlinedate.facet.range.end=NOW%2B1YEARS
> > &f.onlinedate.facet.range.gap=NOW-1MONTHS, NOW-3MONTHS,
> > NOW-6MONTHS,NOW-1YEAR
> >
> > But the solr complained Exception during facet.range of onlinedate
> > org.apache.solr.common.SolrException: Can't add gap NOW-1MONTHS,
> > NOW-3MONTHS, NOW-6MONTHS,NOW-1YEAR to value Mon Jul 26 11:56:40 CST 2010
> > for
> > 
> >
> > What is correct way to make this requirement to realized? Please help on
> > this.
> > Floyd
> >
>


Re: Limit the SolR acces from the web for one user-agent?

2012-11-08 Thread Floyd Wu
Hi Alex, I'd like to know how to "using Client and Server Certificates to
protect
the connection and embedding those certificates into clients?"

Please kindly share your experience.

Floyd


2012/11/8 Alexandre Rafalovitch 

> It is very easy to do this on Apache, but you need to be aware that
> User-Agent is extremely easy to both sniff and spoof.
>
> Have you thought of perhaps using Client and Server Certificates to protect
> the connection and embedding those certificates into clients?
>
> Regards,
>Alex.
>
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>
>
> On Thu, Nov 8, 2012 at 9:39 AM, Bruno Mannina  wrote:
>
> > Dear All,
> >
> > I'm using an external program (my own client) to access to my Apache-SolR
> > database.
> > I would like to restrict the SOLR access to a specific User-Agent
> (defined
> > in my program).
> >
> > I would like to know if it's possible to do that directly in SolR config
> > or I must
> > process that in the Apache server?
> >
> > My program do only requests like this (i.e.):
> > http://xxx.xxx.xxx.xxx:pp/**solr/select/?q=ap%3Afuelcell&**
> > version=2.2&start=0&rows=10&**indent=on
> >
> > I can add on my HTTP component properties an User-Agent, Log, Pass,
> etc...
> > like a standard Http connection.
> >
> > To complete: my soft is distribued to several users and I would like to
> > limit the SOLR access to these users and with my program.
> > FireFox, Chrome, I.E. will be unauthorized.
> >
> > thanks for your comment or help,
> > Bruno
> >
> > Ubuntu 12.04LTS
> > SolR 3.6
> >
>


BM25 model for solr 4?

2012-11-14 Thread Floyd Wu
Hi there,
Does anybody can kindly tell me how to setup solr to use BM25?
By the way, are there any experiment or research shows BM25 and classical
VSM model comparison in recall/precision rate?

Thanks in advanced.


Re: BM25 model for solr 4?

2012-11-15 Thread Floyd Wu
Thanks everyone, especially to Tom, you do give me detailed explanation
about this topic.
Of course in academic we do need to interpret result carefully, what I care
about is from end-users point of view, using BM25 will result better
ranking instead of using lucene's original VSM+Boolean model? How
significant difference will be presented?
I'd like to see some sharing from community.

Floyd


2012/11/16 Tom Burton-West 

> Hello Floyd,
>
> There is a ton of research literature out there comparing BM25 to vector
> space.  But you have to be careful interpreting it.
>
> BM25 originally beat the SMART vector space model in the early  TRECs
>  because it did better tf and length normalization.  Pivoted Document
> Length normalization was invented to get the vector space model to catch up
> to BM25.   (Just Google for Singhal length normalization.  Amith Singhal,
> now chief of Google Search did his doctoral thesis on this and it is
> available.  Similarly Stephan Robertson, now at Microsoft Research
> published a ton of studies of BM25)
>
> The default Solr/Lucene similarity class doesn't provide the length or tf
> normalization tuning params that BM25 does.  There is the sweetspot
> simliarity, but that doesn't quite work the same way that the BM25
> normalizations do.
>
> Document length normalization needs and parameter tuning all depends on
> your data.  So if you are reading a comparison, you need to determine:
> 1) When comparing recall/precision etc. between vector space and Bm25, did
> the experimenter tune both the vector space and the BM25 parameters
> 2) Are the documents (and queries) they are using in the test, similar in
>  length characteristics to your documents and
> queries.
>
> We are planning to do some testing in the next few months for our use case,
> which is 10 million books where we index the entire book.  These are
> extremely long documents compared to most IR research.
> I'd love to hear about actual (non-research) production implementations
> that have tested the new ranking models available in Solr.
>
> Tom
>
>
>
> On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu  wrote:
>
> > Hi there,
> > Does anybody can kindly tell me how to setup solr to use BM25?
> > By the way, are there any experiment or research shows BM25 and classical
> > VSM model comparison in recall/precision rate?
> >
> > Thanks in advanced.
> >
>