Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-05 Thread Dan Davis
Kevin & Noble,

I'll take it on to test this.   I've built from source before, and I've
wanted this authorization capability for awhile.

On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee  wrote:

> Noble,
>
> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
> the restart fix?
>
> At startup, these are the log messages that say there is no security
> configuration and the plugins aren’t being used even though security.json
> is in Zookeeper:
> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer Security
> conf doesn't exist. Skipping setup for authorization module.
> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
> authentication plugin used.
>
> Thanks,
> Kevin
>
> > On Sep 4, 2015, at 5:47 AM, Noble Paul  wrote:
> >
> > There are no download links for 5.3.x branch  till we do a bug fix
> release
> >
> > If you wish to download the trunk nightly (which is not same as 5.3.0)
> > check here
> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
> >
> > If you wish to get the binaries for 5.3 branch you will have to make it
> > (you will need to install svn and ant)
> >
> > Here are the steps
> >
> > svn checkout
> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
> > cd lucene_solr_5_3/solr
> > ant server
> >
> >
> >
> > On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
> >  wrote:
> >> Hi Kevin/Noble,
> >>
> >> What is the download link to take the latest? What are the steps to
> compile
> >> it, test and use?
> >> We also have a use case to have this feature in solr too. Therefore,
> wanted
> >> to test and above info would help a lot to get started.
> >>
> >> Thanks.
> >>
> >>
> >> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee 
> wrote:
> >>
> >>> Thanks, I downloaded the source and compiled it and replaced the jar
> file
> >>> in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem to
> be
> >>> protecting the Collections API reload command now as long as I upload
> the
> >>> security.json after startup of the Solr instances.  If I shutdown and
> bring
> >>> the instances back up, the security is no longer in place and I have to
> >>> upload the security.json again for it to take effect.
> >>>
> >>> - Kevin
> >>>
>  On Sep 3, 2015, at 10:29 PM, Noble Paul  wrote:
> 
>  Both these are committed. If you could test with the latest 5.3 branch
>  it would be helpful
> 
>  On Wed, Sep 2, 2015 at 5:11 PM, Noble Paul 
> wrote:
> > I opened a ticket for the same
> > https://issues.apache.org/jira/browse/SOLR-8004
> >
> > On Wed, Sep 2, 2015 at 1:36 PM, Kevin Lee  >
> >>> wrote:
> >> I’ve found that completely exiting Chrome or Firefox and opening it
> >>> back up re-prompts for credentials when they are required.  It was
> >>> re-prompting with the /browse path where authentication was working
> each
> >>> time I completely exited and started the browser again, however it
> won’t
> >>> re-prompt unless you exit completely and close all running instances
> so I
> >>> closed all instances each time to test.
> >>
> >> However, to make sure I ran it via the command line via curl as
> >>> suggested and it still does not give any authentication error when
> trying
> >>> to issue the command via curl.  I get a success response from all the
> Solr
> >>> instances that the reload was successful.
> >>
> >> Not sure why the pre-canned permissions aren’t working, but the one
> to
> >>> the request handler at the /browse path is.
> >>
> >>
> >>> On Sep 1, 2015, at 11:03 PM, Noble Paul 
> wrote:
> >>>
> >>> " However, after uploading the new security.json and restarting the
> >>> web browser,"
> >>>
> >>> The browser remembers your login , So it is unlikely to prompt for
> the
> >>> credentials again.
> >>>
> >>> Why don't you try the RELOAD operation using command line (curl) ?
> >>>
> >>> On Tue, Sep 1, 2015 at 10:31 PM, Kevin Lee
> 
> >>> wrote:
>  The restart issues aside, I’m trying to lockdown usage of the
> >>> Collections API, but that also does not seem to be working either.
> 
>  Here is my security.json.  I’m using the “collection-admin-edit”
> >>> permission and assigning it to the “adminRole”.  However, after
> uploading
> >>> the new security.json and restarting the web browser, it doesn’t seem
> to be
> >>> requiring credentials when calling the RELOAD action on the Collections
> >>> API.  The only thing that seems to work is the custom permission
> “browse”
> >>> which is requiring authentication before allowing me to pull up the
> page.
> >>> Am I using the permissions correctly for the
> RuleBasedAuthorizationPlugin?
> 
>  {
>   "authentication":{
>  "class":"solr.BasicAuthPlugin",
>  "credentials": {
>   "admin”:” ",
>   "user": ” "
>  

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-10 Thread Dan Davis
Kevin & Noble,

I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004.

I reproduced the initial problem with reloading security.json after
restarting both Solr and ZooKeeper.   I verified using zkcli.sh that
ZooKeeper does retain the changes to the file after using
/solr/admin/authorization, and that therefore the problem was Solr.

After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't
know how to give parameters to ant server), I expanded it, copied in the
core data, and then started it.   I was prompted for a password, and it let
me in once the password was given.

I'll probably get to SOLR-8004 shortly, since I have both environments
built and working.

It also occurs to me that it might be better to forbid all permissions and
grant specific permissions to specific roles.   Is there a comprehensive
list of the permissions available?


On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee  wrote:

> Thanks Dan!  Please let us know what you find.  I’m interested to know if
> this is an issue with anyone else’s setup or if I have an issue in my local
> configuration that is still preventing it to work on start/restart.
>
> - Kevin
>
> > On Sep 5, 2015, at 8:45 AM, Dan Davis  wrote:
> >
> > Kevin & Noble,
> >
> > I'll take it on to test this.   I've built from source before, and I've
> > wanted this authorization capability for awhile.
> >
> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee 
> wrote:
> >
> >> Noble,
> >>
> >> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
> >> the restart fix?
> >>
> >> At startup, these are the log messages that say there is no security
> >> configuration and the plugins aren’t being used even though
> security.json
> >> is in Zookeeper:
> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer
> Security
> >> conf doesn't exist. Skipping setup for authorization module.
> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
> >> authentication plugin used.
> >>
> >> Thanks,
> >> Kevin
> >>
> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul  wrote:
> >>>
> >>> There are no download links for 5.3.x branch  till we do a bug fix
> >> release
> >>>
> >>> If you wish to download the trunk nightly (which is not same as 5.3.0)
> >>> check here
> >>
> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
> >>>
> >>> If you wish to get the binaries for 5.3 branch you will have to make it
> >>> (you will need to install svn and ant)
> >>>
> >>> Here are the steps
> >>>
> >>> svn checkout
> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
> >>> cd lucene_solr_5_3/solr
> >>> ant server
> >>>
> >>>
> >>>
> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
> >>>  wrote:
> >>>> Hi Kevin/Noble,
> >>>>
> >>>> What is the download link to take the latest? What are the steps to
> >> compile
> >>>> it, test and use?
> >>>> We also have a use case to have this feature in solr too. Therefore,
> >> wanted
> >>>> to test and above info would help a lot to get started.
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee 
> >> wrote:
> >>>>
> >>>>> Thanks, I downloaded the source and compiled it and replaced the jar
> >> file
> >>>>> in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem to
> >> be
> >>>>> protecting the Collections API reload command now as long as I upload
> >> the
> >>>>> security.json after startup of the Solr instances.  If I shutdown and
> >> bring
> >>>>> the instances back up, the security is no longer in place and I have
> to
> >>>>> upload the security.json again for it to take effect.
> >>>>>
> >>>>> - Kevin
> >>>>>
> >>>>>> On Sep 3, 2015, at 10:29 PM, Noble Paul 
> wrote:
> >>>>>>
> >>>>>> Both these are committed. If you could test with the latest 5.3
> branch
> >>>>>> it would be helpful
> >>>>>>
> >>>>>> On Wed, Sep 2, 2015 

Re: Issue Using Solr 5.3 Authentication and Authorization Plugins

2015-09-10 Thread Dan Davis
SOLR-8004 also appears to work to me.   I manually edited security.json and
did putfile.   I didn't bother with browse permission, because it was
Kevin's workaround.solr-5.3.1-SNAPSHOT did challenge me for credentials
when going to curl
http://localhost:8983/solr/admin/collections?action=CREATE and so on...

On Thu, Sep 10, 2015 at 11:10 PM, Dan Davis  wrote:

> Kevin & Noble,
>
> I've manually verified the fix for SOLR-8000, but not yet for SOLR-8004.
>
> I reproduced the initial problem with reloading security.json after
> restarting both Solr and ZooKeeper.   I verified using zkcli.sh that
> ZooKeeper does retain the changes to the file after using
> /solr/admin/authorization, and that therefore the problem was Solr.
>
> After building solr-5.3.1-SNAPSHOT.tgz with ant package (because I don't
> know how to give parameters to ant server), I expanded it, copied in the
> core data, and then started it.   I was prompted for a password, and it let
> me in once the password was given.
>
> I'll probably get to SOLR-8004 shortly, since I have both environments
> built and working.
>
> It also occurs to me that it might be better to forbid all permissions and
> grant specific permissions to specific roles.   Is there a comprehensive
> list of the permissions available?
>
>
> On Tue, Sep 8, 2015 at 1:07 PM, Kevin Lee 
> wrote:
>
>> Thanks Dan!  Please let us know what you find.  I’m interested to know if
>> this is an issue with anyone else’s setup or if I have an issue in my local
>> configuration that is still preventing it to work on start/restart.
>>
>> - Kevin
>>
>> > On Sep 5, 2015, at 8:45 AM, Dan Davis  wrote:
>> >
>> > Kevin & Noble,
>> >
>> > I'll take it on to test this.   I've built from source before, and I've
>> > wanted this authorization capability for awhile.
>> >
>> > On Fri, Sep 4, 2015 at 9:59 AM, Kevin Lee 
>> wrote:
>> >
>> >> Noble,
>> >>
>> >> Does SOLR-8000 need to be re-opened?  Has anyone else been able to test
>> >> the restart fix?
>> >>
>> >> At startup, these are the log messages that say there is no security
>> >> configuration and the plugins aren’t being used even though
>> security.json
>> >> is in Zookeeper:
>> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer
>> Security
>> >> conf doesn't exist. Skipping setup for authorization module.
>> >> 2015-09-04 08:06:21.205 INFO  (main) [   ] o.a.s.c.CoreContainer No
>> >> authentication plugin used.
>> >>
>> >> Thanks,
>> >> Kevin
>> >>
>> >>> On Sep 4, 2015, at 5:47 AM, Noble Paul  wrote:
>> >>>
>> >>> There are no download links for 5.3.x branch  till we do a bug fix
>> >> release
>> >>>
>> >>> If you wish to download the trunk nightly (which is not same as 5.3.0)
>> >>> check here
>> >>
>> https://builds.apache.org/job/Solr-Artifacts-trunk/lastSuccessfulBuild/artifact/solr/package/
>> >>>
>> >>> If you wish to get the binaries for 5.3 branch you will have to make
>> it
>> >>> (you will need to install svn and ant)
>> >>>
>> >>> Here are the steps
>> >>>
>> >>> svn checkout
>> >> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3/
>> >>> cd lucene_solr_5_3/solr
>> >>> ant server
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Sep 4, 2015 at 4:11 PM, davidphilip cherian
>> >>>  wrote:
>> >>>> Hi Kevin/Noble,
>> >>>>
>> >>>> What is the download link to take the latest? What are the steps to
>> >> compile
>> >>>> it, test and use?
>> >>>> We also have a use case to have this feature in solr too. Therefore,
>> >> wanted
>> >>>> to test and above info would help a lot to get started.
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>>
>> >>>> On Fri, Sep 4, 2015 at 1:45 PM, Kevin Lee > >
>> >> wrote:
>> >>>>
>> >>>>> Thanks, I downloaded the source and compiled it and replaced the jar
>> >> file
>> >>>>> in the dist and solr-webapp’s WEB-INF/lib directory.  It does seem
>> to
>> >> be
>> >>>

Re: Solr authentication - Error 401 Unauthorized

2015-09-12 Thread Dan Davis
It seems that you have secured Solr so thoroughly that you cannot now run
bin/solr status!

bin/solr has no arguments as yet for providing a username/password - as a
mostly user like you I'm not sure of the roadmap.

I think you should relax those restrictions a bit and try again.

On Fri, Sep 11, 2015 at 5:06 AM, Merlin Morgenstern <
merlin.morgenst...@gmail.com> wrote:

> I have secured solr cloud via basic authentication.
>
> Now I am having difficulties creating cores and getting status information.
> Solr keeps telling me that the request is unothorized. However, I have
> access to the admin UI after login.
>
> How do I configure solr to use the basic authentication credentials?
>
> This is the error message:
>
> /opt/solr-5.3.0/bin/solr status
>
> Found 1 Solr nodes:
>
> Solr process 31114 running on port 8983
>
> ERROR: Failed to get system information from http://localhost:8983/solr
> due
> to: org.apache.http.client.ClientProtocolException: Expected JSON response
> from server but received: 
>
> 
>
> 
>
> Error 401 Unauthorized
>
> 
>
> HTTP ERROR 401
>
> Problem accessing /solr/admin/info/system. Reason:
>
> UnauthorizedPowered by
> Jetty://
>
>
> 
>
> 
>


Re: Solr authentication - Error 401 Unauthorized

2015-09-12 Thread Dan Davis
Noble,

You should also look at this if it is intended to be more than an internal
API.   Using the minor protections I added to test SOLR-8000, I was able to
reproduce a problem very like this:

bin/solr healthcheck -z localhost:2181 -c mycollection

Since Solr /select is protected...

On Sat, Sep 12, 2015 at 9:40 AM, Dan Davis  wrote:

> It seems that you have secured Solr so thoroughly that you cannot now run
> bin/solr status!
>
> bin/solr has no arguments as yet for providing a username/password - as a
> mostly user like you I'm not sure of the roadmap.
>
> I think you should relax those restrictions a bit and try again.
>
> On Fri, Sep 11, 2015 at 5:06 AM, Merlin Morgenstern <
> merlin.morgenst...@gmail.com> wrote:
>
>> I have secured solr cloud via basic authentication.
>>
>> Now I am having difficulties creating cores and getting status
>> information.
>> Solr keeps telling me that the request is unothorized. However, I have
>> access to the admin UI after login.
>>
>> How do I configure solr to use the basic authentication credentials?
>>
>> This is the error message:
>>
>> /opt/solr-5.3.0/bin/solr status
>>
>> Found 1 Solr nodes:
>>
>> Solr process 31114 running on port 8983
>>
>> ERROR: Failed to get system information from http://localhost:8983/solr
>> due
>> to: org.apache.http.client.ClientProtocolException: Expected JSON response
>> from server but received: 
>>
>> 
>>
>> 
>>
>> Error 401 Unauthorized
>>
>> 
>>
>> HTTP ERROR 401
>>
>> Problem accessing /solr/admin/info/system. Reason:
>>
>> UnauthorizedPowered by
>> Jetty://
>>
>>
>> 
>>
>> 
>>
>
>


Re: Cloud Deployment Strategy... In the Cloud

2015-09-24 Thread Dan Davis
ant is very good at this sort of thing, and easier for Java devs to learn
than Make.  Python has a module called fabric that is also very fine, but
for my dev. ops. it is another thing to learn.
I tend to divide things into three categories:

 - Things that have to do with system setup, and need to be run as root.
For this I write a bash script (I should learn puppet, but...)
 - Things that have to do with one time installation as a solr admin user
with /bin/bash, including upconfig.   For this I use an ant build.
 - Normal operational procedures.   For this, I typically use Solr admin or
scripts, but I wish I had time to create a good webapp (or money to
purchase Fusion).


On Thu, Sep 24, 2015 at 12:39 AM, Erick Erickson 
wrote:

> bq: What tools do you use for the "auto setup"? How do you get your config
> automatically uploaded to zk?
>
> Both uploading the config to ZK and creating collections are one-time
> operations, usually done manually. Currently uploading the config set is
> accomplished with zkCli (yes, it's a little clumsy). There's a JIRA to put
> this into solr/bin as a command though. They'd be easy enough to script in
> any given situation though with a shell script or wizard
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:33 PM, Steve Davids  wrote:
>
> > What tools do you use for the "auto setup"? How do you get your config
> > automatically uploaded to zk?
> >
> > On Tue, Sep 22, 2015 at 2:35 PM, Gili Nachum 
> wrote:
> >
> > > Our auto setup sequence is:
> > > 1.deploy 3 zk nodes
> > > 2. Deploy solr nodes and start them connecting to zk.
> > > 3. Upload collection config to zk.
> > > 4. Call create collection rest api.
> > > 5. Done. SolrCloud ready to work.
> > >
> > > Don't yet have automation for replacing or adding a node.
> > > On Sep 22, 2015 18:27, "Steve Davids"  wrote:
> > >
> > > > Hi,
> > > >
> > > > I am trying to come up with a repeatable process for deploying a Solr
> > > Cloud
> > > > cluster from scratch along with the appropriate security groups, auto
> > > > scaling groups, and custom Solr plugin code. I saw that LucidWorks
> > > created
> > > > a Solr Scale Toolkit but that seems to be more of a one-shot deal
> than
> > > > really setting up your environment for the long-haul. Here is were we
> > are
> > > > at right now:
> > > >
> > > >1. ZooKeeper ensemble is easily brought up via a Cloud Formation
> > > Script
> > > >2. We have an RPM built to lay down the Solr distribution + Custom
> > > >plugins + Configuration
> > > >3. Solr machines come up and connect to ZK
> > > >
> > > > Now, we are using Puppet which could easily create the
> core.properties
> > > file
> > > > for the corresponding core and have ZK get bootstrapped but that
> seems
> > to
> > > > be a no-no these days... So, can anyone think of a way to get ZK
> > > > bootstrapped automatically with pre-configured Collection
> > configurations?
> > > > Also, is there a recommendation on how to deal with machines that are
> > > > coming/going? As I see it machines will be getting spun up and
> > terminated
> > > > from time to time and we need to have a process of dealing with that,
> > the
> > > > first idea was to just use a common node name so if a machine was
> > > > terminated a new one can come up and replace that particular node but
> > on
> > > > second thought it would seem to require an auto scaling group *per*
> > node
> > > > (so it knows what node name it is). For a large cluster this seems
> > crazy
> > > > from a maintenance perspective, especially if you want to be elastic
> > with
> > > > regard to the number of live replicas for peak times. So, then the
> next
> > > > idea was to have some outside observer listen to when new ec2
> instances
> > > are
> > > > created or terminated (via CloudWatch SQS) and make the appropriate
> API
> > > > calls to either add the replica or delete it, this seems doable but
> > > perhaps
> > > > not the simplest solution that could work.
> > > >
> > > > I was hoping others have already gone through this and have valuable
> > > advice
> > > > to give, we are trying to setup Solr Cloud the "right way" so we
> don't
> > > get
> > > > nickel-and-dimed to death from an O&M perspective.
> > > >
> > > > Thanks,
> > > >
> > > > -Steve
> > > >
> > >
> >
>


Re: Customzing Solr Dedupe

2015-04-01 Thread Dan Davis
But you can potentially still use Solr dedupe if you do the upfront work
(in RDMS or NoSQL pre-index processing) to assign some sort of "Group ID".
  See OCLC's FRBR Work-Set Algorithm,
http://www.oclc.org/content/dam/research/activities/frbralgorithm/2009-08.pdf?urlm=161376
, for some details on one such algorithm.

If the job is too big for RDBMS, and/or you don't want to use/have a
suitable NoSQL, you can have two Solr indexes (collection/core/whatever) -
one for classification with only id, field1, field2, field3, and another
for production query.   Then, you put stuff into the classification index,
use queries and your own algorithm to do classification, assigning a
groupId, and then put the document with groupId assigned into the
production database.

A key question is whether you want to preserve the groupId.   In some
cases, you do, and in some cases, it is just an internal signature.   In
both cases, a non-deterministic up-front algorithm can work, but if the
groupId needs to be preserved, you need to work harder to make sure it all
hangs together.

Hope this helps,

-Dan

On Wed, Apr 1, 2015 at 7:05 AM, Jack Krupansky 
wrote:

> Solr dedupe is based on the concept of a signature - some fields and rules
> that reduce a document into a discrete signature, and then checking if that
> signature exists as a document key that can be looked up quickly in the
> index. That's the conceptual basis. It is not based on any kind of field by
> field comparison to all existing documents.
>
> -- Jack Krupansky
>
> On Wed, Apr 1, 2015 at 6:35 AM, thakkar.aayush 
> wrote:
>
> > I'm facing a challenges using de-dupliation of Solr documents.
> >
> > De-duplicate is done using TextProfileSignature with following
> parameters:
> > field1, field2, field3
> > 0.5
> > 3
> >
> > Here Field3 is normal text with few lines of data.
> > Field1 and Field2 can contain upto 5 or 6 words of data.
> >
> > I want to de-duplicate when data in field1 and field2 are exactly the
> same
> > and 90% of the lines in field3 is matched to that in another document.
> >
> > Is there anyway to achieve this?
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Customzing-Solr-Dedupe-tp4196879.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: What is the best way of Indexing different formats of documents?

2015-04-07 Thread Dan Davis
Sangeetha,

You can also run Tika directly from data import handler, and Data Import
Handler can be made to run several threads if you can partition the input
documents by directory or database id.   I've done 4 "threads" by having a
base configuration that does an Oracle query like this:

  SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
WHERE ...) WHERE threadid = %d

A bash/sed script writes several data import handler XML files.
I can then index several threads at a time.

Each of these threads can then use all the transformers, e.g.
templateTransformer, etc.
XML can be transformed via XSLT.

The Data Import Handler has other entities that go out to the web and then
index the document via Tika.

If you are indexing generic HTML, you may want to figure out an approach to
SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
locally, because Boilerpipe has a bug that has been fixed, but not pushed
to Maven Central.   Without that, the ASF cannot include the fix, but
distributions such as LucidWorks Solr Enterprise can.

I can drop some configs into github.com if I clean them up to obfuscate
host names, passwords, and such.


On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain  wrote:

> Well have indexed heterogeneous sources including a variety of NoSQL's,
> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
> of using SolrJ is that you should have an API to fetch data from your data
> source (Say JDBC for RDBMS, Tika for extracting text content from rich
> documents etc.) than SolrJ is so damn great and simple. Its as simple as
> downloading the jar and few lines of code to send data to your solr server
> after pre-processing your data. More details here:
>
> http://lucidworks.com/blog/indexing-with-solrj/
>
> https://wiki.apache.org/solr/Solrj
>
> http://www.solrtutorial.com/solrj-tutorial.html
>
> Cheers,
> Yavar
>
>
>
> On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com <
> sangeetha.subraman...@gtnexus.com> wrote:
>
> > Hi,
> >
> > I am a newbie to SOLR and basically from database background. We have a
> > requirement of indexing files of different formats (x12,edifact,
> csv,xml).
> > The files which are inputted can be of any format and we need to do a
> > content based search on it.
> >
> > From the web I understand we can use TIKA processor to extract the
> content
> > and store it in SOLR. What I want to know is, is there any better
> approach
> > for indexing files in SOLR ? Can we index the document through streaming
> > directly from the Application ? If so what is the disadvantage of using
> it
> > (against DIH which fetches from the database)? Could someone share me
> some
> > insight on this ? ls there any web links which I can refer to get some
> idea
> > on it ? Please do help.
> >
> > Thanks
> > Sangeetha
> >
> >
>


Re: Securing solr index

2015-04-13 Thread Dan Davis
Where you want true Role-Based Access Control (RBAC) on each index (core or
collection), one solution is to buy Solr Enterprise from LucidWorks.

My personal practice is mostly dictated by financial decisions:

   - Each core/index has its configuration directory in a Git
   repository/branch where the Git repository software provides RBAC.
   - This relies on developers to keep a separate Solr for development, and
   then to check-in their configuration directory changes when they are
   satisfied with the changes.   This is probably a best practice anyway :)
   - "Continuous Integration" pushes the Git configuration appropriately
   when a particular branch changes.
   - The main URL "/solr" has security provided by Apache httpd on port 80
   (a reverse proxy to http://localhost:8983/solr/)
   - That port is also open, secured by IP address, to other Solr nodes in
   the cluster.
   - The /select request Handler for each core/collection is reverse
   proxied to "/search/".
   - The Solr Amin UI uses a authentication/authorization handler such that
   only the "Search Administrators" group has access to it.

The security here relies on search developers not enabling "handleSelect"
in their solrconfig.xml.The security can also be extended by adding
security on reverse proxied URLs such as "/search/" and
"/update/" so that the client application needs to know some key,
or have access to an SSL private key file.

The downside is that only "Search Administrators" group has access to the
QA or production Solr Admin UI.


On Mon, Apr 13, 2015 at 6:13 AM, Suresh Vanasekaran <
suresh_vanaseka...@infosys.com> wrote:

> Hi,
>
> We are having the solr index maintained in a central server and multiple
> users might be able to access the index data.
>
> May I know what are best practice for securing the solr index folder where
> ideally only application user should be able to access. Even an admin user
> should not be able to copy the data and use it in another schema.
>
> Thanks
>
>
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys
> has taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>


Re: Odp.: solr issue with pdf forms

2015-04-22 Thread Dan Davis
+1 - I like Erick's answer.  Let me know if that turns out to be the
problem - I'm interested in this problem and would be happy to help.

On Wed, Apr 22, 2015 at 11:11 AM, Erick Erickson 
wrote:

> Are they not _indexed_ correctly or not being displayed correctly?
> Take a look at admin UI>>schema browser>> your field and press the
> "load terms" button. That'll show you what is _in_ the index as
> opposed to what the raw data looked like.
>
> When you return the field in a Solr search, you get a verbatim,
> un-analyzed copy of your original input. My guess is that your browser
> isn't using the compatible character encoding for display.
>
> Best,
> Erick
>
> On Wed, Apr 22, 2015 at 7:08 AM,   wrote:
> > Thanks for your answer. Maybe my English is not good enough, what are
> you trying to say? Sorry I didn't get the point.
> > :-(
> >
> >
> > -Ursprüngliche Nachricht-
> > Von: LAFK [mailto:tomasz.bo...@gmail.com]
> > Gesendet: Mittwoch, 22. April 2015 14:01
> > An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
> > Betreff: Odp.: solr issue with pdf forms
> >
> > Out of my head I'd follow how are writable PDFs created and encoded.
> >
> > @LAFK_PL
> >   Oryginalna wiadomość
> > Od: steve.sch...@t-systems.com
> > Wysłano: środa, 22 kwietnia 2015 12:41
> > Do: solr-user@lucene.apache.org
> > Odpowiedz: solr-user@lucene.apache.org
> > Temat: solr issue with pdf forms
> >
> > Hi guys,
> >
> > hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> > - usual pdf files are indexed just fine
> > - pdf files with writable form-fields look like this:
> >
> Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind
> >
> > Somehow the blank space character is not indexed correctly.
> >
> > Is this a know issue? Does anybody have an idea?
> >
> > Thanks a lot
> > Best
> > Steve
>


Re: solr issue with pdf forms

2015-04-22 Thread Dan Davis
Steve,

Are you using ExtractingRequestHandler / DataImportHandler or extracting
the text content from the PDF outside of Solr?

On Wed, Apr 22, 2015 at 6:40 AM,  wrote:

> Hi guys,
>
> hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> - usual pdf files are indexed just fine
> - pdf files with writable form-fields look like this:
>
> Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�vollständig�sind
>
> Somehow the blank space character is not indexed correctly.
>
> Is this a know issue? Does anybody have an idea?
>
> Thanks a lot
> Best
> Steve
>


Re: Odp.: solr issue with pdf forms

2015-04-23 Thread Dan Davis
Steve,

You gave as an example:

Ich�bestätige�mit�meiner�Unterschrift,�dass�alle�Angaben�korrekt�und�
vollständig�sind

This sentence is probably from the PDF form label content, rather than form
values.   Sometimes in PDF, the form's value fields are kept in a separate
file.   I'm 99% sure Tika won't be able to handle that, because it handles
one file at a time.   If the form's value fields are in the PDF, Tika
should be able to handle it, but may be making some small errors that could
be addressed.

When you look at the form in Acrobat Reader, can you see whether the
indexed words contain any words from the form fields's values?

If you have a form where the data is not sensitive, I can investigate.   If
you are interested in this contact me offline - to dansm...@gmail.com or
d...@danizen.net.

Thanks,

Dan

On Thu, Apr 23, 2015 at 11:59 AM, Erick Erickson 
wrote:

> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was
> sent to Solr, _not_
> the actual tokens in the index. What do you see when you go to the admin
> schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you
> see in the browser
> when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see
> your analysis chain,
> i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed
> just fine, but
> I've certainly been wrong before, more times than I want to remember.
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,   wrote:
> > Hey Erick,
> >
> > thanks for your answer. They are not indexed correctly. Also throught
> the solr admin interface I see these typical questionmarks within a rhombus
> where a blank space should be.
> > I now figured out the following (not sure if it is relevant at all):
> > - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
> indexed correctly, no issues
> > - PDF documents (with editable form fields) created with "Adobe InDesign
> CS5 (7.0.1)"  are indexed with the blank space issue
> >
> > Best
> > Steve
> >
> > -Ursprüngliche Nachricht-
> > Von: Erick Erickson [mailto:erickerick...@gmail.com]
> > Gesendet: Mittwoch, 22. April 2015 17:11
> > An: solr-user@lucene.apache.org
> > Betreff: Re: Odp.: solr issue with pdf forms
> >
> > Are they not _indexed_ correctly or not being displayed correctly?
> > Take a look at admin UI>>schema browser>> your field and press the "load
> terms" button. That'll show you what is _in_ the index as opposed to what
> the raw data looked like.
> >
> > When you return the field in a Solr search, you get a verbatim,
> un-analyzed copy of your original input. My guess is that your browser
> isn't using the compatible character encoding for display.
> >
> > Best,
> > Erick
> >
> > On Wed, Apr 22, 2015 at 7:08 AM,   wrote:
> >> Thanks for your answer. Maybe my English is not good enough, what are
> you trying to say? Sorry I didn't get the point.
> >> :-(
> >>
> >>
> >> -Ursprüngliche Nachricht-
> >> Von: LAFK [mailto:tomasz.bo...@gmail.com]
> >> Gesendet: Mittwoch, 22. April 2015 14:01
> >> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
> >> Betreff: Odp.: solr issue with pdf forms
> >>
> >> Out of my head I'd follow how are writable PDFs created and encoded.
> >>
> >> @LAFK_PL
> >>   Oryginalna wiadomość
> >> Od: steve.sch...@t-systems.com
> >> Wysłano: środa, 22 kwietnia 2015 12:41
> >> Do: solr-user@lucene.apache.org
> >> Odpowiedz: solr-user@lucene.apache.org
> >> Temat: solr issue with pdf forms
> >>
> >> Hi guys,
> >>
> >> hopefully you can help me with my issue. We are using a solr setup and
> have the following issue:
> >> - usual pdf files are indexed just fine
> >> - pdf files with writable form-fields look like this:
> >> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v
> >> ollständig sind
> >>
> >> Somehow the blank space character is not indexed correctly.
> >>
> >> Is this a know issue? Does anybody have an idea?
> >>
> >> Thanks a lot
> >> Best
> >> Steve
>


Re: analyzer, indexAnalyzer and queryAnalyzer

2015-04-30 Thread Dan Davis
Hi Doug, nice write-up and 2 questions:

- You write your own QParser plugins - can one keep the features of edismax
for field boosting/phrase-match boosting by subclassing edismax?   Assuming
yes...

- What do pf2 and pf3 do in the edismax query parser?

hon-lucene-synonyms plugin links corrections:

http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
https://github.com/healthonnet/hon-lucene-synonyms


On Wed, Apr 29, 2015 at 9:24 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> So Solr has the idea of a query parser. The query parser is a convenient
> way of passing a search string to Solr and having Solr parse it into
> underlying Lucene queries: You can see a list of query parsers here
> http://wiki.apache.org/solr/QueryParser
>
> What this means is that the query parser does work to pull terms into
> individual clauses *before* analysis is run. It's a parsing layer that sits
> outside the analysis chain. This creates problems like the "sea biscuit"
> problem, whereby we declare "sea biscuit" as a query time synonym of
> "seabiscuit". As you may know synonyms are checked during analysis.
> However, if the query parser splits up "sea" from "biscuit" before running
> analysis, the query time analyzer will fail. The string "sea" is brought by
> itself to the query time analyzer and of course won't match "sea biscuit".
> Same with the string "biscuit" in isolation. If the full string "sea
> biscuit" was brought to the analyzer, it would see [sea] next to [biscuit]
> and declare it a synonym of seabiscuit. Thanks to the query parser, the
> analyzer has lost the association between the terms, and both terms aren't
> brought together to the analyzer.
>
> My colleague John Berryman wrote a pretty good blog post on this
>
> http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/
>
> There's several solutions out there that attempt to address this problem.
> One from Ted Sullivan at Lucidworks
>
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>
> Another popular one is the hon-lucene-synonyms plugin:
>
> http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html
>
> Yet another work-around is to use the field query parser:
>
> http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/search/FieldQParserPlugin.html
>
> I also tend to write my own query parsers, so on the one hand its annoying
> that query parsers have the problems above, on the flipside Solr makes it
> very easy to implement whatever parsing you think is appropriatte with a
> small bit of Java/Lucene knowledge.
>
> Hopefully that explanation wasn't too deep, but its an important thing to
> know about Solr. Are you asking out of curiosity, or do you have a specific
> problem?
>
> Thanks
> -Doug
>
> On Wed, Apr 29, 2015 at 6:32 PM, Steven White 
> wrote:
>
> > Hi Doug,
> >
> > I don't understand what you mean by the following:
> >
> > > For example, if a user searches for q=hot dogs&defType=edismax&qf=title
> > > body the *query parser* *not* the *analyzer* first turns the query
> into:
> >
> > If I have indexAnalyzer and queryAnalyzer in a fieldType that are 100%
> > identical, the example you provided, does it stand?  If so, why?  Or do
> you
> > mean something totally different by "query parser"?
> >
> > Thanks
> >
> > Steve
> >
> >
> > On Wed, Apr 29, 2015 at 4:18 PM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> >
> > > *> 1) If the content of indexAnalyzer and queryAnalyzer are exactly the
> > > same,that's the same as if I have an analyzer only, right?*
> > > 1) Yes
> > >
> > > *>  2) Under the hood, all three are the same thing when it comes to
> what
> > > kind*
> > > *of data and configuration attributes can take, right?*
> > > 2) Yes. Both take in text and output a token stream.
> > >
> > > *>What I'm trying to figure out is this: beside being able to configure
> > a*
> > >
> > > *fieldType to have different analyzer setting at index and query time,
> > > thereis nothing else that's unique about each.*
> > >
> > > The only thing to look out for in Solr land is the query parser. Most
> > Solr
> > > query parsers treat whitespace as meaningful.
> > >
> > > For example, if a user searches for q=hot dogs&defType=edismax&qf=title
> > > body the *query parser* *not* the *analyzer* first turns the query
> into:
> > >
> > > (title:hot title:dog) | (body:hot body:dog)
> > >
> > > each word which *then *gets analyzed. This is because the query parser
> > > tries to be smart and turn "hot dog" into hot OR dog, or more
> > specifically
> > > making them two must clauses.
> > >
> > > This trips quite a few folks up, you can use the field query parser
> which
> > > uses the field as a phrase query. Hope that helps
> > >
> > >
> > > --
> > > *Doug Turnbull **| *Search Relevance Consultant | OpenSource
> Connections,
> > > LLC | 240.476.9983 | http://www.opensou

Tika Integration problem with DIH and JDBC

2014-10-10 Thread Dan Davis
What I want to do is to pull an URL out of an Oracle database, and then use
TikaEntityProcessor and BinURLDataSource to go fetch and process that
URL.   I'm having a problem with this that seems general to JDBC with Tika
- I get an exception as follows:

Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query:
http://www.cdc.gov/healthypets/pets/wildlife.html Processing Document
# 14
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
...

Steps to reproduce any problem should be:


   - Try it with the XML and verify you get two documents and they contain
   text (schema browser with the text field)
   - Try it with a JDBC sqlite3 dataSource and verify that you get an
   exception, and advise me what may be the problem in my configuration ...

Now, I've tried this 3 ways:


   - My Oracle database - fails as above
   - An SQLite3 database to see if it is Oracle specific - fails with
   "Unable to execute query", but doesn't have the URL as part of the message.
   - An XML file listing two URLs - succeeds without error.

For the SQL attempts, setting onError="skip" leads the data from the
database to be indexed, but the exception is logged for each root entity.
I can tell that nothing is indexed from the text extraction by browsing the
"text" field from the schema browser and seeing how few terms there are.
The exceptions also sort of give it away, but it is good to be careful :)

This is using:

   - Tomcat 7.0.55
   - Solr 4.10.1
   - and JDBC drivers
  - ojdbc7.jar
  - sqlite-jdbc-3.7.2.jar

Excerpt of solrconfig.xml:

  
  

  dih-healthtopics.xml

  

  
  

  dih-smallxml.xml

  


  

  dih-smallsqlite.xml

  


The data import handlers and a copy-paste from Solr logging are attached.
Exception in entity : 
extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
to execute query:  Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:283)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
at 
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:502)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:189)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.p

Re: Tika Integration problem with DIH and JDBC

2014-10-10 Thread Dan Davis
Thanks, Alexandre.My role is to kick the tires on this.   We're trying
it a couple of different ways.   So, I'm going to assume this could be
resolved and move on to trying ManifestCF and see whether it can do similar
things for me, e.g. what it adds for free to our bag of tricks.

On Fri, Oct 10, 2014 at 3:16 PM, Alexandre Rafalovitch 
wrote:

> I would concentrate on the stack traces and try reading them. They
> often provide a lot of clues. For example, you original stack trace
> had
>
>
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:283)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
> 2) at
> org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
> at
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
> 1) at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
>
> I added 1) and 2) to show the lines of importance. You can see in 1)
> that your TikaEntityProcessor is calling 2) JdbcDataSource, which was
> not what you wanted as you specified BinDataSource. So, you focus on
> that until it gets resolved.
>
> Sometimes these happens when the XML file says 'datasource' instead of
> 'dataSource' (DIH is case-sensitive), but it does not seem to be the
> case in your situation.
>
> Regards,
> Alex.
> P.s. If you still haven't figure it out, mention the Solr version on
> the next email. Sometimes it makes difference, though DIH has been
> largely unchanged for a while.
>
> -- Forwarded message --
> From: Dan Davis 
> Date: 10 October 2014 15:00
> Subject: Re: Tika Integration problem with DIH and JDBC
> To: Alexandre Rafalovitch 
>
>
> The definition of dataSource name="bin" type="BinURLDataSource" is in
> each of the dih-*.xml files.
> But only the xml version has the definition at the top, above the document.
>
> Moving the dataSource definition to the top does change the behavior,
> now I get the following error for that entity:
>
> Exception in entity :
> extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> JDBC URL or JNDI name has to be specified Processing Document # 30
>
> When I changed it to specify url="", it then reverted to form:
>
> Exception in entity :
> extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Unable to execute query: http://www.cdc.gov/flu/swineflu/ Processing
> Document # 1
> at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
>
> It does seem to be a problem resolving the dataSource in some way.   I
> did double check another part of solrconfig.xml therefore.   Since the
> XML example still works, I guess I know it has to be there.
>
>regex="solr-dataimporthandler-.*\.jar" />
>
>   
>   
>
>   
>   
>
>   
>   
>
>   
>   
>
>
> On Fri, Oct 10, 2014 at 2:37 PM, Alexandre Rafalovitch
>  wrote:
> >
> > You say "dataSource='bin'" but I don't see you defining that datasource.
> E.g.:
> >
> > 
> >
> > So, there might be some weird default fallback that's just causes
> > strange problems.
> >
> > Regards,
> > Alex.
> >
> > Personal: http://www.outerthoughts.com/ and @arafalov
> > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> >
> >
> > On 10 October 2014 14:17, Dan Davis  wrote:
> > >
> > > What I want to do is to pull an URL out of an Oracle database, and
> then use
> > > TikaEntityProcessor and BinURLDataSource to go fetch and process that
> URL.
> > > I'm having a problem with this that seems general to JDBC with Tika -
> I get
> > > an exception as follows:
> > >
> > > Exception in entity :
> > > extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
> > > Unable to execute query:
> http://www.cdc.gov/healthypets/pets/wildlife.html
> > > Processing Document # 14
> > >   at
> > >
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
> > > ...
> > >
> > > S

Re: Problem with DIH

2014-10-16 Thread Dan Davis
This seems a little abstract.   What I'd do is double check that the SQL is
working correctly by running the stored procedure outside of Solr and see
what you get.   You should also be able to look at the corresponding
.properties file and see the inputs used for the delta import.  If the data
import XML is called "dih-example.xml", then the properties file should be
called "dih-example.properties" and be in the same conf directory (for the
collection).Example contents are:

#Fri Oct 10 14:53:44 EDT 2014
last_index_time=2014-10-10 14\:53\:44
healthtopic.last_index_time=2014-10-10 14\:53\:44

Again, I'm suggesting you double check that the SQL is working correctly.
If that isn't the problem, provide more details on your data import
handler, e.g. the XML with some modifications (no passwords).

On Thu, Oct 16, 2014 at 2:11 AM, Jay Potharaju 
wrote:

> Hi
> I 'm using DIH for updating my core. I 'm using store procedure for doing a
> full/ delta imports. In order to avoid running delta imports for a long
> time, i limit the rows returned to a max of 100,000 rows at a given time.
> On an average the delta import runs for less than 1 minute.
>
> For the last couple of days I have been noticing that my delta imports has
> been running for couple of hours and tries to update all the records in the
> core. I 'm not sure why that has been happening. I cant reproduce this
> event all the time, it happens randomly.
>
> Has anyone noticed this kind of behavior. And secondly are there any solr
> logs that will tell me what is getting updated or what exactly is happening
> at the DIH ?
> Any suggestion appreciated.
>
> Document size: 20 million
> Solr 4.9
> 3 Nodes in the solr cloud.
>
>
> Thanks
> J
>


Re: import solr source to eclipse

2014-10-16 Thread Dan Davis
I had a problem with the "ant eclipse" answer - it was unable to resolve
"javax.activation" for the Javadoc.  Updating
solr/contrib/dataimporthandler-extras/ivy.xml
as follows did the trick for me:

-  
+  

What I'm trying to do is to construct a failing Unit test for something
that I think is a bug.   But the first thing is to be able to run tests,
probably in eclipse, but the command-line might be good enough although not
ideal.


On Tue, Oct 14, 2014 at 10:38 AM, Erick Erickson 
wrote:

> I do exactly what Anurag mentioned, but _only_ when what
> I want to debug is, for some reason, not accessible via unit
> tests. It's very easy to do.
>
> It's usually much faster though to use unit tests, which you
> should be able to run from eclipse without starting a server
> at all. In IntelliJ, you just ctrl-click on the file and the menu
> gives you a choice of running or debugging the unit test, I'm
> sure Eclipse does something similar.
>
> There are zillions of units to choose from, and for new development
> it's a Good Thing to write the unit test first...
>
> Good luck!
> Erick
>
> On Tue, Oct 14, 2014 at 1:37 AM, Anurag Sharma  wrote:
> > Another alternative is launch the jetty server from outside and attach it
> > remotely from eclipse.
> >
> > java -Xdebug
> -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=7666
> > -jar start.jar
> > The above command waits until the application attach succeed.
> >
> >
> > On Tue, Oct 14, 2014 at 12:56 PM, Rajani Maski 
> > wrote:
> >
> >> Configure eclipse with Jetty plugin. Create a Solr folder under your
> >> Solr-Java-Project and Run the project [Run as] on Jetty Server.
> >>
> >> This blog[1] may help you to configure Solr within eclipse.
> >>
> >>
> >> [1]
> >>
> http://hokiesuns.blogspot.in/2010/01/setting-up-apache-solr-in-eclipse.html
> >>
> >> On Tue, Oct 14, 2014 at 12:06 PM, Ali Nazemian 
> >> wrote:
> >>
> >> > Thank you very much for your guides but how can I run solr server
> inside
> >> > eclipse?
> >> > Best regards.
> >> >
> >> > On Mon, Oct 13, 2014 at 8:02 PM, Rajani Maski 
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > The best tutorial for setting up Solr[solr 4.7] in
> eclipse/intellij  is
> >> > > documented in Solr In Action book, Apendix A, *Working with the Solr
> >> > > codebase*
> >> > >
> >> > >
> >> > > On Mon, Oct 13, 2014 at 6:45 AM, Tomás Fernández Löbbe <
> >> > > tomasflo...@gmail.com> wrote:
> >> > >
> >> > > > The way I do this:
> >> > > > From a terminal:
> >> > > > svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk/
> >> > > > lucene-solr-trunk
> >> > > > cd lucene-solr-trunk
> >> > > > ant eclipse
> >> > > >
> >> > > > ... And then, from your Eclipse "import existing java project",
> and
> >> > > select
> >> > > > the directory where you placed lucene-solr-trunk
> >> > > >
> >> > > > On Sun, Oct 12, 2014 at 7:09 AM, Ali Nazemian <
> alinazem...@gmail.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > > I am going to import solr source code to eclipse for some
> >> development
> >> > > > > purpose. Unfortunately every tutorial that I found for this
> purpose
> >> > is
> >> > > > > outdated and did not work. So would you please give me some hint
> >> > about
> >> > > > how
> >> > > > > can I import solr source code to eclipse?
> >> > > > > Thank you very much.
> >> > > > >
> >> > > > > --
> >> > > > > A.Nazemian
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >> >
> >>
>


Re: javascript form data save to XML in server side

2014-10-22 Thread Dan Davis
I always, always have a web application running that accepts the JavaScript
AJAX call and then forwards it on to the Apache Solr request handler.  Even
if you don't control the web application, and can only add JavaScript, you
can put up a API oriented webapp somewhere that only protects Solr for a
couple of posts.  Then, you can use CORS or JSONP to facilitate interaction
between the main web application and the ancillary webapp providing APIs
for Solr integration.

Of course, this only applies if you don't control the primary
application.   If you can use a Drupal or Typo3 to front-end Solr, than
this is a great way to solve the problem.

On Mon, Oct 20, 2014 at 11:02 PM, LongY  wrote:

> thank you very much. Alex. You reply is very informative and I really
> appreciate it. I hope I would be able to help others in this forum like you
> are in the future.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/javascript-form-data-save-to-XML-in-server-side-tp4165025p4165066.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Tika Integration problem with DIH and JDBC

2014-11-04 Thread Dan Davis
All,

The problem here was that I gave driver="BinURLDataSource" rather than
type="BinURLDataSource".   Of course, saying driver="BinURLDataSource"
caused it not to be able to find it.


Best Practices for open source pipeline/connectors

2014-11-04 Thread Dan Davis
I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan


Re: Best Practices for open source pipeline/connectors

2014-11-04 Thread Dan Davis
We are looking at LucidWorks, but also want to see what we can do on our
own so we can evaluate the value-add of Lucid Works among other products.

On Tue, Nov 4, 2014 at 4:13 PM, Alexandre Rafalovitch 
wrote:

> And, just to get the stupid question out of the way, you prefer to pay
> in developer integration time rather than in purchase/maintenance
> fees?
>
> Because, otherwise, I would look at LucidWorks commercial offering
> first, even to just have a comparison.
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 4 November 2014 16:01, Dan Davis  wrote:
> > I'm trying to do research for my organization on the best practices for
> > open source pipeline/connectors.   Since we need Web Crawls, File System
> > crawls, and Databases, it seems to me that Manifold CF might be the best
> > case.
> >
> > Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
> > DataImportHandler?   It would be nice to decide in ManifestCF which
> > resultHandler should receive a document or id, barring that, you can post
> > some fields including an URL and have Data Import Handler handle it - it
> > already supports scripts whereas ManifestCF may not at this time.
> >
> > Suggestions and ideas?
> >
> > Thanks,
> >
> > Dan
>


Fwd: Best Practices for open source pipeline/connectors

2014-11-10 Thread Dan Davis
The volume and influx rate in my scenario are very modest.  Our largest
collections with existing indexing software is about 20 million objects,
second up is about 5 million, and more typical collections are in the tens
of thousands.   Aside from the 20 million object corpus, we re-index and
replicate nightly.

Note that I am not responsible for any specific operation, only for
advising my organization on how to go.   My organization wants to
understand how much "programming" will be involved using Solr rather than
higher level tools.   I have to acknowledge that our current solution
involves less "programming", even as I urge them to think of programming as
not a bad thing ;)   From my perspective, 'programming', that is,
configuration files in a git archive (with internal comments and commit
comments) is much, much more productive than using form-based configuration
software.  So, my organizations' needs and mine may be different...

-- Forwarded message --
From: "Jürgen Wagner (DVT)" 
Date: Tue, Nov 4, 2014 at 4:48 PM
Subject: Re: Best Practices for open source pipeline/connectors
To: solr-user@lucene.apache.org


 Hello Dan,
  ManifoldCF is a connector framework, not a processing framework.
Therefore, you may try your own lightweight connectors (which usually are
not really rocket science and may take less time to write than time to
configure a super-generic connector of some sort), any connector out there
(including Nutch and others), or even commercial offerings from some
companies. That, however, won't make you very happy all by itself - my
guess. Key to really creating value out of data dragged into a search
platform is the processing pipeline. Depending on the scale of data and the
amount of processing you need to do, you may have a simplistic approach
with just some more or less configurable Java components massaging your
data until it can be sent to Solr (without using Tika or any other
processing in Solr), or you can employ frameworks like Apache Spark to
really heavily transform and enrich data before feeding them into Solr.

I prefer to have a clear separation between connectors, processing,
indexing/querying and front-end visualization/interaction. Only the
indexing/querying task I grant to Solr (or naked Lucene or Elasticsearch).
Each of the different task types has entirely different scaling
requirements and computing/networking properties, so you definitely don't
want them depend on each other too much. Addressing the needs of several
customers, one needs to even swap one or the other component in favour of
what a customer prefers or needs.

So, my answer is YES. But we've also tried Nutch, our own specialized
crawlers and a number of elaborate connectors for special customer
applications. In any case, the result of that connector won't go into Solr.
It will go into processing. From there it will go into Solr. I suspect that
connectors won't be the challenge in your project. Solr requires a bit of
tuning and tweaking, but you'll be fine eventually. Document processing
will be the fun part. As you come to scaling the zoo of components, this
will become evident :-)

What is the volume and influx rate in your scenario?

Best regards,
--Jürgen



On 04.11.2014 22:01, Dan Davis wrote:

I'm trying to do research for my organization on the best practices for
open source pipeline/connectors.   Since we need Web Crawls, File System
crawls, and Databases, it seems to me that Manifold CF might be the best
case.

Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
DataImportHandler?   It would be nice to decide in ManifestCF which
resultHandler should receive a document or id, barring that, you can post
some fields including an URL and have Data Import Handler handle it - it
already supports scripts whereas ManifestCF may not at this time.

Suggestions and ideas?

Thanks,

Dan




-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
--
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


Logging in Solr's DataImportHandler

2014-12-05 Thread Dan Davis
I have a script transformer and a log transformer, and I'm not seeing the
log messages, at least not where I expect.
Is there anyway I can simply log a custom message from within my script?
Can the script easily interact with its containers logger?


Re: Tika HTTP 400 Errors with DIH

2014-12-08 Thread Dan Davis
I would say that you could determine a row that gives a bad URL, and then
run it in DIH admin interface (or the command-line) with "debug" enabled
The url parameter going into tika should be present in its transformed form
before the next entity gets going.   This works in a similar scenario for
me.

On Tue, Dec 2, 2014 at 1:19 PM, Teague James 
wrote:

> Hi all,
>
> I am using Solr 4.9.0 to index a DB with DIH. In the DB there is a URL
> field. In the DIH Tika uses that field to fetch and parse the documents.
> The
> URL from the field is valid and will download the document in the browser
> just fine. But Tika is getting HTTP response code 400. Any ideas why?
>
> ERROR
> BinURLDataSource
> java.io.IOException: Server returned HTTP response code: 400 for URL:
>
> EntityProcessorWrapper
> Exception in entity :
> tika_content:org.apache.solr.handler.dataimport.DataImportHandlerException:
> Exception in invoking url
>
> DIH
> 
>name="ds-1"
>   driver="net.sourceforge.jtds.jdbc.Driver"
>
> url="jdbc:jtds:sqlserver://
> 1.2.3.4/database;instance=INSTANCE;user=USER;pass
> word=PASSWORD" />
>
> 
>
> 
>  transformer="ClobTransformer, RegexTransformer"
> query="SELECT ContentID,
> DownloadURL
> FROM DATABASE.VIEW
> 
>  name="DownloadURL" />
>
>  processor="TikaEntityProcessor" url="${db_content.DownloadURL}"
> onError="continue" dataSource="ds-2">
> 
> 
>
> 
> 
> 
>
> SCHEMA - Fields
> 
>  stored="true" multiValued="true"/>
>
>
>
>


DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
When I have a forEach attribute like the following:


forEach="/medical-topics/medical-topic/health-topic[@language='English']"

And then need to match an attribute of that, is there any alternative to
spelling it all out:

 

I suppose I could do "//health-topic/@url" since the document should then
have a single health-topic (as long as I know they don't nest).


Re: DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
In experimentation with a much simpler and smaller XML file, it doesn't
look like '//health-topic/@url" will not work, nor will '//@url' etc.So
far, only spelling it all out will work.
With child elements, such as , an xpath of "//title" works fine, but
it  is beginning to same dangerous.

Is there any short-hand for the current node or the match?

On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis  wrote:

> When I have a forEach attribute like the following:
>
>
> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
>
> And then need to match an attribute of that, is there any alternative to
> spelling it all out:
>
>   xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
>
> I suppose I could do "//health-topic/@url" since the document should then
> have a single health-topic (as long as I know they don't nest).
>
>


Re: DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
The problem is that XPathEntityProcessor implements Xpath on its own, and
implements a subset of XPath.  So, if the input document is small enough,
it makes no sense to fight it.   One possibility is to apply an XSLT to the
file before processing ite

This blog post
<http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx>
shows a worked example.   The XSL transform takes place before the forEach
or field specifications, which is the principal question I had about it
from the documentation.  This is also illustrated in the initQuery()
private method of XPathEntityProcessor.You can see the transformation
being applied before the forEach.  This will not scale to extremely large
XML documents including millions of rows - that is why they have the
stream="true" argument there, so that you don't preprocess the document.
In my case, the entire XML file is 29M, and so I think I could do the XSL
transformation and then do for each document.

This potentially shortens my time frame of moving to Apache Solr
substantially, because the common case with our previous indexer is to run
XSLT to trasform to the document format desired by the indexer.

On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch 
wrote:

> I don't believe there are any alternatives. At least I could not get
> anything but the full path to work.
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 8 December 2014 at 17:01, Dan Davis  wrote:
> > In experimentation with a much simpler and smaller XML file, it doesn't
> > look like '//health-topic/@url" will not work, nor will '//@url' etc.
> So
> > far, only spelling it all out will work.
> > With child elements, such as , an xpath of "//title" works fine,
> but
> > it  is beginning to same dangerous.
> >
> > Is there any short-hand for the current node or the match?
> >
> > On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis  wrote:
> >
> >> When I have a forEach attribute like the following:
> >>
> >>
> >>
> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
> >>
> >> And then need to match an attribute of that, is there any alternative to
> >> spelling it all out:
> >>
> >>   >>
> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
> >>
> >> I suppose I could do "//health-topic/@url" since the document should
> then
> >> have a single health-topic (as long as I know they don't nest).
> >>
> >>
>


Re: DIH XPathEntityProcessor question

2014-12-08 Thread Dan Davis
Yes, that worked quite well.   I still need the "//tagname" but that is the
only DIH incantation I need.   This will substantially accelerate things.

On Mon, Dec 8, 2014 at 5:37 PM, Dan Davis  wrote:

> The problem is that XPathEntityProcessor implements Xpath on its own, and
> implements a subset of XPath.  So, if the input document is small enough,
> it makes no sense to fight it.   One possibility is to apply an XSLT to the
> file before processing ite
>
> This blog post
> <http://www.andornot.com/blog/post/Sample-Solr-DataImportHandler-for-XML-Files.aspx>
> shows a worked example.   The XSL transform takes place before the forEach
> or field specifications, which is the principal question I had about it
> from the documentation.  This is also illustrated in the initQuery()
> private method of XPathEntityProcessor.You can see the transformation
> being applied before the forEach.  This will not scale to extremely large
> XML documents including millions of rows - that is why they have the
> stream="true" argument there, so that you don't preprocess the document.
> In my case, the entire XML file is 29M, and so I think I could do the XSL
> transformation and then do for each document.
>
> This potentially shortens my time frame of moving to Apache Solr
> substantially, because the common case with our previous indexer is to run
> XSLT to trasform to the document format desired by the indexer.
>
> On Mon, Dec 8, 2014 at 5:10 PM, Alexandre Rafalovitch 
> wrote:
>
>> I don't believe there are any alternatives. At least I could not get
>> anything but the full path to work.
>>
>> Regards,
>>Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On 8 December 2014 at 17:01, Dan Davis  wrote:
>> > In experimentation with a much simpler and smaller XML file, it doesn't
>> > look like '//health-topic/@url" will not work, nor will '//@url' etc.
>>   So
>> > far, only spelling it all out will work.
>> > With child elements, such as , an xpath of "//title" works fine,
>> but
>> > it  is beginning to same dangerous.
>> >
>> > Is there any short-hand for the current node or the match?
>> >
>> > On Mon, Dec 8, 2014 at 4:42 PM, Dan Davis  wrote:
>> >
>> >> When I have a forEach attribute like the following:
>> >>
>> >>
>> >>
>> forEach="/medical-topics/medical-topic/health-topic[@language='English']"
>> >>
>> >> And then need to match an attribute of that, is there any alternative
>> to
>> >> spelling it all out:
>> >>
>> >>  > >>
>> xpath="/medical-topics/medical-topic/health-topic[@language='English']/@url"/>
>> >>
>> >> I suppose I could do "//health-topic/@url" since the document should
>> then
>> >> have a single health-topic (as long as I know they don't nest).
>> >>
>> >>
>>
>
>


Re: Spellchecker delivers far too few suggestions

2014-12-17 Thread Dan Davis
What about the frequency comparison - I haven't used the spellchecker
heavily, but it seems that if "bnak" is in the database, but "bank" is much
more frequent, then "bank" should be a suggestion anyway...

On Wed, Dec 17, 2014 at 10:41 AM, Erick Erickson 
wrote:
>
> First, I'd look in your corpus for "bnak". The problem with index-based
> suggestions is that if your index contains garbage, they're "correctly
> spelled" since they're in the index. TermsComponent is very useful for
> this.
>
> You can also loosen up the match criteria, and as I remember the collations
> parameter does some permutations of the word (but my memory of how that
> works is shaky).
>
> Best,
> Erick
>
> On Wed, Dec 17, 2014 at 9:13 AM, Martin Dietze  wrote:
> > I recently upgraded to SOLR 4.10.1 and after that set up the spell
> > checker which I use for returning suggestions after searches with few
> > or no results.
> > When the spellchecker is active, this request handler is used (most of
> > which is taken from examples I found in the net):
> >
> >> default="false">
> >  
> >explicit
> >true
> >false
> >10
> >false
> >*:*
> >explicit
> >50
> >*,score
> >  
> >  
> >spellcheck
> >  
> >   
> >
> > The search component is configured as follows (again most of it copied
> > from examples in the net):
> >
> >   
> > text
> > 
> >   default
> >   text
> >   solr.DirectSolrSpellChecker
> >   internal
> >   0.3
> >   2
> >   1
> >   5
> >   4
> >   0.01
> >   .01
> > 
> >   
> >
> > With this setup I can get suggestions for misspelled words. The
> > results on my developer machine were mostly fine, but on the test
> > system (much larger database, much larger search index) I found it
> > very hard to get suggestions at all. If for instance I misspell “bank”
> > as “bnak” I’d expect to get a suggestion for “bank” (since that word
> > can be found in the index very often).
> >
> > I’ve played around with maxQueryFrequency and maxQueryFrequency with
> > no success.
> >
> > Does anyone see any obvious misconfiguration? Anything that I could try?
> >
> > Any way I can debug this? (problem is that my application uses the
> > core API which makes trying out requests through the web interface
> > does not work)
> >
> > Any help would be greatly appreciated!
> >
> > Cheers,
> >
> > Martin
> >
> >
> > --
> > -- mdie...@gmail.com --/-- mar...@the-little-red-haired-girl.org
> 
> > - / http://herbert.the-little-red-haired-girl.org /
> -
>


Best way to implement Spotlight of certain results

2015-01-09 Thread Dan Davis
I have a requirement to spotlight certain results if the query text exactly
matches the title or see reference (indexed by me as alttitle_t).
What that means is that these matching results are shown above the
top-10/20 list with different CSS and fields.   Its like feeling lucky on
google :)

I have considered three ways of implementing this:

   1. Assume that edismax qf/pf will boost these results to be first when
   there is an exact match on these important fields.   The downside then is
   that my relevancy is constrained and I must maintain my configuration with
   title and alttitle_t as top search fields (see XML snippet below).I may
   have to overweight them to achieve the "always first" criteria.   Another
   less major downside is that I must always return the spotlight summary
   field (for display) and the image to display on each search.   These could
   be got from a database by the id, however, it is convenient to get them
   from Solr.
   2. Issue two searches for every user search, and use a second set of
   parameters (change the search type and fields to search only by exact
   matching a specific string field spottitle_s).   The search for the
   spotlight can then have its own configuration.   The downside here is that
   I am using Django and pysolr for the front-end, and pysolr is both
   synchronous and tied to the requestHandler named "select".   Convention.
   Of course, running in parallel is not a fix-all - running a search takes
   some time, even if run in parallel.
   3. Automate the population of elevate.xml so that all these 959 queries
   are here.   This is probably best, but forces me to restart/reload when
   there are changes to this components.   The elevation can be done through a
   query.

What I'd love to do is to configure the "select" requestHandler to run both
searches and return me both sets of results.   Is there anyway to do that -
apply the same q= parameter to two configured way to run a search?
Something like sub queries?

I suspect that approach 1 will get me through my demo and a brief
evaluation period, but that either approach 2 or 3 will be the winner.

Here's a snippet from my current qf/pf configuration:
  
title^100
alttitle_t^100
...
text
  
  
title^1000
alttitle_t^1000
    ...
    text^10
 

Thanks,

Dan Davis


Suggester questions

2015-01-13 Thread Dan Davis
I am having some trouble getting the suggester to work.   The spell
requestHandler is working, but I didn't like the results I was getting from
the word breaking dictionary and turned them off.
So some basic questions:

   - How can I check on the status of a dictionary?
   - How can I see what is in that dictionary?
   - How do I actually manually rebuild the dictionary - all attempts to
   set spellcheck.build=on or suggest.build=on have led to nearly instant
   results (0 suggestions for the latter), indicating something is wrong.


Thanks,

Daniel Davis


Improved suggester question

2015-01-13 Thread Dan Davis
The suggester is not working for me with Solr 4.10.2

Can anyone shed light over why I might be getting the exception below when
I build the dictionary?



500
26


len must be <= 32767; got 35680

java.lang.IllegalArgumentException: len must be <= 32767; got 35680 at
org.apache.lucene.util.OfflineSorter$ByteSequencesWriter.write(OfflineSorter.java:479)
at
org.apache.lucene.search.suggest.analyzing.AnalyzingSuggester.build(AnalyzingSuggester.java:493)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:190) at
org.apache.solr.spelling.suggest.SolrSuggester.build(SolrSuggester.java:160)
at
org.apache.solr.handler.component.SuggestComponent.prepare(SuggestComponent.java:165)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:197)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:200) at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

500



Thank you.

I've configured my suggester as follows:


  
mySuggester
FuzzyLookupFactory
DocumentDictionaryFactory
text
medsite_id
text_general
true
0.1
  



  
on
mySuggester
10
  
  
suggest
  



Re: Logging in Solr's DataImportHandler

2015-01-13 Thread Dan Davis
Mikhail,

Thanks - it works now.The script transformer was really not needed, a
template transformer is clearer, and the log transformer is now working.

On Mon, Dec 8, 2014 at 1:56 AM, Mikhail Khludnev  wrote:

> Hello Dan,
>
> Usually it works well. Can you describe how you run it particularly, eg
> what you download exactly and what's the command line ?
>
> On Fri, Dec 5, 2014 at 11:37 PM, Dan Davis  wrote:
>
>> I have a script transformer and a log transformer, and I'm not seeing the
>> log messages, at least not where I expect.
>> Is there anyway I can simply log a custom message from within my script?
>> Can the script easily interact with its containers logger?
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


Re: Best way to implement Spotlight of certain results

2015-01-13 Thread Dan Davis
Maybe I can use grouping, but my understanding of the feature is not up to
figuring that out :)

I tried something like

http://localhost:8983/solr/collection/select?q=childhood+cancer&group=on&group.query=childhood+cancer
Because the group.limit=1, I get a single result, and no other results.
If I add group.field=title, then I get each result, in a group of 1
member...

Eric's re-ranking I do understand - I can re-rank the top-N to make sure
the spotlighted result is always first, avoiding the potential problem of
having to overweight the title field.In practice, I may not ever need
to use the reranking, but its there if I need it.This is enough,
because it gives me talking points.


On Fri, Jan 9, 2015 at 3:05 PM, Michał B. .  wrote:

> Maybe I understand you badly but I thing that you could use grouping to
> achieve such effect. If you could prepare two group queries one with exact
> match and other, let's say, default than you will be able to extract
> matches from grouping results. i.e (using default solr example collection)
>
>
> http://localhost:8983/solr/collection1/select?q=*:*&group=true&group.query=manu%3A%22Ap+Computer+Inc.%22&group.query=name:Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Black&group.limit=10
>
> this query will return two groups one with exact match second with the rest
> standard results.
>
> Regars,
> Michal
>
>
> 2015-01-09 20:44 GMT+01:00 Erick Erickson :
>
> > Hmm, I wonder if the RerankingQueryParser might help here?
> > See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis  wrote:
> > > I have a requirement to spotlight certain results if the query text
> > exactly
> > > matches the title or see reference (indexed by me as alttitle_t).
> > > What that means is that these matching results are shown above the
> > > top-10/20 list with different CSS and fields.   Its like feeling lucky
> on
> > > google :)
> > >
> > > I have considered three ways of implementing this:
> > >
> > >1. Assume that edismax qf/pf will boost these results to be first
> when
> > >there is an exact match on these important fields.   The downside
> > then is
> > >that my relevancy is constrained and I must maintain my
> configuration
> > with
> > >title and alttitle_t as top search fields (see XML snippet below).
> > I may
> > >have to overweight them to achieve the "always first" criteria.
> >  Another
> > >less major downside is that I must always return the spotlight
> summary
> > >field (for display) and the image to display on each search.   These
> > could
> > >be got from a database by the id, however, it is convenient to get
> > them
> > >from Solr.
> > >2. Issue two searches for every user search, and use a second set of
> > >parameters (change the search type and fields to search only by
> exact
> > >matching a specific string field spottitle_s).   The search for the
> > >spotlight can then have its own configuration.   The downside here
> is
> > that
> > >I am using Django and pysolr for the front-end, and pysolr is both
> > >synchronous and tied to the requestHandler named "select".
> >  Convention.
> > >Of course, running in parallel is not a fix-all - running a search
> > takes
> > >some time, even if run in parallel.
> > >3. Automate the population of elevate.xml so that all these 959
> > queries
> > >are here.   This is probably best, but forces me to restart/reload
> > when
> > >there are changes to this components.   The elevation can be done
> > through a
> > >query.
> > >
> > > What I'd love to do is to configure the "select" requestHandler to run
> > both
> > > searches and return me both sets of results.   Is there anyway to do
> > that -
> > > apply the same q= parameter to two configured way to run a search?
> > > Something like sub queries?
> > >
> > > I suspect that approach 1 will get me through my demo and a brief
> > > evaluation period, but that either approach 2 or 3 will be the winner.
> > >
> > > Here's a snippet from my current qf/pf configuration:
> > >   
> > > title^100
> > > alttitle_t^100
> > > ...
> > > text
> > >   
> > >   
> > > title^1000
> > > alttitle_t^1000
> > > ...
> > > text^10
> > >  
> > >
> > > Thanks,
> > >
> > > Dan Davis
> >
>
>
>
> --
> Michał Bieńkowski
>


Re: Occasionally getting error in solr suggester component.

2015-01-13 Thread Dan Davis
Related question -

I see mention of needing to rebuild the spellcheck/suggest dictionary after
solr core reload.   I see spellcheckIndexDir in both the old wiki entry and
the solr reference guide
.  If this
parameter is provided, it sounds like the index is stored on the filesystem
and need not be rebuilt each time the core is reloaded.

Is this a correct understanding?


On Tue, Jan 13, 2015 at 2:17 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> I think you are probably getting bitten by one of the issues addressed in
> LUCENE-5889
>
> I would recommend against using buildOnCommit=true - with a large index
> this can be a performance-killer.  Instead, build the index yourself using
> the Solr spellchecker support (spellcheck.build=true)
>
> -Mike
>
>
> On 01/13/2015 10:41 AM, Dhanesh Radhakrishnan wrote:
>
>> Hi all,
>>
>> I am experiencing a problem in Solr SuggestComponent
>> Occasionally solr suggester component throws an  error like
>>
>> Solr failed:
>> {"responseHeader":{"status":500,"QTime":1},"error":{"msg":"suggester was
>> not built","trace":"java.lang.IllegalStateException: suggester was not
>> built\n\tat
>> org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.
>> lookup(AnalyzingInfixSuggester.java:368)\n\tat
>> org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester.
>> lookup(AnalyzingInfixSuggester.java:342)\n\tat
>> org.apache.lucene.search.suggest.Lookup.lookup(Lookup.java:240)\n\tat
>> org.apache.solr.spelling.suggest.SolrSuggester.
>> getSuggestions(SolrSuggester.java:199)\n\tat
>> org.apache.solr.handler.component.SuggestComponent.
>> process(SuggestComponent.java:234)\n\tat
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(
>> SearchHandler.java:218)\n\tat
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> RequestHandlerBase.java:135)\n\tat
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>> handleRequest(RequestHandlers.java:246)\n\tat
>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.execute(
>> SolrDispatchFilter.java:777)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:418)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:207)\n\tat
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
>> ApplicationFilterChain.java:243)\n\tat
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(
>> ApplicationFilterChain.java:210)\n\tat
>> org.apache.catalina.core.StandardWrapperValve.invoke(
>> StandardWrapperValve.java:225)\n\tat
>> org.apache.catalina.core.StandardContextValve.invoke(
>> StandardContextValve.java:123)\n\tat
>> org.apache.catalina.core.StandardHostValve.invoke(
>> StandardHostValve.java:168)\n\tat
>> org.apache.catalina.valves.ErrorReportValve.invoke(
>> ErrorReportValve.java:98)\n\tat
>> org.apache.catalina.valves.AccessLogValve.invoke(
>> AccessLogValve.java:927)\n\tat
>> org.apache.catalina.valves.RemoteIpValve.invoke(
>> RemoteIpValve.java:680)\n\tat
>> org.apache.catalina.core.StandardEngineValve.invoke(
>> StandardEngineValve.java:118)\n\tat
>> org.apache.catalina.connector.CoyoteAdapter.service(
>> CoyoteAdapter.java:407)\n\tat
>> org.apache.coyote.http11.AbstractHttp11Processor.process(
>> AbstractHttp11Processor.java:1002)\n\tat
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.
>> process(AbstractProtocol.java:579)\n\tat
>> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.
>> run(JIoEndpoint.java:312)\n\tat
>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1145)\n\tat
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:615)\n\tat
>> java.lang.Thread.run(Thread.java:745)\n","code":500}}
>>
>> This is not freequently happening, but idexing and suggestor component
>> working togethere  this error will occur.
>>
>>
>>
>>
>> In solr config
>>
>> 
>>  
>>haSuggester
>>AnalyzingInfixLookupFactory  
>>textSpell
>>DocumentDictionaryFactory
>>  
>>name
>>packageWeight
>>true
>>  
>>
>>
>>> startup="lazy">
>>  
>>true
>>10
>>  
>>  
>>suggest
>>  
>>
>>
>> Can any one suggest where to look to figure out this error and why these
>> errors are occurring?
>>
>>
>>
>> Thanks,
>> dhanesh s.r
>>
>>
>>
>>
>> --
>>
>>
>


Re: OutOfMemoryError for PDF document upload into Solr

2015-01-15 Thread Dan Davis
Why re-write all the document conversion in Java ;)  Tika is very slow.   5
GB PDF is very big.

If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
mode.   The HTML mode captures some meta-data that would otherwise be lost.


If you need to go faster still, you can  also write some stuff linked
directly against poppler library.

Before you jump down by through about Tika being slow - I wrote a PDF
indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
getjmp/longjmp.   But fast...



On Thu, Jan 15, 2015 at 1:54 PM,  wrote:

> Siegfried and Michael Thank you for your replies and help.
>
> -Original Message-
> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
> Sent: Thursday, January 15, 2015 3:45 AM
> To: solr-user@lucene.apache.org
> Subject: Re: OutOfMemoryError for PDF document upload into Solr
>
> Hi Ganesh,
>
> you can increase the heap size but parsing a 4 GB PDF document will very
> likely consume A LOT OF memory - I think you need to check if that large
> PDF can be parsed at all :-)
>
> Cheers,
>
> Siegfried Goeschl
>
> On 14.01.15 18:04, Michael Della Bitta wrote:
> > Yep, you'll have to increase the heap size for your Tomcat container.
> >
> > http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
> > -heap-size-correctly
> >
> > Michael Della Bitta
> >
> > Senior Software Engineer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions  | g+:
> > plus.google.com/appinions
> >  > 3336/posts>
> > w: appinions.com 
> >
> > On Wed, Jan 14, 2015 at 12:00 PM,  wrote:
> >
> >> Hello,
> >>
> >> Can someone pass on the hints to get around following error? Is there
> >> any Heap Size parameter I can set in Tomcat or in Solr webApp that
> >> gets deployed in Solr?
> >>
> >> I am running Solr webapp inside Tomcat on my local machine which has
> >> RAM of 12 GB. I have PDF document which is 4 GB max in size that
> >> needs to be loaded into Solr
> >>
> >>
> >>
> >>
> >> Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap
> space
> >>  at java.util.AbstractCollection.toArray(Unknown Source)
> >>  at java.util.ArrayList.(Unknown Source)
> >>  at
> >> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
> >>  at
> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
> >>  at
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
> >>  at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
> >>  at
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
> >>  at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
> >>  at
> >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >>  at
> >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> >>  at
> >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> >>  at
> >>
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
> >>  at
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>  at
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>  at
> >>
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
> >>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
> >>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
> >>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
> >>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> >>  at
> >>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
> >>  at
> >>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
> >>  at
> >>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
> >>  at
> >>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
> >>  at
> >>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
> >>  at
> >>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
> >>  at
> >>
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
> >>  at
> >>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
> >>  at
> >>
> org.apache.catalina.connector.CoyoteAdapter

solr replication vs. rsync

2015-01-24 Thread Dan Davis
When I polled the various projects already using Solr at my organization, I
was greatly surprised that none of them were using Solr replication,
because they had talked about "replicating" the data.

But we are not Pinterest, and do not expect to be taking in changes one
post at a time (at least the engineers don't - just wait until its used for
a Crud app that wants full-text search on a description field!).Still,
rsync can be very, very fast with the right options (-W for gigabit
ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s over
GigE previously.

Does anyone have any numbers for how fast Solr replication goes, and what
to do to tune it?

I'm not enthusiastic to give-up recently tested cluster stability for a
home grown mess, but I am interested in numbers that are out there.


Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis
Thanks!

On Sunday, January 25, 2015, Erick Erickson  wrote:

> @Shawn: Cool table, thanks!
>
> @Dan:
> Just to throw a different spin on it, if you migrate to SolrCloud, then
> this question becomes moot as the raw documents are sent to each of the
> replicas so you very rarely have to copy the full index. Kind of a tradeoff
> between constant load because you're sending the raw documents around
> whenever you index and peak usage when the index replicates.
>
> There are a bunch of other reasons to go to SolrCloud, but you know your
> problem space best.
>
> FWIW,
> Erick
>
> On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey  > wrote:
>
> > On 1/24/2015 10:56 PM, Dan Davis wrote:
> > > When I polled the various projects already using Solr at my
> > organization, I
> > > was greatly surprised that none of them were using Solr replication,
> > > because they had talked about "replicating" the data.
> > >
> > > But we are not Pinterest, and do not expect to be taking in changes one
> > > post at a time (at least the engineers don't - just wait until its used
> > for
> > > a Crud app that wants full-text search on a description field!).
> > Still,
> > > rsync can be very, very fast with the right options (-W for gigabit
> > > ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
> > over
> > > GigE previously.
> > >
> > > Does anyone have any numbers for how fast Solr replication goes, and
> what
> > > to do to tune it?
> > >
> > > I'm not enthusiastic to give-up recently tested cluster stability for a
> > > home grown mess, but I am interested in numbers that are out there.
> >
> > Numbers are included on the Solr replication wiki page, both in graph
> > and numeric form.  Gathering these numbers must have been pretty easy --
> > before the HTTP replication made it into Solr, Solr used to contain an
> > rsync-based implementation.
> >
> > http://wiki.apache.org/solr/SolrReplication#Performance_numbers
> >
> > Other data on that wiki page discusses the replication config.  There's
> > not a lot to tune.
> >
> > I run a redundant non-SolrCloud index myself through a different method
> > -- my indexing program indexes each index copy completely independently.
> >  There is no replication.  This separation allows me to upgrade any
> > component, or change any part of solrconfig or schema, on either copy of
> > the index without affecting the other copy at all.  With replication, if
> > something is changed on the master or the slave, you might find that the
> > slave no longer works, because it will be handling an index created by
> > different software or a different config.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: solr replication vs. rsync

2015-01-25 Thread Dan Davis
@Erick,

Problem space is not constant indexing.   I thought SolrCloud replicas were
replication, and you imply parallel indexing.  Good to know.

On Sunday, January 25, 2015, Erick Erickson  wrote:

> @Shawn: Cool table, thanks!
>
> @Dan:
> Just to throw a different spin on it, if you migrate to SolrCloud, then
> this question becomes moot as the raw documents are sent to each of the
> replicas so you very rarely have to copy the full index. Kind of a tradeoff
> between constant load because you're sending the raw documents around
> whenever you index and peak usage when the index replicates.
>
> There are a bunch of other reasons to go to SolrCloud, but you know your
> problem space best.
>
> FWIW,
> Erick
>
> On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey  > wrote:
>
> > On 1/24/2015 10:56 PM, Dan Davis wrote:
> > > When I polled the various projects already using Solr at my
> > organization, I
> > > was greatly surprised that none of them were using Solr replication,
> > > because they had talked about "replicating" the data.
> > >
> > > But we are not Pinterest, and do not expect to be taking in changes one
> > > post at a time (at least the engineers don't - just wait until its used
> > for
> > > a Crud app that wants full-text search on a description field!).
> > Still,
> > > rsync can be very, very fast with the right options (-W for gigabit
> > > ethernet, and maybe -S for sparse files).   I've clocked it at 48 MB/s
> > over
> > > GigE previously.
> > >
> > > Does anyone have any numbers for how fast Solr replication goes, and
> what
> > > to do to tune it?
> > >
> > > I'm not enthusiastic to give-up recently tested cluster stability for a
> > > home grown mess, but I am interested in numbers that are out there.
> >
> > Numbers are included on the Solr replication wiki page, both in graph
> > and numeric form.  Gathering these numbers must have been pretty easy --
> > before the HTTP replication made it into Solr, Solr used to contain an
> > rsync-based implementation.
> >
> > http://wiki.apache.org/solr/SolrReplication#Performance_numbers
> >
> > Other data on that wiki page discusses the replication config.  There's
> > not a lot to tune.
> >
> > I run a redundant non-SolrCloud index myself through a different method
> > -- my indexing program indexes each index copy completely independently.
> >  There is no replication.  This separation allows me to upgrade any
> > component, or change any part of solrconfig or schema, on either copy of
> > the index without affecting the other copy at all.  With replication, if
> > something is changed on the master or the slave, you might find that the
> > slave no longer works, because it will be handling an index created by
> > different software or a different config.
> >
> > Thanks,
> > Shawn
> >
> >
>


Weighting of prominent text in HTML

2015-01-25 Thread Dan Davis
By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


Re: [MASSMAIL]Weighting of prominent text in HTML

2015-01-26 Thread Dan Davis
Helps lots.   Thanks, Jorge Luis.   Good point about different fields -
I'll just put the h1 and h2 (however deep I want to go) into fields, and we
can sort out weighting and whether we want it later with edismax.   The
blogs on adding plugins for that sort of thing look straightforward.

On Mon, Jan 26, 2015 at 12:47 AM, Jorge Luis Betancourt González <
jlbetanco...@uci.cu> wrote:

> Hi Dan:
>
> Agreed, this question is more Nutch related than Solr ;)
>
> Nutch doesn't send any data into /update/extract request handler, all the
> text and metadata extraction happens in Nutch side rather than relying in
> the ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the
> same technology as the ExtractRequestHandler provided by Solr so shouldn't
> be any greater difference.
>
> By default Nutch doesn't boost anything as is Solr job to boost the
> different content in the different fields, which is what happens when you
> do a query against Solr. Nutch calculates the LinkRank which is a variation
> of the famous PageRank (or the OPIC score, which is another scoring
> algorithm implemented in Nutch, which I believe is the default in Nutch
> 2.x). What you can do is use the headings and map the heading tags into
> different fields and then apply different boosts to each field.
>
> The general idea with Nutch is to "make pieces of the web page" and store
> each piece in a different field in Solr, then you can tweak your relevance
> function using the values yo see fit, so you don't need to write any plugin
> to accomplish this (at least for the h1, h2, etc. example you provided, if
> you want to extract other parts of the webpage you'll need to write your
> own plugin to do so).
>
> Nutch is highly customizable, you can write a plugin for almost any piece
> of logic, from parsers to indexers, passing from URL filters, scoring
> algorithms, protocols and a long long list, usually the plugins are not so
> difficult to write, but the problem comes to know which extension point you
> need to use, this comes with experience and taking a good dive in the
> source code.
>
> Hope this helps,
>
> - Original Message -
> From: "Dan Davis" 
> To: "solr-user" 
> Sent: Monday, January 26, 2015 12:08:13 AM
> Subject: [MASSMAIL]Weighting of prominent text in HTML
>
> By examining solr.log, I can see that Nutch is using the /update request
> handler rather than /update/extract.   So, this may be a more appropriate
> question for the nutch mailing list.   OTOH, y'all know the anwser off the
> top of your head.
>
> Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
> normal paragraph?Can this weighting be tuned without writing a plugin?
>Is writing a plugin often needed because of the flexibility that is
> needed in practice?
>
> I wanted to call this post *Anatomy of a small scale search engine*, but
> lacked the nerve ;)
>
> Thanks, all and many,
>
> Dan Davis, Systems/Applications Architect
> National Library of Medicine
>
>
> ---
> XII Aniversario de la creación de la Universidad de las Ciencias
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
>
>


Re: Need Help with custom ZIPURLDataSource class

2015-01-26 Thread Dan Davis
I have seen such errors by looking under Logging in the Solr Admin UI.
There is also the LogTransformer for Data Import Handler.

However, it is a design choice in Data Import Handler to skip fields not in
the schema.   I would suggest you always use Debug and Verbose to do the
first couple of documents through the GUI, and then look at the debugging
output with a fine toothed comb.

I'm not sure whether there's an option for it, but it would be nice if the
Data Import Handler could collect skipped fields into the status response.
  That would highlight your problem without forcing you to look in other
areas.


On Fri, Jan 23, 2015 at 9:51 PM, Carl Roberts  wrote:

> NVM - I have this working.
>
> The problem was this:  pk="link" in rss-dat.config.xml but unique id not
> link in schema.xml - it is id.
>
> From rss-data-config.xml:
>
>  *pk="link"*
> url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.
> xml.zip"
> processor="XPathEntityProcessor"
> forEach="/nvd/entry">
> 
>  commonField="true" />
>  commonField="true" />
> 
> 
>
> From schema.xml:
>
> * id
>
> *What really bothers me is that there were no errors output by Solr to
> indicate this type of misconfiguration error and all the messages that Solr
> gave indicated the import was successful.  This lack of appropriate error
> reporting is a pain, especially for someone learning Solr.
>
> Switching pk="link" to pk="id" solved the problem and I was then able to
> import the data.
>
> On 1/23/15, 6:34 PM, Carl Roberts wrote:
>
>>
>> Hi,
>>
>> I created a custom ZIPURLDataSource class to unzip the content from an
>> http URL for an XML ZIP file and it seems to be working (at least I have
>> no errors), but no data is imported.
>>
>> Here is my configuration in rss-data-config.xml:
>>
>> 
>> > readTimeout="3"/>
>> 
>> > pk="link"
>> url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
>> processor="XPathEntityProcessor"
>> forEach="/nvd/entry"
>> transformer="DateFormatTransformer">
>> 
>> 
>> 
>> > xpath="/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name"
>> commonField="false" />
>> > xpath="/nvd/entry/vulnerable-software-list/product" commonField="false"
>> />
>> > commonField="false" />
>> > commonField="false" />
>> 
>> 
>> 
>> 
>>
>>
>> Attached is the ZIPURLDataSource.java file.
>>
>> It actually unzips and saves the raw XML to disk, which I have verified
>> to be a valid XML file.  The file has one or more entries (here is an
>> example):
>>
>> http://scap.nist.gov/schema/scap-core/0.1";
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>> xmlns:patch="http://scap.nist.gov/schema/patch/0.1";
>> xmlns:vuln="http://scap.nist.gov/schema/vulnerability/0.4";
>> xmlns:cvss="http://scap.nist.gov/schema/cvss-v2/0.2";
>> xmlns:cpe-lang="http://cpe.mitre.org/language/2.0";
>> xmlns="http://scap.nist.gov/schema/feed/vulnerability/2.0";
>> pub_date="2015-01-10T05:37:05"
>> xsi:schemaLocation="http://scap.nist.gov/schema/patch/0.1
>> http://nvd.nist.gov/schema/patch_0.1.xsd
>> http://scap.nist.gov/schema/scap-core/0.1
>> http://nvd.nist.gov/schema/scap-core_0.1.xsd
>> http://scap.nist.gov/schema/feed/vulnerability/2.0
>> http://nvd.nist.gov/schema/nvd-cve-feed_2.0.xsd"; nvd_xml_version="2.0">
>> 
>> http://nvd.nist.gov/";>
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> cpe:/o:freebsd:freebsd:2.2.8
>> cpe:/o:freebsd:freebsd:1.1.5.1
>> cpe:/o:freebsd:freebsd:2.2.3
>> cpe:/o:freebsd:freebsd:2.2.2
>> cpe:/o:freebsd:freebsd:2.2.5
>> cpe:/o:freebsd:freebsd:2.2.4
>> cpe:/o:freebsd:freebsd:2.0.5
>> cpe:/o:freebsd:freebsd:2.2.6
>> cpe:/o:freebsd:freebsd:2.1.6.1
>> cpe:/o:freebsd:freebsd:2.0.1
>> cpe:/o:freebsd:freebsd:2.2
>> cpe:/o:freebsd:freebsd:2.0
>> cpe:/o:openbsd:openbsd:2.3
>> cpe:/o:freebsd:freebsd:3.0
>> cpe:/o:freebsd:freebsd:1.1
>> cpe:/o:freebsd:freebsd:2.1.6
>> cpe:/o:openbsd:openbsd:2.4
>> cpe:/o:bsdi:bsd_os:3.1
>> cpe:/o:freebsd:freebsd:1.0
>> cpe:/o:freebsd:freebsd:2.1.7
>> cpe:/o:freebsd:freebsd:1.2
>> cpe:/o:freebsd:freebsd:2.1.5
>> cpe:/o:freebsd:freebsd:2.1.7.1
>> 
>> CVE-1999-0001
>> 1999-12-30T00:00:00.000-05:00
>>
>> 2010-12-16T00:00:00.000-05:00
>>
>> 
>> 
>> 5.0
>> NETWORK
>> LOW
>> NONE
>> NONE
>> NONE
>> PARTIAL
>> http://nvd.nist.gov
>> 2004-01-01T00:00:00.000-05:00
>>
>> 
>> 
>> 
>> 
>> OSVDB
>> http://www.osvdb.org/5707";
>> xml:lang="en">5707
>> 
>> 
>> CONFIRM
>> http://www.openbsd.org/errata23.html#tcpfix";
>> xml:lang="en">http://www.openbsd.org/errata23.html#tcpfix
>>
>> 
>> ip_input.c in BSD-derived TCP/IP implementations allows
>> remote attackers to cause a denial of service (crash or hang) via
>> crafted packets.
>> 
>>
>>
>> Here is the curl command:
>>
>> curl http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import
>>
>> And here is the output from the console for Jetty:
>>
>> main{StandardDirectoryReader(segm

Re: Need help importing data

2015-01-26 Thread Dan Davis
Glad it worked out.

On Fri, Jan 23, 2015 at 9:50 PM, Carl Roberts  wrote:

> NVM
>
> I figured this out.  The problem was this:  pk="link" in
> rss-dat.config.xml but unique id not link in schema.xml - it is id.
>
> From rss-data-config.xml:
>
>  *pk="link"*
> url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
> processor="XPathEntityProcessor"
> forEach="/nvd/entry">
> 
>  commonField="true" />
>  commonField="true" />
> 
> 
>
> From schema.xml:
>
> * id
>
> *What really bothers me is that there were no errors output by Solr to
> indicate this type of misconfiguration error and all the messages that Solr
> gave indicated the import was successful.  This lack of appropriate error
> reporting is a pain, especially for someone learning Solr.
>
> Switching pk="link" to pk="id" solved the problem and I was then able to
> import the data.
>
>
>
> On 1/23/15, 9:39 PM, Carl Roberts wrote:
>
>> Hi,
>>
>> I have set log4j logging to level DEBUG and I have also modified the code
>> to see what is being imported and I can see the nextRow() records, and the
>> import is successful, however I have no data. Can someone please help me
>> figure this out?
>>
>> Here is the logging output:
>>
>> ow:  r1={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264,
>> $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r3={{id=CVE-2002-2353, cve=CVE-2002-2353, cwe=CWE-264, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> URL={url}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r1={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r3={{id=CVE-2002-2354, cve=CVE-2002-2354, cwe=CWE-20, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> URL={url}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r1={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,606- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r3={{id=CVE-2002-2355, cve=CVE-2002-2355, cwe=CWE-255, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> URL={url}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r1={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r3={{id=CVE-2002-2356, cve=CVE-2002-2356, cwe=CWE-264, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> URL={url}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r1={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r3={{id=CVE-2002-2357, cve=CVE-2002-2357, cwe=CWE-119, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> URL={url}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r1={{id=CVE-2002-2358, cve=CVE-2002-2358, cwe=CWE-79, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:251]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> r3={{id=CVE-2002-2358, cve=CVE-2002-2358, cwe=CWE-79, $forEach=/nvd/entry}}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:221]
>> -org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow:
>> URL={url}
>> 2015-01-23 21:28:04,607- INFO-[Thread-15]-[XPathEntityProcessor.java:227]
>> -org.apache

Re: Indexed epoch time in Solr

2015-01-26 Thread Dan Davis
I think copying to a new Solr date field is your best bet, because then you
have the flexibility to do date range facets in the future.

If you can re-index, and are using Data Import Handler, Jim Musil's
suggestion is just right.

If you can re-index, and are not using Data Import Handler:

   - This seems a job for an UpdateRequestProcessor
   ,
   but I don't see one for this.
   - This seems to be a good candidate for a standard, core
   UpdateRequestProcessor, but I haven't checked Jira for a bug report.

If the scale is too large to re-index, then there is surely still a way,
but I'm not sure I can advise you on the best one.  I'm not an Solr expert
yet... just someone on the list with a IR background.

On Mon, Jan 26, 2015 at 12:35 AM, Ahmed Adel  wrote:

> Hi All,
>
> Is there a way to convert unix time field that is already indexed to
> ISO-8601 format in query response? If this is not possible on the query
> level, what is the best way to copy this field to a new Solr standard date
> field.
>
> Thanks,
>
> --
> *Ahmed Adel*
> 
>


Re: How to implement Auto complete, suggestion client side

2015-01-26 Thread Dan Davis
Cannot get any easier than jquery-ui's autocomplete widget -
http://jqueryui.com/autocomplete/

Basically, you set some classes and implement a javascript that calls the
server to get the autocomplete data.   I never would expose Solr to
browsers, so I would have the AJAX call go to a php script (or
function/method if you are using a web framework such as CakePHP or
Symfony).

Then, on the server, you make a request to Solr /suggest or /spell with
wt=json, and then you reformulate this into a simple JSON response that is
a simple array of options.

You can do this in stages:

   - Constant suggestions - you change your html and implement Javascript
   that shows constant suggestions after for instance 2 seconds.
   - Constant suggestions from the server - you change your JavaScript to
   call the server, and have the server return a constant list.
   - Dynamic suggestions from the server - you implement the server-side to
   query Solr and turn the return from /suggest or /spell into a JSON array.
   - Tuning, tuning, tuning - you work hard on tuning it so that you get
   high quality suggestions for a wide variety of inputs.

Note that the autocomplete I've described for you is basically the simplest
thing possible, as you suggest you are new to it.   It is not based on data
mining of query and click-through logs, which is a very common pattern
these days.   There is no bolding of the portion of the words that are new.
  It is just a basic autocomplete widget with a delay.

On Mon, Jan 26, 2015 at 5:11 PM, Olivier Austina 
wrote:

> Hi All,
>
> I would say I am new to web technology.
>
> I would like to implement auto complete/suggestion in the user search box
> as the user type in the search box (like Google for example). I am using
> Solr as database. Basically I am  familiar with Solr and I can formulate
> suggestion queries.
>
> But now I don't know how to implement suggestion in the User Interface.
> Which technologies should I need. The website is in PHP. Any suggestions,
> examples, basic tutorial is welcome. Thank you.
>
>
>
> Regards
> Olivier
>


Re: Solr admin Url issues

2015-01-26 Thread Dan Davis
Is Jetty actually running on port 80?Do you have Apache2 reverse proxy
in front?

On Mon, Jan 26, 2015 at 11:02 PM, Summer Shire 
wrote:

> Hi All,
>
> Running solr (4.7.2) locally and hitting the admin page like this works
> just fine http://localhost:8983/solr/ # <
> http://localhost:8983/solr/#>
>
> But on my deployment server my path is
> http://example.org/jetty/MyApp/1/solr/# <
> http://example.org/jetty/MyApp/1/solr/#>
> Or http://example.org/jetty/MyApp/1/solr/admin/cores <
> http://example.org/jetty/MyApp/1/solr/admin/cores> or
> http://example.org/jetty/MyApp/1/solr/main/admin/ <
> http://example.org/jetty/MyApp/1/solr/main/admin/>
>
> the above request in a browser loads the admin page half way and then
> spawns another request at
> http://example.org/solr/admin/cores  >….
>
> how can I maintain my other params such as jetty/MyApp/1/
>
> btw http://example.org/jetty/MyApp/1/solr/main/select?q=*:* <
> http://example.org/jetty/MyApp/1/solr/main/select?q=*:*> or any other
> requesthandlers work just fine.
>
> What is going on here ? any idea ?
>
> thanks,
> Summer


Re: Cannot reindex to add a new field

2015-01-29 Thread Dan Davis
For this I prefer TemplateTransformer to RegexTransformer - its not a
regex, just a pattern, and so should be more efficient to use
TemplateTransformer.   A script will also work, of course.

On Tue, Jan 27, 2015 at 5:54 PM, Alexandre Rafalovitch 
wrote:

> On 27 January 2015 at 17:47, Carl Roberts 
> wrote:
> >  > commonField="false" regex=":" replaceWith=" "/>
>
> Yes, that works because the transformer copies it, not the
> EntityProcessor. So, no conflict on xpath.
>
> Regards,
>Alex.
>
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>


Re: Calling custom request handler with data import

2015-01-30 Thread Dan Davis
The Data Import Handler isn't pushing data into the /update request
handler.   However, Data Import Handler can be extended with transformers.
  Two such transformers are the TemplateTransformer and the
ScriptTransformer.   It may be possible to get a script function to load
your custom Java code.   You could also just write a
StandfordNerTransformer.

Hope this helps,

Dan

On Fri, Jan 30, 2015 at 9:07 AM, vineet yadav 
wrote:

> Hi,
> I am using data import handler to import data from mysql, and I want to
> identify name entities from it. So I am using following example(
> http://www.searchbox.com/named-entity-recognition-ner-in-solr/). where I
> am
> using stanford ner to identify name entities. I am using following
> requesthandler
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
> 
>  data-import.xml
>  
> 
>
> for importing data from mysql and
>
> 
>   
>
>  
>content
>  
>
>
>
>  
>  
>
>  mychain
>
>   
>
> for identifying name entities.NER request handler identifies name entities
> from content field, but store extracted entities in solr fields.
>
> NER request handler was working when I am using nutch with solr. But When I
> am importing data from mysql, ner request handler is not invoked. So
> entities are not stored in solr for imported documents. Can anybody tell me
> how to call custom request handler in data import handler.
>
> Otherwise if I can invoke ner request handler externally, so that it can
> index person, organization and location in solr for imported document. It
> is also fine. Any suggestion are welcome.
>
> Thanks
> Vineet Yadav
>


Re: Calling custom request handler with data import

2015-01-30 Thread Dan Davis
You know, another thing you can do is just write some Java/perl/whatever to
pull data out of your database and push it to Solr.Not as convenient
for development perhaps, but it has more legs in the long run.   Data
Import Handler does not easily multi-thread.

On Sat, Jan 31, 2015 at 12:34 AM, Dan Davis  wrote:

> The Data Import Handler isn't pushing data into the /update request
> handler.   However, Data Import Handler can be extended with transformers.
>   Two such transformers are the TemplateTransformer and the
> ScriptTransformer.   It may be possible to get a script function to load
> your custom Java code.   You could also just write a
> StandfordNerTransformer.
>
> Hope this helps,
>
> Dan
>
> On Fri, Jan 30, 2015 at 9:07 AM, vineet yadav  > wrote:
>
>> Hi,
>> I am using data import handler to import data from mysql, and I want to
>> identify name entities from it. So I am using following example(
>> http://www.searchbox.com/named-entity-recognition-ner-in-solr/). where I
>> am
>> using stanford ner to identify name entities. I am using following
>> requesthandler
>>
>> > class="org.apache.solr.handler.dataimport.DataImportHandler">
>> 
>>  data-import.xml
>>  
>> 
>>
>> for importing data from mysql and
>>
>> 
>>   
>>
>>  
>>content
>>  
>>
>>
>>
>>  
>>  
>>
>>  mychain
>>
>>   
>>
>> for identifying name entities.NER request handler identifies name entities
>> from content field, but store extracted entities in solr fields.
>>
>> NER request handler was working when I am using nutch with solr. But When
>> I
>> am importing data from mysql, ner request handler is not invoked. So
>> entities are not stored in solr for imported documents. Can anybody tell
>> me
>> how to call custom request handler in data import handler.
>>
>> Otherwise if I can invoke ner request handler externally, so that it can
>> index person, organization and location in solr for imported document. It
>> is also fine. Any suggestion are welcome.
>>
>> Thanks
>> Vineet Yadav
>>
>
>


role of the wiki and cwiki

2015-01-30 Thread Dan Davis
I've been thinking of https://wiki.apache.org/solr/ as the "Old Wiki" and
https://cwiki.apache.org/confluence/display/solr as the "New Wiki".

I guess that's the wrong way to think about it - Confluence is being used
for the "Solr Reference Guide", and MoinMoin is being used as a wiki.

Is this the correct understanding?


Re: role of the wiki and cwiki

2015-02-02 Thread Dan Davis
Hoss et. al,

I'm not intending on contributing documentation in any immediate sense (the
disclaimer), but I thank you all for the clarification.

It makes some sense to require a committer to review each suggested piece
of official documentation, but I wonder abstractly how a non-committer then
should contribute to the documentation.  I just did an evaluation of
several WCM systems, and it sounds almost like you need something more like
a WCM that supports some moderation workflow, rather than a wiki.

With current technology, possibilities include:

 * Make a comment within Confluence suggesting content or making a
clarification,
 * Create a blog post or MoinMoin edit with whatever content seems to be
needed,
 * Paste text and/or content into a JIRA ticket, or upload an attachment to
the JIRA ticket.

I think the JIRA ticket is the strongest, honestly, because it is true
moderation - nothing shows up until evaluated by a committer.

I also want to say that I value the very technical nature of the Solr
documentation, even as I welcome better organization   Many product's
documentation is very much too much abstracted, because it is written by a
technical writer not deeply familiar with either the technology or with
what users specifically want to do.   This is addressed by surfacing what
the user's want to do, and then "How-to" specific documentation is written
that is still too vague on the technical details.   Sometimes a worked
example is very useful. I see a little, though not too much, of this
transition in the Data Import Handler documentation -
https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
is more abstract, and moves too fast, relative to
http://wiki.apache.org/solr/DataImportHandler.   The ability to nest SQL
based entities is very key to understanding, and not covered in the former.
  One needs to see that entity is not always a root entity.

So, I agree with the direction, but I hope the Solr Reference Guide can go
into more depth in some places, even as it continues to be better organized
if you are reading from scratch rather than starting with Solr In Action or
something like that.

Thanks again,

Dan


On Mon, Feb 2, 2015 at 11:57 AM, Chris Hostetter 
wrote:

>
> : Because they have different potential authors, the two systems now serve
> : different purposes.
> :
> : There are still some pages on the MoinMoin wiki that contain
> : documentation that should be in the reference guide, but isn't.
> :
> : The MoinMoin wiki is still useful, as a place where users can collect
> : information that is useful to others, but doesn't qualify as official
> : documentation, or perhaps simply hasn't been verified.  I believe this
> : means that a lot of information which has been migrated into the
> : reference guide will eventually be removed from MoinMoin.
>
> +1 ... it's just a matter of time/energy to clean things up...
>
>
> https://cwiki.apache.org/confluence/display/solr/Internal+-+Maintaining+Documentation#Internal-MaintainingDocumentation-WhatShouldandShouldNotbeIncludedinThisDocumentation
>
>
> FWIW: "Emmanuel Stalling" has started doing an audit of the wiki content
> vs the ref guide ... once more folks have a chance to review & dive
> in with edits should be really helpful to cleaning all this up...
>
> https://wiki.apache.org/solr/WikiManualComparison
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
Suresh and Meena,

I have solved this problem by taking a row count on a query, and adding its
modulo as another field called threadid. The base query is wrapped in a
query that selects a subset of the results for indexing.   The modulo on
the row number was intentional - you cannot rely on id columns to be well
distributed and you cannot rely on the number of rows to stay constant over
time.

To make it more concrete, I have a base DataImportHandler configuration
that looks something like what's below - your SQL may differ as we use
Oracle.

 
...

 


To get it to be multi-threaded, I then copy it to 4 different configuration
files as follows:

echo "Medical Sites Configuration - "
${MEDSITES_CONF:=medical-sites-conf.xml}
echo "Medical Sites Prototype - "
${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
for tid in `seq 0 3`; do
   MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e "s/%%d%%/$tid/"`
   sed -e "s/%%d%%/$tid/" $MEDSITES_CONF > $MEDSITES_OUT
done


Then, I have 4 requestHandlers in solrconfig.xml that point to each of
these files.They are "/import/medical-sites-0" through
"/import/medical-sites-3".   Note that this wouldn't work with a single
Data Import Handler that was parameterized - a particular data Import
Handler is either idle or busy, and no longer should be run in multiple
threads.   How this would work if the first entity weren't the root entity
is another question - you can usually structure it with the first SQL query
being the root entity if you are using SQL.   XML is another story, however.

I did it this way because I wanted to stay with Solr "out-of-the-box"
because it was an evaluation of what Data Import Handler could do.   If I
were doing this without some business requirement to evaluate whether Solr
"out-of-the-box" could do multithreaded database improt, I'd probably write
a multi-threaded front-end that did the queries and transformations I
needed to do.   In this case, I was considering the best way to do "all"
our data imports from RDBMS, and Data Import Handler is the only good
solution that involves writing configuration, not code.   The distinction
is slight, I think.

Hope this helps,

Dan Davis

On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev  wrote:

> Suresh,
>
> There are a few common workaround for such problem. But, I think that
> submitting more than "maxIndexingThreads" is not really productive. Also, I
> think that out-of-memory problem is caused not by indexing, but by opening
> searcher. Do you really need to open it? I don't think it's a good idea to
> search on the instance which cooks many T index at the same time. Are you
> sure you don't issue superfluous commit, and you've disabled auto-commit?
>
> let's nail down oom problem first, and then deal with indexing speedup. I
> like huge indices!
>
> On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh 
> wrote:
>
> > We are also facing the same problem in loading 14 Billion documents into
> > Solr 4.8.10.
> >
> > Dataimport is working in Single threaded, which is taking more than 3
> > weeks. This is working fine without any issues but it takes months to
> > complete the load.
> >
> > When we tried SolrJ with the below configuration in Multithreaded load,
> > the Solr is taking more memory & at one point we will end up in out of
> > memory as well.
> >
> > Batch Doc count  :  10 docs
> > No of Threads  : 16/32
> >
> > Solr Memory Allocated : 200 GB
> >
> > The reason can be as below.
> >
> > Solr is taking the snapshot, whenever we open a SearchIndexer.
> > Due to this more memory is getting consumed & solr is extremely
> > slow while running 16 or more threads for loading.
> >
> > If anyone have already done the multithreaded data load into Solr in a
> > quicker way, Can you please share the code or logic in using the SolrJ
> API?
> >
> > Thanks in advance.
> >
> > Regards,
> > Suresh.A
> >
> > -Original Message-
> > From: Dyer, James [mailto:james.d...@ingramcontent.com]
> > Sent: Tuesday, February 03, 2015 1:58 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Solr 4.9 Calling DIH concurrently
> >
> > DIH is single-threaded.  There was once a threaded option, but it was
> > buggy and subsequently was removed.
> >
> > What I do is partition my data and run multiple dih request handlers at
> > the same time.  It means redundant sections in solrconfig.xml and its not
> > very elegant but it works.
> >
> > For instance, for a sql query, I add something like 

Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
"Data Import Handler is the only good solution that involves writing
configuration, not code."  - I also had a requirement not to look at
product-oriented enhancements to Solr, and there are many products I didn't
look at, or rejected, like django-haystack.   Perl, ruby, and python have
good handling of both databases and Solr, as does Java with JDBC and SolrJ.
  Pushing to Solr probably has more legs than Data Import Handler going
forward.

On Wed, Feb 4, 2015 at 11:13 AM, Dan Davis  wrote:

> Suresh and Meena,
>
> I have solved this problem by taking a row count on a query, and adding
> its modulo as another field called threadid. The base query is wrapped
> in a query that selects a subset of the results for indexing.   The modulo
> on the row number was intentional - you cannot rely on id columns to be
> well distributed and you cannot rely on the number of rows to stay constant
> over time.
>
> To make it more concrete, I have a base DataImportHandler configuration
> that looks something like what's below - your SQL may differ as we use
> Oracle.
>
>   rootEntity="true"
> query="SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
> medplus.public_topic_sites_us_v t) WHERE threadid = %%d%%"
> transformer="TemplateTransformer">
> ...
>
>  
>
>
> To get it to be multi-threaded, I then copy it to 4 different
> configuration files as follows:
>
> echo "Medical Sites Configuration - "
> ${MEDSITES_CONF:=medical-sites-conf.xml}
> echo "Medical Sites Prototype - "
> ${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
> for tid in `seq 0 3`; do
>MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e "s/%%d%%/$tid/"`
>sed -e "s/%%d%%/$tid/" $MEDSITES_CONF > $MEDSITES_OUT
> done
>
>
> Then, I have 4 requestHandlers in solrconfig.xml that point to each of
> these files.They are "/import/medical-sites-0" through
> "/import/medical-sites-3".   Note that this wouldn't work with a single
> Data Import Handler that was parameterized - a particular data Import
> Handler is either idle or busy, and no longer should be run in multiple
> threads.   How this would work if the first entity weren't the root entity
> is another question - you can usually structure it with the first SQL query
> being the root entity if you are using SQL.   XML is another story, however.
>
> I did it this way because I wanted to stay with Solr "out-of-the-box"
> because it was an evaluation of what Data Import Handler could do.   If I
> were doing this without some business requirement to evaluate whether Solr
> "out-of-the-box" could do multithreaded database improt, I'd probably write
> a multi-threaded front-end that did the queries and transformations I
> needed to do.   In this case, I was considering the best way to do "all"
> our data imports from RDBMS, and Data Import Handler is the only good
> solution that involves writing configuration, not code.   The distinction
> is slight, I think.
>
> Hope this helps,
>
> Dan Davis
>
> On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> Suresh,
>>
>> There are a few common workaround for such problem. But, I think that
>> submitting more than "maxIndexingThreads" is not really productive. Also,
>> I
>> think that out-of-memory problem is caused not by indexing, but by opening
>> searcher. Do you really need to open it? I don't think it's a good idea to
>> search on the instance which cooks many T index at the same time. Are you
>> sure you don't issue superfluous commit, and you've disabled auto-commit?
>>
>> let's nail down oom problem first, and then deal with indexing speedup. I
>> like huge indices!
>>
>> On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh > >
>> wrote:
>>
>> > We are also facing the same problem in loading 14 Billion documents into
>> > Solr 4.8.10.
>> >
>> > Dataimport is working in Single threaded, which is taking more than 3
>> > weeks. This is working fine without any issues but it takes months to
>> > complete the load.
>> >
>> > When we tried SolrJ with the below configuration in Multithreaded load,
>> > the Solr is taking more memory & at one point we will end up in out of
>> > memory as well.
>> >
>> > Batch Doc count  :  10 docs
>> > No of Threads  : 16/32
>> >
>> > Solr Memory Allocated : 200 GB
>> >
&

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Dan Davis
Doesn't relevancy for that assume that the IDF and TF for user1 and user2
are not too different?SolrCloud still doesn't use a distributed IDF,
correct?

On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum  wrote:

> Alright. So shard splitting and composite routing plays nicely together.
> Thank you Anshum.
>
> On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta 
> wrote:
>
> > In one line, shard splitting doesn't cater to depend on the routing
> > mechanism but just the hash range so you could have documents for the
> same
> > prefix split up.
> >
> > Here's an overview of routing in SolrCloud:
> > * Happens based on a hash value
> > * The hash is calculated using the multiple parts of the routing key. In
> > case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits
> of
> > the routing key are obtained from murmurhash(B). This sends the docs to
> the
> > right shard.
> > * When querying using A!, all shards that contain hashes from the range
> 16
> > bits from murmurhash(A)- to murmurhash(A)- are used.
> >
> > When you split a shard, for say range  -  , it is split
> > from the middle (by default) and over multiple split, docs for the same
> A!
> > prefix might end up on different shards, but the request routing should
> > take care of that.
> >
> > You can read more about routing here:
> > https://lucidworks.com/blog/solr-cloud-document-routing/
> > http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
> >
> > and shard splitting here:
> > http://lucidworks.com/blog/shard-splitting-in-solrcloud/
> >
> >
> > On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum 
> wrote:
> >
> > > Hi, I'm also interested. When using composite the ID, the _route_
> > > information is not kept on the document itself, so to me it looks like
> > it's
> > > not possible as the split API
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> > > >
> > > doesn't have a relevant parameter to split correctly.
> > > Could report back once I try it in practice.
> > >
> > > On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose 
> wrote:
> > >
> > > > Howdy -
> > > >
> > > > We are using composite IDs of the form !.  This ensures
> > that
> > > > all events for a user are stored in the same shard.
> > > >
> > > > I'm assuming from the description of how composite ID routing works,
> > that
> > > > if you split a shard the "split point" of the hash range for that
> shard
> > > is
> > > > chosen to maintain the invariant that all documents that share a
> > routing
> > > > prefix (before the "!") will still map to the same (new) shard.  Is
> > that
> > > > accurate?
> > > >
> > > > A naive shard-split implementation (e.g. that chose the hash range
> > split
> > > > point arbitrarily) could end up with "child" shards that split a
> > routing
> > > > prefix.
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > >
> >
> >
> >
> > --
> > Anshum Gupta
> > http://about.me/anshumgupta
> >
>


Re: clarification regarding shard splitting and composite IDs

2015-02-05 Thread Dan Davis
Thanks, Anshum - I should never have posted so late.It is true that
different users will have different word frequencies, but an application
exploiting that for better relevancy would be going far for the relevancy
of individual user's results.

On Thu, Feb 5, 2015 at 12:41 AM, Anshum Gupta 
wrote:

> Solr 5.0 has support for distributed IDF. Also, users having the same IDF
> is orthogonal to the original question.
>
> In general, the Doc Freq. is only per-shard. If for some reason, a single
> user has documents split across shards, the IDF used would be different for
> docs on different shards.
>
> On Wed, Feb 4, 2015 at 9:06 PM, Dan Davis  wrote:
>
>> Doesn't relevancy for that assume that the IDF and TF for user1 and user2
>> are not too different?SolrCloud still doesn't use a distributed IDF,
>> correct?
>>
>> On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum  wrote:
>>
>> > Alright. So shard splitting and composite routing plays nicely together.
>> > Thank you Anshum.
>> >
>> > On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta 
>> > wrote:
>> >
>> > > In one line, shard splitting doesn't cater to depend on the routing
>> > > mechanism but just the hash range so you could have documents for the
>> > same
>> > > prefix split up.
>> > >
>> > > Here's an overview of routing in SolrCloud:
>> > > * Happens based on a hash value
>> > > * The hash is calculated using the multiple parts of the routing key.
>> In
>> > > case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16
>> bits
>> > of
>> > > the routing key are obtained from murmurhash(B). This sends the docs
>> to
>> > the
>> > > right shard.
>> > > * When querying using A!, all shards that contain hashes from the
>> range
>> > 16
>> > > bits from murmurhash(A)- to murmurhash(A)- are used.
>> > >
>> > > When you split a shard, for say range  -  , it is
>> split
>> > > from the middle (by default) and over multiple split, docs for the
>> same
>> > A!
>> > > prefix might end up on different shards, but the request routing
>> should
>> > > take care of that.
>> > >
>> > > You can read more about routing here:
>> > > https://lucidworks.com/blog/solr-cloud-document-routing/
>> > >
>> http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
>> > >
>> > > and shard splitting here:
>> > > http://lucidworks.com/blog/shard-splitting-in-solrcloud/
>> > >
>> > >
>> > > On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum 
>> > wrote:
>> > >
>> > > > Hi, I'm also interested. When using composite the ID, the _route_
>> > > > information is not kept on the document itself, so to me it looks
>> like
>> > > it's
>> > > > not possible as the split API
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
>> > > > >
>> > > > doesn't have a relevant parameter to split correctly.
>> > > > Could report back once I try it in practice.
>> > > >
>> > > > On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose 
>> > wrote:
>> > > >
>> > > > > Howdy -
>> > > > >
>> > > > > We are using composite IDs of the form !.  This
>> ensures
>> > > that
>> > > > > all events for a user are stored in the same shard.
>> > > > >
>> > > > > I'm assuming from the description of how composite ID routing
>> works,
>> > > that
>> > > > > if you split a shard the "split point" of the hash range for that
>> > shard
>> > > > is
>> > > > > chosen to maintain the invariant that all documents that share a
>> > > routing
>> > > > > prefix (before the "!") will still map to the same (new) shard.
>> Is
>> > > that
>> > > > > accurate?
>> > > > >
>> > > > > A naive shard-split implementation (e.g. that chose the hash range
>> > > split
>> > > > > point arbitrarily) could end up with "child" shards that split a
>> > > routing
>> > > > > prefix.
>> > > > >
>> > > > > Thanks,
>> > > > > Ian
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Anshum Gupta
>> > > http://about.me/anshumgupta
>> > >
>> >
>>
>
>
>
> --
> Anshum Gupta
> http://about.me/anshumgupta
>


Re: Delta import query not working

2015-02-05 Thread Dan Davis
It looks like you are returning the transformed ID, along with some other
fields, in the deltaQuery command.deltaQuery should only return the ID,
without the "stk_" prefix, and then deltaImportQuery should retrieve the
transformed ID.   I'd suggest:



I'm not sure which RDBMS you are using, but you probably don't need to work
around the column names at all.


On Thu, Feb 5, 2015 at 5:18 PM, willbrindle  wrote:

> Hi,
>
> I am very new to Solr but I have been playing around with it a bit and my
> imports are all working fine. However, now I wish to perform a delta import
> on my query and I'm just getting nothing.
>
> I have the entity:
>
> query="SELECT CONCAT('stk_',id) AS id,part_no,name,description
> FROM
> stock_items"
>   deltaQuery="SELECT CONCAT('stk_',id) AS
> id,part_no,name,description,updated_at FROM stock_items WHERE updated_at >
> '${dih.delta.last_index_time}'"
>   deltaImportQuery="SELECT CONCAT('stk_',id) AS id,id AS
> id2,part_no,name,description FROM stock_items WHERE id2='${dih.delta.id
> }'">
>
>
> I am not too sure if ${dih.delta.id} is supposed to be id or id2 but I
> have
> tried both and neither work. My output is something along the lines of:
>
> {
>   "responseHeader": {
> "status": 0,
> "QTime": 0
>   },
>   "initArgs": [
> "defaults",
> [
>   "config",
>   "data-config.xml"
> ]
>   ],
>   "command": "status",
>   "status": "idle",
>   "importResponse": "",
>   "statusMessages": {
> "Time Elapsed": "0:0:16.778",
> "Total Requests made to DataSource": "2",
> "Total Rows Fetched": "0",
> "Total Documents Skipped": "0",
> "Delta Dump started": "2015-02-05 16:17:54",
> "Identifying Delta": "2015-02-05 16:17:54",
> "Deltas Obtained": "2015-02-05 16:17:54",
> "Building documents": "2015-02-05 16:17:54",
> "Total Changed Documents": "0",
> "Delta Import Failed": "2015-02-05 16:17:54"
>   },
>   "WARNING": "This response format is experimental.  It is likely to change
> in the future."
> }
>
> My full import query is working fine.
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Delta-import-query-not-working-tp4184280.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Delta import query not working

2015-02-05 Thread Dan Davis
It also should be ${dataimporter.last_index_time}

Also, that's two queries - an outer query to get the IDs that are modified,
and another query (done repeatedly) to get the data.   You can go faster
using a parameterized data import as described in the wiki:

http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

Hope this helps,

Dan

On Thu, Feb 5, 2015 at 9:30 PM, Dan Davis  wrote:

> It looks like you are returning the transformed ID, along with some other
> fields, in the deltaQuery command.deltaQuery should only return the ID,
> without the "stk_" prefix, and then deltaImportQuery should retrieve the
> transformed ID.   I'd suggest:
>
>   deltaQuery="SELECT id WHERE updated_at > '${dih.delta.last_index_time}'"
>  deltaImportQuery="SELECT CONCAT('stk_',id) AS id, part_no, name,
> description FROM stock_items WHERE id='${dih.delta.id}'">
>
> I'm not sure which RDBMS you are using, but you probably don't need to
> work around the column names at all.
>
>
> On Thu, Feb 5, 2015 at 5:18 PM, willbrindle  wrote:
>
>> Hi,
>>
>> I am very new to Solr but I have been playing around with it a bit and my
>> imports are all working fine. However, now I wish to perform a delta
>> import
>> on my query and I'm just getting nothing.
>>
>> I have the entity:
>>
>>  >   query="SELECT CONCAT('stk_',id) AS id,part_no,name,description
>> FROM
>> stock_items"
>>   deltaQuery="SELECT CONCAT('stk_',id) AS
>> id,part_no,name,description,updated_at FROM stock_items WHERE updated_at >
>> '${dih.delta.last_index_time}'"
>>   deltaImportQuery="SELECT CONCAT('stk_',id) AS id,id AS
>> id2,part_no,name,description FROM stock_items WHERE id2='${dih.delta.id
>> }'">
>>
>>
>> I am not too sure if ${dih.delta.id} is supposed to be id or id2 but I
>> have
>> tried both and neither work. My output is something along the lines of:
>>
>> {
>>   "responseHeader": {
>> "status": 0,
>> "QTime": 0
>>   },
>>   "initArgs": [
>> "defaults",
>> [
>>   "config",
>>   "data-config.xml"
>> ]
>>   ],
>>   "command": "status",
>>   "status": "idle",
>>   "importResponse": "",
>>   "statusMessages": {
>> "Time Elapsed": "0:0:16.778",
>> "Total Requests made to DataSource": "2",
>> "Total Rows Fetched": "0",
>> "Total Documents Skipped": "0",
>> "Delta Dump started": "2015-02-05 16:17:54",
>> "Identifying Delta": "2015-02-05 16:17:54",
>> "Deltas Obtained": "2015-02-05 16:17:54",
>> "Building documents": "2015-02-05 16:17:54",
>> "Total Changed Documents": "0",
>> "Delta Import Failed": "2015-02-05 16:17:54"
>>   },
>>   "WARNING": "This response format is experimental.  It is likely to
>> change
>> in the future."
>> }
>>
>> My full import query is working fine.
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Delta-import-query-not-working-tp4184280.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: Solr on Tomcat

2015-02-10 Thread Dan Davis
As an application developer, I have to agree with this direction.   I ran
ManifoldCF and Solr together in the same Tomcat, and the sl4j
configurations of the two conflicted with strange results.   From a systems
administrator/operations perspective, a separate install allows better
packaging, e.g. Debian and RPM packages are then possible, although may not
be preferred as many enterprises will want to use Oracle Java rather than
OpenJDK.

On Tue, Feb 10, 2015 at 1:12 PM, Matt Kuiper  wrote:

> Thanks for all the responses.  I am planning a new project, and
> considering deployment options at this time.  It's helpful to see where
> Solr is headed.
>
> Thanks,
>
> Matt Kuiper
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: Tuesday, February 10, 2015 10:05 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr on Tomcat
>
> On 2/10/2015 9:48 AM, Matt Kuiper wrote:
> > I am starting to look in to Solr 5.0.  I have been running Solr 4.* on
> Tomcat.   I was surprised to find the following notice on
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+Tomcat
>  (Marked as Unreleased)
> >
> >  Beginning with Solr 5.0, Support for deploying Solr as a WAR in
> servlet containers like Tomcat is no longer supported.
> >
> > I want to verify that it is true that Solr 5.0 will not be able to run
> on Tomcat, and confirm that the recommended way to deploy Solr 5.0 is as a
> Linux service.
>
> Solr will eventually (hopefully soon) be entirely its own application.
> The documentation you have seen in the reference guide is there to prepare
> users for this eventuality.
>
> Right now we are in a transition period.  We have built scripts for
> controlling the start and stop of the example server installation.
> Under the covers, Solr is still a web application contained in a war and
> the example server still runs an unmodified copy of jetty.  Down the road,
> when Solr will becomes a completely standalone application, we will merely
> have to modify the script wrapper to use it, and the user may not even
> notice the change.
>
> With 5.0, if you want to run in tomcat, you will be able to find the war
> in the download's server/webapps directory and use it just like you do now
> ... but we will be encouraging people to NOT do this, because eventually it
> will be completely unsupported.
>
> Thanks,
> Shawn
>
>


Meta-search by subclassing SearchHandler

2013-08-15 Thread Dan Davis
I am considering enabling a true "Federated Search", or meta-search, using
the following basic configuration (this configuration is only for
development and evaluation):

Three Solr cores:

   - One to search data I have indexed locally
   - One with a custom SearchHandler that is a facade, e.g. it performs a
   meta-search (aka Federated Search)
   - One that queries and merges the above cores as "shards"

Lest I seem completely like Sauron, I read
http://2011.berlinbuzzwords.de/sites/2011.berlinbuzzwords.de/files/AndrzejBialecki-Buzzwords-2011_0.pdf
and am familiar with evaluating "precision at 10", etc. although I am no
doubt less familiar with IR than many.

I think that it is much, much better for performance and relevancy to index
it all on a level playing field.  But my employer cannot do that, because
we do not have a license to all the data we may wish to search in the
future.

My questions are simple - has anybody implemented such a SearchHandler that
is a facade for another search engine?   How would I get started with that?

I have made a similar post on the blacklight developers google group.


More on topic of Meta-search/Federated Search with Solr

2013-08-16 Thread Dan Davis
I've thought about it, and I have no time to really do a meta-search during
evaluation.  What I need to do is to create a single core that contains
both of my data sets, and then describe the architecture that would be
required to do blended results, with liberal estimates.

>From the perspective of evaluation, I need to understand whether any of the
solutions to better ranking in the absence of global IDF have been
explored?I suspect that one could retrieve a much larger than N set of
results from a set of shards, re-score in some way that doesn't require
IDF, e.g. storing both results in the same priority queue and *re-scoring*
before *re-ranking*.

The other way to do this would be to have a custom SearchHandler that works
differently - it performs the query, retries all results deemed relevant by
another engine, adds them to the Lucene index, and then performs the query
again in the standard way.   This would be quite slow, but perhaps useful
as a way to evaluate my method.

I still welcome any suggestions on how such a SearchHandler could be
implemented.


Re: Prevent Some Keywords at Analyzer Step

2013-08-19 Thread Dan Davis
This is an interesting topic - my employer is a medical library and there
are many keywords that may need to be aliased in various ways, and 2 or 3
word phrases that perhaps should be treated specially.   Jack, can you give
me an example of how to do that sort of thing?Perhaps I need to buy
your almost released Deep Dive book...
Sorry to be too tangential - it is my strange way.


On Mon, Aug 19, 2013 at 12:32 PM, Jack Krupansky wrote:

> Okay, but what is it that you are trying to "prevent"??
>
> And, "diet follower" is a phrase, not a keyword or term.
>
> So, I'm still baffled as to what you are really trying to do. Trying
> explaining it in plain English.
>
> And given this same input, how would it be queried?
>
>
> -- Jack Krupansky
>
> -Original Message- From: Furkan KAMACI
> Sent: Monday, August 19, 2013 11:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Prevent Some Keywords at Analyzer Step
>
>
> Let's assume that my sentence is that:
>
> *Alice is a diet follower*
>
> My special keyword => *diet follower*
>
> Tokens will be:
>
> Token 1) Alice
> Token 2) is
> Token 3) a
> Token 4) diet
> Token 5) follower
> Token 6) *diet follower*
>
>
> 2013/8/19 Jack Krupansky 
>
>  Your example doesn't "prevent" any keywords.
>>
>> You need to elaborate the specific requirements with more detail.
>>
>> Given a long stream of text, what tokenization do you expect in the index?
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Furkan KAMACI Sent: Monday, August 19,
>> 2013 8:07 AM To: solr-user@lucene.apache.org Subject: Prevent Some
>> Keywords at Analyzer Step
>> Hi;
>>
>> I want to write an analyzer that will prevent some special words. For
>> example sentence to be indexed is:
>>
>> diet follower
>>
>> it will tokenize it as like that
>>
>> token 1) diet
>> token 2) follower
>> token 3) diet follower
>>
>> How can I do that with Solr?
>>
>>
>


Re: Flushing cache without restarting everything?

2013-08-22 Thread Dan Davis
be careful with drop_caches - make sure you sync first


On Thu, Aug 22, 2013 at 1:28 PM, Jean-Sebastien Vachon <
jean-sebastien.vac...@wantedanalytics.com> wrote:

> I was afraid someone would tell me that... thanks for your input
>
> > -Original Message-
> > From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
> > Sent: August-22-13 9:56 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Flushing cache without restarting everything?
> >
> > On Tue, 2013-08-20 at 20:04 +0200, Jean-Sebastien Vachon wrote:
> > > Is there a way to flush the cache of all nodes in a Solr Cloud (by
> > > reloading all the cores, through the collection API, ...) without
> > > having to restart all nodes?
> >
> > As MMapDirectory shares data with the OS disk cache, flushing of
> > Solr-related caches on a machine should involve
> >
> > 1) Shut down all Solr instances on the machine
> > 2) Clear the OS read cache ('sudo echo 1 > /proc/sys/vm/drop_caches' on
> > a Linux box)
> > 3) Start the Solr instances
> >
> > I do not know of any Solr-supported way to do step 2. For our
> > performance tests we use custom scripts to perform the steps.
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
> >
> > -
> > Aucun virus trouvé dans ce message.
> > Analyse effectuée par AVG - www.avg.fr
> > Version: 2013.0.3392 / Base de données virale: 3209/6563 - Date:
> 09/08/2013
> > La Base de données des virus a expiré.
>


Removing duplicates during a query

2013-08-22 Thread Dan Davis
Suppose I have two documents with different id, and there is another field,
for instance "content-hash" which is something like a 16-byte hash of the
content.

Can Solr be configured to return just one copy, and drop the other if both
are relevant?

If Solr does drop one result, do you get any indication in the document
that was kept that there was another copy?


Re: How to avoid underscore sign indexing problem?

2013-08-22 Thread Dan Davis
Ah, but what is the definition of punctuation in Solr?


On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky wrote:

> "I thought that the StandardTokenizer always split on punctuation, "
>
> Proving that you haven't read my book! The section on the standard
> tokenizer details the rules that the tokenizer uses (in addition to
> extensive examples.) That's what I mean by "deep dive."
>
> -- Jack Krupansky
>
> -Original Message- From: Shawn Heisey
> Sent: Wednesday, August 21, 2013 10:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to avoid underscore sign indexing problem?
>
>
> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>
>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>
>> ST
>> textraw_**bytesstartendtypeposition
>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]0111
>>
>> How to make this string to be tokenized to these two tokens "Pacific",
>> "Rim"?
>> Set _ as stopword?
>> Please kindly help on this.
>> Many thanks.
>>
>
> Interesting.  I thought that the StandardTokenizer always split on
> punctuation, but apparently that's not the case for the underscore
> character.
>
> You can always use the WordDelimeterFilter after the StandardTokenizer.
>
> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
> WordDelimiterFilterFactory
>
> Thanks,
> Shawn
>


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-22 Thread Dan Davis
You are right, but here's my null hypothesis for studying the impact on
relevance.Hash the query to deterministically seed random number
generator.Pick one from column A or column B randomly.

This is of course wrong - a query might find two non-relevant results in
corpus A and lots of relevant results in corpus B, leading to poor
precision because the two non-relevant documents are likely to show up on
the first page.   You can weight on the size of the corpus, but weighting
is probably wrong then on any specifc query.

It was an interesting thought experiment though.

Erik,

Since LucidWorks was dinged in the 2013 Magic Quadrant on Enterprise Search
due to a lack of "Federated Search", the for-profit Enterprise Search
companies must be doing it some way.Maybe relevance suffers (a lot),
but you can do it if you want to.

I have read very little of the IR literature - enough to sound like I know
a little, but it is a very little.  If there is literature on this, it
would be an interesting read.


On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson wrote:

> The lack of global TF/IDF has been answered in the past,
> in the sharded case, by "usually you have similar enough
> stats that it doesn't matter". This pre-supposes a fairly
> evenly distributed set of documents.
>
> But if you're talking about federated search across different
> types of documents, then what would you "rescore" with?
> How would you even consider scoring docs that are somewhat/
> totally different? Think magazine articles an meta-data associated
> with pictures.
>
> What I've usually found is that one can use grouping to show
> the top N of a variety of results. Or show tabs with different
> types. Or have the app intelligently combine the different types
> of documents in a way that "makes sense". But I don't know
> how you'd just get "the right thing" to happen with some kind
> of scoring magic.
>
> Best
> Erick
>
>
> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis  wrote:
>
>> I've thought about it, and I have no time to really do a meta-search
>> during
>> evaluation.  What I need to do is to create a single core that contains
>> both of my data sets, and then describe the architecture that would be
>> required to do blended results, with liberal estimates.
>>
>> From the perspective of evaluation, I need to understand whether any of
>> the
>> solutions to better ranking in the absence of global IDF have been
>> explored?I suspect that one could retrieve a much larger than N set of
>> results from a set of shards, re-score in some way that doesn't require
>> IDF, e.g. storing both results in the same priority queue and *re-scoring*
>> before *re-ranking*.
>>
>> The other way to do this would be to have a custom SearchHandler that
>> works
>> differently - it performs the query, retries all results deemed relevant
>> by
>> another engine, adds them to the Lucene index, and then performs the query
>> again in the standard way.   This would be quite slow, but perhaps useful
>> as a way to evaluate my method.
>>
>> I still welcome any suggestions on how such a SearchHandler could be
>> implemented.
>>
>
>


Re: Removing duplicates during a query

2013-08-22 Thread Dan Davis
OK - I see that this can be done with Field Collapsing/Grouping.  I also
see the mentions in the Wiki for avoiding duplicates using a 16-byte hash.

So, question withdrawn...


On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis  wrote:

> Suppose I have two documents with different id, and there is another
> field, for instance "content-hash" which is something like a 16-byte hash
> of the content.
>
> Can Solr be configured to return just one copy, and drop the other if both
> are relevant?
>
> If Solr does drop one result, do you get any indication in the document
> that was kept that there was another copy?
>
>


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Dan Davis
I have now come to the task of estimating man-days to add "Blended Search
Results" to Apache Solr.   The argument has been made that this is not
desirable (see Jonathan Rochkind's blog entries on Bento search with
blacklight).   But the estimate remains.No estimate is worth much
without a design.   So, I am come to the difficult of estimating this
without having an in-depth knowledge of the Apache core.   Here is my
design, likely imperfect, as it stands.

   - Configure a core specific to each search source (local or remote)
   - On cores that index remote content, implement a periodic delete query
   that deletes documents whose timestamp is too old
   - Implement a custom requestHandler for the "remote" cores that goes out
   and queries the remote source.   For each result in the top N
   (configurable), it computes an id that is stable (e.g. it is based on the
   remote resource URL, doi, or hash of data returned).   It uses that id to
   look-up the document in the lucene database.   If the data is not there, it
   updates the lucene core and sets a flag that commit is required.   Once it
   is done, it commits if needed.
   - Configure a core that uses a custom SearchComponent to call the
   requestHandler that goes and gets new documents and commits them.   Since
   the cores for remote content are different cores, they can restart their
   searcher at this point if any commit is needed.   The custom
   SearchComponent will wait for commit and reload to be completed.   Then,
   search continues uses the other cores as "shards".
   - Auto-warming on this will assure that the most recently requested data
   is present.

It will, of course, be very slow a good part of the time.

Erik and others, I need to know whether this design has legs and what other
alternatives I might consider.



On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson wrote:

> The lack of global TF/IDF has been answered in the past,
> in the sharded case, by "usually you have similar enough
> stats that it doesn't matter". This pre-supposes a fairly
> evenly distributed set of documents.
>
> But if you're talking about federated search across different
> types of documents, then what would you "rescore" with?
> How would you even consider scoring docs that are somewhat/
> totally different? Think magazine articles an meta-data associated
> with pictures.
>
> What I've usually found is that one can use grouping to show
> the top N of a variety of results. Or show tabs with different
> types. Or have the app intelligently combine the different types
> of documents in a way that "makes sense". But I don't know
> how you'd just get "the right thing" to happen with some kind
> of scoring magic.
>
> Best
> Erick
>
>
> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis  wrote:
>
>> I've thought about it, and I have no time to really do a meta-search
>> during
>> evaluation.  What I need to do is to create a single core that contains
>> both of my data sets, and then describe the architecture that would be
>> required to do blended results, with liberal estimates.
>>
>> From the perspective of evaluation, I need to understand whether any of
>> the
>> solutions to better ranking in the absence of global IDF have been
>> explored?I suspect that one could retrieve a much larger than N set of
>> results from a set of shards, re-score in some way that doesn't require
>> IDF, e.g. storing both results in the same priority queue and *re-scoring*
>> before *re-ranking*.
>>
>> The other way to do this would be to have a custom SearchHandler that
>> works
>> differently - it performs the query, retries all results deemed relevant
>> by
>> another engine, adds them to the Lucene index, and then performs the query
>> again in the standard way.   This would be quite slow, but perhaps useful
>> as a way to evaluate my method.
>>
>> I still welcome any suggestions on how such a SearchHandler could be
>> implemented.
>>
>
>


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Dan Davis
First answer:

My employer is a library and do not have the license to harvest everything
indexed by a "web-scale discovery service" such as PRIMO or Summon.If
our design automatically relays searches entered by users, and then
periodically purges results, I think it is reasonable from a licensing
perspective.

Second answer:

What if you wanted your Apache Solr powered search to include all results
from Google scholar to any query?   Do you think you could easily or
cheaply configure a Zookeeper cluster large enough to harvest and index all
of Google Scholar?   Would that violate robot rules?Is it even possible
to do this from an API perspective?   Wouldn't google notice?

Third answer:

On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the
other Enterprise Search firm based on Apache Solr were dinged on the lack
of Federated Search.  I do not have the hubris to think I can fix that, and
it is not really my role to try, but something that works without
Harvesting and local indexing is obviously desirable to Enterprise Search
users.



On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht  wrote:

>
> Why not simply create a meta search engine that indexes everything of each
> of the nodes.?
> (I think one calls this harvesting)
>
> I believe that this the way to avoid all sorts of performance bottleneck.
> As far as I could analyze, the performance of a federated search is the
> performance of the least speedy node; which can turn to be quite bad if you
> do not exercise guarantees of remote sources.
>
> Or are the "remote cores" below actually things that you manage on your
> side? If yes guarantees are easy to manage..
>
> Paul
>
>
> Le 26 août 2013 à 22:38, Dan Davis a écrit :
>
> > I have now come to the task of estimating man-days to add "Blended Search
> > Results" to Apache Solr.   The argument has been made that this is not
> > desirable (see Jonathan Rochkind's blog entries on Bento search with
> > blacklight).   But the estimate remains.No estimate is worth much
> > without a design.   So, I am come to the difficult of estimating this
> > without having an in-depth knowledge of the Apache core.   Here is my
> > design, likely imperfect, as it stands.
> >
> >   - Configure a core specific to each search source (local or remote)
> >   - On cores that index remote content, implement a periodic delete query
> >   that deletes documents whose timestamp is too old
> >   - Implement a custom requestHandler for the "remote" cores that goes
> out
> >   and queries the remote source.   For each result in the top N
> >   (configurable), it computes an id that is stable (e.g. it is based on
> the
> >   remote resource URL, doi, or hash of data returned).   It uses that id
> to
> >   look-up the document in the lucene database.   If the data is not
> there, it
> >   updates the lucene core and sets a flag that commit is required.
> Once it
> >   is done, it commits if needed.
> >   - Configure a core that uses a custom SearchComponent to call the
> >   requestHandler that goes and gets new documents and commits them.
> Since
> >   the cores for remote content are different cores, they can restart
> their
> >   searcher at this point if any commit is needed.   The custom
> >   SearchComponent will wait for commit and reload to be completed.
> Then,
> >   search continues uses the other cores as "shards".
> >   - Auto-warming on this will assure that the most recently requested
> data
> >   is present.
> >
> > It will, of course, be very slow a good part of the time.
> >
> > Erik and others, I need to know whether this design has legs and what
> other
> > alternatives I might consider.
> >
> >
> >
> > On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson  >wrote:
> >
> >> The lack of global TF/IDF has been answered in the past,
> >> in the sharded case, by "usually you have similar enough
> >> stats that it doesn't matter". This pre-supposes a fairly
> >> evenly distributed set of documents.
> >>
> >> But if you're talking about federated search across different
> >> types of documents, then what would you "rescore" with?
> >> How would you even consider scoring docs that are somewhat/
> >> totally different? Think magazine articles an meta-data associated
> >> with pictures.
> >>
> >> What I've usually found is that one can use grouping to show
> >> the top N of a variety of results. Or show tabs with different
> >> types. Or have the app intelligently combine the d

Re: More on topic of Meta-search/Federated Search with Solr

2013-08-26 Thread Dan Davis
One more question here - is this topic more appropriate to a different list?


On Mon, Aug 26, 2013 at 4:38 PM, Dan Davis  wrote:

> I have now come to the task of estimating man-days to add "Blended Search
> Results" to Apache Solr.   The argument has been made that this is not
> desirable (see Jonathan Rochkind's blog entries on Bento search with
> blacklight).   But the estimate remains.No estimate is worth much
> without a design.   So, I am come to the difficult of estimating this
> without having an in-depth knowledge of the Apache core.   Here is my
> design, likely imperfect, as it stands.
>
>- Configure a core specific to each search source (local or remote)
>- On cores that index remote content, implement a periodic delete
>query that deletes documents whose timestamp is too old
>- Implement a custom requestHandler for the "remote" cores that goes
>out and queries the remote source.   For each result in the top N
>(configurable), it computes an id that is stable (e.g. it is based on the
>remote resource URL, doi, or hash of data returned).   It uses that id to
>look-up the document in the lucene database.   If the data is not there, it
>updates the lucene core and sets a flag that commit is required.   Once it
>is done, it commits if needed.
>- Configure a core that uses a custom SearchComponent to call the
>requestHandler that goes and gets new documents and commits them.   Since
>the cores for remote content are different cores, they can restart their
>searcher at this point if any commit is needed.   The custom
>SearchComponent will wait for commit and reload to be completed.   Then,
>search continues uses the other cores as "shards".
>- Auto-warming on this will assure that the most recently requested
>data is present.
>
> It will, of course, be very slow a good part of the time.
>
> Erik and others, I need to know whether this design has legs and what
> other alternatives I might consider.
>
>
>
> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson 
> wrote:
>
>> The lack of global TF/IDF has been answered in the past,
>> in the sharded case, by "usually you have similar enough
>> stats that it doesn't matter". This pre-supposes a fairly
>> evenly distributed set of documents.
>>
>> But if you're talking about federated search across different
>> types of documents, then what would you "rescore" with?
>> How would you even consider scoring docs that are somewhat/
>> totally different? Think magazine articles an meta-data associated
>> with pictures.
>>
>> What I've usually found is that one can use grouping to show
>> the top N of a variety of results. Or show tabs with different
>> types. Or have the app intelligently combine the different types
>> of documents in a way that "makes sense". But I don't know
>> how you'd just get "the right thing" to happen with some kind
>> of scoring magic.
>>
>> Best
>> Erick
>>
>>
>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis  wrote:
>>
>>> I've thought about it, and I have no time to really do a meta-search
>>> during
>>> evaluation.  What I need to do is to create a single core that contains
>>> both of my data sets, and then describe the architecture that would be
>>> required to do blended results, with liberal estimates.
>>>
>>> From the perspective of evaluation, I need to understand whether any of
>>> the
>>> solutions to better ranking in the absence of global IDF have been
>>> explored?I suspect that one could retrieve a much larger than N set
>>> of
>>> results from a set of shards, re-score in some way that doesn't require
>>> IDF, e.g. storing both results in the same priority queue and
>>> *re-scoring*
>>> before *re-ranking*.
>>>
>>> The other way to do this would be to have a custom SearchHandler that
>>> works
>>> differently - it performs the query, retries all results deemed relevant
>>> by
>>> another engine, adds them to the Lucene index, and then performs the
>>> query
>>> again in the standard way.   This would be quite slow, but perhaps useful
>>> as a way to evaluate my method.
>>>
>>> I still welcome any suggestions on how such a SearchHandler could be
>>> implemented.
>>>
>>
>>
>


Re: Storing query results

2013-08-28 Thread Dan Davis
You could copy the existing core to a new core every once in awhile, and
then do your delta indexing into a new core once the copy is complete.  If
a Persistent URL for the search results included the name of the original
core, the results you would get from a bookmark would be stable.  However,
if you went to the site, and did a new site, you would be searching the
newest core.

This I think applies whether the site is "Intranet" or not.

Older cores could be aged out gracefully, and the search handler for an old
core could be replaced by a search on the new core via sharding.


On Fri, Aug 23, 2013 at 11:57 AM, jfeist  wrote:

> I completely agree.  I would prefer to just rerun the search each time.
> However, we are going to be replacing our rdb based search with something
> like Solr, and the application currently behaves this way.  Our users
> understand that the search is essentially a snapshot (and I would guess
> many
> prefer this over changing results) and we don't want to change existing
> behavior and confuse anyone.  Also, my boss told me it unequivocally has to
> be this way :p
>
> Thanks for your input though, looks like I'm going to have to do something
> like you've suggested within our application.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Storing-query-results-tp4086182p4086349.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to Manage RAM Usage at Heavy Indexing

2013-08-28 Thread Dan Davis
This could be an operating systems problem rather than a Solr problem.
CentOS 6.4 (linux kernel 2.6.32) may have some issues with page flushing
and I would read-up up on that.
The VM parameters can be tuned in /etc/sysctl.conf


On Sun, Aug 25, 2013 at 4:23 PM, Furkan KAMACI wrote:

> Hi Erick;
>
> I wanted to get a quick answer that's why I asked my question as that way.
>
> Error is as follows:
>
> INFO  - 2013-08-21 22:01:30.978;
> org.apache.solr.update.processor.LogUpdateProcessor; [collection1]
> webapp=/solr path=/update params={wt=javabin&version=2}
> {add=[com.deviantart.reachmeh
> ere:http/gallery/, com.deviantart.reachstereo:http/,
> com.deviantart.reachstereo:http/art/SE-mods-313298903,
> com.deviantart.reachtheclouds:http/, com.deviantart.reachthegoddess:http/,
> co
> m.deviantart.reachthegoddess:http/art/retouched-160219962,
> com.deviantart.reachthegoddess:http/badges/,
> com.deviantart.reachthegoddess:http/favourites/,
> com.deviantart.reachthetop:http/
> art/Blue-Jean-Baby-82204657 (1444006227844530177),
> com.deviantart.reachurdreams:http/, ... (163 adds)]} 0 38790
> ERROR - 2013-08-21 22:01:30.979; org.apache.solr.common.SolrException;
> java.lang.RuntimeException: [was class org.eclipse.jetty.io.EofException]
> early EOF
> at
>
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> at
>
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
> at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> at
>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1812)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
> at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> at
>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
> at
>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:365)
> at
>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
> at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
>
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
> at
>
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:948)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at
>
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
>
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:722)
> Caused by: org.eclipse.jetty.io.EofException: early EOF
> at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65)
> at java.io.InputStream.read(InputStream.java:101)
> at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
> at com.ctc.wstx.io.MergedReader.read(MergedRea

Re: More on topic of Meta-search/Federated Search with Solr

2013-08-28 Thread Dan Davis
On Tue, Aug 27, 2013 at 2:03 AM, Paul Libbrecht  wrote:

> Dan,
>
> if you're bound to federated search then I would say that you need to work
> on the service guarantees of each of the nodes and, maybe, create
> strategies to cope with bad nodes.
>
> paul
>

+1

I'll think on that.


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-28 Thread Dan Davis
On Tue, Aug 27, 2013 at 3:33 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> Years ago when "Federated Search" was a buzzword we did some development
> and
> testing with Lucene, FAST Search, Google and several other Search Engines
> according Federated Search in Library context.
> The results can be found here
> http://pub.uni-bielefeld.de/download/2516631/2516644
> Some minor parts are in German most is written in English.
> It also gives you an idea where to keep an eye on, where are the pitfalls
> and so on.
> We also had a tool called "unity" (written in Python) which did Federated
> Search on any Search Engine and
> Database, like Google, Gigablast, FAST, Lucene, ...
> The trick with Federated Search is to combine the results.
> We offered three options to the users search surface:
> - RoundRobin
> - Relevancy
> - PseudoRandom
>


Thanks much - Andrzej B. suggested I read "Comparing top-k lists" in
addition to his Berlin Buzzwords presentation.

I will know soon whether we are intent on this direction, right now I'm
still trying to think on how hard it will be.


Re: More on topic of Meta-search/Federated Search with Solr

2013-08-28 Thread Dan Davis
On Mon, Aug 26, 2013 at 9:06 PM, Amit Jha  wrote:

> Would you like to create something like
> http://knimbus.com
>

I work at the National Library of Medicine.   We are moving our library
catalog to a newer platform, and we will probably include articles.   The
article's content and meta-data are available from a number of web-scale
discovery services such as PRIMO, Summon, EBSCO's EDS, EBSCO's "traditional
API".   Most libraries use open source solutions to avoid the cost of
purchasing an expensive enterprise search platform.   We are big; we
already have a closed-source enterprise search engine (and our own home
grown Entrez search used for PubMed).Since we can already do Federated
Search with the above, I am evaluating the effort of adding such to Apache
Solr.   Because NLM data is used in the open relevancy project, we actually
have the relevancy decisions to decide whether we have done a good job of
it.

I obviously think it would be "Fun" to add Federated Search to Apache Solr.

*Standard disclosure *- my opinion's do not represent the opinions of NIH
or NLM."Fun" is no reason to spend tax-payer money.Enhancing Apache
Solr would reduce the risk of "putting all our eggs in one basket." and
there may be some other relevant benefits.

We do use Apache Solr here for more than one other project... so keep up
the good work even if my working group decides to go with the closed-source
solution.


Excluding a facet's constraint to exclude a facet

2013-09-24 Thread Dan Davis
Summary - when constraining a search using filter query, how can I exclude
the constraint for a particular facet?

Detail - Suppose I have the following facet results for a query "q=*
mainquery*":





491
111
103
...

...

I understand from
http://people.apache.org/~hossman/apachecon2010/facets/and Wiki
documentation that I can limit results to category "A" as follows:

fq={!raw f=foo}A

But I cannot seem to (Solr 3.6.1) exclude that way:

fq={!raw f=foo}-A

And the simpler test (with edismax) doesn't work either:

fq=foo:A# works
fq=foo:-A   # doesn't work

Do I need to be using facet.method=enum to get this to work?   What else
could be the problem here?


Re: Nagle's Algorithm

2013-09-29 Thread Dan Davis
I don't keep up with this list well enough to know whether anyone else
answered.  I don't know how to do it in jetty.xml, but you can certainly
tweak the code.   java.net.Socket has a method setTcpNoDelay() that
corresponds with the standard Unix system calls.

Long-time past, my suggestion of this made Apache Axis 2.0 250ms faster per
call (1).   Now I want to know whether Apache Solr sets it.

One common way to test the overhead portion of latency is to project the
latency for a zero size request based on larger requests.   What you do is
to "warm" requests (all in memory) for progressively fewer and fewer
rows.   You can make requests for 100, 90, 80, 70 ... 10 rows each more
than once so that all is warmed.   If you plot this, it should look like a
linear function latency(rows) = m(rows) + b since all is cached in
memory.   You have to control what else is going on on the server to get
the linear plot of course - it can be quite hard to get this to work right
on modern Linux.   But once you have it, you can simply calculate f(0) and
you have the latency for a theoretical 0 sized request.

This is a tangential answer at best - I wish I just knew a setting to give
you.

(1) Latency Performance of SOAP
Implementations


On Sun, Sep 29, 2013 at 9:22 PM, William Bell  wrote:

> How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4?
>
> Is there an option in jetty.xml ?
>
> /* Create new stream socket */
>
> sock = *socket*( AF_INET, SOCK_STREAM, 0 );
>
>
>
> /* Disable the Nagle (TCP No Delay) algorithm */
>
> flag = 1;
>
> ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)&flag,
> sizeof(flag) );
>
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>