Solr 5.1 ignores SOLR_JAVA_MEM setting

2015-04-15 Thread Ere Maijala
Folks, just a quick heads-up that apparently Solr 5.1 introduced a 
change in bin/solr that overrides SOLR_JAVA_MEM setting from solr.in.sh 
or environment. I just filed 
https://issues.apache.org/jira/browse/SOLR-7392. The problem can be 
circumvented by using SOLR_HEAP setting, e.g. SOLR_HEAP="32G", but it's 
not mentioned in solr.in.sh by default.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Securing solr index

2015-04-15 Thread Per Steffensen
That said, it might be nice with a wiki-page (or something) explaining 
how it can be done, including maybe concrete cases about exactly how it 
has been done on different installations around the world using Solr


On 14/04/15 14:03, Per Steffensen wrote:

Hi

I might misunderstand you, but if you are talking about securing the 
actual files/folders of the index, I do not think this is a 
Solr/Lucene concern. Use standard mechanisms of your OS. E.g. on 
linux/unix use chown, chgrp, chmod, sudo, apparmor etc - e.g. allowing 
only root to write the folders/files and sudo the user running 
Solr/Lucene to operate as root in this area. Even admins should not 
(normally) operate as root - that way they cannot write the files 
either. No one knows the root-password - except maybe for the 
super-super-admin, or you split the root-password in two and two 
admins know a part each, so that they have to both agree in order to 
operate as root. Be creative yourself.


Regards, Per Steffensen

On 13/04/15 12:13, Suresh Vanasekaran wrote:

Hi,

We are having the solr index maintained in a central server and 
multiple users might be able to access the index data.


May I know what are best practice for securing the solr index folder 
where ideally only application user should be able to access. Even an 
admin user should not be able to copy the data and use it in another 
schema.


Thanks



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended 
solely
for the use of the addressee(s). If you are not the intended 
recipient, please
notify the sender by e-mail and delete the original message. Further, 
you are not
to copy, disclose, or distribute this e-mail or its contents to any 
other person and
any such actions are unlawful. This e-mail may contain viruses. 
Infosys has taken
every reasonable precaution to minimize this risk, but is not liable 
for any damage
you may sustain as a result of any virus in this e-mail. You should 
carry out your
own virus checks before opening the e-mail or attachment. Infosys 
reserves the
right to monitor and review the content of all messages sent to or 
from this e-mail
address. Messages sent to or from this e-mail address may be stored 
on the

Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***








RE: sort by a copy field error

2015-04-15 Thread Pedro Figueiredo
Hello,

Yes I restart solr and re-index after the change.
The request is:
http://localhost:8983/solr/patientsCollection/select?q=*%3A*&sort=name_sort+asc&wt=json&indent=true&_=1429082874881

I am using the solr console admin and in the "query" option I just define the 
field sort with "name_sort asc".

thanks!

Pedro Figueiredo
Senior Engineer

pjlfigueir...@criticalsoftware.com
M. 934058150
 

Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
T. +351 229 446 927 | F. +351 229 446 929
www.criticalsoftware.com

PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA
A CMMI® LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
 


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: 14 April 2015 19:44
To: solr-user@lucene.apache.org
Subject: Re: sort by a copy field error

On 4/14/2015 11:32 AM, Pedro Figueiredo wrote:
> And when I try to sort by "name_sort" the following error is raised: 
>
> "error": {
>
> "msg": "sort param field can't be found: name_sort",
>
> "code": 400
>
>   }

What was the exact sort parameter you sent to Solr?

Did you reload the core or restart Solr and then reindex after you changed your 
schema?  A reindex will be required.

http://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn



RE: sort by a copy field error

2015-04-15 Thread Pedro Figueiredo
Hello,

http://localhost:8983/solr/patientsCollection/select?q=*%3A*&sort=name_sort+asc&wt=json&indent=true&_=1429082874881

I am using the solr console admin and in the "query" option I just define the 
field sort with "name_sort asc".

Pedro Figueiredo
Senior Engineer

pjlfigueir...@criticalsoftware.com
M. 934058150
 

Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
T. +351 229 446 927 | F. +351 229 446 929
www.criticalsoftware.com

PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA
A CMMI® LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
 


-Original Message-
From: Andrea Gazzarini [mailto:a.gazzar...@gmail.com] 
Sent: 14 April 2015 19:47
To: solr-user@lucene.apache.org
Subject: Re: sort by a copy field error

Hi Pedro
Please post the request that produces that error

Andrea
On 14 Apr 2015 19:33, "Pedro Figueiredo" 
wrote:

> Hello,
>
>
>
> I have a pretty basic question:  how can I sort by a copyfield?
>
>
>
> My schema conf is:
>
>
>
>  stored="true" omitNorms="true" termVectors="true"/>
>
> 
>
> 
>
>
>
> And when I try to sort by "name_sort" the following error is raised:
>
> "error": {
>
> "msg": "sort param field can't be found: name_sort",
>
> "code": 400
>
>   }
>
>
>
> Thanks in advanced,
>
>
>
> Pedro Figueiredo
>
>
>
>



Re: sort by a copy field error

2015-04-15 Thread Andrea Gazzarini
Really strange to me: the cause should be what Shawn already pointed 
out, because that error is raised when:


SchemaField sf = req.getSchema().getFieldOrNull(field);



is null:

if (null == sf) {
   ...
   throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "sort 
param field can't be found: " + field);

}

So it seems Solr doesn't find the "name_sort" field in the schema, as 
you changed that (the schema) without reloading / restarting.


Andrea

On 04/15/2015 09:30 AM, Pedro Figueiredo wrote:

Hello,

http://localhost:8983/solr/patientsCollection/select?q=*%3A*&sort=name_sort+asc&wt=json&indent=true&_=1429082874881

I am using the solr console admin and in the "query" option I just define the field sort 
with "name_sort asc".

Pedro Figueiredo
Senior Engineer

pjlfigueir...@criticalsoftware.com
M. 934058150
  


Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
T. +351 229 446 927 | F. +351 229 446 929
www.criticalsoftware.com

PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA
A CMMI® LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
  



-Original Message-
From: Andrea Gazzarini [mailto:a.gazzar...@gmail.com]
Sent: 14 April 2015 19:47
To: solr-user@lucene.apache.org
Subject: Re: sort by a copy field error

Hi Pedro
Please post the request that produces that error

Andrea
On 14 Apr 2015 19:33, "Pedro Figueiredo" 
wrote:


Hello,



I have a pretty basic question:  how can I sort by a copyfield?



My schema conf is:











And when I try to sort by "name_sort" the following error is raised:

"error": {

 "msg": "sort param field can't be found: name_sort",

 "code": 400

   }



Thanks in advanced,



Pedro Figueiredo








Validate document against schema

2015-04-15 Thread Artem Karpenko

Hi,

I am looking for possibility to validate document that is about to be 
inserted against schema to check if addition of document will fail or 
not w/o actually making an insert. Is there a way to that? I'm doing 
update from inside the Solr plugin so there is an access to API if that 
matters.


Thanks in advance,
Artem.


RE: sort by a copy field error

2015-04-15 Thread Pedro Figueiredo
Ok... my bad...

My solr installation is in cloud mode... so the basic solr stop and start does 
not update the configuration right?

I started solr using: 
solr -c -Dbootstrap_confdir=C:\solr-5.0.0\server\solr\patientsCollection\conf 
-Dcollection.configName=myconf

and the error was solved.

Please, advice if this is the correct command to update the solr configuration 
in cloud mode.

Thanks,

Pedro Figueiredo
Senior Engineer

pjlfigueir...@criticalsoftware.com
M. 934058150
 

Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal
T. +351 229 446 927 | F. +351 229 446 929
www.criticalsoftware.com

PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA
A CMMI® LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
 


-Original Message-
From: Andrea Gazzarini [mailto:a.gazzar...@gmail.com] 
Sent: 15 April 2015 08:41
To: solr-user@lucene.apache.org
Subject: Re: sort by a copy field error

Really strange to me: the cause should be what Shawn already pointed out, 
because that error is raised when:

SchemaField sf = req.getSchema().getFieldOrNull(field);



is null:

if (null == sf) {
...
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "sort param 
field can't be found: " + field); }

So it seems Solr doesn't find the "name_sort" field in the schema, as you 
changed that (the schema) without reloading / restarting.

Andrea

On 04/15/2015 09:30 AM, Pedro Figueiredo wrote:
> Hello,
>
> http://localhost:8983/solr/patientsCollection/select?q=*%3A*&sort=name
> _sort+asc&wt=json&indent=true&_=1429082874881
>
> I am using the solr console admin and in the "query" option I just define the 
> field sort with "name_sort asc".
>
> Pedro Figueiredo
> Senior Engineer
>
> pjlfigueir...@criticalsoftware.com
> M. 934058150
>   
>
> Rua Engº Frederico Ulrich, nº 2650 4470-605 Moreira da Maia, Portugal 
> T. +351 229 446 927 | F. +351 229 446 929 www.criticalsoftware.com
>
> PORTUGAL | UK | GERMANY | USA | BRAZIL | MOZAMBIQUE | ANGOLA A CMMI® 
> LEVEL 5 RATED COMPANY CMMI® is registered in the USPTO by CMU"
>   
>
>
> -Original Message-
> From: Andrea Gazzarini [mailto:a.gazzar...@gmail.com]
> Sent: 14 April 2015 19:47
> To: solr-user@lucene.apache.org
> Subject: Re: sort by a copy field error
>
> Hi Pedro
> Please post the request that produces that error
>
> Andrea
> On 14 Apr 2015 19:33, "Pedro Figueiredo" 
> 
> wrote:
>
>> Hello,
>>
>>
>>
>> I have a pretty basic question:  how can I sort by a copyfield?
>>
>>
>>
>> My schema conf is:
>>
>>
>>
>> > stored="true" omitNorms="true" termVectors="true"/>
>>
>> 
>>
>> 
>>
>>
>>
>> And when I try to sort by "name_sort" the following error is raised:
>>
>> "error": {
>>
>>  "msg": "sort param field can't be found: name_sort",
>>
>>  "code": 400
>>
>>}
>>
>>
>>
>> Thanks in advanced,
>>
>>
>>
>> Pedro Figueiredo
>>
>>
>>
>>



How to get the query content in DefaultSimilarity class?

2015-04-15 Thread Xi Shen
Hi,

I want to implement a custom TFIDF similarity scoring function. I read the
code for org.apache.lucene.search.similarities.DefaultSimilarity. I could
not find a way to get the "query" that user provided.

In my case, I would want to allow the user to upload some binary content to
my search server. The binary content can be transformed into "text" like
document for inverted index. But I still need to whole "query document" to
compute the similarity score.

Any suggestions?

Thanks,

[image: --]
Xi Shen
[image: http://]about.me/davidshen

  


Re: Indexing PDF and MS Office files

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's AutoParser
and PDF functionality is working fine. However,  the error with some MS
office word documents still persist.

The error message is "java.lang.IllegalArgumentException: This paragraph is
not the first one in the table" which will eventually result in "Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"

Upon some reading, it looks like its a bug with Tika 1.5 and seems to have
been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ).
I am new to Solr / Tika and hence wondering whether I can change the Tika
library alone to v1.6 without impacting any of the libraries within Solr
4.10.2? Please let me know your response and how to get away with this
issue.

Many thanks in advance.

Thanks & Regards
Vijay


On 15 April 2015 at 05:14, Shyam R  wrote:

> Vijay,
>
> You could try different excel files with different formats to rule out the
> issue is with TIKA version being used.
>
> Thanks
> Murthy
>
> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
> wrote:
>
> > Perhaps the PDF is protected and the content can not be extracted?
> >
> > i have an unverified suspicion that the tika shipped with solr 4.10.2 may
> > not support some/all office 2013 document formats.
> >
> >
> >
> >
> >
> > On 4/14/2015 8:18 PM, Jack Krupansky wrote:
> >
> >> Try doing a manual extraction request directly to Solr (not via SolrJ)
> and
> >> use the extractOnly option to see if the content is actually extracted.
> >>
> >> See:
> >> https://cwiki.apache.org/confluence/display/solr/
> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika
> >>
> >> Also, some PDF files actually have the content as a bitmap image, so no
> >> text is extracted.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
> >> vijaya.bhoomire...@whishworks.com> wrote:
> >>
> >>  Hi,
> >>>
> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following
> issues.
> >>> Request to please let me know what is going wrong with the indexing
> >>> process.
> >>>
> >>> I am using solr 4.10.2 and using the default example server
> configuration
> >>> that comes with Solr distribution.
> >>>
> >>> PDF Files - Indexing as such works fine, but when I query using *.* in
> >>> the
> >>> Solr Query console, metadata information is displayed properly.
> However,
> >>> the PDF content field is empty. This is happening for all PDF files I
> >>> have
> >>> tried. I have tried with some proprietary files, PDF eBooks etc.
> Whatever
> >>> be the PDF file, content is not being displayed.
> >>>
> >>> MS Office files -  For some office files, everything works perfect and
> >>> the
> >>> extracted content is visible in the query console. However, for
> others, I
> >>> see the below error message during the indexing process.
> >>>
> >>> *Exception in thread "main"
> >>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> >>> from
> >>> org.apache.tika.parser.microsoft.OfficeParser*
> >>>
> >>>
> >>> I am using SolrJ to index the documents and below is the code snippet
> >>> related to indexing. Please let me know where the issue is occurring.
> >>>
> >>>  static String solrServerURL = "
> >>> http://localhost:8983/solr";;
> >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
> >>>  static ContentStreamUpdateRequest indexingReq
> =
> >>> new
> >>>
> >>>  ContentStreamUpdateRequest("/update/extract");
> >>>
> >>>  indexingReq.addFile(file, fileType);
> >>> indexingReq.setParam("literal.id", literalId);
> >>> indexingReq.setParam("uprefix", "attr_");
> >>> indexingReq.setParam("fmap.content", "content");
> >>> indexingReq.setParam("literal.fileurl", fileURL);
> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> >>> solrServer.request(indexingReq);
> >>>
> >>> Thanks & Regards
> >>> Vijay
> >>>
> >>> --
> >>> The contents of this e-mail are confidential and for the exclusive use
> of
> >>> the intended recipient. If you receive this e-mail in error please
> delete
> >>> it from your system immediately and notify us either by e-mail or
> >>> telephone. You should not copy, forward or otherwise disclose the
> content
> >>> of the e-mail. The views expressed in this communication may not
> >>> necessarily be the view held by WHISHWORKS.
> >>>
> >>>
> >
>
>
> --
> Ph: 9845704792
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disc

Change scoring method

2015-04-15 Thread מאיה גלעד
Hello,

I'm using solr and indexing vectors into fields.
In some cases when I search for documents I want to do it based on the
fields containing the vector.
I have an algorithm which defines a score to the document.
The document's score is based on vector multiplication and on an input
vector given in the url.

What is the best way in order to implement the search?

Thank you,
Maya


MoreLikeThis (mlt) in sharded SolrCloud

2015-04-15 Thread Ere Maijala

Hi,

I'm trying to gather information on how mlt works or is supposed to work 
with SolrCloud and a sharded collection. I've read issues SOLR-6248, 
SOLR-5480 and SOLR-4414, and docs at 
, but I'm still struggling 
with multiple issues. I've been testing with Solr 5.1 and the "Getting 
Started" sample cloud. So, with a freshly extracted Solr, these are the 
steps I've done:


bin/solr start -e cloud -noprompt
bin/post -c gettingstarted docs/
bin/post -c gettingstarted example/exampledocs/books.json

After this I've tried different variations of queries with limited success:


causes java.lang.NullPointerException at 
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:80)



causes java.lang.NullPointerException at 
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:84)



causes java.lang.NullPointerException at 
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)



actually gives results


again causes Java.lang.NullPointerException at 
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)



I guess the actual question is, how am I supposed to use the handler to 
replicate behavior of non-distributed mlt that was formerly used with 
qt=morelikethis and the following configuration in solrconfig.xml:


  

  name="mlt.fl">title,title_short,callnumber-label,topic,language,author,publishDate

  
title^75
title_short^100
callnumber-label^400
topic^300
language^30
author^75
publishDate
  
  1
  1
  true
  5
  5

  

Real-life full schema and config can be found at 
.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: sort by a copy field error

2015-04-15 Thread Shawn Heisey
On 4/15/2015 2:02 AM, Pedro Figueiredo wrote:
> My solr installation is in cloud mode... so the basic solr stop and start 
> does not update the configuration right?
> 
> I started solr using: 
> solr -c -Dbootstrap_confdir=C:\solr-5.0.0\server\solr\patientsCollection\conf 
> -Dcollection.configName=myconf
> 
> and the error was solved.

I really dislike the bootstrap options.  They are designed to be run
exactly once -- when you are first converting a non-cloud config to
cloud.  If they are used beyond that, they will likely cause confusion.

When you're running SolrCloud, editing the schema on the disk does
nothing.  You must upload the changed config to zookeeper, where
SolrCloud actually looks for config data.  That's what the startup
options that you added will do, is upload that specific config to
zookeeper with that specific name, and it will do it every time you
start Solr with those options.

The zkcli script, specifically its "upconfig" command, should be used to
do that.  Then you can use the Collections API to reload the collection,
and follow that up with a reindex.

https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities
https://cwiki.apache.org/confluence/display/solr/Collections+API

Thanks,
Shawn



SolrCloud 4.8 - solrconfig.xml hot changes

2015-04-15 Thread Vincenzo D'Amore
Hi all,

can I change solrconfig.xml configuration when solrcloud is up and running?

Best regards,
Vincenzo


-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


SolrCluoud servers sync/replica

2015-04-15 Thread Vincenzo D'Amore
Hi all,

I have a solrcloud cluster with 3 server, is it possible have a new server
solrcloud standalone always in sync with the cluster? I would like to have
something a replica or a slave.
As far as I have seen, it is not possibile do this with solrcloud, so I
have written a batch program that every minutes sync the new standalone
server.

Best regards,
Vincenzo

-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


ContentTypes supported by Solr to index

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi,

I am trying to index various binary file types into Solr. However, some
file types seems to be ignored and not getting indexed, though the metadata
is being extracted successfuly for all the types.

Specifically, zip files and jpg files are not getting indexed, where as
pdf, MS office documents are getting indexed. Hence wondering whether there
is a defined list of indexable file types.

Moreover, I am just wondering why Solr could not index the jpg and zip
documents when it was able to extract the metadata from those files?

The code snippet is as below:

contentStreamUpdateReq.addFile(file, fileType);
contentStreamUpdateReq.setParam("literal.id", literalId);
contentStreamUpdateReq.setParam("uprefix", "attr_");
contentStreamUpdateReq.setParam("fmap.content", "content");
contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
true);
solrServer.request(contentStreamUpdateReq);

Thanks & Regards
Vijay

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: Indexing PDF and MS Office files

2015-04-15 Thread Erick Erickson
There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137

But, I personally am not a huge fan of pushing all the work on to Solr, in a
production environment the Solr server is responsible for indexing, parsing the
docs through Tika, perhaps searching etc. This doesn't scale all that well.

So an alternative is to use SolrJ with Tika, which is totally independent of
what version of Tika is on the Solr server. Here's an example.

http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
 wrote:
> Thanks everyone for the responses. Now I am able to index PDF documents
> successfully. I have implemented manual extraction using Tika's AutoParser
> and PDF functionality is working fine. However,  the error with some MS
> office word documents still persist.
>
> The error message is "java.lang.IllegalArgumentException: This paragraph is
> not the first one in the table" which will eventually result in "Unexpected
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
>
> Upon some reading, it looks like its a bug with Tika 1.5 and seems to have
> been fixed with Tika 1.6 ( https://issues.apache.org/jira/browse/TIKA-1251 ).
> I am new to Solr / Tika and hence wondering whether I can change the Tika
> library alone to v1.6 without impacting any of the libraries within Solr
> 4.10.2? Please let me know your response and how to get away with this
> issue.
>
> Many thanks in advance.
>
> Thanks & Regards
> Vijay
>
>
> On 15 April 2015 at 05:14, Shyam R  wrote:
>
>> Vijay,
>>
>> You could try different excel files with different formats to rule out the
>> issue is with TIKA version being used.
>>
>> Thanks
>> Murthy
>>
>> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes 
>> wrote:
>>
>> > Perhaps the PDF is protected and the content can not be extracted?
>> >
>> > i have an unverified suspicion that the tika shipped with solr 4.10.2 may
>> > not support some/all office 2013 document formats.
>> >
>> >
>> >
>> >
>> >
>> > On 4/14/2015 8:18 PM, Jack Krupansky wrote:
>> >
>> >> Try doing a manual extraction request directly to Solr (not via SolrJ)
>> and
>> >> use the extractOnly option to see if the content is actually extracted.
>> >>
>> >> See:
>> >> https://cwiki.apache.org/confluence/display/solr/
>> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika
>> >>
>> >> Also, some PDF files actually have the content as a bitmap image, so no
>> >> text is extracted.
>> >>
>> >>
>> >> -- Jack Krupansky
>> >>
>> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
>> >> vijaya.bhoomire...@whishworks.com> wrote:
>> >>
>> >>  Hi,
>> >>>
>> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
>> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following
>> issues.
>> >>> Request to please let me know what is going wrong with the indexing
>> >>> process.
>> >>>
>> >>> I am using solr 4.10.2 and using the default example server
>> configuration
>> >>> that comes with Solr distribution.
>> >>>
>> >>> PDF Files - Indexing as such works fine, but when I query using *.* in
>> >>> the
>> >>> Solr Query console, metadata information is displayed properly.
>> However,
>> >>> the PDF content field is empty. This is happening for all PDF files I
>> >>> have
>> >>> tried. I have tried with some proprietary files, PDF eBooks etc.
>> Whatever
>> >>> be the PDF file, content is not being displayed.
>> >>>
>> >>> MS Office files -  For some office files, everything works perfect and
>> >>> the
>> >>> extracted content is visible in the query console. However, for
>> others, I
>> >>> see the below error message during the indexing process.
>> >>>
>> >>> *Exception in thread "main"
>> >>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> >>> from
>> >>> org.apache.tika.parser.microsoft.OfficeParser*
>> >>>
>> >>>
>> >>> I am using SolrJ to index the documents and below is the code snippet
>> >>> related to indexing. Please let me know where the issue is occurring.
>> >>>
>> >>>  static String solrServerURL = "
>> >>> http://localhost:8983/solr";;
>> >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>> >>>  static ContentStreamUpdateRequest indexingReq
>> =
>> >>> new
>> >>>
>> >>>  ContentStreamUpdateRequest("/update/extract");
>> >>>
>> >>>  indexingReq.addFile(file, fileType);
>> >>> indexingReq.setParam("literal.id", literalId);
>> >>> indexingReq.setParam("uprefix", "attr_");
>> >>> indexingReq.setParam("fmap.content", "content");
>> >>> indexingReq.setParam("literal.fileurl", fileURL);
>> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
>> >>> solrServer.request(indexingReq);
>> >>>
>> >>> Thanks & Regards
>> >>> Vijay
>> >>>
>> >>> --
>> >>> The contents of this 

Re: ContentTypes supported by Solr to index

2015-04-15 Thread Andrea Gazzarini

Hi Vijay,
here you can find all supported formats by Tika, which is internally 
used by SolrCell:


 * https://tika.apache.org/*1.4*/formats.html
 * https://tika.apache.org/*1.5*/formats.html
 * https://tika.apache.org/*1.6*/formats.html
 * https://tika.apache.org/*1.7*/formats.html

Best,
Andrea



On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

I am trying to index various binary file types into Solr. However, some
file types seems to be ignored and not getting indexed, though the metadata
is being extracted successfuly for all the types.

Specifically, zip files and jpg files are not getting indexed, where as
pdf, MS office documents are getting indexed. Hence wondering whether there
is a defined list of indexable file types.

Moreover, I am just wondering why Solr could not index the jpg and zip
documents when it was able to extract the metadata from those files?

The code snippet is as below:

contentStreamUpdateReq.addFile(file, fileType);
contentStreamUpdateReq.setParam("literal.id", literalId);
contentStreamUpdateReq.setParam("uprefix", "attr_");
contentStreamUpdateReq.setParam("fmap.content", "content");
contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
true);
solrServer.request(contentStreamUpdateReq);

Thanks & Regards
Vijay





file index format

2015-04-15 Thread Shlomit Afgin
Hi,

I just install solr and try it.
The index ignore text files with extension like php and py.  Is there any way 
to add types so solr will index them ?

Thanks.


Re: SolrCloud 4.8 - solrconfig.xml hot changes

2015-04-15 Thread Erick Erickson
Yes, but you must then push the changes up to Zookeeper (usually via
zkcli -cmd upconfig ) then reload the collection to get the
changes to take effect on all the replicas.

Best,
Erick

On Wed, Apr 15, 2015 at 6:12 AM, Vincenzo D'Amore  wrote:
> Hi all,
>
> can I change solrconfig.xml configuration when solrcloud is up and running?
>
> Best regards,
> Vincenzo
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251


Re: file index format

2015-04-15 Thread Erick Erickson
Solr uses Tika to try to process semi-structured documents. You can
see all the supported document types here:

https://tika.apache.org/1.4/formats.html

I assume you're using the Extracting Request Handler to do this?

Best,
Erick

On Wed, Apr 15, 2015 at 7:31 AM, Shlomit Afgin
 wrote:
> Hi,
>
> I just install solr and try it.
> The index ignore text files with extension like php and py.  Is there any way 
> to add types so solr will index them ?
>
> Thanks.


Re: Change scoring method

2015-04-15 Thread Doug Turnbull
When customizing scoring beyond what's available in the Query API, there's
a couple of layers you can work in

1. Create a Solr query parser -- not too hard, just requires very light
Java/Lucene skills. This involves taking a query string and query params
from Solr and digesting them into Lucene queries. You need to work with the
existing library of Lucene queries.
2. Custom Score queries -- you want to use a built-in Lucene query to
decide what documents should *match*, but you want to customize the *score*
generated
3. Custom similarity -- you want to control how the search engine computes
factors like norms, term frequency, etc that get used in the normal scoring
process
4. Custom query -- I call this the "nuclear option" -- you want to
customize matching and scoring. how the search engine decides which
documents to return AND assigns them scores

This has come up time-to-time in my work. So I've blogged/spoken about it

http://opensourceconnections.com/blog/2014/01/20/build-your-own-custom-lucene-query-and-scorer/
https://www.youtube.com/watch?v=UotgfwNpqrs

Hope that helps,
-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search  from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

On Wed, Apr 15, 2015 at 7:49 AM, מאיה גלעד  wrote:

> Hello,
>
> I'm using solr and indexing vectors into fields.
> In some cases when I search for documents I want to do it based on the
> fields containing the vector.
> I have an algorithm which defines a score to the document.
> The document's score is based on vector multiplication and on an input
> vector given in the url.
>
> What is the best way in order to implement the search?
>
> Thank you,
> Maya
>


Using synonyms API

2015-04-15 Thread Mike Thomsen
We recently upgraded from 4.5.0 to 4.10.4. I tried getting a list of our
synonyms like this:

http://localhost/solr/default-collection/schema/analysis/synonyms/english

I got a not found error. I found this page on new features in 4.8

http://yonik.com/solr-4-8-features/

Do we have to do something like this with our schema to even get the
synonyms API working?



  



  


I wanted to ask before changing our schema.

Thanks,

Mike


Re: ContentTypes supported by Solr to index

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP) and
image (JPG) formats. If thats the case, why SolrCell could not index the
documents of .zip and .jpg? Am I missing something here?  No error is
thrown in the overall process and the java program completes successfully.
But when I query the Solr UI, only 8 files are indexed.

Attached is a simple screenshot of the files types I am trying to index.

Thanks & Regards
Vijay

On 15 April 2015 at 15:27, Andrea Gazzarini  wrote:

> Hi Vijay,
> here you can find all supported formats by Tika, which is internally used
> by SolrCell:
>
>  * https://tika.apache.org/*1.4*/formats.html
>  * https://tika.apache.org/*1.5*/formats.html
>  * https://tika.apache.org/*1.6*/formats.html
>  * https://tika.apache.org/*1.7*/formats.html
>
> Best,
> Andrea
>
>
>
>
> On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
>
>> Hi,
>>
>> I am trying to index various binary file types into Solr. However, some
>> file types seems to be ignored and not getting indexed, though the
>> metadata
>> is being extracted successfuly for all the types.
>>
>> Specifically, zip files and jpg files are not getting indexed, where as
>> pdf, MS office documents are getting indexed. Hence wondering whether
>> there
>> is a defined list of indexable file types.
>>
>> Moreover, I am just wondering why Solr could not index the jpg and zip
>> documents when it was able to extract the metadata from those files?
>>
>> The code snippet is as below:
>>
>> contentStreamUpdateReq.addFile(file, fileType);
>> contentStreamUpdateReq.setParam("literal.id", literalId);
>> contentStreamUpdateReq.setParam("uprefix", "attr_");
>> contentStreamUpdateReq.setParam("fmap.content", "content");
>> contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.COMMIT,
>> true,
>> true);
>> solrServer.request(contentStreamUpdateReq);
>>
>> Thanks & Regards
>> Vijay
>>
>>
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: ContentTypes supported by Solr to index

2015-04-15 Thread Andrea Gazzarini

Sorry, attachments are not supported here :(

Anyway, I believe the misunderstanding resides in what you think you 
should mean "image indexing": actually, AFAIK, Tika indexes only a) the 
textual content of a given resource b) its metadata.

So

- for a JPG file (or in genetal, an image) you will get only its metadata
- for a compressed archive, Commons Compress API will decompress the 
archive and once did that, each file within the archive will be 
associated to a proper parser. So here actually it depends on the files 
(types) you have in your archive.


Best,
Andrea



Is that close to what you were thinking?

On 04/15/2015 05:16 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP) 
and image (JPG) formats. If thats the case, why SolrCell could not 
index the documents of .zip and .jpg? Am I missing something here?  No 
error is thrown in the overall process and the java program completes 
successfully. But when I query the Solr UI, only 8 files are indexed.


Attached is a simple screenshot of the files types I am trying to index.

Thanks & Regards
Vijay

On 15 April 2015 at 15:27, Andrea Gazzarini > wrote:


Hi Vijay,
here you can find all supported formats by Tika, which is
internally used by SolrCell:

 * https://tika.apache.org/*1.4*/formats.html
 * https://tika.apache.org/*1.5*/formats.html
 * https://tika.apache.org/*1.6*/formats.html
 * https://tika.apache.org/*1.7*/formats.html

Best,
Andrea




On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

I am trying to index various binary file types into Solr.
However, some
file types seems to be ignored and not getting indexed, though
the metadata
is being extracted successfuly for all the types.

Specifically, zip files and jpg files are not getting indexed,
where as
pdf, MS office documents are getting indexed. Hence wondering
whether there
is a defined list of indexable file types.

Moreover, I am just wondering why Solr could not index the jpg
and zip
documents when it was able to extract the metadata from those
files?

The code snippet is as below:

contentStreamUpdateReq.addFile(file, fileType);
contentStreamUpdateReq.setParam("literal.id
", literalId);
contentStreamUpdateReq.setParam("uprefix", "attr_");
contentStreamUpdateReq.setParam("fmap.content", "content");
contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.COMMIT,
true,
true);
solrServer.request(contentStreamUpdateReq);

Thanks & Regards
Vijay




The contents of this e-mail are confidential and for the exclusive use 
of the intended recipient. If you receive this e-mail in error please 
delete it from your system immediately and notify us either by e-mail 
or telephone. You should not copy, forward or otherwise disclose the 
content of the e-mail. The views expressed in this communication may 
not necessarily be the view held by WHISHWORKS. 




Re: ContentTypes supported by Solr to index

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks Andrea. For image files and zip files, even metadata is not
available. Just to explain further, I have indexed a total of 10 files, out
of which a .jpg file and .zip file are present.

After the indexing process is complete, no information about either of
these files is present in the solr query UI when I give *.* as the query
parameters. Not even metadata is displayed. Infact in the response,
*numFound* is showing only 8 documents, which are the ones apart from zip
and jpg files.

Thanks & Regards
Vijay


On 15 April 2015 at 16:29, Andrea Gazzarini  wrote:

> Sorry, attachments are not supported here :(
>
> Anyway, I believe the misunderstanding resides in what you think you
> should mean "image indexing": actually, AFAIK, Tika indexes only a) the
> textual content of a given resource b) its metadata.
> So
>
> - for a JPG file (or in genetal, an image) you will get only its metadata
> - for a compressed archive, Commons Compress API will decompress the
> archive and once did that, each file within the archive will be associated
> to a proper parser. So here actually it depends on the files (types) you
> have in your archive.
>
> Best,
> Andrea
>
>
>
> Is that close to what you were thinking?
>
> On 04/15/2015 05:16 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
>
>> Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP) and
>> image (JPG) formats. If thats the case, why SolrCell could not index the
>> documents of .zip and .jpg? Am I missing something here?  No error is
>> thrown in the overall process and the java program completes successfully.
>> But when I query the Solr UI, only 8 files are indexed.
>>
>> Attached is a simple screenshot of the files types I am trying to index.
>>
>> Thanks & Regards
>> Vijay
>>
>> On 15 April 2015 at 15:27, Andrea Gazzarini > > wrote:
>>
>> Hi Vijay,
>> here you can find all supported formats by Tika, which is
>> internally used by SolrCell:
>>
>>  * https://tika.apache.org/*1.4*/formats.html
>>  * https://tika.apache.org/*1.5*/formats.html
>>  * https://tika.apache.org/*1.6*/formats.html
>>  * https://tika.apache.org/*1.7*/formats.html
>>
>> Best,
>> Andrea
>>
>>
>>
>>
>> On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
>>
>> Hi,
>>
>> I am trying to index various binary file types into Solr.
>> However, some
>> file types seems to be ignored and not getting indexed, though
>> the metadata
>> is being extracted successfuly for all the types.
>>
>> Specifically, zip files and jpg files are not getting indexed,
>> where as
>> pdf, MS office documents are getting indexed. Hence wondering
>> whether there
>> is a defined list of indexable file types.
>>
>> Moreover, I am just wondering why Solr could not index the jpg
>> and zip
>> documents when it was able to extract the metadata from those
>> files?
>>
>> The code snippet is as below:
>>
>> contentStreamUpdateReq.addFile(file, fileType);
>> contentStreamUpdateReq.setParam("literal.id
>> ", literalId);
>> contentStreamUpdateReq.setParam("uprefix", "attr_");
>> contentStreamUpdateReq.setParam("fmap.content", "content");
>> contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.
>> COMMIT,
>> true,
>> true);
>> solrServer.request(contentStreamUpdateReq);
>>
>> Thanks & Regards
>> Vijay
>>
>>
>>
>>
>> The contents of this e-mail are confidential and for the exclusive use of
>> the intended recipient. If you receive this e-mail in error please delete
>> it from your system immediately and notify us either by e-mail or
>> telephone. You should not copy, forward or otherwise disclose the content
>> of the e-mail. The views expressed in this communication may not
>> necessarily be the view held by WHISHWORKS.
>>
>
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.


Re: ContentTypes supported by Solr to index

2015-04-15 Thread Jack Krupansky
Check to see if there are any errors in the Solr log for jpg and zip files.
Solr should do something for them - if not, file a Jira to suggest that it
should, as an imporvement. Zip should give a list of the enclosed files.
Images should at least give the metadata.

-- Jack Krupansky

On Wed, Apr 15, 2015 at 11:45 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Thanks Andrea. For image files and zip files, even metadata is not
> available. Just to explain further, I have indexed a total of 10 files, out
> of which a .jpg file and .zip file are present.
>
> After the indexing process is complete, no information about either of
> these files is present in the solr query UI when I give *.* as the query
> parameters. Not even metadata is displayed. Infact in the response,
> *numFound* is showing only 8 documents, which are the ones apart from zip
> and jpg files.
>
> Thanks & Regards
> Vijay
>
>
> On 15 April 2015 at 16:29, Andrea Gazzarini  wrote:
>
> > Sorry, attachments are not supported here :(
> >
> > Anyway, I believe the misunderstanding resides in what you think you
> > should mean "image indexing": actually, AFAIK, Tika indexes only a) the
> > textual content of a given resource b) its metadata.
> > So
> >
> > - for a JPG file (or in genetal, an image) you will get only its metadata
> > - for a compressed archive, Commons Compress API will decompress the
> > archive and once did that, each file within the archive will be
> associated
> > to a proper parser. So here actually it depends on the files (types) you
> > have in your archive.
> >
> > Best,
> > Andrea
> >
> >
> >
> > Is that close to what you were thinking?
> >
> > On 04/15/2015 05:16 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
> >
> >> Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP) and
> >> image (JPG) formats. If thats the case, why SolrCell could not index the
> >> documents of .zip and .jpg? Am I missing something here?  No error is
> >> thrown in the overall process and the java program completes
> successfully.
> >> But when I query the Solr UI, only 8 files are indexed.
> >>
> >> Attached is a simple screenshot of the files types I am trying to index.
> >>
> >> Thanks & Regards
> >> Vijay
> >>
> >> On 15 April 2015 at 15:27, Andrea Gazzarini  >> > wrote:
> >>
> >> Hi Vijay,
> >> here you can find all supported formats by Tika, which is
> >> internally used by SolrCell:
> >>
> >>  * https://tika.apache.org/*1.4*/formats.html
> >>  * https://tika.apache.org/*1.5*/formats.html
> >>  * https://tika.apache.org/*1.6*/formats.html
> >>  * https://tika.apache.org/*1.7*/formats.html
> >>
> >> Best,
> >> Andrea
> >>
> >>
> >>
> >>
> >> On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
> >>
> >> Hi,
> >>
> >> I am trying to index various binary file types into Solr.
> >> However, some
> >> file types seems to be ignored and not getting indexed, though
> >> the metadata
> >> is being extracted successfuly for all the types.
> >>
> >> Specifically, zip files and jpg files are not getting indexed,
> >> where as
> >> pdf, MS office documents are getting indexed. Hence wondering
> >> whether there
> >> is a defined list of indexable file types.
> >>
> >> Moreover, I am just wondering why Solr could not index the jpg
> >> and zip
> >> documents when it was able to extract the metadata from those
> >> files?
> >>
> >> The code snippet is as below:
> >>
> >> contentStreamUpdateReq.addFile(file, fileType);
> >> contentStreamUpdateReq.setParam("literal.id
> >> ", literalId);
> >> contentStreamUpdateReq.setParam("uprefix", "attr_");
> >> contentStreamUpdateReq.setParam("fmap.content", "content");
> >> contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.
> >> COMMIT,
> >> true,
> >> true);
> >> solrServer.request(contentStreamUpdateReq);
> >>
> >> Thanks & Regards
> >> Vijay
> >>
> >>
> >>
> >>
> >> The contents of this e-mail are confidential and for the exclusive use
> of
> >> the intended recipient. If you receive this e-mail in error please
> delete
> >> it from your system immediately and notify us either by e-mail or
> >> telephone. You should not copy, forward or otherwise disclose the
> content
> >> of the e-mail. The views expressed in this communication may not
> >> necessarily be the view held by WHISHWORKS.
> >>
> >
> >
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this 

Re: SolrCloud 4.8 - solrconfig.xml hot changes

2015-04-15 Thread Vincenzo D'Amore
Thanks, it works :)

On Wed, Apr 15, 2015 at 4:38 PM, Erick Erickson 
wrote:

> Yes, but you must then push the changes up to Zookeeper (usually via
> zkcli -cmd upconfig ) then reload the collection to get the
> changes to take effect on all the replicas.
>
> Best,
> Erick
>
> On Wed, Apr 15, 2015 at 6:12 AM, Vincenzo D'Amore 
> wrote:
> > Hi all,
> >
> > can I change solrconfig.xml configuration when solrcloud is up and
> running?
> >
> > Best regards,
> > Vincenzo
> >
> >
> > --
> > Vincenzo D'Amore
> > email: v.dam...@gmail.com
> > skype: free.dev
> > mobile: +39 349 8513251
>



-- 
Vincenzo D'Amore
email: v.dam...@gmail.com
skype: free.dev
mobile: +39 349 8513251


custom search component on solrcloud

2015-04-15 Thread Peyman Faratin
Hi

I am trying to port my none solrcloud custom search handler to a solrcloud one. 
I have read the WritingDistibutedSearchComponents wiki page and looked at Terms 
and Querycomponent codes but the control flow of execution is still fuzzy (even 
given the “distributed algorithm” description). 

Concretely, I have a none solrcloud algorithm that given a sequence of tokens T 
would 

1- split T into single tokens
2- foreach token t_i
get all the DocList for t_i by executing 
rb.req.getSearcher().getDocList in process() method of the custom search 
component

3- do some magic on the collection of doclists

My question is how can i 

1) do the splitting (step 1 above) in a single shard, and
2) distribute the getDocList for each token t_i to all shards
3) wait till i have all the doclists from all shards, then
4) do something with the results, in the original calling shard (step 1 above). 

Thank you for your help

How do I tell Tika to not complement a field's value defined in my Solr schema when indexing a binary document?

2015-04-15 Thread Patrick Savelberg
I use Solr to index different kinds of database tables. I have a Solr index 
containing a field named category. I make sure that the category field in Solr 
gets occupied with the right value depending on the table. This I can use to 
build facet queries which works fine.

The problem I have is with tables that contain records which represent binary 
documents like PDF's. I use the extract query (TIKA) to index the contents of 
the binary document along with the data from the database record. Tika 
sometimes finds metadata in the document which has the same name as one of my 
index fields I have in my schema.xml, like category. I end up with the category 
field being a multi-value field containing the category data from my database 
record AND the additional data from the category (meta)field extracted by TIKA 
from the actual binary document. It seems that the extracthandler adds every 
field it may find to my index if there is a corresponding field in my index.

How can I prevent this from happening? All I need is the textual representation 
of the binary document added as content and not the extra (meta?) fields. I 
don't want the extra data TIKA may find to be added to any field in my index. 
However I do want to keep the data in the category field which comes from my 
database record. So adding a fmap.category="ignored_" won't help me because 
then the data of my database record will be ignored as well.

Another reason for wanting to prevent this is that I cannot know in advance 
which other fields TIKA might come up with when the document is extracted. In 
other words choosing more elaborated names (like a namespace like prefix) for 
my index fields will never guarantee field name collisions 100%.

So, how can I prevent the data the extract comes up with is added to my index 
field or am I missing a point here?



Re: Lucene updateDocument does not affect index until restarting solr

2015-04-15 Thread Chris Hostetter

the short answer is that you need something to re-open the searcher -- but 
i'm not going to go into specifics on how to do that because...

You are dealing with a VERY low level layer of the lucene/solr code stack 
-- w/o more details on why you've written this particular bit of code (and 
where in the solr stack this code lives) it's hard to give you general 
advice on the best way to proceed and i don't wnat to encourage you along 
a dangerous path when there are likely much 
easier/better/safer/more-supported ways to do what you are trying to do -- 
you just need to explain to us what that is.

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341




: Date: Thu, 9 Apr 2015 01:02:16 +0430
: From: Ali Nazemian 
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" 
: Subject: Lucene updateDocument does not affect index until restarting solr
: 
: Dear all,
: Hi,
: As a part of my code I have to update Lucene document. For this purpose I
: used writer.updateDocument() method. My problem is the update process is
: not affect index until restarting Solr. Would you please tell me what part
: of my code is wrong? Or what should I add in order to apply the changes?
: 
: RefCounted iw = solrCoreState.getIndexWriter(core);
:   try {
: IndexWriter writer = iw.get();
: FieldType type= new FieldType(StringField.TYPE_STORED);
: for (int i = 0; i < hits.length; i++) {
:   Document document = searcher.doc(hits[i].doc);
:   List keywords = keyword.getKeywords(hits[i].doc);
:   if (keywords.size() > 0) document.removeFields(keywordField);
:   for (String word : keywords) {
: document.add(new Field(keywordField, word, type));
:   }
:   String uniqueKey =
: searcher.getSchema().getUniqueKeyField().getName();
:   writer.updateDocument(new Term(uniqueKey,
: document.get(uniqueKey)),
:   document);
: }
: writer.commit();
:   } finally {
: iw.decref();
:   }
: 
: 
: Best regards.
: 
: -- 
: A.Nazemian
: 

-Hoss
http://www.lucidworks.com/


How do you manage / update schema.xml file

2015-04-15 Thread Steven White
Hi folks,

What is the best practice to manage and update Solr's schema.xml?

I need to deploy Solr dynamically based on customer configuration (they
will pick fields to be indexed or not, they will want to customize the
analyzer (WordDelimiterFilterFactory, etc.) and specify the language to use.

Is the task of setting up a proper schema.xml outside the scope of Solr
admin, one that I have to manage by writing my own application or is there
some tool that comes with Solr to help me do this?

I was thinking maybe SolrJ will do this for me but I couldn't find anything
about it to do this.

I also have to do customization to solrconfig.xml, thus the same question
applies here too.

Thanks in advanced.

Steve


Re: Using synonyms API

2015-04-15 Thread Yonik Seeley
I just tried this quickly on trunk and it still works.

/opt/code/lusolr_trunk$ curl
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english

{
  "responseHeader":{
"status":0,
"QTime":234},
  "synonymMappings":{
"initArgs":{
  "ignoreCase":true,
  "format":"solr"},
"initializedOn":"2015-04-14T19:39:55.157Z",
"managedMap":{
  "GB":["GiB",
"Gigabyte"],
  "TV":["Television"],
  "happy":["glad",
"joyful"]}}}


Verify that your URL has the correct port number (your example below
doesn't), and that "default-collection" is actually the name of your
default collection (and not "collection1" which is the default for the
4x series).

-Yonik


On Wed, Apr 15, 2015 at 11:11 AM, Mike Thomsen  wrote:
> We recently upgraded from 4.5.0 to 4.10.4. I tried getting a list of our
> synonyms like this:
>
> http://localhost/solr/default-collection/schema/analysis/synonyms/english
>
> I got a not found error. I found this page on new features in 4.8
>
> http://yonik.com/solr-4-8-features/
>
> Do we have to do something like this with our schema to even get the
> synonyms API working?
>
> 
>  positionIncrementGap="100">
>   
> 
> 
> 
>   
> 
>
> I wanted to ask before changing our schema.
>
> Thanks,
>
> Mike


Re: Using synonyms API

2015-04-15 Thread Mike Thomsen
Thanks. It turned out to be caused by me not using the
ManagedSynonymFilterFactory.

I added the dummy managed_en field:


  



  


and defined a field that uses it in the schema block like so:




Here is the output of the managed synonym listing:

{
  "responseHeader":{
"status":0,
"QTime":343},
  "synonymMappings":{
"initArgs":{"ignoreCase":false},
"initializedOn":"2015-04-15T19:13:15.072Z",
"managedMap":{"Crota":["Crouton"]}}}



I posted this document successfully and can find it when I search for it
with this: *dummy_text: Crota*

{
"id": "stupidtestmessage",
"label": "Crota, Son of Oryx, lives!",
"dummy_stuff": [
"Crota, Son of Oryx, and pretty important dude in the Hive was
discovered alive and well in the hellmouth today!"
]
}

When I use *dummy_text: Crouton*, nothing comes back. I am pretty confident
that I am missing something here. Any ideas?

Thanks,

Mike

On Wed, Apr 15, 2015 at 3:04 PM, Yonik Seeley  wrote:

> I just tried this quickly on trunk and it still works.
>
> /opt/code/lusolr_trunk$ curl
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":234},
>   "synonymMappings":{
> "initArgs":{
>   "ignoreCase":true,
>   "format":"solr"},
> "initializedOn":"2015-04-14T19:39:55.157Z",
> "managedMap":{
>   "GB":["GiB",
> "Gigabyte"],
>   "TV":["Television"],
>   "happy":["glad",
> "joyful"]}}}
>
>
> Verify that your URL has the correct port number (your example below
> doesn't), and that "default-collection" is actually the name of your
> default collection (and not "collection1" which is the default for the
> 4x series).
>
> -Yonik
>
>
> On Wed, Apr 15, 2015 at 11:11 AM, Mike Thomsen 
> wrote:
> > We recently upgraded from 4.5.0 to 4.10.4. I tried getting a list of our
> > synonyms like this:
> >
> >
> http://localhost/solr/default-collection/schema/analysis/synonyms/english
> >
> > I got a not found error. I found this page on new features in 4.8
> >
> > http://yonik.com/solr-4-8-features/
> >
> > Do we have to do something like this with our schema to even get the
> > synonyms API working?
> >
> > 
> >  > positionIncrementGap="100">
> >   
> > 
> > 
> > 
> >   
> > 
> >
> > I wanted to ask before changing our schema.
> >
> > Thanks,
> >
> > Mike
>


(possible)SimplePostTool problem --(Windows, Bitnami distribution)

2015-04-15 Thread kenadian
Hello all, 
my Bitnami/*Solr-5.0.0* instalation is not able to index any type of
file(found in the provided examples folders or anywhere else) except HTML. 

Tested on the files in "exampledocs" folder
(books.csv,books.json,...,utf8-example.xml, vidcard.xml) I get:
for *.csv* files I get the reponse "Unexpected character 'i' " (depending on
what is the 1st character in file),
for *.xml* files I get the response "ERROR: unknown field 'id' "
for *.pdf* files I get the response "Invalid UTF-8 middle byte 0xe5"
and so forth.
Even *.TXT* files are not handled:
I get the reponse "Unexpected character 'T' " (depending on what is the 1st
character in file--This is a test of TXT extraction in Solr, it is only a
test. Do not panic.")


The only type that works is *HTML* :

C:\Bitnami\solr-5.0.0-0\apache-solr\solr\exampledocs>java -Dc=tika -jar
post.jar  *.html

SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/tika/update using
content-type application/xml...
POSTing file sample.html to [base]
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/tika/update...
Time spent: 0:00:00.313

I use Windows 8.1, java version "1.8.0_40".

Any ideas of how to fix this? Many thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/possible-SimplePostTool-problem-Windows-Bitnami-distribution-tp4199980.html
Sent from the Solr - User mailing list archive at Nabble.com.


_version_ returned from /update?

2015-04-15 Thread Reitzel, Charles
Hi All,

In the interests of minimizing round-trips to the database, is there any way to 
get the added/changed _version_ values returned from /update?   Or do you 
always have to do a fresh get?

Yes, I am using optimistic concurrency.  No, I am not using atomic updates 
(yet).

Has anyone tried this (or something like it)?

Thanks,
Charlie

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*


Re: Problem related to filter on Zero value for DateField

2015-04-15 Thread Ali Nazemian
Dear Jack,
Hi,
The q parameter is *:* since I just wanted to filter the documents.
Regards.

On Tue, Apr 14, 2015 at 8:07 PM, Jack Krupansky 
wrote:

> What does your main query look like? Normally we don't speak of "searching"
> with the fq parameter - it filters the results, but the actual searching is
> done via the main query with the q parameter.
>
> -- Jack Krupansky
>
> On Tue, Apr 14, 2015 at 4:17 AM, Ali Nazemian 
> wrote:
>
> > Dears,
> > Hi,
> > I have strange problem with Solr 4.10.x. My problem is when I do
> searching
> > on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter
> > be considered, the results became invalid. For example consider this
> > scenario:
> > When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr
> > returns three different documents which is right for my Collection. All
> of
> > these three documents have same value of "7" for document status. Now If
> I
> > search for fq=document_status:7 the same three documents returns which is
> > also a correct response. But When I do the searching on
> > fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns
> > nothing! (0 document) While I have not such problem with other date
> values
> > beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug
> > related to Solr or I did something wrong?
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian


Re: Lucene updateDocument does not affect index until restarting solr

2015-04-15 Thread Ali Nazemian
Dear Chris,
Hi,
Thank you for your response. Actually I implemented a small code for the
purpose of extracting article keywords out of Lucene index on commit,
optimize or calling the specific query. I did implement that using search
component. I know that the searchComponent is not for the purpose of
updating index, but it was suggested in Solr mailing list at the first
place and it seems it is the most possible solution according to Solr
extension points. Anyway for more information about why I chose
searchComponent at the first place please take a look at this

link.

Best regards.


On Wed, Apr 15, 2015 at 10:00 PM, Chris Hostetter 
wrote:

>
> the short answer is that you need something to re-open the searcher -- but
> i'm not going to go into specifics on how to do that because...
>
> You are dealing with a VERY low level layer of the lucene/solr code stack
> -- w/o more details on why you've written this particular bit of code (and
> where in the solr stack this code lives) it's hard to give you general
> advice on the best way to proceed and i don't wnat to encourage you along
> a dangerous path when there are likely much
> easier/better/safer/more-supported ways to do what you are trying to do --
> you just need to explain to us what that is.
>
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
>
>
> : Date: Thu, 9 Apr 2015 01:02:16 +0430
> : From: Ali Nazemian 
> : Reply-To: solr-user@lucene.apache.org
> : To: "solr-user@lucene.apache.org" 
> : Subject: Lucene updateDocument does not affect index until restarting
> solr
> :
> : Dear all,
> : Hi,
> : As a part of my code I have to update Lucene document. For this purpose I
> : used writer.updateDocument() method. My problem is the update process is
> : not affect index until restarting Solr. Would you please tell me what
> part
> : of my code is wrong? Or what should I add in order to apply the changes?
> :
> : RefCounted iw = solrCoreState.getIndexWriter(core);
> :   try {
> : IndexWriter writer = iw.get();
> : FieldType type= new FieldType(StringField.TYPE_STORED);
> : for (int i = 0; i < hits.length; i++) {
> :   Document document = searcher.doc(hits[i].doc);
> :   List keywords = keyword.getKeywords(hits[i].doc);
> :   if (keywords.size() > 0) document.removeFields(keywordField);
> :   for (String word : keywords) {
> : document.add(new Field(keywordField, word, type));
> :   }
> :   String uniqueKey =
> : searcher.getSchema().getUniqueKeyField().getName();
> :   writer.updateDocument(new Term(uniqueKey,
> : document.get(uniqueKey)),
> :   document);
> : }
> : writer.commit();
> :   } finally {
> : iw.decref();
> :   }
> :
> :
> : Best regards.
> :
> : --
> : A.Nazemian
> :
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
A.Nazemian


Re: Using synonyms API

2015-04-15 Thread Mike Thomsen
I also tried the 4.10.4 default example and set up the synonym list like
this:

{
  "responseHeader":{
"status":0,
"QTime":2},
  "synonymMappings":{
"initArgs":{
  "ignoreCase":true,
  "format":"solr"},
"initializedOn":"2015-04-15T20:26:02.072Z",
"managedMap":{
  "Battery":["Deadweight"],
  "GB":["GiB",
"Gigabyte"],
  "TV":["Television"],
  "happy":["glad",
"joyful"]}}}

I added a dynamicField called my_syntext with a type of
managed_english per the example.

Then I indexed an example from the ipod data set with my_syntext set
to Full Battery for you as the text.

Finally, did a search on my_syntext for Deadweight and nothing came
back. I reloaded the core and even restarted solr. Nothing seemed to
work.



On Wed, Apr 15, 2015 at 3:04 PM, Yonik Seeley  wrote:

> I just tried this quickly on trunk and it still works.
>
> /opt/code/lusolr_trunk$ curl
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/english
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":234},
>   "synonymMappings":{
> "initArgs":{
>   "ignoreCase":true,
>   "format":"solr"},
> "initializedOn":"2015-04-14T19:39:55.157Z",
> "managedMap":{
>   "GB":["GiB",
> "Gigabyte"],
>   "TV":["Television"],
>   "happy":["glad",
> "joyful"]}}}
>
>
> Verify that your URL has the correct port number (your example below
> doesn't), and that "default-collection" is actually the name of your
> default collection (and not "collection1" which is the default for the
> 4x series).
>
> -Yonik
>
>
> On Wed, Apr 15, 2015 at 11:11 AM, Mike Thomsen 
> wrote:
> > We recently upgraded from 4.5.0 to 4.10.4. I tried getting a list of our
> > synonyms like this:
> >
> >
> http://localhost/solr/default-collection/schema/analysis/synonyms/english
> >
> > I got a not found error. I found this page on new features in 4.8
> >
> > http://yonik.com/solr-4-8-features/
> >
> > Do we have to do something like this with our schema to even get the
> > synonyms API working?
> >
> > 
> >  > positionIncrementGap="100">
> >   
> > 
> > 
> > 
> >   
> > 
> >
> > I wanted to ask before changing our schema.
> >
> > Thanks,
> >
> > Mike
>


solr index design for this use case?

2015-04-15 Thread vsriram30
Hi All,

Consider this scenario : I am having around 100K content and I want to
launch 5 sites with that content. For example, around 50K content for site1,
40K content for site2, 30K for site3, 20K for site4, and 10K for site5.

As seen from this example, these sites have few overlapping content and non
overlapping content as well. In this case say if a content page is present
in all site1, site2 and site3 out of 50 fields per content page, say 30
fields remain common between site1 and site2, 25 fields common between site1
and site3 and 20 fields between site2 and site3, in this case, my aim is to
prevent duplication as much as possible without getting too much reduction
in QPS. Hence I consider the following options,

Option 1: Just maintain individual copy of duplicated content for each site
and overwrite site specific information while indexing for those sites.
Pros:
Better QPS as no query time joins are involved.
Cons:
Duplication of common fields for common content across sites.

Option 2: Maintain just a single copy of common fields per content across
all overlapping sites and separate site specific information for that
content and do a merge while serving using joins.
In this approach, for joins I looked at Block join provided by solr and
looks like it may not be a good fit for my case as if one site specific info
changes, I don't want to index the entire block containing other sites as
well.
Is there any better way to tackle this making sure we are not occupying so
much space and at the same time not reducing the QPS too much?

Thanks,
Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-index-design-for-this-use-case-tp420.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: _version_ returned from /update?

2015-04-15 Thread Chris Hostetter

: In the interests of minimizing round-trips to the database, is there any 
: way to get the added/changed _version_ values returned from /update?  
: Or do you always have to do a fresh get?

there is a "versions=true" param you can specify on updates to get the 
version# back for each doc added -- but aparently we've never documented 
it before? ... I'll try to make sure that's remedied in the 5.1 ref guide


(FWIW: there also appears to be some problem with using this option with 
the data_driven_schema_configs ... i just opened SOLR-7404 to look into 
this)


curl -X POST -H "Content-Type: application/csv" --data-binary @books.csv 
"http://localhost:8983/solr/techproducts/update?commit=true&versions=true";


01513149855345493671936001498553454987051008149855345499019673614985534549933424641498553454995439616








-Hoss
http://www.lucidworks.com/


Re: How do I tell Tika to not complement a field's value defined in my Solr schema when indexing a binary document?

2015-04-15 Thread Erick Erickson
My standard answer when you want to really customize how stuff like
this works is to do the Tika processing in SolrJ. That lets you
ignore/modify/whatever anything you want. It also moves the parsing
load off of the Solr node which scales much better. Here's an example:
http://lucidworks.com/blog/indexing-with-solrj/

IOW, I don't know how to do what you're asking for from within the
Extracting Request Handler. Not quite sure whether "literals" would
work for you, see:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Best,
Erick

On Wed, Apr 15, 2015 at 10:26 AM, Patrick Savelberg
 wrote:
> I use Solr to index different kinds of database tables. I have a Solr index 
> containing a field named category. I make sure that the category field in 
> Solr gets occupied with the right value depending on the table. This I can 
> use to build facet queries which works fine.
>
> The problem I have is with tables that contain records which represent binary 
> documents like PDF's. I use the extract query (TIKA) to index the contents of 
> the binary document along with the data from the database record. Tika 
> sometimes finds metadata in the document which has the same name as one of my 
> index fields I have in my schema.xml, like category. I end up with the 
> category field being a multi-value field containing the category data from my 
> database record AND the additional data from the category (meta)field 
> extracted by TIKA from the actual binary document. It seems that the 
> extracthandler adds every field it may find to my index if there is a 
> corresponding field in my index.
>
> How can I prevent this from happening? All I need is the textual 
> representation of the binary document added as content and not the extra 
> (meta?) fields. I don't want the extra data TIKA may find to be added to any 
> field in my index. However I do want to keep the data in the category field 
> which comes from my database record. So adding a fmap.category="ignored_" 
> won't help me because then the data of my database record will be ignored as 
> well.
>
> Another reason for wanting to prevent this is that I cannot know in advance 
> which other fields TIKA might come up with when the document is extracted. In 
> other words choosing more elaborated names (like a namespace like prefix) for 
> my index fields will never guarantee field name collisions 100%.
>
> So, how can I prevent the data the extract comes up with is added to my index 
> field or am I missing a point here?
>


Re: How do you manage / update schema.xml file

2015-04-15 Thread Erick Erickson
Have you looked at the "managed schema" stuff?
see: 
https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig
There's also some work being done to update at least parts of
solrconfig.xml, see:
https://issues.apache.org/jira/browse/SOLR-6533

Best,
Erick

On Wed, Apr 15, 2015 at 11:46 AM, Steven White  wrote:
> Hi folks,
>
> What is the best practice to manage and update Solr's schema.xml?
>
> I need to deploy Solr dynamically based on customer configuration (they
> will pick fields to be indexed or not, they will want to customize the
> analyzer (WordDelimiterFilterFactory, etc.) and specify the language to use.
>
> Is the task of setting up a proper schema.xml outside the scope of Solr
> admin, one that I have to manage by writing my own application or is there
> some tool that comes with Solr to help me do this?
>
> I was thinking maybe SolrJ will do this for me but I couldn't find anything
> about it to do this.
>
> I also have to do customization to solrconfig.xml, thus the same question
> applies here too.
>
> Thanks in advanced.
>
> Steve


Differentiating user search term in Solr

2015-04-15 Thread Steven White
Hi folks,

If a user types in the search box (without quotes): "{!q.op=AND df=text
solr sys" and I take that text and build the URL like so:

http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sys&fl=id%2Cscore%2Ctitle&wt=xml&indent=true

This will fail with "Expected identifier" because it is not a valid Solr
text.

My question is this: is there a flag I can send to Solr with the URL
telling it to treat what's in "q" as raw text vs. having it to process it
as a Solr syntax?  If not, than it means I have to escape all Solr reserved
characters and words.  If so, where can I find the complete list?  Also,
what happens when a new reserved characters or word is added to Solr down
the road?  It means I have to upgrade my application too, which is
something I would like to avoid.

Thanks

Steve


Re: How do you manage / update schema.xml file

2015-04-15 Thread Steven White
Thanks, this is exactly what I was looking for!!

Steve

On Wed, Apr 15, 2015 at 5:48 PM, Erick Erickson 
wrote:

> Have you looked at the "managed schema" stuff?
> see:
> https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig
> There's also some work being done to update at least parts of
> solrconfig.xml, see:
> https://issues.apache.org/jira/browse/SOLR-6533
>
> Best,
> Erick
>
> On Wed, Apr 15, 2015 at 11:46 AM, Steven White 
> wrote:
> > Hi folks,
> >
> > What is the best practice to manage and update Solr's schema.xml?
> >
> > I need to deploy Solr dynamically based on customer configuration (they
> > will pick fields to be indexed or not, they will want to customize the
> > analyzer (WordDelimiterFilterFactory, etc.) and specify the language to
> use.
> >
> > Is the task of setting up a proper schema.xml outside the scope of Solr
> > admin, one that I have to manage by writing my own application or is
> there
> > some tool that comes with Solr to help me do this?
> >
> > I was thinking maybe SolrJ will do this for me but I couldn't find
> anything
> > about it to do this.
> >
> > I also have to do customization to solrconfig.xml, thus the same question
> > applies here too.
> >
> > Thanks in advanced.
> >
> > Steve
>


rq breaks wildcard search?

2015-04-15 Thread Ryan Josal
Using edismax, supplying a rq= param, like {!rerank ...} is causing an
UnsupportedOperationException because the Query doesn't implement
createWeight.  This is for WildcardQuery in particular.  From some
preliminary debugging it looks like without rq, somehow the qf Queries
might turn into ConstantScore instead of WildcardQuery.  I don't think this
is related to the RankQuery implementation as my own subclass has the same
issue.  Anyway the effect is that all q's containing ? or * return http 500
because I always have rq on.  Can anyone confirm if this is a bug?  I will
log it in Jira if so.

Also, does anyone know how I can work around it?  Specifically, can I
disable edismax from making WildcardQueries?

Ryan


RE: _version_ returned from /update?

2015-04-15 Thread Reitzel, Charles
Hey, that's great!  I'll give it a try.  

File under, "never hurts to ask" ... :-)

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Wednesday, April 15, 2015 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: _version_ returned from /update?


: In the interests of minimizing round-trips to the database, is there any
: way to get the added/changed _version_ values returned from /update?  
: Or do you always have to do a fresh get?

there is a "versions=true" param you can specify on updates to get the version# 
back for each doc added -- but aparently we've never documented it before? ... 
I'll try to make sure that's remedied in the 5.1 ref guide

(FWIW: there also appears to be some problem with using this option with the 
data_driven_schema_configs ... i just opened SOLR-7404 to look into this)

curl -X POST -H "Content-Type: application/csv" --data-binary @books.csv 
"http://localhost:8983/solr/techproducts/update?commit=true&versions=true";


01513149855345493671936001498553454987051008149855345499019673614985534549933424641498553454995439616








-Hoss
http://www.lucidworks.com/

*
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*



Re: Differentiating user search term in Solr

2015-04-15 Thread Shawn Heisey
On 4/15/2015 3:54 PM, Steven White wrote:
> Hi folks,
>
> If a user types in the search box (without quotes): "{!q.op=AND df=text
> solr sys" and I take that text and build the URL like so:
>
> http://localhost:8983/solr/db/select?q={!q.op=AND%20df=text%20solr%20sys&fl=id%2Cscore%2Ctitle&wt=xml&indent=true
>
> This will fail with "Expected identifier" because it is not a valid Solr
> text.

That isn't valid syntax for the lucene query parser ... the localparams
are not closed (it would require a } character), and after the
localparams there would need to be some additional text.

> My question is this: is there a flag I can send to Solr with the URL
> telling it to treat what's in "q" as raw text vs. having it to process it
> as a Solr syntax?  If not, than it means I have to escape all Solr reserved
> characters and words.  If so, where can I find the complete list?  Also,
> what happens when a new reserved characters or word is added to Solr down
> the road?  It means I have to upgrade my application too, which is
> something I would like to avoid.

One way to treat the entire input as literal text is to use the terms
query parser ... but that requires the localparams syntax, and I do not
know exactly what is going to happen if you use a query string that
itself is localparams syntax -- {! other params} ... so escaping is
probably safer.

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermQueryParser

The other way to handle it is to escape every special character with a
backslash.  The escapeQueryChars method in SolrJ is always kept up to
date, and can escape every special character.

http://lucene.apache.org/solr/4_10_3/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html#escapeQueryChars%28java.lang.String%29

The javadoc for that method points to the queryparser syntax for more
info on characters that need escaping.  Scroll to the very end of this page:

http://lucene.apache.org/core/4_10_3/queryparser/org/apache/lucene/queryparser/classic/package-summary.html?is-external=true

That page lists || and && rather than just the single characters | and &
... the escapeQueryChars method in SolrJ will escape both characters, as
it only works at the character level, not the string level.

If you want the *spaces* in your query to be treated literally also, you
must escape them too.  The escapeQueryChars method I've mentioned will
NOT escape spaces.

Note that this does not cover URL escaping -- the & character must be
sent as %26 or the servlet container will treat it as a special
character, before it even gets to Solr.

Thanks,
Shawn



Re: Problem related to filter on Zero value for DateField

2015-04-15 Thread Chris Hostetter

You're going to have to provide a lot more details (solr version, sample 
data, full queries, details about configs, etc...) in order for anyone to 
offer you meaningful assistence...

https://wiki.apache.org/solr/UsingMailingLists

I attempted to reproduce the steps you describe using Solr 5.1 and the 
techproducts example and could not even remotely reproduce what you 
describe


$ bin/solr -e techproducts
...
$ curl -X POST -H 'Content-Type: application/json' 
'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary 
'[{ "id" : "aaa", "bar_i" : 7, "foo_dt" : "0002-11-30T00:00:00Z" }, { "id" 
: "bbb", "bar_i" : 7, "foo_dt" : "0002-11-30T00:00:00Z" }, { "id" : 
"nomatch_date", "bar_i" : 7, "foo_dt" : "1976-11-30T00:00:00Z" }]'
{"responseHeader":{"status":0,"QTime":399}}
$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fq=bar_i:7'
{
  "responseHeader":{
"status":0,
"QTime":8,
"params":{
  "q":"*:*",
  "fq":"bar_i:7"}},
  "response":{"numFound":3,"start":0,"docs":[
  {
"id":"aaa",
"bar_i":7,
"foo_dt":"0002-11-30T00:00:00Z",
"_version_":1498568386111602688},
  {
"id":"bbb",
"bar_i":7,
"foo_dt":"0002-11-30T00:00:00Z",
"_version_":1498568386113699840},
  {
"id":"nomatch_date",
"bar_i":7,
"foo_dt":"1976-11-30T00:00:00Z",
"_version_":1498568386114748416}]
  }}
$ curl 
'http://localhost:8983/solr/techproducts/query?q=*:*&fq=foo_dt:"0002-11-30T00:00:00Z";'
{
  "responseHeader":{
"status":0,
"QTime":1,
"params":{
  "q":"*:*",
  "fq":"foo_dt:\"0002-11-30T00:00:00Z\""}},
  "response":{"numFound":2,"start":0,"docs":[
  {
"id":"aaa",
"bar_i":7,
"foo_dt":"0002-11-30T00:00:00Z",
"_version_":1498568386111602688},
  {
"id":"bbb",
"bar_i":7,
"foo_dt":"0002-11-30T00:00:00Z",
"_version_":1498568386113699840}]
  }}
$ curl 
'http://localhost:8983/solr/techproducts/query?q=*:*&fq=bar_i:7&fq=foo_dt:"0002-11-30T00:00:00Z";'
{
  "responseHeader":{
"status":0,
"QTime":1,
"params":{
  "q":"*:*",
  "fq":["bar_i:7",
"foo_dt:\"0002-11-30T00:00:00Z\""]}},
  "response":{"numFound":2,"start":0,"docs":[
  {
"id":"aaa",
"bar_i":7,
"foo_dt":"0002-11-30T00:00:00Z",
"_version_":1498568386111602688},
  {
"id":"bbb",
"bar_i":7,
"foo_dt":"0002-11-30T00:00:00Z",
"_version_":1498568386113699840}]
  }}




: Date: Tue, 14 Apr 2015 12:47:31 +0430
: From: Ali Nazemian 
: Reply-To: solr-user@lucene.apache.org
: To: "solr-user@lucene.apache.org" 
: Subject: Problem related to filter on Zero value for DateField
: 
: Dears,
: Hi,
: I have strange problem with Solr 4.10.x. My problem is when I do searching
: on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter
: be considered, the results became invalid. For example consider this
: scenario:
: When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr
: returns three different documents which is right for my Collection. All of
: these three documents have same value of "7" for document status. Now If I
: search for fq=document_status:7 the same three documents returns which is
: also a correct response. But When I do the searching on
: fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns
: nothing! (0 document) While I have not such problem with other date values
: beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug
: related to Solr or I did something wrong?
: Best regards.
: 
: -- 
: A.Nazemian
: 

-Hoss
http://www.lucidworks.com/


Re: solr index design for this use case?

2015-04-15 Thread Erick Erickson
At this data size, don't worry at _all_ about duplicating content. A
single Solr node easily holds 20M docs. 50M is common and 250M is not
unheard of.

My bold claim is: you can freely duplicate the data to your heart's
content and you'll never notice it.

In fact, you can put it all in a single collection with some kind of
"site" field to distinguish which is which
and when you want to restrict results to a specific site, just use an fq clause.

HTH,
Erick

On Wed, Apr 15, 2015 at 1:40 PM, vsriram30  wrote:
> Hi All,
>
> Consider this scenario : I am having around 100K content and I want to
> launch 5 sites with that content. For example, around 50K content for site1,
> 40K content for site2, 30K for site3, 20K for site4, and 10K for site5.
>
> As seen from this example, these sites have few overlapping content and non
> overlapping content as well. In this case say if a content page is present
> in all site1, site2 and site3 out of 50 fields per content page, say 30
> fields remain common between site1 and site2, 25 fields common between site1
> and site3 and 20 fields between site2 and site3, in this case, my aim is to
> prevent duplication as much as possible without getting too much reduction
> in QPS. Hence I consider the following options,
>
> Option 1: Just maintain individual copy of duplicated content for each site
> and overwrite site specific information while indexing for those sites.
> Pros:
> Better QPS as no query time joins are involved.
> Cons:
> Duplication of common fields for common content across sites.
>
> Option 2: Maintain just a single copy of common fields per content across
> all overlapping sites and separate site specific information for that
> content and do a merge while serving using joins.
> In this approach, for joins I looked at Block join provided by solr and
> looks like it may not be a good fit for my case as if one site specific info
> changes, I don't want to index the entire block containing other sites as
> well.
> Is there any better way to tackle this making sure we are not occupying so
> much space and at the same time not reducing the QPS too much?
>
> Thanks,
> Sriram
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-index-design-for-this-use-case-tp420.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr index design for this use case?

2015-04-15 Thread vsriram30
Hi Eric,

Thanks for your response. I was planning to do the same, to store the data
in a single collection with site parameter differentiating duplicated
content for different sites. But my use case is that in future the content
would run into millions and potentially there could be large number of sites
as well. Hence I am trying to arrive at a solution that can be best of both
worlds. Is there any efficient way to store preventing duplication, merging
the site specific and common content, and still be able to edit record for
individual site without having to index the entire block.

Thanks,
Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-index-design-for-this-use-case-tp420p4200065.html
Sent from the Solr - User mailing list archive at Nabble.com.