SOLR query parameters

2015-03-05 Thread phiroc
Hello,

could someone please explain what these SOLR query parameter keywords stand for:

- ppcdb

- srbycb

- as

For instance,

http://searcharchives.iht.com:8983/solr/inytapdf0/browse?ppdcb=&srbycb=&as=&q=kaiser&sort=

I could not find them in the SOLR documentation.

Many thanks.

Philippe






Re: SOLR query parameters

2015-03-05 Thread phiroc
Please ignore my question.

These are form field names which I created a couple of months ago, not SOLR 
query parameters.

Philippe


- Mail original -
De: phi...@free.fr
À: solr-user@lucene.apache.org
Envoyé: Jeudi 5 Mars 2015 14:54:26
Objet: SOLR query parameters

Hello,

could someone please explain what these SOLR query parameter keywords stand for:

- ppcdb

- srbycb

- as

For instance,

http://searcharchives.iht.com:8983/solr/inytapdf0/browse?ppdcb=&srbycb=&as=&q=kaiser&sort=

I could not find them in the SOLR documentation.

Many thanks.

Philippe






Re: Missing doc fields

2015-03-11 Thread phiroc
When I run the following query,

http://myserver:8990/solr/archives0/select?q=*:*&rows=3&wt=json&ft=id,ymd

The response is 

{"responseHeader":{"status":0,"QTime":1,"params":{"q":"*:*","rows":"3","wt":"json","ft":"id,ymd"}},"response":{"numFound":160238,"start":0,"docs":[{"id":"10","_version_":1495262519674011648},{"id":"1","_version_":1495262517261238272},{"id":"2","_version_":1495262517637677056}]}}


the ymd field does not appear in the list of document fields, although it is 
defined in my schema.xml.

Is there a way to tell SOLR to return that field in responses?


Philippe



- Mail original -
De: phi...@free.fr
À: solr-user@lucene.apache.org
Envoyé: Mercredi 11 Mars 2015 11:06:29
Objet: Missing doc fields



Hello,

when I display one of my core's  schema, lots of fields appear:

"fields":[{
"name":"_root_",
"type":"string",
"indexed":true,
"stored":false},
  {
"name":"_version_",
"type":"long",
"indexed":true,
"stored":true},
  {
"name":"id",
"type":"string",
"multiValued":false,
"indexed":true,
"required":true,
"stored":true},
  {
"name":"ymd",
"type":"tdate",
"indexed":true,
"stored":true}],
   


Yet, when I display $results in the richtext_doc.vm Velocity template, 
documents only contain three fields (id, _version_, score):

SolrDocument{id=3, _version_=1495262517955395584, score=1.0}, 


How can I increase the number of doc fields?

Many thanks.

Philipppe


Re: Missing doc fields

2015-03-11 Thread phiroc
I meant 'fl'.

--

http://myserver:8990/solr/archives0/select?q=*:*&rows=3&wt=json&fl=*

--

{"responseHeader":{"status":0,"QTime":3,"params":{"q":"*:*","fl":"*","rows":"3","wt":"json"}},"response":{"numFound":160238,"start":0,"docs":[{"id":"10","_version_":1495262519674011648},{"id":"1","_version_":1495262517261238272},{"id":"2","_version_":1495262517637677056}]}}


-- schema.xml 



 
   
   
   

  



   
   


---





- Mail original -
De: "Dmitry Kan" 
À: solr-user@lucene.apache.org
Envoyé: Mercredi 11 Mars 2015 11:38:26
Objet: Re: Missing doc fields

What is the ft parameter that you are sending?


In order to see all stored fields use the parameter fl=*

Or list the field names you need: fl=id,ymd

On Wed, Mar 11, 2015 at 12:35 PM,  wrote:

> When I run the following query,
>
> http://myserver:8990/solr/archives0/select?q=*:*&rows=3&wt=json&ft=id,ymd
>
> The response is
>
>
> {"responseHeader":{"status":0,"QTime":1,"params":{"q":"*:*","rows":"3","wt":"json","ft":"id,ymd"}},"response":{"numFound":160238,"start":0,"docs":[{"id":"10","_version_":1495262519674011648},{"id":"1","_version_":1495262517261238272},{"id":"2","_version_":1495262517637677056}]}}
>
>
> the ymd field does not appear in the list of document fields, although it
> is defined in my schema.xml.
>
> Is there a way to tell SOLR to return that field in responses?
>
>
> Philippe
>
>
>
> - Mail original -
> De: phi...@free.fr
> À: solr-user@lucene.apache.org
> Envoyé: Mercredi 11 Mars 2015 11:06:29
> Objet: Missing doc fields
>
>
>
> Hello,
>
> when I display one of my core's  schema, lots of fields appear:
>
> "fields":[{
> "name":"_root_",
> "type":"string",
> "indexed":true,
> "stored":false},
>   {
> "name":"_version_",
> "type":"long",
> "indexed":true,
> "stored":true},
>   {
> "name":"id",
> "type":"string",
> "multiValued":false,
> "indexed":true,
> "required":true,
> "stored":true},
>   {
> "name":"ymd",
> "type":"tdate",
> "indexed":true,
> "stored":true}],
>
>
>
> Yet, when I display $results in the richtext_doc.vm Velocity template,
> documents only contain three fields (id, _version_, score):
>
> SolrDocument{id=3, _version_=1495262517955395584, score=1.0},
>
>
> How can I increase the number of doc fields?
>
> Many thanks.
>
> Philipppe
>



-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Missing doc fields

2015-03-11 Thread phiroc
Hello,

I found the reason: the query to store ymds in SOLR was invalid ("json" and 
"literal" are concatenated below).

curl -Ss -X POST 
'http://myserver:8990/solr/archives0/update/extract?extractFormat=text&wt=jsonliteral.ymd=1944-12-31T00:00:00A&literal.id=159168


Philippe



- Mail original -
De: phi...@free.fr
À: solr-user@lucene.apache.org
Envoyé: Mercredi 11 Mars 2015 11:44:15
Objet: Re: Missing doc fields

I meant 'fl'.

--

http://myserver:8990/solr/archives0/select?q=*:*&rows=3&wt=json&fl=*

--

{"responseHeader":{"status":0,"QTime":3,"params":{"q":"*:*","fl":"*","rows":"3","wt":"json"}},"response":{"numFound":160238,"start":0,"docs":[{"id":"10","_version_":1495262519674011648},{"id":"1","_version_":1495262517261238272},{"id":"2","_version_":1495262517637677056}]}}


-- schema.xml 



 
   
   
   

  



   
   


---





- Mail original -
De: "Dmitry Kan" 
À: solr-user@lucene.apache.org
Envoyé: Mercredi 11 Mars 2015 11:38:26
Objet: Re: Missing doc fields

What is the ft parameter that you are sending?


In order to see all stored fields use the parameter fl=*

Or list the field names you need: fl=id,ymd

On Wed, Mar 11, 2015 at 12:35 PM,  wrote:

> When I run the following query,
>
> http://myserver:8990/solr/archives0/select?q=*:*&rows=3&wt=json&ft=id,ymd
>
> The response is
>
>
> {"responseHeader":{"status":0,"QTime":1,"params":{"q":"*:*","rows":"3","wt":"json","ft":"id,ymd"}},"response":{"numFound":160238,"start":0,"docs":[{"id":"10","_version_":1495262519674011648},{"id":"1","_version_":1495262517261238272},{"id":"2","_version_":1495262517637677056}]}}
>
>
> the ymd field does not appear in the list of document fields, although it
> is defined in my schema.xml.
>
> Is there a way to tell SOLR to return that field in responses?
>
>
> Philippe
>
>
>
> - Mail original -
> De: phi...@free.fr
> À: solr-user@lucene.apache.org
> Envoyé: Mercredi 11 Mars 2015 11:06:29
> Objet: Missing doc fields
>
>
>
> Hello,
>
> when I display one of my core's  schema, lots of fields appear:
>
> "fields":[{
> "name":"_root_",
> "type":"string",
> "indexed":true,
> "stored":false},
>   {
> "name":"_version_",
> "type":"long",
> "indexed":true,
> "stored":true},
>   {
> "name":"id",
> "type":"string",
> "multiValued":false,
> "indexed":true,
> "required":true,
> "stored":true},
>   {
> "name":"ymd",
> "type":"tdate",
> "indexed":true,
> "stored":true}],
>
>
>
> Yet, when I display $results in the richtext_doc.vm Velocity template,
> documents only contain three fields (id, _version_, score):
>
> SolrDocument{id=3, _version_=1495262517955395584, score=1.0},
>
>
> How can I increase the number of doc fields?
>
> Many thanks.
>
> Philipppe
>



-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Creating a directory resource in solr-jetty

2015-03-11 Thread phiroc
Hello,

does anyone if it is possible to create a directory resource in the solr-jetty 
configuration files?

In Tomcat 8, you can do the following:




many thanks.

Philippe


Missing doc fields

2015-03-12 Thread phiroc


Hello,

when I display one of my core's  schema, lots of fields appear:

"fields":[{
"name":"_root_",
"type":"string",
"indexed":true,
"stored":false},
  {
"name":"_version_",
"type":"long",
"indexed":true,
"stored":true},
  {
"name":"id",
"type":"string",
"multiValued":false,
"indexed":true,
"required":true,
"stored":true},
  {
"name":"ymd",
"type":"tdate",
"indexed":true,
"stored":true}],
   


Yet, when I display $results in the richtext_doc.vm Velocity template, 
documents only contain three fields (id, _version_, score):

SolrDocument{id=3, _version_=1495262517955395584, score=1.0}, 


How can I increase the number of doc fields?

Many thanks.

Philipppe


DocumentAnalysisRequestHandler

2015-03-12 Thread phiroc
Hello,

my solr logs say:

INFO  - 2015-03-12 08:49:34.900; org.apache.solr.core.RequestHandlers; created 
/analysis/document: solr.DocumentAnalysisRequestHandler
WARN  - 2015-03-12 08:49:34.919; org.apache.solr.core.SolrResourceLoader; Solr 
loaded a deprecated plugin/analysis class [solr.admin.AdminHandlers]. Please 
consult documentation how to replace it accordingly.


Is /analysis/document deprecated in SOLR 5?




What is the modern equivalent of Luke?

Many thanks.

Philippe


Re: Creating a directory resource in solr-jetty

2015-03-12 Thread phiroc


Hi Shawn,

here is the Jetty Mailing List's reply concerning my question.

Unfortunately, this solution won't work with SOLR Jetty, because its version is 
< 9.

Philippe



--

Just ensure you don't have a /WEB-INF/ directory, and you can use this on Jetty 
9.2.9+


http://www.eclipse.org/jetty/configure_9_0.dtd";>


  /example
  /mnt/iiiparnex01_pdf/PDF/III/









- Mail original -
De: "Shawn Heisey" 
À: solr-user@lucene.apache.org
Envoyé: Jeudi 12 Mars 2015 13:59:49
Objet: Re: Creating a directory resource in solr-jetty

On 3/11/2015 7:38 AM, phi...@free.fr wrote:
> does anyone if it is possible to create a directory resource in the 
> solr-jetty configuration files?
> 
> In Tomcat 8, you can do the following:
> 
> 
>  
> className="org.apache.catalina.webresources.DirResourceSet"
> base="/mnt/archive_pdf/PDF/IHT"
> webAppMount="/arcpdf0"
> />

This is a question that you'd need to ask in a Jetty support venue.  I
don't know the answer, and from the lack of response, I would guess that
nobody else who has seen your question knows the answer either.  This
container config has nothing to do with Solr at all ... most people here
are only familiar with those pieces of container config that affect Solr.

http://eclipse.org/jetty/mailinglists.php

I hate to turn you away without giving you an answer ... if I knew, I
would ignore the fact that this is off topic, and give you the answer.

Thanks,
Shawn



Word frequency

2015-03-13 Thread phiroc
Hello,

is it possible to create dynamic facets with SOLR 5.0.0?

For instance, I would like to display the most-frequently occurring words in 
the left-hand side of my Velocity SOLR GUI (facet_fields.vm).

Facet_fields.vm currently looks like this:


---
#**
 *  Display facets based on field values
 *  e.g.: fields specified by &facet.field=
 *#

#if($response.facetFields)
  
##Field Facets
Results
  
  #foreach($field in $response.facetFields)
## Hide facets without value
#if($field.values.size() > 0)
  $field.name
  
#foreach($facet in $field.values)
  
$facet.name ($facet.count)
  
#end
  
#end  ## end if > 0
  #end## end for each facet field
#end  ## end if response has facet fields

--

Many thanks.

Philippe


response.results

2015-03-13 Thread phiroc
Hello,

could someone please explain how the current Velocity template examples 
provided with the 5.0.0 distribution retrieve documents from SOLR?

result_list.vm contains the following line

#foreach($doc in $response.results)

but I can't figure out where $response.results is generated.

Many thanks.

Philippe



Re: Word frequency

2015-03-13 Thread phiroc
Yes.

Except that I don't want to facet the entire text field (as it can contain 
thousands of words).

I would like to:

- loop throught the documents in my core
- extract the most-frequently-appearing words in each document's text field
- generate a .vm  which displays those words ranked number of occurrences, or, 
ideally, automatically generate that .vm whenever users use SOLR.






- Mail original -
De: "Erik Hatcher" 
À: solr-user@lucene.apache.org
Envoyé: Vendredi 13 Mars 2015 15:05:21
Objet: Re: Word frequency

Do you mean like faceting on one of your full text fields?   Something like 
/browse?facet.field=_text or one of your other fields?


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




> On Mar 13, 2015, at 4:16 AM, phi...@free.fr wrote:
> 
> Hello,
> 
> is it possible to create dynamic facets with SOLR 5.0.0?
> 
> For instance, I would like to display the most-frequently occurring words in 
> the left-hand side of my Velocity SOLR GUI (facet_fields.vm).
> 
> Facet_fields.vm currently looks like this:
> 
> 
> ---
> #**
> *  Display facets based on field values
> *  e.g.: fields specified by &facet.field=
> *#
> 
> #if($response.facetFields)
>  
>##Field Facets
>Results
>  
>  #foreach($field in $response.facetFields)
>## Hide facets without value
>#if($field.values.size() > 0)
>  $field.name
>  
>#foreach($facet in $field.values)
>  
>$facet.name ($facet.count)
>  
>#end
>  
>#end  ## end if > 0
>  #end## end for each facet field
> #end  ## end if response has facet fields
> 
> --
> 
> Many thanks.
> 
> Philippe



Re: Word frequency

2015-03-13 Thread phiroc

If you are asking whether users have access to /browse, then the answer is yes.

Currently, they can type keywords in the q input field to do searches.

I plan to turn q into a hidden field and add a 'keywords' input field whose 
contents will be transferred to q when users press Search, using Javascript.

I will also add date selects so that users don't have to type date queries.

How do you secure the rest of SOLR (e.g., admin)?

Would would recommend creating an alternative Search GUI with, say, Wicket, 
which queries SOLR using AJAX?

Sounds hard, but I will try. Velocity is so much simpler.

Cheers,

Philippe







- Mail original -
De: "Alexandre Rafalovitch" 
À: "solr-user" 
Envoyé: Vendredi 13 Mars 2015 15:41:45
Objet: Re: Word frequency

On 13 March 2015 at 10:25,   wrote:
> I would like to:
>
> - loop throught the documents in my core
> - extract the most-frequently-appearing words in each document's text field
> - generate a .vm  which displays those words ranked number of occurrences, 
> or, ideally, automatically generate that .vm whenever users use SOLR.

That's what faceting does. You you can fine tune it further by telling
how many of top hits you want to get back. Have a look at those
parameters and play with them first in Web Admin UI before trying to
apply them to the browse handler.

Regards,
   Alex.
P.s. You are not planning to expose /browse handler directly to users,
do you? Because unless you REALLY know how to secure the rest of Solr,
you are asking for big troubles.


Re: Word frequency

2015-03-13 Thread phiroc
Point taken, Shawn. Thanks for your input.


- Mail original -
De: "Shawn Heisey" 
À: solr-user@lucene.apache.org
Envoyé: Vendredi 13 Mars 2015 16:12:46
Objet: Re: Word frequency

On 3/13/2015 8:54 AM, phi...@free.fr wrote:
> 
> If you are asking whether users have access to /browse, then the answer is 
> yes.
> 
> Currently, they can type keywords in the q input field to do searches.
> 
> I plan to turn q into a hidden field and add a 'keywords' input field whose 
> contents will be transferred to q when users press Search, using Javascript.
> 
> I will also add date selects so that users don't have to type date queries.
> 
> How do you secure the rest of SOLR (e.g., admin)?
> 
> Would would recommend creating an alternative Search GUI with, say, Wicket, 
> which queries SOLR using AJAX?
> 
> Sounds hard, but I will try. Velocity is so much simpler.

Anything that requires an end user to have direct access to Solr (which
includes both the /browse handler and AJAX) is a potential security
issue.  If that access is unfiltered, a user can completely erase your
index, or cause other problems.  Switching to a different input field
other than "q" won't be any kind of protection ... they will just have
to run their browser in debug mode and they'll be able to see the Solr
queries sent, and then they can completely bypass any javascript
protections you create.

The intent with Solr is that it will be completely firewalled from user
access and only queried by server-side programs (PHP, Java, Ruby, etc).

Securing a Solr server exposed to the public requires an intelligent
proxy server with a specific config tailored to only allowing certain
requests to work.  I had this discussion with someone else on a
javascript client for Solr, and they said they're using this for a
proxy, and that this code will protect the Solr server from malicious
activity:

https://github.com/adsabs/solr-service

I haven't looked deeper, so I don't know if that claim is valid.

Note that even with a proxy server, it is still usually possible to send
denial-of-service queries designed to keep the server too busy to handle
legitimate requests.  If the code that accesses Solr is server-side,
then you may be able to detect malicious queries created from user input
and stop them from being sent to Solr.

Thanks,
Shawn



Connection pool shutdown error

2015-03-19 Thread phiroc
Hello,

I am trying to use the 4.9.1 SOLR Core API and the 1.3.2.RELEASE version of the 
Spring Data SOLR API, to connect to a SOLR server, but to no avail.

When I run Java application, I get the following errors:

---

Exception in thread "main" 
org.springframework.data.solr.UncategorizedSolrException: Error executing 
query; nested exception is org.apache.solr.client.solrj.SolrServerException: 
Error executing query
...
Caused by: java.lang.IllegalStateException: Connection pool shut down

-

I have tried changing Core API version (4.3.0, 4.4.0, ...) but to no avail.

Any help would be much appreciated.

Cheers,

Philippe




Here's my Solr Context:



package com.myco.archives.SolrGuiMain;



@Configuration
@EnableSolrRepositories(basePackages = { "com.myco.archives" }, 
multicoreSupport = false)
@ComponentScan
public class SolrContext {

private final StringHTTP_SEARCHARCHIVES = 
"http://mysolr.com:8990/solr/collection3";;

@Bean
public SolrServer solrServer() {
SolrServer server = new HttpSolrServer(HTTP_SEARCHARCHIVES);
return server;
}

@Bean
public SolrOperations solrTemplate() {
return new SolrTemplate(solrServer());
}

}

-

Here's my Repository Class:

import org.springframework.data.repository.CrudRepository;

public interface ArchiveDocumentRepository extends CrudRepository {

List findByText(String text);

List findByYmd(Date ymd);

}




And here's my App:

import 
org.springframework.context.annotation.AnnotationConfigApplicationContext;

public class App
{

private ArchiveDocumentRepository   archiveDocumentRepository;

public App() {

setContext();
processDocs();


}
public static void main(String[] args) {

new App();

}

public void setContext() throws RuntimeException {

AnnotationConfigApplicationContext context = new 
AnnotationConfigApplicationContext(SolrContext.class);

if (context != null) {

setArchiveDocumentRepository(context.getBean(ArchiveDocumentRepository.class));
}
context.close();
}

public final ArchiveDocumentRepository getArchiveDocumentRepository() {
return archiveDocumentRepository;
}

public final void 
setArchiveDocumentRepository(ArchiveDocumentRepository 
archiveDocumentRepository) {
this.archiveDocumentRepository = archiveDocumentRepository;
}

public void processDocs() {


Iterable docs = 
getArchiveDocumentRepository().findAll();

for (Document doc : docs) {
System.out.println("doc count = " + doc.getYmd());
}

}

}


---










Creating facets based on the content field

2015-03-23 Thread phiroc
Hello,

let's say that you haved indexed hundreds of PDFs using the following curl 
command:

curl -Ss -X POST 
'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";

The PDF's contents are now stored in core0's "content" field.

I wonder how you create facets based on the field's contents, if you don't know 
in advance what it contains (unless you have compiled a list of 
frequently-occurring words in the PDFs, after reading them.)

Many thanks.

Philippe




Re: Creating facets based on the content field

2015-03-23 Thread phiroc
Let's say that one pdf has the following contents:

"[thousands of characters] blablabla Churchill blablabla [thousands of text 
characters]"

... and another PDF contains:

"[thousands of characters] blablabla Gandhi [thousands of characters] Churchill 
blablabla [thousands of text characters]"

As you can see, there two PDFs contain keywords that are potential candidates 
for facets (e.g. Churchill, Gandhi, ...), but I have no
way of knowing that when adding facets to the solrconfig.xml file, unless I 
read all the PDFs (which will take me years) and compile a list of 
often-occurring words and names.

The fallback solution is therefore to guess the keywords, which are likely to 
appear in the PDFs; e.g.:

Aircraft
Armistice
Austria
Bolshevik
Britain
British
Charlie Chaplin
Clemenceau
Einstein
...


However, how can I be sure that these facets will be useful to the other 'core' 
users? For instance, let's say that one
user is more interested in Gandhi that Einstein: the "Einstein" facet is 
therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml.

Is there a way to dynamically generate a list of facets based on words 
contained in the content field?

Cheers,

Philippe





- Mail original -
De: "Erik Hatcher" 
À: solr-user@lucene.apache.org
Envoyé: Lundi 23 Mars 2015 16:30:49
Objet: Re: Creating facets based on the content field

Philippe - can you provide a concrete example of what you mean by creating 
facets on field’s content?   Or maybe rather, what’s missing from doing 
&facet.field=content currently?

Erik




> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
> 
> Hello,
> 
> let's say that you haved indexed hundreds of PDFs using the following curl 
> command:
> 
> curl -Ss -X POST 
> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";
> 
> The PDF's contents are now stored in core0's "content" field.
> 
> I wonder how you create facets based on the field's contents, if you don't 
> know in advance what it contains (unless you have compiled a list of 
> frequently-occurring words in the PDFs, after reading them.)
> 
> Many thanks.
> 
> Philippe
> 
> 



Re: Creating facets based on the content field

2015-03-23 Thread phiroc
I reindexed the PDFs without specifying facets and they "magically" appeared in 
facets.vm!

Many thanks!



- Mail original -
De: "Alexandre Rafalovitch" 
À: "solr-user" 
Envoyé: Lundi 23 Mars 2015 17:23:40
Objet: Re: Creating facets based on the content field

I think you are over-complicated this before actually trying it. If
you index your texts and tokenize them to have individual words then
"facet.field=content" will actually give you the list of words sorted
by their occurrence count. That's what facet will do.

A bigger problem is - from your example - that I still don't see how
exactly that will be good for your users. But perhaps seeing the
actual results will help with that too.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 23 March 2015 at 12:08,   wrote:
> Let's say that one pdf has the following contents:
>
> "[thousands of characters] blablabla Churchill blablabla [thousands of text 
> characters]"
>
> ... and another PDF contains:
>
> "[thousands of characters] blablabla Gandhi [thousands of characters] 
> Churchill blablabla [thousands of text characters]"
>
> As you can see, there two PDFs contain keywords that are potential candidates 
> for facets (e.g. Churchill, Gandhi, ...), but I have no
> way of knowing that when adding facets to the solrconfig.xml file, unless I 
> read all the PDFs (which will take me years) and compile a list of 
> often-occurring words and names.
>
> The fallback solution is therefore to guess the keywords, which are likely to 
> appear in the PDFs; e.g.:
>
> Aircraft
> Armistice
> Austria
> Bolshevik
> Britain
> British
> Charlie Chaplin
> Clemenceau
> Einstein
> ...
>
>
> However, how can I be sure that these facets will be useful to the other 
> 'core' users? For instance, let's say that one
> user is more interested in Gandhi that Einstein: the "Einstein" facet is 
> therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml.
>
> Is there a way to dynamically generate a list of facets based on words 
> contained in the content field?
>
> Cheers,
>
> Philippe
>
>
>
>
>
> - Mail original -
> De: "Erik Hatcher" 
> À: solr-user@lucene.apache.org
> Envoyé: Lundi 23 Mars 2015 16:30:49
> Objet: Re: Creating facets based on the content field
>
> Philippe - can you provide a concrete example of what you mean by creating 
> facets on field’s content?   Or maybe rather, what’s missing from doing 
> &facet.field=content currently?
>
> Erik
>
>
>
>
>> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
>>
>> Hello,
>>
>> let's say that you haved indexed hundreds of PDFs using the following curl 
>> command:
>>
>> curl -Ss -X POST 
>> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";
>>
>> The PDF's contents are now stored in core0's "content" field.
>>
>> I wonder how you create facets based on the field's contents, if you don't 
>> know in advance what it contains (unless you have compiled a list of 
>> frequently-occurring words in the PDFs, after reading them.)
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>


Re: Creating facets based on the content field

2015-03-23 Thread phiroc
I just want a list of recurring words (for now.)

I removed the manually-created facets from solrconfig.xml and SOLR 
"automagically" created a facet list for me.

But thanks for your suggestions.



- Mail original -
De: "Charlie Hull" 
À: solr-user@lucene.apache.org
Envoyé: Lundi 23 Mars 2015 17:26:18
Objet: Re: Creating facets based on the content field

On 23/03/2015 16:08, phi...@free.fr wrote:
> Let's say that one pdf has the following contents:

Aren't you thinking of Named Entity Recognition? We've used Stanford NLP 
for this in the past and it's quite good at People, Places and 
Organisations out of the box (needs tuning for other classes of 
entities). You can then add these entities as metadata to your document 
objects and index them so you can facet on them appropriately.

Cheers

Charlie
>
> "[thousands of characters] blablabla Churchill blablabla [thousands of text 
> characters]"
>
> ... and another PDF contains:
>
> "[thousands of characters] blablabla Gandhi [thousands of characters] 
> Churchill blablabla [thousands of text characters]"
>
> As you can see, there two PDFs contain keywords that are potential candidates 
> for facets (e.g. Churchill, Gandhi, ...), but I have no
> way of knowing that when adding facets to the solrconfig.xml file, unless I 
> read all the PDFs (which will take me years) and compile a list of 
> often-occurring words and names.
>
> The fallback solution is therefore to guess the keywords, which are likely to 
> appear in the PDFs; e.g.:
>
>  Aircraft
>  Armistice
>  Austria
>  Bolshevik
>  Britain
>  British
>  Charlie Chaplin
>  Clemenceau
>  Einstein
> ...
>
>
> However, how can I be sure that these facets will be useful to the other 
> 'core' users? For instance, let's say that one
> user is more interested in Gandhi that Einstein: the "Einstein" facet is 
> therefore useless to him and a "Gandhi" facet is missing from sorlconfig.xml.
>
> Is there a way to dynamically generate a list of facets based on words 
> contained in the content field?
>
> Cheers,
>
> Philippe
>
>
>
>
>
> - Mail original -
> De: "Erik Hatcher" 
> À: solr-user@lucene.apache.org
> Envoyé: Lundi 23 Mars 2015 16:30:49
> Objet: Re: Creating facets based on the content field
>
> Philippe - can you provide a concrete example of what you mean by creating 
> facets on field’s content?   Or maybe rather, what’s missing from doing 
> &facet.field=content currently?
>
>  Erik
>
>
>
>
>> On Mar 23, 2015, at 10:48 AM, phi...@free.fr wrote:
>>
>> Hello,
>>
>> let's say that you haved indexed hundreds of PDFs using the following curl 
>> command:
>>
>> curl -Ss -X POST 
>> 'http://mysolr:8990/solr/core0/update/extract?extractFormat=text&wt=json&literal.url=/path/to/the/pdf.pdf";
>>
>> The PDF's contents are now stored in core0's "content" field.
>>
>> I wonder how you create facets based on the field's contents, if you don't 
>> know in advance what it contains (unless you have compiled a list of 
>> frequently-occurring words in the PDFs, after reading them.)
>>
>> Many thanks.
>>
>> Philippe
>>
>>
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


_text

2015-03-24 Thread phiroc

Hello,

my SOLR 5 Admin Panel displays the following error:

23/03/2015 15:05:05 ERROR   SolrCore
org.apache.solr.common.SolrException: undefined field: "_text"

How should _text be defined in schema.xml?

Many thanks.

Philippe


Re: _text

2015-03-24 Thread phiroc
Hi Zheng,

I copied the SOLR 5 schema.xml file on Github (?), which contains the following 
line:







- Mail original -
De: "Zheng Lin Edwin Yeo" 
À: solr-user@lucene.apache.org
Envoyé: Mardi 24 Mars 2015 10:59:49
Objet: Re: _text

Hi Philippe,

Are you using the default schemaFactory, in which your setting in
solrconfig.xml is , or you
have used your own defined schema.xml, in which your setting in
solrconfig.xml should be ?


Regards,
Edwin


On 24 March 2015 at 17:40,  wrote:

>
> Hello,
>
> my SOLR 5 Admin Panel displays the following error:
>
> 23/03/2015 15:05:05 ERROR   SolrCore
> org.apache.solr.common.SolrException: undefined field: "_text"
>
> How should _text be defined in schema.xml?
>
> Many thanks.
>
> Philippe
>


Tweaking SOLR memory and cull facet words

2015-03-27 Thread phiroc
Hi,

my SOLR 5 solrconfig.xml file contains the following lines:


   on
text
 100


where the 'text' field contains thousands of words.

When I start SOLR, the search engine takes several minutes to index the words 
in the 'text' field (although loading the browse template later only takes a 
few seconds because the 'text' field has already been indexed).

Here are my questions:

- should I increase SOLR's JVM memory to make initial indexing faster?

e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh

- how can I cull facet words according to certain criteria (length, case, 
etc.)? For instance, my facets are the following:

application (22427)
inytapdf0 (22427)
pdf (22427)
the (22334)
new (22131)
herald (21983)
york (21975)
paris (21780)
a (21692)
and (21298)
of (21288)
i (21247)
in (21062)
to (20918)
on (20899)
m (20857)
by (20733)
de (20664)
for (20580)
at (20417)
with (20371) 
...

Obviously, words such as "the", "i", "to","m", etc. should not be indexed. 
Furthermore, I don't care about "nouns". I am only interested in people and 
location names.


Many thanks.

Philippe







Re: Tweaking SOLR memory and cull facet words

2015-03-27 Thread phiroc
Hi Shawn,

> You must send indexing requests to Solr,

Are you referring to posting  queries to SOLR, or to something 
else?

> If you can set up multiple threads or processes...

How do you do that?

> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

Can you update the stopwords.txt file, and then re-index the documents?

How?

Many thanks.

Philippe






- Mail original -
De: "Shawn Heisey" 
À: solr-user@lucene.apache.org
Envoyé: Vendredi 27 Mars 2015 14:38:20
Objet: Re: Tweaking SOLR memory and cull facet words

On 3/27/2015 4:14 AM, phi...@free.fr wrote:
> Hi,
> 
> my SOLR 5 solrconfig.xml file contains the following lines:
> 
> 
>on
>   text
>100
> 
> 
> where the 'text' field contains thousands of words.
> 
> When I start SOLR, the search engine takes several minutes to index the words 
> in the 'text' field (although loading the browse template later only takes a 
> few seconds because the 'text' field has already been indexed).
> 
> Here are my questions:
> 
> - should I increase SOLR's JVM memory to make initial indexing faster?
> 
> e.g., SOLR_JAVA_MEM="-Xms1024m -Xmx204800m" in solr.in.sh
> 
> - how can I cull facet words according to certain criteria (length, case, 
> etc.)? For instance, my facets are the following:
> 
> application (22427)
> inytapdf0 (22427)
> pdf (22427)
> the (22334)
> new (22131)
> herald (21983)
> york (21975)
> paris (21780)
> a (21692)
> and (21298)
> of (21288)
> i (21247)
> in (21062)
> to (20918)
> on (20899)
> m (20857)
> by (20733)
> de (20664)
> for (20580)
> at (20417)
> with (20371) 
> ...
> 
> Obviously, words such as "the", "i", "to","m", etc. should not be indexed. 
> Furthermore, I don't care about "nouns". I am only interested in people and 
> location names.

Starting Solr does not index anything, unless you are talking about one
of the sidecar indexes for spelling correction or suggestions.  You must
send indexing requests to Solr, and if you are experiencing slow
indexing, chances are that it's because of slowness in obtaining data
from the source, not Solr ... or that you are indexing with a single
thread.  If you can set up multiple threads or processes that are
indexing in parallel, it should go faster.

Thousands of terms are not hard for Solr to handle at all.  When the
number of terms gets into the millions or billions, then it starts
becoming a hard problem.

If you use the stopword filter on the index analysis chain for the field
that you are using for facets, then all the stopwords will be removed
from the facets.  That would change how searches work on the field, so
you will probably want to use copyField to create a new field that you
use for faceting.  There are other filters that can do things you have
mentioned, like LengthFilterFactory:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

As far as java heap sizing, trial and error is about the only way to
find the right size.

http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

Thanks,
Shawn



Copying a collection from one version of SOLR to another

2014-08-25 Thread phiroc

Hello,

is it possible to copy a collection created with SOLR 4.6.0 to a SOLR 4.9.0 
server?

I have just copied a collection called 'collection3', located in 
solr4.6.0/example/solr,  to solr4.9.0/example/solr, but to no avail, because my 
SOLR 4.9.0 Server's admin does not list it among the available cores.

What am I doing wrong?

Many thanks.

Philippe



Problem deploying solr-4.10.0.war in Tomcat

2014-09-17 Thread phiroc


Hello,

I've dropped solr-4.10.0.war in Tomcat 7's webapp directory.

When I start the Java web server, the following message appears in catalina.out:

---

INFO: Starting Servlet Engine: Apache Tomcat/7.0.55
Sep 17, 2014 11:35:59 AM org.apache.catalina.startup.HostConfig deployWAR
INFO: Deploying web application archive 
/archives/apache-tomcat-7.0.55_solr_8983/webapps/solr-4.10.0.war
Sep 17, 2014 11:35:59 AM org.apache.catalina.core.StandardContext startInternal
SEVERE: Error filterStart
Sep 17, 2014 11:35:59 AM org.apache.catalina.core.StandardContext startInternal
SEVERE: Context [/solr-4.10.0] startup failed due to previous errors

--

Any help would be much appreciated.

Cheers,

Philippe





Updating an index

2014-11-06 Thread phiroc
Hello,

I have [mistakenly] created a SOLR index in which the document IDs contain URIs 
such as file:///Z:/1933/01/1933_01.png .

In a single SOLR update command, how can I:

- copy the contents of each document's id field to a new field called 'url', 
after replacing 'Z:' by 'Y:'

- make SOLR generate a new random Id for each document

Many thanks.

Philippe




Core deletion

2015-01-14 Thread phiroc


Hello,

I am running SOLR 4.10.0 on Tomcat 8.

The solr.xml file in .../apache-tomcat-8.0.15_solr_8983/conf/Catalina/localhost 
looks like this:







My SOLR instance contains four cores, including one whose instanceDir and 
dataDir have the following values:


instanceDir:/archives/solr/example/solr/indexapdf0/
dataDir:/archives/indexpdf0/data/

Strangely enough, every time I restart Tomcat, this core's data, [and only this 
core's data,] get deleted, which is pretty annoying.

How can I prevent it?

Many thanks.

Philippe














Re: Core deletion

2015-01-15 Thread phiroc
I duplicated an exist core, deleted the data directory and core.properties, 
updated solrconfig.xml and schema.xml and loaded the new core in SOLR's Admin 
Panel.

The logs contain a few 'index locked' errors:


solr.log:INFO  - 2015-01-15 14:43:09.492; 
org.apache.solr.core.CorePropertiesLocator; Found core inytapdf0 in 
/archives/solr/example/solr/inytapdf0/
solr.log:INFO  - 2015-01-15 14:49:17.685; 
org.apache.solr.core.CorePropertiesLocator; Found core inytapdf0 in 
/archives/solr/example/solr/inytapdf0/
solr.log.1:INFO  - 2015-01-05 18:08:13.253; 
org.apache.solr.core.CorePropertiesLocator; Found core inytapdf0 in 
/archives/solr/example/solr/inytapdf0/
solr.log.1:ERROR - 2015-01-05 18:08:17.467; org.apache.solr.core.CoreContainer; 
Error creating core [inytapdf0]: Index locked for write for core inytapdf0
solr.log.1:org.apache.solr.common.SolrException: Index locked for write for 
core inytapdf0
solr.log.1:Caused by: org.apache.lucene.store.LockObtainFailedException: Index 
locked for write for core inytapdf0
solr.log.1:INFO  - 2015-01-06 09:19:32.125; 
org.apache.solr.core.CorePropertiesLocator; Found core inytapdf0 in 
/archives/solr/example/solr/inytapdf0/
solr.log.1:ERROR - 2015-01-06 09:19:35.305; org.apache.solr.core.CoreContainer; 
Error creating core [inytapdf0]: Index locked for write for core inytapdf0
solr.log.1:org.apache.solr.common.SolrException: Index locked for write for 
core inytapdf0
solr.log.1:Caused by: org.apache.lucene.store.LockObtainFailedException: Index 
locked for write for core inytapdf0


Philippe




- Mail original -
De: "Dominique Bejean" 
À: solr-user@lucene.apache.org
Envoyé: Jeudi 15 Janvier 2015 11:46:43
Objet: Re: Core deletion

Hi,

Is there something in solr logs at startup that can explain the deletion ?

How were created the cores ? using cores API ?

Dominique
http://www.eolya.fr


2015-01-14 17:43 GMT+01:00 :

>
>
> Hello,
>
> I am running SOLR 4.10.0 on Tomcat 8.
>
> The solr.xml file in
> .../apache-tomcat-8.0.15_solr_8983/conf/Catalina/localhost looks like this:
>
>
> 
>  crossContext="true">
>  value="/archives/solr/example/solr" override="true"/>
> 
>
> My SOLR instance contains four cores, including one whose instanceDir and
> dataDir have the following values:
>
>
> instanceDir:/archives/solr/example/solr/indexapdf0/
> dataDir:/archives/indexpdf0/data/
>
> Strangely enough, every time I restart Tomcat, this core's data, [and only
> this core's data,] get deleted, which is pretty annoying.
>
> How can I prevent it?
>
> Many thanks.
>
> Philippe
>
>
>
>
>
>
>
>
>
>
>
>
>