date:20110302

Hey,

normally .. if i have problems with dih:

* i start having a look at the mysql-query-log, to check which queries
are executed.
* re-run the query myself, verify the return data
* Activate http://wiki.apache.org/solr/DataImportHandler#LogTransformer
and log the important data, check console output

yet, this was debugging enough for all my problems :)

Regards
Stefan

On Wed, Mar 2, 2011 at 7:28 AM, cyang2010  wrote:
> I wonder how to run dataimporthandler in debug mode.  Currently i can't get
> data correctly into index through dataimporthandler, especially a timestamp
> column to solr date field.  I want to debug the process.
>
> According to this wiki page:
>
> Commands
> The handler exposes all its API as http requests . The following are the
> possible operations
> •full-import : Full Import operation can be started by hitting the URL
> http://:/solr/dataimport?command=full-import
> ...
> ■clean : (default 'true'). Tells whether to clean up the index before the
> indexing is started
> ■commit: (default 'true'). Tells whether to commit after the operation
> ■optimize: (default 'true'). Tells whether to optimize after the operation
> ■debug : (default false). Runs in debug mode.It is used by the interactive
> development mode (see here)
> ■Please note that in debug mode, documents are never committed
> automatically. If you want to run debug mode and commit the results too, add
> 'commit=true' as a request parameter.
>
>
> Therefore, i run
>
> http://:/solr/dataimport?command=full-import &debug=true
>
> Not only i didn't see log with "DEBUG" level, but also it crashes my machine
> a few times.   I was surprised it can even do that ...
>
> Did someone ever try to debug the process before?  What is your experience
> with it?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-debug-dataimporthandler-tp2611506p2611506.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: indexing mysql dateTime/timestamp into solr date field

2011-03-02 Thread cyang2010

It turn out you don't need to use dateFormatTransformer at all.  The reason
why the timestamp mysql column fail to be inserted to solr is because in
schema.xml i mistakenly set "index=false,  stored=false".  Of course that
won't make it come to index at all.  No wonder schema browser always show no
term for that field.

DataImportHanlder just take care of using jdbc to read timestamp/datetime
column, and format it into solr format out of box.  There is no need to use
any transformer on top it.  I am guessing that the DateFormatTransformer is
only needed for string value (somevalue from a xml source, rather than for
database column value)  to convert it to solr date type.  

--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-mysql-dateTime-timestamp-into-solr-date-field-tp2608327p2611865.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Rosa (Anuncios)


Nice job!

It would be good to be able to extract specific data from a given page 
via XPATH though.


Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
Crawler. It includes :


   * a crawler
   * a document processing pipeline
   * a solr indexer

The crawler has a web administration in order to manage web sites to 
be crawled. Each web site crawl is configured with a lot of possible 
parameters (no all mandatory) :


   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

The pileline includes various ready to use stages (text extraction, 
language detection, Solr ready to index xml writer, ...).


All is very configurable and extendible either by scripting or java 
coding.


With scripting technology, you can help the crawler to handle 
javascript links or help the pipeline to extract relevant title and 
cleanup the html pages (remove menus, header, footers, ..)


With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen 
shots. All is documented in a wiki.


The current version is 1.1.4. You can download and try it out from 
here : www.crawl-anywhere.com



Regards

Dominique

Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Lukáš Vlček

Hi,

is there any plan to open source it?

Regards,
Lukas

[OT] I tried HuriSearch, input "Java" into search field, it returned a lot
of references to coldfusion error pages. May be a recrawl would help?

On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean
wrote:

> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
> Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites to be
> crawled. Each web site crawl is configured with a lot of possible parameters
> (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, …)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text extraction,
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java coding.
>
> With scripting technology, you can help the crawler to handle javascript
> links or help the pipeline to extract relevant title and cleanup the html
> pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen shots.
> All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from here :
> www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>

Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread findbestopensource

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't know how far
yours would be different from the rest.

Your license states that it is not open source but it is free for personnel
use.

Regards
Aditya
www.findbestopensource.com


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
wrote:

> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
> Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites to be
> crawled. Each web site crawl is configured with a lot of possible parameters
> (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, …)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text extraction,
> language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or java coding.
>
> With scripting technology, you can help the crawler to handle javascript
> links or help the pipeline to extract relevant title and cleanup the html
> pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen shots.
> All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from here :
> www.crawl-anywhere.com
>
>
> Regards
>
> Dominique
>
>

design help

2011-03-02 Thread Ken Foskey



I have read the solr book and the other book is on its way for me to read. 
I need some help in the mean time.


a)  Using the example solr system how do I send the word document using curl 
into the system.I want to have the ID as the full path of the document. 
I have tried various commands but it gives me stream errors,  documents are 
on one server,  solr is on a second server.  I would like to experiment at 
home indexing all the documents on the server,  giving me experience outside 
work.


b)  I am trying to grasp a few issues with my design,  I have read the books 
but I am struggling to grasp the ideas yet.


-  I am using a .Net connector,  which one is recommended?

-  I have an aspx webpage that connects to solr,  adds some extra 
information to the solr query (eg client id see later) and returns the data. 
Is this a normal design?


I am using a shopping trolley design.  Orders are created and then comments 
are added as the order is processed.   I want to search all the order 
information.  (this is one of many indexing requirements,  then there will 
be a 'site search' facility.)


I want to index many database tables,  I am proposing to use one SOLR 
instance that indexes all the data.   I will use the example TEXT field and 
copydata all the extra fields I have for the tables.  The SOLR book gave me 
short shift on the database indexing tool,  is there a good tutorial on 
using SOLR with SQLServer?  I have a many tables, I was also thinking I 
might have to join a lot,  it might be easier to create an XML output then 
send that,  is there pros and cons.


I am thinking that I will search client = 578  and (  text = keywords). 
This way I am restricting my searches to the client signed on.Is this a 
good idea?  (it is secured behind my application so I am adding the extras)


I was also thinking that I would index the order notes separately from the 
order,   The other alternative is to multi value a note field in the order 
and keep deleting it.   Which is best?  One way I have to keep replacing the 
whole order,  the other I have to some how 'join' up the notes in the 
search.   (more like this feature I suppose).


When you come out of the SOLR index,   is it normal to map the ID to a URL, 
for exampleid="product:   1234" =>   href="product.aspx?productId=1234" 
in code using the return data from solr.


Hope you can help me,

Ta
Ken

Re: [ANNOUNCE] Web Crawler


David,

The UI was not the only reason that make me choose to write a totaly new 
crawler. After eliminating candidate crawlers due to various reasons 
(inactive project, ...), Nutch and Heritrix where the 2 crawlers in my 
short list of possible candidates to be use.


In my mind, the crawler and the pipleline have to be tottaly 
disconnected of the target repository (Solr, ...). This made nutch not a 
possible choice.
At the end, I found Heritrix to far of the solution's architecture I 
imagined.


Dominique


Le 02/03/11 05:41, David Smiley (@MITRE.org) a écrit :

Dominique,
The obvious number one question is of course why you re-invented this wheel
when there are several existing crawlers to choose from.  Your website says
the reason is that the UIs on existing crawlers (e.g. Nutch, Heritrix, ...)
weren't sufficiently user-friendly or had the site-specific configuration
you wanted.  Well if that is the case, why didn't you add/enhance such
capabilities for an existing crawler?

~ David Smiley

-
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book

Re: [ANNOUNCE] Web Crawler


Rosa,

In the pipeline, there is a stage that extract the text from the 
original document (PDF, HTML, ...).
It is possible to plug scripts (Java 6 compliant) in order to keep only 
relevant parts of the document.
See 
http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage


Dominique

Le 02/03/11 09:36, Rosa (Anuncios) a écrit :

Nice job!

It would be good to be able to extract specific data from a given page 
via XPATH though.


Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web 
Crawler. It includes :


   * a crawler
   * a document processing pipeline
   * a solr indexer

The crawler has a web administration in order to manage web sites to 
be crawled. Each web site crawl is configured with a lot of possible 
parameters (no all mandatory) :


   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, …)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

The pileline includes various ready to use stages (text extraction, 
language detection, Solr ready to index xml writer, ...).


All is very configurable and extendible either by scripting or java 
coding.


With scripting technology, you can help the crawler to handle 
javascript links or help the pipeline to extract relevant title and 
cleanup the html pages (remove menus, header, footers, ..)


With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen 
shots. All is documented in a wiki.


The current version is 1.1.4. You can download and try it out from 
here : www.crawl-anywhere.com



Regards

Dominique

Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Geert-Jan Brits

Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a
scheme/wrapper from a set of example webpages (often called 'wrapper
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a basis for
extraction.

Lately I've been looking for crawlers that incoporate this technology but
without success.
Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean 

> Rosa,
>
> In the pipeline, there is a stage that extract the text from the original
> document (PDF, HTML, ...).
> It is possible to plug scripts (Java 6 compliant) in order to keep only
> relevant parts of the document.
> See
> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>
> Dominique
>
> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>
>  Nice job!
>>
>> It would be good to be able to extract specific data from a given page via
>> XPATH though.
>>
>> Regards,
>>
>>
>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>
>>> Hi,
>>>
>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
>>> Crawler. It includes :
>>>
>>>   * a crawler
>>>   * a document processing pipeline
>>>   * a solr indexer
>>>
>>> The crawler has a web administration in order to manage web sites to be
>>> crawled. Each web site crawl is configured with a lot of possible parameters
>>> (no all mandatory) :
>>>
>>>   * number of simultaneous items crawled by site
>>>   * recrawl period rules based on item type (html, PDF, …)
>>>   * item type inclusion / exclusion rules
>>>   * item path inclusion / exclusion / strategy rules
>>>   * max depth
>>>   * web site authentication
>>>   * language
>>>   * country
>>>   * tags
>>>   * collections
>>>   * ...
>>>
>>> The pileline includes various ready to use stages (text extraction,
>>> language detection, Solr ready to index xml writer, ...).
>>>
>>> All is very configurable and extendible either by scripting or java
>>> coding.
>>>
>>> With scripting technology, you can help the crawler to handle javascript
>>> links or help the pipeline to extract relevant title and cleanup the html
>>> pages (remove menus, header, footers, ..)
>>>
>>> With java coding, you can develop your own pipeline stage stage
>>>
>>> The Crawl Anywhere web site provides good explanations and screen shots.
>>> All is documented in a wiki.
>>>
>>> The current version is 1.1.4. You can download and try it out from here :
>>> www.crawl-anywhere.com
>>>
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>
>>
>>

Re: [ANNOUNCE] Web Crawler


Lukas,

I am thinking about it but no decision yet.

Anyway, in next release, I will provide source code of pipeline stages 
and connectors as samples.


Dominique

Le 02/03/11 10:01, Lukáš Vlček a écrit :

Hi,

is there any plan to open source it?

Regards,
Lukas

[OT] I tried HuriSearch, input "Java" into search field, it returned a 
lot of references to coldfusion error pages. May be a recrawl would help?


On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean 
mailto:dominique.bej...@eolya.fr>> wrote:


Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage web sites
to be crawled. Each web site crawl is configured with a lot of
possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml writer, ...).

All is very configurable and extendible either by scripting or
java coding.

With scripting technology, you can help the crawler to handle
javascript links or help the pipeline to extract relevant title
and cleanup the html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen
shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it out from
here : www.crawl-anywhere.com 


Regards

Dominique

Re: [ANNOUNCE] Web Crawler


Aditya,

The crawler is not open source and won't be in the next future. Anyway, 
I have to change the license because it can be use for any personal or 
commercial projects.


Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers 
http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
far yours would be different from the rest.


Your license states that it is not open source but it is free for 
personnel use.


Regards
Aditya
www.findbestopensource.com 


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
mailto:dominique.bej...@eolya.fr>> wrote:


Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage web sites
to be crawled. Each web site crawl is configured with a lot of
possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml writer, ...).

All is very configurable and extendible either by scripting or
java coding.

With scripting technology, you can help the crawler to handle
javascript links or help the pipeline to extract relevant title
and cleanup the html pages (remove menus, header, footers, ..)

With java coding, you can develop your own pipeline stage stage

The Crawl Anywhere web site provides good explanations and screen
shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it out from
here : www.crawl-anywhere.com 


Regards

Dominique

Re: [ANNOUNCE] Web Crawler


Hi,

The crawler comes with a extendible document processing pipeline. If you 
know java libraries or web services for 'wrapper induction' processing, 
it is possible to implement a dedicated stage in the pipeline.


Dominique

Le 02/03/11 12:20, Geert-Jan Brits a écrit :

Hi Dominique,

This looks nice.
In the past, I've been interested in (semi)-automatically inducing a 
scheme/wrapper from a set of example webpages (often called 'wrapper 
induction' is the scientific field) .
This would allow for fast scheme-creation which could be used as a 
basis for extraction.


Lately I've been looking for crawlers that incoporate this technology 
but without success.

Any plans on incorporating this?

Cheers,
Geert-Jan

2011/3/2 Dominique Bejean >


Rosa,

In the pipeline, there is a stage that extract the text from the
original document (PDF, HTML, ...).
It is possible to plug scripts (Java 6 compliant) in order to keep
only relevant parts of the document.
See
http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage

Dominique

Le 02/03/11 09:36, Rosa (Anuncios) a écrit :

Nice job!

It would be good to be able to extract specific data from a
given page via XPATH though.

Regards,


Le 02/03/2011 01:25, Dominique Bejean a écrit :

Hi,

I would like to announce Crawl Anywhere. Crawl-Anywhere is
a Java Web Crawler. It includes :

  * a crawler
  * a document processing pipeline
  * a solr indexer

The crawler has a web administration in order to manage
web sites to be crawled. Each web site crawl is configured
with a lot of possible parameters (no all mandatory) :

  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...

The pileline includes various ready to use stages (text
extraction, language detection, Solr ready to index xml
writer, ...).

All is very configurable and extendible either by
scripting or java coding.

With scripting technology, you can help the crawler to
handle javascript links or help the pipeline to extract
relevant title and cleanup the html pages (remove menus,
header, footers, ..)

With java coding, you can develop your own pipeline stage
stage

The Crawl Anywhere web site provides good explanations and
screen shots. All is documented in a wiki.

The current version is 1.1.4. You can download and try it
out from here : www.crawl-anywhere.com



Regards

Dominique

Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Paul Libbrecht

VIewing the indexing result, which is a part of what you are describing I 
think, is a nice job for such an indexing framework.

Do you guys know whether such feature is already out there?

paul


Le 2 mars 2011 à 12:20, Geert-Jan Brits a écrit :

> Hi Dominique,
> 
> This looks nice.
> In the past, I've been interested in (semi)-automatically inducing a
> scheme/wrapper from a set of example webpages (often called 'wrapper
> induction' is the scientific field) .
> This would allow for fast scheme-creation which could be used as a basis for
> extraction.
> 
> Lately I've been looking for crawlers that incoporate this technology but
> without success.
> Any plans on incorporating this?
> 
> Cheers,
> Geert-Jan
> 
> 2011/3/2 Dominique Bejean 
> 
>> Rosa,
>> 
>> In the pipeline, there is a stage that extract the text from the original
>> document (PDF, HTML, ...).
>> It is possible to plug scripts (Java 6 compliant) in order to keep only
>> relevant parts of the document.
>> See
>> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage
>> 
>> Dominique
>> 
>> Le 02/03/11 09:36, Rosa (Anuncios) a écrit :
>> 
>> Nice job!
>>> 
>>> It would be good to be able to extract specific data from a given page via
>>> XPATH though.
>>> 
>>> Regards,
>>> 
>>> 
>>> Le 02/03/2011 01:25, Dominique Bejean a écrit :
>>> 
 Hi,
 
 I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web
 Crawler. It includes :
 
  * a crawler
  * a document processing pipeline
  * a solr indexer
 
 The crawler has a web administration in order to manage web sites to be
 crawled. Each web site crawl is configured with a lot of possible 
 parameters
 (no all mandatory) :
 
  * number of simultaneous items crawled by site
  * recrawl period rules based on item type (html, PDF, …)
  * item type inclusion / exclusion rules
  * item path inclusion / exclusion / strategy rules
  * max depth
  * web site authentication
  * language
  * country
  * tags
  * collections
  * ...
 
 The pileline includes various ready to use stages (text extraction,
 language detection, Solr ready to index xml writer, ...).
 
 All is very configurable and extendible either by scripting or java
 coding.
 
 With scripting technology, you can help the crawler to handle javascript
 links or help the pipeline to extract relevant title and cleanup the html
 pages (remove menus, header, footers, ..)
 
 With java coding, you can develop your own pipeline stage stage
 
 The Crawl Anywhere web site provides good explanations and screen shots.
 All is documented in a wiki.
 
 The current version is 1.1.4. You can download and try it out from here :
 www.crawl-anywhere.com
 
 
 Regards
 
 Dominique
 
 
 
>>> 
>>>

Groupped results

2011-03-02 Thread Rok Rejc

I have an index with a number of documents. For example (this example is
representative and contains many others fields):

IdType1Type2Title
1abxfg
2acabd
3adthm
4baefd
5bbikj
6bcazd
...

I want to query an index on a number of fields (not a problem), but I want
to get results ordered in  a "groups", and after that (inside the group) I
want to order result alphabeticaly by a Title.

"Group" is not fixed but it is created in runtime. For example:
- Group 1: documents with Type1=b and Type2=b.
- Group 2: documents with Type1=a and Type2=b.
- Group 3: documents with Type1=b and Type2=a.
- Group 4: documents with Type1=b and Type2=c.
...

So I want to retrieve results ordered by group (1,2,3,4) and after that
alphabeticaly by a Title.

I think that I should create a query where each group is seperated with OR
operator, and boost each group with appropriate factor. After that I should
order the results by this factor and title.

Is this possible? Any suggestions are appreciated.

Many thanks,

Rok

Split analysis

2011-03-02 Thread dan sutton

Hi All,

I have a requirement to analyze a field with a series of filters,
calculate a 'signature' then concatenate with the original input

e.g.

input => 'this is the input'

tokenized and filtered,  input becomes say 'this input' =>
12ef5e (signature)

so the final output indexed is:

12ef5ethis is the input

I can calculate the signature easily, but how can I get access to the
original (now tokenized and filtered) input

Many thanks in advance,
Dan

Re: Split analysis

There is an updateRequestProcessorChain you can use to execute some 
processors. Check de page for deduplication, it already has methods for 
creating signatures but you can easily create your own if you have to.

Use copyField to copy the value to a non-analyzed field (string) and obtain the 
original token input.

http://wiki.apache.org/solr/Deduplication

On Wednesday 02 March 2011 13:21:58 dan sutton wrote:
> Hi All,
> 
> I have a requirement to analyze a field with a series of filters,
> calculate a 'signature' then concatenate with the original input
> 
> e.g.
> 
> input => 'this is the input'
> 
> tokenized and filtered,  input becomes say 'this input' =>
> 12ef5e (signature)
> 
> so the final output indexed is:
> 
> 12ef5ethis is the input
> 
> I can calculate the signature easily, but how can I get access to the
> original (now tokenized and filtered) input
> 
> Many thanks in advance,
> Dan

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Problem - Help me with DataImport

2011-03-02 Thread Matias Alonso

Good Morning,


First, sorry for my poor english.


I trying to index “blogs” (rss) to my solr, so I´m using a dataImportHandler
for this.

I can´t index the date and I don´t no how to index static values (constant)
in a Field.

When I make a “Full Import” it doesn´t index the docs; if I delete the line
of date, it´s work.

When I debug with verbose it shows me the right information.


Below you can see my dataImportHandler:







   

   http://locademiaz.wordpress.com/feed/";


processor="XPathEntityProcessor"


transformer="DateFormatTransformer"


forEach="/rss/channel/item">


















   

   







I appreciate your help.

Thank you very much.



Matias.

Re: MLT with boost

2011-03-02 Thread Koji Sekiguchi


(11/03/02 0:23), Mark wrote:

Is it possible to add function queries/boosts to the results that are by MLT? 
If not out of the box
how would one go about achieving this functionality?

Thanks


Beside the point, why do you need such function?
If you give us more information/background of your needs, it might help 
responders.

regards,

Koji
--
http://www.rondhuit.com/en/

Re: solr different sizes on master and slave

2011-03-02 Thread Mike Franon

Right now I have the slave polling every 10 seconds, becuase we want
to make sure they stay in sync.  I have users who will do post
directly from a web application.  But I do notice it syncs very quick,
becuase usually the update is only one or two records at a time.

I am thinking maybe 10 seconds is too fast?

On Tue, Mar 1, 2011 at 4:40 PM, Jonathan Rochkind  wrote:
> The slave should not keep multiple copies _permanently_, but might
> temporarily after it's fetched the new files from master, but before it's
> committed them and fully wamred the new index searchers in the slave.  Could
> that be what's going on, is your slave just still working on committing and
> warming the new version(s) of the index?
>
> [If you do 'commit' to slave (and a replication pull counts as a 'commit')
> so quick that you get overlapping commits before the slave was able to warm
> a new index... its' going to be trouble all around.]
>
> On 3/1/2011 4:27 PM, Mike Franon wrote:
>>
>> ok doing some more research I noticed, on the slave it has multiple
>> folders where it keeps them for example
>>
>> index
>> index.20110204010900
>> index.20110204013355
>> index.20110218125400
>>
>> and then there is an index.properties that shows which index it is using.
>>
>> I am just curious why does it keep multiple copies?  Is there a
>> setting somewhere I can change to only keep one copy so not to lose
>> space?
>>
>> Thanks
>>
>> On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon  wrote:
>>>
>>> No pending commits, what it looks like is there are almost two copies
>>> of the index on the master, not sure how that happened.
>>>
>>>
>>>
>>> On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
>>>   wrote:

 Are there pending commits on the master?

> I was curious why would the size be dramatically different even though
> the index versions are the same?
>
> One is 1.2 Gb, and on the slave it is 512 MB
>
> I would think they should both be the same size no?
>
> Thanks
>

Re: solr different sizes on master and slave

2011-03-02 Thread Mike Franon

Is it ok if I just delete the old copies manually?  or maybe run a
script that does it?

On Tue, Mar 1, 2011 at 7:47 PM, Markus Jelsma
 wrote:
> Indeed, the slave should not have useless copies but it does, at least in
> 1.4.0, i haven't seen it in 3.x, but that was just a small test that did not
> exactly meet my other production installs.
>
> In 1.4.0 Solr does not remove old copies at startup and it does not cleanly
> abort running replications at shutdown. Between shutdown and startup there
> might be a higher index version, it will then proceed as expected; download
> the new version and continue. Old copies will appear.
>
> There is an earlier thread i started but without patch. You can, however, work
> around the problem by letting Solr delete a running replication by: 1. disable
> polling and then 2) abort replication. You can also write a script that will
> compare current and available replication directories before startup and act
> accordingly.
>
>
>> The slave should not keep multiple copies _permanently_, but might
>> temporarily after it's fetched the new files from master, but before
>> it's committed them and fully wamred the new index searchers in the
>> slave.  Could that be what's going on, is your slave just still working
>> on committing and warming the new version(s) of the index?
>>
>> [If you do 'commit' to slave (and a replication pull counts as a
>> 'commit') so quick that you get overlapping commits before the slave was
>> able to warm a new index... its' going to be trouble all around.]
>>
>> On 3/1/2011 4:27 PM, Mike Franon wrote:
>> > ok doing some more research I noticed, on the slave it has multiple
>> > folders where it keeps them for example
>> >
>> > index
>> > index.20110204010900
>> > index.20110204013355
>> > index.20110218125400
>> >
>> > and then there is an index.properties that shows which index it is using.
>> >
>> > I am just curious why does it keep multiple copies?  Is there a
>> > setting somewhere I can change to only keep one copy so not to lose
>> > space?
>> >
>> > Thanks
>> >
>> > On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon  wrote:
>> >> No pending commits, what it looks like is there are almost two copies
>> >> of the index on the master, not sure how that happened.
>> >>
>> >>
>> >>
>> >> On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
>> >>
>> >>   wrote:
>> >>> Are there pending commits on the master?
>> >>>
>>  I was curious why would the size be dramatically different even though
>>  the index versions are the same?
>> 
>>  One is 1.2 Gb, and on the slave it is 512 MB
>> 
>>  I would think they should both be the same size no?
>> 
>>  Thanks
>

RE: [ANNOUNCE] Web Crawler

2011-03-02 Thread Thumuluri, Sai

Dominique, Does your crawler support NTLM2 authentication? We have content 
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?

-Original Message-
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr] 
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Aditya,

The crawler is not open source and won't be in the next future. Anyway, 
I have to change the license because it can be use for any personal or 
commercial projects.

Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :
> Hello Dominique Bejean,
>
> Good job.
>
> We identified almost 8 open source web crawlers 
> http://www.findbestopensource.com/tagged/webcrawler   I don't know how 
> far yours would be different from the rest.
>
> Your license states that it is not open source but it is free for 
> personnel use.
>
> Regards
> Aditya
> www.findbestopensource.com 
>
>
> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean 
> mailto:dominique.bej...@eolya.fr>> wrote:
>
> Hi,
>
> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
> Web Crawler. It includes :
>
>   * a crawler
>   * a document processing pipeline
>   * a solr indexer
>
> The crawler has a web administration in order to manage web sites
> to be crawled. Each web site crawl is configured with a lot of
> possible parameters (no all mandatory) :
>
>   * number of simultaneous items crawled by site
>   * recrawl period rules based on item type (html, PDF, ...)
>   * item type inclusion / exclusion rules
>   * item path inclusion / exclusion / strategy rules
>   * max depth
>   * web site authentication
>   * language
>   * country
>   * tags
>   * collections
>   * ...
>
> The pileline includes various ready to use stages (text
> extraction, language detection, Solr ready to index xml writer, ...).
>
> All is very configurable and extendible either by scripting or
> java coding.
>
> With scripting technology, you can help the crawler to handle
> javascript links or help the pipeline to extract relevant title
> and cleanup the html pages (remove menus, header, footers, ..)
>
> With java coding, you can develop your own pipeline stage stage
>
> The Crawl Anywhere web site provides good explanations and screen
> shots. All is documented in a wiki.
>
> The current version is 1.1.4. You can download and try it out from
> here : www.crawl-anywhere.com 
>
>
> Regards
>
> Dominique
>
>

Re: multi-core solr, specifying the data directory

2011-03-02 Thread Nagendra Nagarajayya


HI Jonathan:

Did you try :





This should create the indexes under some_core/data or you can make 
datadir relative to some_core dir.


Regards,

- NN
http://solr-ra.tgels.com
http://rankingalgorithm.tgels.com



On 3/1/2011 7:21 AM, Jonathan Rochkind wrote:
I did try that, yes. I tried that first in fact!  It seems to fall 
back to a ./data directory relative to the _main_ solr directory (the 
one above all the cores), not the core instancedir.  Which is not what 
I expected either.


I wonder if this should be considered a bug? I wonder if anyone has 
considered this and thought of changing/fixing it?


On 3/1/2011 4:23 AM, Jan Høydahl wrote:
Have you tried removing the  tag from solrconfig.xml? Then 
it should fall back to default ./data relative to core instancedir.


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 1. mars 2011, at 00.00, Jonathan Rochkind wrote:

Unless I'm doing something wrong, in my experience in multi-core 
Solr in 1.4.1, you NEED to explicitly provide an absolute path to 
the 'data' dir.


I set up multi-core like this:







Now, setting instanceDir like that works for Solr to look for the 
'conf' directory in the default location you'd expect, 
./some_core/conf.


You'd expect it to look for the 'data' dir for an index in 
./some_core/data too, by default.  But it does not seem to. It's 
still looking for the 'data' directory in the _main_ solr.home/data, 
not under the relevant core directory.


The only way I can manage to get it to look for the /data directory 
where I expect is to spell it out with a full absolute path:






And then in the solrconfig.xml do a${dataDir}

Is this what everyone else does too? Or am I missing a better way of 
doing this?  I would have thought it would "just work", with Solr by 
default looking for a ./data subdir of the specified instanceDir.  
But it definitely doesn't seem to do that.


Should it? Anyone know if Solr in trunk past 1.4.1 has been changed 
to do what I expect? Or am I wrong to expect it? Or does everyone 
else do multi-core in some different way than me where this doesn't 
come up?


Jonathan

Solr under Tomcat

2011-03-02 Thread Thumuluri, Sai

Good Morning, 
We have deployed Solr 1.4.1 under Tomcat and it works great, however I
cannot find where the index (directory) is created. I set solr home in
web.xml under /webapps/solr/WEB-INF/, but not sure where the data
directory is. I have a need where I need to completely index the site
and it would help for me to stop solr, delete index directory and
restart solr prior to re-indexing the content. 

Thanks,
Sai Thumuluri

Re: Problem - Help me with DataImport

Matias,

for indexing constant/static values .. try
http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer

Regards
Stefan

On Wed, Mar 2, 2011 at 2:46 PM, Matias Alonso  wrote:
> Good Morning,
>
>
> First, sorry for my poor english.
>
>
> I trying to index “blogs” (rss) to my solr, so I´m using a dataImportHandler
> for this.
>
> I can´t index the date and I don´t no how to index static values (constant)
> in a Field.
>
> When I make a “Full Import” it doesn´t index the docs; if I delete the line
> of date, it´s work.
>
> When I debug with verbose it shows me the right information.
>
>
> Below you can see my dataImportHandler:
>
>
>
> 
>
>    
>
>                               
>
>                                               
>                                                               pk="link"
>
>                                                               url="
> http://locademiaz.wordpress.com/feed/";
>
>
> processor="XPathEntityProcessor"
>
>
> transformer="DateFormatTransformer"
>
>
> forEach="/rss/channel/item">
>
>
> 
>
>
>
>
> 
>
>
> 
>
>
> 
>
>
> 
>
>                                               
>
>                               
>
> 
>
>
>
>
>
> I appreciate your help.
>
> Thank you very much.
>
>
>
> Matias.
>

Re: solr different sizes on master and slave

Yes. But keep in mind that Solr may be actually using an index. 
directory for its live search. See either the replication.properties file or 
consult the replication page to see what index directory it uses.

If it uses an index. directory you can safely move it to index and 
remove or modify replication.properties.

On Wednesday 02 March 2011 15:03:54 Mike Franon wrote:
> Is it ok if I just delete the old copies manually?  or maybe run a
> script that does it?
> 
> On Tue, Mar 1, 2011 at 7:47 PM, Markus Jelsma
> 
>  wrote:
> > Indeed, the slave should not have useless copies but it does, at least in
> > 1.4.0, i haven't seen it in 3.x, but that was just a small test that did
> > not exactly meet my other production installs.
> > 
> > In 1.4.0 Solr does not remove old copies at startup and it does not
> > cleanly abort running replications at shutdown. Between shutdown and
> > startup there might be a higher index version, it will then proceed as
> > expected; download the new version and continue. Old copies will appear.
> > 
> > There is an earlier thread i started but without patch. You can, however,
> > work around the problem by letting Solr delete a running replication by:
> > 1. disable polling and then 2) abort replication. You can also write a
> > script that will compare current and available replication directories
> > before startup and act accordingly.
> > 
> >> The slave should not keep multiple copies _permanently_, but might
> >> temporarily after it's fetched the new files from master, but before
> >> it's committed them and fully wamred the new index searchers in the
> >> slave.  Could that be what's going on, is your slave just still working
> >> on committing and warming the new version(s) of the index?
> >> 
> >> [If you do 'commit' to slave (and a replication pull counts as a
> >> 'commit') so quick that you get overlapping commits before the slave was
> >> able to warm a new index... its' going to be trouble all around.]
> >> 
> >> On 3/1/2011 4:27 PM, Mike Franon wrote:
> >> > ok doing some more research I noticed, on the slave it has multiple
> >> > folders where it keeps them for example
> >> > 
> >> > index
> >> > index.20110204010900
> >> > index.20110204013355
> >> > index.20110218125400
> >> > 
> >> > and then there is an index.properties that shows which index it is
> >> > using.
> >> > 
> >> > I am just curious why does it keep multiple copies?  Is there a
> >> > setting somewhere I can change to only keep one copy so not to lose
> >> > space?
> >> > 
> >> > Thanks
> >> > 
> >> > On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon 
 wrote:
> >> >> No pending commits, what it looks like is there are almost two copies
> >> >> of the index on the master, not sure how that happened.
> >> >> 
> >> >> 
> >> >> 
> >> >> On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
> >> >> 
> >> >>   wrote:
> >> >>> Are there pending commits on the master?
> >> >>> 
> >>  I was curious why would the size be dramatically different even
> >>  though the index versions are the same?
> >>  
> >>  One is 1.2 Gb, and on the slave it is 512 MB
> >>  
> >>  I would think they should both be the same size no?
> >>  
> >>  Thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: [ANNOUNCE] Web Crawler


Hi,

No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
https://issues.apache.org/jira/browse/HTTPCLIENT-579

Dominique

Le 02/03/11 15:04, Thumuluri, Sai a écrit :

Dominique, Does your crawler support NTLM2 authentication? We have content 
under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?

-Original Message-
From: Dominique Bejean [mailto:dominique.bej...@eolya.fr]
Sent: Wednesday, March 02, 2011 6:22 AM
To: solr-user@lucene.apache.org
Subject: Re: [ANNOUNCE] Web Crawler

Aditya,

The crawler is not open source and won't be in the next future. Anyway,
I have to change the license because it can be use for any personal or
commercial projects.

Sincerely,

Dominique

Le 02/03/11 10:02, findbestopensource a écrit :

Hello Dominique Bejean,

Good job.

We identified almost 8 open source web crawlers
http://www.findbestopensource.com/tagged/webcrawler   I don't know how
far yours would be different from the rest.

Your license states that it is not open source but it is free for
personnel use.

Regards
Aditya
www.findbestopensource.com


On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
mailto:dominique.bej...@eolya.fr>>  wrote:

 Hi,

 I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
 Web Crawler. It includes :

   * a crawler
   * a document processing pipeline
   * a solr indexer

 The crawler has a web administration in order to manage web sites
 to be crawled. Each web site crawl is configured with a lot of
 possible parameters (no all mandatory) :

   * number of simultaneous items crawled by site
   * recrawl period rules based on item type (html, PDF, ...)
   * item type inclusion / exclusion rules
   * item path inclusion / exclusion / strategy rules
   * max depth
   * web site authentication
   * language
   * country
   * tags
   * collections
   * ...

 The pileline includes various ready to use stages (text
 extraction, language detection, Solr ready to index xml writer, ...).

 All is very configurable and extendible either by scripting or
 java coding.

 With scripting technology, you can help the crawler to handle
 javascript links or help the pipeline to extract relevant title
 and cleanup the html pages (remove menus, header, footers, ..)

 With java coding, you can develop your own pipeline stage stage

 The Crawl Anywhere web site provides good explanations and screen
 shots. All is documented in a wiki.

 The current version is 1.1.4. You can download and try it out from
 here : www.crawl-anywhere.com


 Regards

 Dominique

Re: Solr under Tomcat

2011-03-02 Thread Savvas-Andreas Moysidis

Hi Sai,

You can find your index files at:
{%TOMCAT_HOME}\solr\data\index

If you want to clear the index just delete the whole index directory.

Regards,
- Savvas

On 2 March 2011 14:09, Thumuluri, Sai wrote:

> Good Morning,
> We have deployed Solr 1.4.1 under Tomcat and it works great, however I
> cannot find where the index (directory) is created. I set solr home in
> web.xml under /webapps/solr/WEB-INF/, but not sure where the data
> directory is. I have a need where I need to completely index the site
> and it would help for me to stop solr, delete index directory and
> restart solr prior to re-indexing the content.
>
> Thanks,
> Sai Thumuluri
>
>
>

Re: solr different sizes on master and slave

2011-03-02 Thread Jayendra Patil

Hi Mike,

There was an issue with the Snappuller wherein it fails to clean up
the old index directories on the slave side.
https://issues.apache.org/jira/browse/SOLR-2156

The patch can be applied to fix the issue.
You can also delete the old index directories, except for the current
one which is mentioned in the index.properties.

Regards,
Jayendra

On Tue, Mar 1, 2011 at 4:27 PM, Mike Franon  wrote:
> ok doing some more research I noticed, on the slave it has multiple
> folders where it keeps them for example
>
> index
> index.20110204010900
> index.20110204013355
> index.20110218125400
>
> and then there is an index.properties that shows which index it is using.
>
> I am just curious why does it keep multiple copies?  Is there a
> setting somewhere I can change to only keep one copy so not to lose
> space?
>
> Thanks
>
> On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon  wrote:
>> No pending commits, what it looks like is there are almost two copies
>> of the index on the master, not sure how that happened.
>>
>>
>>
>> On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
>>  wrote:
>>> Are there pending commits on the master?
>>>
 I was curious why would the size be dramatically different even though
 the index versions are the same?

 One is 1.2 Gb, and on the slave it is 512 MB

 I would think they should both be the same size no?

 Thanks
>>>
>>
>

Re: Indexed, but cannot search

2011-03-02 Thread Brian Lamb

Here are the relevant parts of schema.xml:


globalField


This is what is returned when I search:


-

0
1
-

Mammal
true



-

Mammal
Mammal
globalField:mammal
globalField:mammal

LuceneQParser
-

1.0
-

1.0
-

1.0

-

0.0

-

0.0

-

0.0

-

0.0

-

0.0


-

0.0
-

0.0

-

0.0

-

0.0

-

0.0

-

0.0

-

0.0






On Tue, Mar 1, 2011 at 7:57 PM, Markus Jelsma wrote:

> Hmm, please provide analyzer of text and output of debugQuery=true. Anyway,
> if
> field type is fieldType text and the catchall field text is fieldType text
> as well
> and you reindexed, it should work as expected.
>
> > Oh if only it were that easy :-). I have reindexed since making that
> change
> > which is how I was able to get the regular search working. I have not
> > however been able to get the search across all fields to work.
> >
> > On Tue, Mar 1, 2011 at 3:01 PM, Markus Jelsma
> wrote:
> > > Traditionally, people forget to reindex ;)
> > >
> > > > Hi all,
> > > >
> > > > The problem was that my fields were defined as type="string" instead
> of
> > > > type="text". Once I corrected that, it seems to be fixed. The only
> part
> > > > that still is not working though is the search across all fields.
> > > >
> > > > For example:
> > > >
> > > > http://localhost:8983/solr/select/?q=type%3AMammal
> > > >
> > > > Now correctly returns the records matching mammal. But if I try to do
> a
> > > > global search across all fields:
> > > >
> > > > http://localhost:8983/solr/select/?q=Mammal
> > > > http://localhost:8983/solr/select/?q=text%3AMammal
> > > >
> > > > I get no results returned. Here is how the schema is set up:
> > > >
> > > >  > > > multiValued="true"/>
> > > > text
> > > > 
> > > >
> > > > Thanks to everyone for your help so far. I think this is the last
> > > > hurdle
> > >
> > > I
> > >
> > > > have to jump over.
> > > >
> > > > On Tue, Mar 1, 2011 at 12:34 PM, Upayavira  wrote:
> > > > > Next question, do you have your "type" field set to index="true" in
> > >
> > > your
> > >
> > > > > schema?
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Tue, 01 Mar 2011 11:06 -0500, "Brian Lamb"
> > > > >
> > > > >  wrote:
> > > > > > Thank you for your reply but the searching is still not working
> > > > > > out. For example, when I go to:
> > > > > >
> > > > > > http://localhost:8983/solr/select/?q=*%3A*<
> > >
> > >
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > >
> > > > > dent=on
> > > > >
> > > > > > I get the following as a response:
> > > > > >
> > > > > > 
> > > > > >
> > > > > >   
> > > > > >
> > > > > > Mammal
> > > > > > 1
> > > > > > Canis
> > > > > >
> > > > > >   
> > > > > >
> > > > > > 
> > > > > >
> > > > > > (plus some other docs but one is enough for this example)
> > > > > >
> > > > > > But if I go to
> > > > > > http://localhost:8983/solr/select/?q=type%3A<
> > >
> > >
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > >
> > > > > dent=on
> > > > >
> > > > > > Mammal
> > > > > >
> > > > > > I only get:
> > > > > >
> > > > > > 
> > > > > >
> > > > > > But it seems that should return at least the result I have listed
> > > > > > above. What am I doing incorrectly?
> > > > > >
> > > > > > On Mon, Feb 28, 2011 at 6:57 PM, Upayavira 
> wrote:
> > > > > > > q=dog is equivalent to q=text:dog (where the default search
> field
> > >
> > > is
> > >
> > > > > > > defined as text at the bottom of schema.xml).
> > > > > > >
> > > > > > > If you want to specify a different field, well, you need to
> tell
> > > > > > > it
> > > > > > >
> > > > > > > :-)
> > > > > > >
> > > > > > > Is that it?
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Mon, 28 Feb 2011 15:38 -0500, "Brian Lamb"
> > > > > > >
> > > > > > >  wrote:
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I was able to get my installation of Solr indexed using
> > >
> > > dataimport.
> > >
> > > > > > > > However,
> > > > > > > > I cannot seem to get search working. I can verify that the
> data
> > >
> > > is
> > >
> > > > > there
> > > > >
> > > > > > > > by
> > >
> > > > > > > > going to:
> > >
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > >
> > > > > dent=on
> > > > >
> > > > > > > > This gives me the response:  > > > > > > > numFound="234961" start="0">
> > > > > > > >
> > > > > > > > But when I go to
> > >
> > >
> http://localhost:8983/solr/select/?q=dog&version=2.2&start=0&rows=10&inde
> > >
> > > > > nt=on
> > > > >
> > > > > > > > I get the response:  > >
> > > start="0">
> > >
> > > > > > > > I know that dog should return some results because it is the
> > >
> > > first
> > >
> > > > > result
> > > > >
> > > > > > > > when I select all the records. So what am I doing incorrectly
> > >
> > > that
> > >
> > > > > would
> > > > >
> > > > > > > > prevent me from seeing results?
> > > > > > >
> > > > > > > ---
> > > > > > > Enterprise Search Consultant at Sourcesense UK,
> > > > > > > Making Sense of Open Source
> > > > >

Solr Sharding and idf

2011-03-02 Thread Jae Joo

Is there still issue regarding distributed idf in sharding environment in
Solr 1.4 or 4.0?
If yes, any suggestions to resolve it?

Thanks,

Jae

Re: Groupped results

2011-03-02 Thread Jayendra Patil

Hi Rok,

If I understood the use case rightly, Grouping of the results are
possible in Solr http://wiki.apache.org/solr/FieldCollapsing
Probably, you can create new fields with the combination for the
groups and use the field collapsing feature to group the results.

Id Type1Type2Title Group1
1abxfgab
2acabd   ac
3adthm  ad
4baefd   ba
5bbikjbb
6bcazd  bc

It also provides the sort and a group sorting features.

Regards,
Jayendra

On Wed, Mar 2, 2011 at 6:37 AM, Rok Rejc  wrote:
> I have an index with a number of documents. For example (this example is
> representative and contains many others fields):
>
> Id    Type1    Type2    Title
> 1    a    b    xfg
> 2    a    c    abd
> 3    a    d    thm
> 4    b    a    efd
> 5    b    b    ikj
> 6    b    c    azd
> ...
>
> I want to query an index on a number of fields (not a problem), but I want
> to get results ordered in  a "groups", and after that (inside the group) I
> want to order result alphabeticaly by a Title.
>
> "Group" is not fixed but it is created in runtime. For example:
> - Group 1: documents with Type1=b and Type2=b.
> - Group 2: documents with Type1=a and Type2=b.
> - Group 3: documents with Type1=b and Type2=a.
> - Group 4: documents with Type1=b and Type2=c.
> ...
>
> So I want to retrieve results ordered by group (1,2,3,4) and after that
> alphabeticaly by a Title.
>
> I think that I should create a query where each group is seperated with OR
> operator, and boost each group with appropriate factor. After that I should
> order the results by this factor and title.
>
> Is this possible? Any suggestions are appreciated.
>
> Many thanks,
>
> Rok
>

multiple localParams for each query clause

2011-03-02 Thread Roman Chyla

Hi,

Is it possible to set local arguments for each query clause?

example:

{!type=x q.field=z}something AND {!type=database}something


I am pulling together result sets coming from two sources, Solr index
and DB engine - however I realized that local parameters apply only to
the whole query - so I don't know how to set the query to mark the
second clause as db-searchable.

Thanks,

  Roman

Re: Indexed, but cannot search

Please also provide analysis part of fieldType text. You can also use Luke to 
inspect the index. 

http://localhost:8983/solr/admin/luke?fl=globalField&numTerms=100

On Wednesday 02 March 2011 16:09:33 Brian Lamb wrote:
> Here are the relevant parts of schema.xml:
> 
>  multiValued="true"/>
> globalField
> 
> 
> This is what is returned when I search:
> 
> 
> -
> 
> 0
> 1
> -
> 
> Mammal
> true
> 
> 
> 
> -
> 
> Mammal
> Mammal
> globalField:mammal
> globalField:mammal
> 
> LuceneQParser
> -
> 
> 1.0
> -
> 
> 1.0
> -
> 
> 1.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> 
> -
> 
> 0.0
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> -
> 
> 0.0
> 
> 
> 
> 
> 
> 
> On Tue, Mar 1, 2011 at 7:57 PM, Markus Jelsma 
wrote:
> > Hmm, please provide analyzer of text and output of debugQuery=true.
> > Anyway, if
> > field type is fieldType text and the catchall field text is fieldType
> > text as well
> > and you reindexed, it should work as expected.
> > 
> > > Oh if only it were that easy :-). I have reindexed since making that
> > 
> > change
> > 
> > > which is how I was able to get the regular search working. I have not
> > > however been able to get the search across all fields to work.
> > > 
> > > On Tue, Mar 1, 2011 at 3:01 PM, Markus Jelsma
> > 
> > wrote:
> > > > Traditionally, people forget to reindex ;)
> > > > 
> > > > > Hi all,
> > > > > 
> > > > > The problem was that my fields were defined as type="string"
> > > > > instead
> > 
> > of
> > 
> > > > > type="text". Once I corrected that, it seems to be fixed. The only
> > 
> > part
> > 
> > > > > that still is not working though is the search across all fields.
> > > > > 
> > > > > For example:
> > > > > 
> > > > > http://localhost:8983/solr/select/?q=type%3AMammal
> > > > > 
> > > > > Now correctly returns the records matching mammal. But if I try to
> > > > > do
> > 
> > a
> > 
> > > > > global search across all fields:
> > > > > 
> > > > > http://localhost:8983/solr/select/?q=Mammal
> > > > > http://localhost:8983/solr/select/?q=text%3AMammal
> > > > > 
> > > > > I get no results returned. Here is how the schema is set up:
> > > > > 
> > > > >  > > > > multiValued="true"/>
> > > > > text
> > > > > 
> > > > > 
> > > > > Thanks to everyone for your help so far. I think this is the last
> > > > > hurdle
> > > > 
> > > > I
> > > > 
> > > > > have to jump over.
> > > > > 
> > > > > On Tue, Mar 1, 2011 at 12:34 PM, Upayavira  wrote:
> > > > > > Next question, do you have your "type" field set to index="true"
> > > > > > in
> > > > 
> > > > your
> > > > 
> > > > > > schema?
> > > > > > 
> > > > > > Upayavira
> > > > > > 
> > > > > > On Tue, 01 Mar 2011 11:06 -0500, "Brian Lamb"
> > > > > > 
> > > > > >  wrote:
> > > > > > > Thank you for your reply but the searching is still not working
> > > > > > > out. For example, when I go to:
> > > > > > > 
> > > > > > > http://localhost:8983/solr/select/?q=*%3A*<
> > 
> > http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > 
> > > > > > dent=on
> > > > > > 
> > > > > > > I get the following as a response:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >   
> > > > > > >   
> > > > > > > Mammal
> > > > > > > 1
> > > > > > > Canis
> > > > > > >   
> > > > > > >   
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > (plus some other docs but one is enough for this example)
> > > > > > > 
> > > > > > > But if I go to
> > > > > > > http://localhost:8983/solr/select/?q=type%3A<
> > 
> > http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > 
> > > > > > dent=on
> > > > > > 
> > > > > > > Mammal
> > > > > > > 
> > > > > > > I only get:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > But it seems that should return at least the result I have
> > > > > > > listed above. What am I doing incorrectly?
> > > > > > > 
> > > > > > > On Mon, Feb 28, 2011 at 6:57 PM, Upayavira 
> > 
> > wrote:
> > > > > > > > q=dog is equivalent to q=text:dog (where the default search
> > 
> > field
> > 
> > > > is
> > > > 
> > > > > > > > defined as text at the bottom of schema.xml).
> > > > > > > > 
> > > > > > > > If you want to specify a different field, well, you need to
> > 
> > tell
> > 
> > > > > > > > it
> > > > > > > > 
> > > > > > > > :-)
> > > > > > > > 
> > > > > > > > Is that it?
> > > > > > > > 
> > > > > > > > Upayavira
> > > > > > > > 
> > > > > > > > On Mon, 28 Feb 2011 15:38 -0500, "Brian Lamb"
> > > > > > > > 
> > > > > > > >  wrote:
> > > > > > > > > Hi all,
> > > > > > > > > 
> > > > > > > > > I was able to get my installation of Solr indexed using
> > > > 
> > > > dataimport.
> > > > 
> > > > > > > > > However,
> > > > > > > > > I cannot seem to get search working. I can verify that the
> > 
> > data
> > 
> > > > is
> > > > 
> > > > > > there
> > > > > > 
> > > > > > > > > by
> > 
> > > > > > > > > going to:
> > http://localhost:8983/solr/select/?q=*%3A*&version=2.2&st

Re: MLT with boost

2011-03-02 Thread darren

I think what's being asked is obvious, in that, they want to modify the
sorted relevancy of the results of MLT. Where, instead of (or in addition
to) sorting by the mlt score, some modified function/subquery can be used
to further sort the results.

One example. You run a MLT query against a document about Shoes. You find
other documents about Shoes, but you want the relevancy modified by other
facets or fields like size, brand, style, etc. This should be done inside
the original MLT query and not after the results are retrieved.

On Wed, 02 Mar 2011 22:51:22 +0900, Koji Sekiguchi 
wrote:
> (11/03/02 0:23), Mark wrote:
>> Is it possible to add function queries/boosts to the results that are
by
>> MLT? If not out of the box
>> how would one go about achieving this functionality?
>>
>> Thanks
> 
> Beside the point, why do you need such function?
> If you give us more information/background of your needs, it might help
> responders.
> 
> regards,
> 
> Koji

Re: [ANNOUNCE] Web Crawler

2011-03-02 Thread Nestor Oviedo

Hi everyone!
I've been following this thread and I realized we've constructed something
similar to "Crawl Anywhere". The main difference is that our project is
oriented to the digital libraries and digital repositories context.
Specifically related to metadata collection from multiple sources,
information improvements and storing in multiple destinations.
So far, I can share an article about the project, because the code is in our
development machines and testing servers. If everything goes well, we plan
to make it open source in the near future.
I'd be glad to hear your comments and opinions about it. There is no need to
be polite.
Thanks in advance.

Best regards.
Nestor



On Wed, Mar 2, 2011 at 11:46 AM, Dominique Bejean  wrote:

> Hi,
>
> No, it doesn't. It looks like to be a apache httpclient 3.x limitation.
> https://issues.apache.org/jira/browse/HTTPCLIENT-579
>
> Dominique
>
> Le 02/03/11 15:04, Thumuluri, Sai a écrit :
>
>  Dominique, Does your crawler support NTLM2 authentication? We have content
>> under SiteMinder which uses NTLM2 and that is posing challenges with Nutch?
>>
>> -Original Message-
>> From: Dominique Bejean [mailto:dominique.bej...@eolya.fr]
>> Sent: Wednesday, March 02, 2011 6:22 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: [ANNOUNCE] Web Crawler
>>
>> Aditya,
>>
>> The crawler is not open source and won't be in the next future. Anyway,
>> I have to change the license because it can be use for any personal or
>> commercial projects.
>>
>> Sincerely,
>>
>> Dominique
>>
>> Le 02/03/11 10:02, findbestopensource a écrit :
>>
>>> Hello Dominique Bejean,
>>>
>>> Good job.
>>>
>>> We identified almost 8 open source web crawlers
>>> http://www.findbestopensource.com/tagged/webcrawler   I don't know how
>>> far yours would be different from the rest.
>>>
>>> Your license states that it is not open source but it is free for
>>> personnel use.
>>>
>>> Regards
>>> Aditya
>>> www.findbestopensource.com
>>>
>>>
>>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean
>>> mailto:dominique.bej...@eolya.fr>>  wrote:
>>>
>>> Hi,
>>>
>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java
>>> Web Crawler. It includes :
>>>
>>>   * a crawler
>>>   * a document processing pipeline
>>>   * a solr indexer
>>>
>>> The crawler has a web administration in order to manage web sites
>>> to be crawled. Each web site crawl is configured with a lot of
>>> possible parameters (no all mandatory) :
>>>
>>>   * number of simultaneous items crawled by site
>>>   * recrawl period rules based on item type (html, PDF, ...)
>>>   * item type inclusion / exclusion rules
>>>   * item path inclusion / exclusion / strategy rules
>>>   * max depth
>>>   * web site authentication
>>>   * language
>>>   * country
>>>   * tags
>>>   * collections
>>>   * ...
>>>
>>> The pileline includes various ready to use stages (text
>>> extraction, language detection, Solr ready to index xml writer, ...).
>>>
>>> All is very configurable and extendible either by scripting or
>>> java coding.
>>>
>>> With scripting technology, you can help the crawler to handle
>>> javascript links or help the pipeline to extract relevant title
>>> and cleanup the html pages (remove menus, header, footers, ..)
>>>
>>> With java coding, you can develop your own pipeline stage stage
>>>
>>> The Crawl Anywhere web site provides good explanations and screen
>>> shots. All is documented in a wiki.
>>>
>>> The current version is 1.1.4. You can download and try it out from
>>> here : www.crawl-anywhere.com
>>>
>>>
>>> Regards
>>>
>>> Dominique
>>>
>>>
>>>

Re: MLT with boost

There is a mlt.boost parameter.

On Wednesday 02 March 2011 16:28:35 dar...@ontrenet.com wrote:
> I think what's being asked is obvious, in that, they want to modify the
> sorted relevancy of the results of MLT. Where, instead of (or in addition
> to) sorting by the mlt score, some modified function/subquery can be used
> to further sort the results.
> 
> One example. You run a MLT query against a document about Shoes. You find
> other documents about Shoes, but you want the relevancy modified by other
> facets or fields like size, brand, style, etc. This should be done inside
> the original MLT query and not after the results are retrieved.
> 
> On Wed, 02 Mar 2011 22:51:22 +0900, Koji Sekiguchi 
> 
> wrote:
> > (11/03/02 0:23), Mark wrote:
> >> Is it possible to add function queries/boosts to the results that are
> 
> by
> 
> >> MLT? If not out of the box
> >> how would one go about achieving this functionality?
> >> 
> >> Thanks
> > 
> > Beside the point, why do you need such function?
> > If you give us more information/background of your needs, it might help
> > responders.
> > 
> > regards,
> > 
> > Koji

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: solr different sizes on master and slave

2011-03-02 Thread Mike Franon

Thanks you very much for this info, that helps a lot!



On Wed, Mar 2, 2011 at 10:05 AM, Jayendra Patil
 wrote:
> Hi Mike,
>
> There was an issue with the Snappuller wherein it fails to clean up
> the old index directories on the slave side.
> https://issues.apache.org/jira/browse/SOLR-2156
>
> The patch can be applied to fix the issue.
> You can also delete the old index directories, except for the current
> one which is mentioned in the index.properties.
>
> Regards,
> Jayendra
>
> On Tue, Mar 1, 2011 at 4:27 PM, Mike Franon  wrote:
>> ok doing some more research I noticed, on the slave it has multiple
>> folders where it keeps them for example
>>
>> index
>> index.20110204010900
>> index.20110204013355
>> index.20110218125400
>>
>> and then there is an index.properties that shows which index it is using.
>>
>> I am just curious why does it keep multiple copies?  Is there a
>> setting somewhere I can change to only keep one copy so not to lose
>> space?
>>
>> Thanks
>>
>> On Tue, Mar 1, 2011 at 3:26 PM, Mike Franon  wrote:
>>> No pending commits, what it looks like is there are almost two copies
>>> of the index on the master, not sure how that happened.
>>>
>>>
>>>
>>> On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma
>>>  wrote:
 Are there pending commits on the master?

> I was curious why would the size be dramatically different even though
> the index versions are the same?
>
> One is 1.2 Gb, and on the slave it is 512 MB
>
> I would think they should both be the same size no?
>
> Thanks

>>>
>>
>

Re: Indexed, but cannot search

2011-03-02 Thread Brian Lamb

So here's something interesting. I did a delta import this morning and it
looks like I can do a global search across those fields.

I'll do another full import and see if that fixed the problem. I had done a
fullimport after making this change but it seems like another reindex is in
order.

On Wed, Mar 2, 2011 at 10:31 AM, Markus Jelsma
wrote:

> Please also provide analysis part of fieldType text. You can also use Luke
> to
> inspect the index.
>
> http://localhost:8983/solr/admin/luke?fl=globalField&numTerms=100
>
> On Wednesday 02 March 2011 16:09:33 Brian Lamb wrote:
> > Here are the relevant parts of schema.xml:
> >
> >  > multiValued="true"/>
> > globalField
> > 
> >
> > This is what is returned when I search:
> >
> > 
> > -
> > 
> > 0
> > 1
> > -
> > 
> > Mammal
> > true
> > 
> > 
> > 
> > -
> > 
> > Mammal
> > Mammal
> > globalField:mammal
> > globalField:mammal
> > 
> > LuceneQParser
> > -
> > 
> > 1.0
> > -
> > 
> > 1.0
> > -
> > 
> > 1.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > 
> > -
> > 
> > 0.0
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > -
> > 
> > 0.0
> > 
> > 
> > 
> > 
> > 
> >
> > On Tue, Mar 1, 2011 at 7:57 PM, Markus Jelsma
> wrote:
> > > Hmm, please provide analyzer of text and output of debugQuery=true.
> > > Anyway, if
> > > field type is fieldType text and the catchall field text is fieldType
> > > text as well
> > > and you reindexed, it should work as expected.
> > >
> > > > Oh if only it were that easy :-). I have reindexed since making that
> > >
> > > change
> > >
> > > > which is how I was able to get the regular search working. I have not
> > > > however been able to get the search across all fields to work.
> > > >
> > > > On Tue, Mar 1, 2011 at 3:01 PM, Markus Jelsma
> > >
> > > wrote:
> > > > > Traditionally, people forget to reindex ;)
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > The problem was that my fields were defined as type="string"
> > > > > > instead
> > >
> > > of
> > >
> > > > > > type="text". Once I corrected that, it seems to be fixed. The
> only
> > >
> > > part
> > >
> > > > > > that still is not working though is the search across all fields.
> > > > > >
> > > > > > For example:
> > > > > >
> > > > > > http://localhost:8983/solr/select/?q=type%3AMammal
> > > > > >
> > > > > > Now correctly returns the records matching mammal. But if I try
> to
> > > > > > do
> > >
> > > a
> > >
> > > > > > global search across all fields:
> > > > > >
> > > > > > http://localhost:8983/solr/select/?q=Mammal
> > > > > > http://localhost:8983/solr/select/?q=text%3AMammal
> > > > > >
> > > > > > I get no results returned. Here is how the schema is set up:
> > > > > >
> > > > > >  > > > > > multiValued="true"/>
> > > > > > text
> > > > > > 
> > > > > >
> > > > > > Thanks to everyone for your help so far. I think this is the last
> > > > > > hurdle
> > > > >
> > > > > I
> > > > >
> > > > > > have to jump over.
> > > > > >
> > > > > > On Tue, Mar 1, 2011 at 12:34 PM, Upayavira 
> wrote:
> > > > > > > Next question, do you have your "type" field set to
> index="true"
> > > > > > > in
> > > > >
> > > > > your
> > > > >
> > > > > > > schema?
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Tue, 01 Mar 2011 11:06 -0500, "Brian Lamb"
> > > > > > >
> > > > > > >  wrote:
> > > > > > > > Thank you for your reply but the searching is still not
> working
> > > > > > > > out. For example, when I go to:
> > > > > > > >
> > > > > > > > http://localhost:8983/solr/select/?q=*%3A*<
> > >
> > >
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > >
> > > > > > > dent=on
> > > > > > >
> > > > > > > > I get the following as a response:
> > > > > > > >
> > > > > > > > 
> > > > > > > >
> > > > > > > >   
> > > > > > > >
> > > > > > > > Mammal
> > > > > > > > 1
> > > > > > > > Canis
> > > > > > > >
> > > > > > > >   
> > > > > > > >
> > > > > > > > 
> > > > > > > >
> > > > > > > > (plus some other docs but one is enough for this example)
> > > > > > > >
> > > > > > > > But if I go to
> > > > > > > > http://localhost:8983/solr/select/?q=type%3A<
> > >
> > >
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=10&in
> > >
> > > > > > > dent=on
> > > > > > >
> > > > > > > > Mammal
> > > > > > > >
> > > > > > > > I only get:
> > > > > > > >
> > > > > > > > 
> > > > > > > >
> > > > > > > > But it seems that should return at least the result I have
> > > > > > > > listed above. What am I doing incorrectly?
> > > > > > > >
> > > > > > > > On Mon, Feb 28, 2011 at 6:57 PM, Upayavira 
> > >
> > > wrote:
> > > > > > > > > q=dog is equivalent to q=text:dog (where the default search
> > >
> > > field
> > >
> > > > > is
> > > > >
> > > > > > > > > defined as text at the bottom of schema.xml).
> > > > > > > > >
> > > > > > > > > If you want to specify a different field

Boost function problem with disquerymax

2011-03-02 Thread Gastone Penzo

HI,
for search i use disquery max
and a i want to boost a field with bf parameter like:

...&bf=boost_has_img^5&

the boost_has_img field of my document is 3:

3

if i see the results in debug query mode i can see:

  0.0 = (MATCH) FunctionQuery(int(boost_has_img)), product of:
0.0 = int(boost_has_img)=0
5.0 = boost
0.06543833 = queryNorm

why the score is 0 if the value is 3 and the boost is 5???


THANX
-- 

Gastone Penzo

Re: multiple localParams for each query clause

2011-03-02 Thread Jonathan Rochkind

Not per clause, no. But you can use the "nested queries" feature to set 
local params for each nested query instead.  Which is in fact one of the 
most common use cases for local params.


&q=_query_:"{type=x q.field=z}something" AND 
_query_:"{!type=database}something"


URL encode that whole thing though.

http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

On 3/2/2011 10:24 AM, Roman Chyla wrote:

Hi,

Is it possible to set local arguments for each query clause?

example:

{!type=x q.field=z}something AND {!type=database}something


I am pulling together result sets coming from two sources, Solr index
and DB engine - however I realized that local parameters apply only to
the whole query - so I don't know how to set the query to mark the
second clause as db-searchable.

Thanks,

   Roman

Re: Problem - Help me with DataImport

2011-03-02 Thread Matias Alonso

Stefan,

Thank you very much! It works perfect...
Any idea for the other question? Someone?


Matias.



2011/3/2 Stefan Matheis 

> Matias,
>
> for indexing constant/static values .. try
> http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer
>
> Regards
> Stefan
>
> On Wed, Mar 2, 2011 at 2:46 PM, Matias Alonso 
> wrote:
> > Good Morning,
> >
> >
> > First, sorry for my poor english.
> >
> >
> > I trying to index “blogs” (rss) to my solr, so I´m using a
> dataImportHandler
> > for this.
> >
> > I can´t index the date and I don´t no how to index static values
> (constant)
> > in a Field.
> >
> > When I make a “Full Import” it doesn´t index the docs; if I delete the
> line
> > of date, it´s work.
> >
> > When I debug with verbose it shows me the right information.
> >
> >
> > Below you can see my dataImportHandler:
> >
> >
> >
> > 
> >
> >
> >
> >   
> >
> >>
> >   pk="link"
> >
> >   url="
> > http://locademiaz.wordpress.com/feed/";
> >
> >
> > processor="XPathEntityProcessor"
> >
> >
> > transformer="DateFormatTransformer"
> >
> >
> > forEach="/rss/channel/item">
> >
> >
> >  dateTimeFormat="EEE,
> > d MMM  HH:mm:ss Z" />
> >
> >
> >
> >
> > 
> >
> >
> > 
> >
> >
> > 
> >
> >
> > 
> >
> >   
> >
> >   
> >
> > 
> >
> >
> >
> >
> >
> > I appreciate your help.
> >
> > Thank you very much.
> >
> >
> >
> > Matias.
> >
>

Re: Solr Sharding and idf

2011-03-02 Thread Upayavira

As I understand it there is, and the best you can do is keep the same
number of docs per shard, and keep your documents randomised across
shards. That way you'll minimise the chances of suffering from
distributed IDF issues.

Upayavira

On Wed, 02 Mar 2011 10:10 -0500, "Jae Joo"  wrote:
> Is there still issue regarding distributed idf in sharding environment in
> Solr 1.4 or 4.0?
> If yes, any suggestions to resolve it?
> 
> Thanks,
> 
> Jae
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source

Re: Solr Sharding and idf

2011-03-02 Thread Tomás Fernández Löbbe

Hi Jae, this is the Jira created for the problem of IDF on distributed
search:

https://issues.apache.org/jira/browse/SOLR-1632

It's still open

On Wed, Mar 2, 2011 at 1:48 PM, Upayavira  wrote:

> As I understand it there is, and the best you can do is keep the same
> number of docs per shard, and keep your documents randomised across
> shards. That way you'll minimise the chances of suffering from
> distributed IDF issues.
>
> Upayavira
>
> On Wed, 02 Mar 2011 10:10 -0500, "Jae Joo"  wrote:
> > Is there still issue regarding distributed idf in sharding environment in
> > Solr 1.4 or 4.0?
> > If yes, any suggestions to resolve it?
> >
> > Thanks,
> >
> > Jae
> >
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>

Efficient boolean query

Hey all,
I have an index with a lot of documents with the term X and no documents
with the term Y.
If i query for X it take a few seconds and returns the results.
If I query for Y it takes a millisecond and returns an empty set.
If i query for Y AND X it takes a few seconds and returns an empty set.

I'm guessing that it evaluate both X and Y and only then tries to intersect
them?

Am i wrong? is there another way to run this query more efficiently?

thanks for any input

Re: Boost function problem with disquerymax

On Wed, Mar 2, 2011 at 11:34 AM, Gastone Penzo  wrote:
> HI,
> for search i use disquery max
> and a i want to boost a field with bf parameter like:
> ...&bf=boost_has_img^5&
> the boost_has_img field of my document is 3:
> 3
> if i see the results in debug query mode i can see:
>   0.0 = (MATCH) FunctionQuery(int(boost_has_img)), product of:
>     0.0 = int(boost_has_img)=0
>     5.0 = boost
>     0.06543833 = queryNorm
> why the score is 0 if the value is 3 and the boost is 5???

Solr thinks the value of boost_has_img is 0 for that document.
Is boost_has_img an indexed field?
If so, verify that the value is actually 3 for that specific document.


-Yonik
http://lucidimagination.com

Re: design help

2011-03-02 Thread Bill Bell

You might want to hire a consultant.

Tika can deal with Word documents. Ids needs to be unique. One index might 
work, not sure based on your info below.

For database you need to use a Java db thin connector to SQL server. Throw the 
jar in the lib directory and restart. Then setup dih settings to get data from 
the database. There is also web crawling solutions like Nutch. 

I would get some data indexed and try searching it first. Lucid has a tutorial.

Bill Bell
Sent from mobile


On Mar 2, 2011, at 4:39 AM, "Ken Foskey"  wrote:

> 
> I have read the solr book and the other book is on its way for me to read. I 
> need some help in the mean time.
> 
> a)  Using the example solr system how do I send the word document using curl 
> into the system.I want to have the ID as the full path of the document. I 
> have tried various commands but it gives me stream errors,  documents are on 
> one server,  solr is on a second server.  I would like to experiment at home 
> indexing all the documents on the server,  giving me experience outside work.
> 
> b)  I am trying to grasp a few issues with my design,  I have read the books 
> but I am struggling to grasp the ideas yet.
> 
> -  I am using a .Net connector,  which one is recommended?
> 
> -  I have an aspx webpage that connects to solr,  adds some extra information 
> to the solr query (eg client id see later) and returns the data. Is this a 
> normal design?
> 
> I am using a shopping trolley design.  Orders are created and then comments 
> are added as the order is processed.   I want to search all the order 
> information.  (this is one of many indexing requirements,  then there will be 
> a 'site search' facility.)
> 
> I want to index many database tables,  I am proposing to use one SOLR 
> instance that indexes all the data.   I will use the example TEXT field and 
> copydata all the extra fields I have for the tables.  The SOLR book gave me 
> short shift on the database indexing tool,  is there a good tutorial on using 
> SOLR with SQLServer?  I have a many tables, I was also thinking I might have 
> to join a lot,  it might be easier to create an XML output then send that,  
> is there pros and cons.
> 
> I am thinking that I will search client = 578  and (  text = keywords). This 
> way I am restricting my searches to the client signed on.Is this a good 
> idea?  (it is secured behind my application so I am adding the extras)
> 
> I was also thinking that I would index the order notes separately from the 
> order,   The other alternative is to multi value a note field in the order 
> and keep deleting it.   Which is best?  One way I have to keep replacing the 
> whole order,  the other I have to some how 'join' up the notes in the search. 
>   (more like this feature I suppose).
> 
> When you come out of the SOLR index,   is it normal to map the ID to a URL, 
> for exampleid="product:   1234" =>   href="product.aspx?productId=1234" 
> in code using the return data from solr.
> 
> Hope you can help me,
> 
> Ta
> Ken
> 
> 
>

Re: Efficient boolean query

2011-03-02 Thread Geert-Jan Brits

If you often query X as part of several other queries (e.g: X  | X AND Y |
 X AND Z)
you might consider putting X in a filter query (
http://wiki.apache.org/solr/CommonQueryParameters#fq)

leading to:
q=*:*&fq=X
q=Y&fq=X
q=Z&fq=X

Filter queries are cached seperately which means that after the first query
involving X, X should be returned quickly.
So your FIRST query will probably still be in the 'few seconds'- range, but
all following queries involving X will return much quicker.

hth,
Geert-Jan

2011/3/2 Ofer Fort 

> Hey all,
> I have an index with a lot of documents with the term X and no documents
> with the term Y.
> If i query for X it take a few seconds and returns the results.
> If I query for Y it takes a millisecond and returns an empty set.
> If i query for Y AND X it takes a few seconds and returns an empty set.
>
> I'm guessing that it evaluate both X and Y and only then tries to intersect
> them?
>
> Am i wrong? is there another way to run this query more efficiently?
>
> thanks for any input
>

Re: Efficient boolean query

Thanks,
I tried it in the past and found out that my hit ratio was pretty low, so it
doesn't help most of my queries

ofer

On Wed, Mar 2, 2011 at 7:16 PM, Geert-Jan Brits  wrote:

> If you often query X as part of several other queries (e.g: X  | X AND Y |
>  X AND Z)
> you might consider putting X in a filter query (
> http://wiki.apache.org/solr/CommonQueryParameters#fq)
>
> leading to:
> q=*:*&fq=X
> q=Y&fq=X
> q=Z&fq=X
>
> Filter queries are cached seperately which means that after the first query
> involving X, X should be returned quickly.
> So your FIRST query will probably still be in the 'few seconds'- range, but
> all following queries involving X will return much quicker.
>
> hth,
> Geert-Jan
>
> 2011/3/2 Ofer Fort 
>
> > Hey all,
> > I have an index with a lot of documents with the term X and no documents
> > with the term Y.
> > If i query for X it take a few seconds and returns the results.
> > If I query for Y it takes a millisecond and returns an empty set.
> > If i query for Y AND X it takes a few seconds and returns an empty set.
> >
> > I'm guessing that it evaluate both X and Y and only then tries to
> intersect
> > them?
> >
> > Am i wrong? is there another way to run this query more efficiently?
> >
> > thanks for any input
> >
>

Re: Solr Sharding and idf

2011-03-02 Thread Jae Joo

Yes, I knew that the ticket is still open. This is why I am looking for the
solutions now.

2011/3/2 Tomás Fernández Löbbe 

> Hi Jae, this is the Jira created for the problem of IDF on distributed
> search:
>
> https://issues.apache.org/jira/browse/SOLR-1632
>
> It's still open
>
> On Wed, Mar 2, 2011 at 1:48 PM, Upayavira  wrote:
>
> > As I understand it there is, and the best you can do is keep the same
> > number of docs per shard, and keep your documents randomised across
> > shards. That way you'll minimise the chances of suffering from
> > distributed IDF issues.
> >
> > Upayavira
> >
> > On Wed, 02 Mar 2011 10:10 -0500, "Jae Joo"  wrote:
> > > Is there still issue regarding distributed idf in sharding environment
> in
> > > Solr 1.4 or 4.0?
> > > If yes, any suggestions to resolve it?
> > >
> > > Thanks,
> > >
> > > Jae
> > >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
>

Re: Efficient boolean query

On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
> Hey all,
> I have an index with a lot of documents with the term X and no documents
> with the term Y.
> If i query for X it take a few seconds and returns the results.
> If I query for Y it takes a millisecond and returns an empty set.
> If i query for Y AND X it takes a few seconds and returns an empty set.

This depends on the specifics of what X is.   Some query types must
generate all hits first internally - an example is a multi-term query
(like numeric range query, etc) that matches many terms.

Can you show the generated query (i.e. add debugQuery=true to the request)?

-Yonik
http://lucidimagination.com

Re: MLT with boost

2011-03-02 Thread Mark

High level overview. We have items and we have sellers. The scoring of 
our documents is such that our boost functions outweight the pure lucene 
term/query scoring. Our boost functions basically take into account how 
"good" the seller is.


Now for MLT searches we would like to incorporate this same sort of 
behavior.



On 3/2/11 5:51 AM, Koji Sekiguchi wrote:

(11/03/02 0:23), Mark wrote:
Is it possible to add function queries/boosts to the results that are 
by MLT? If not out of the box

how would one go about achieving this functionality?

Thanks


Beside the point, why do you need such function?
If you give us more information/background of your needs, it might 
help responders.


regards,

Koji

Re: MLT with boost

2011-03-02 Thread Mark

mlt.boost - [true/false] set if the query will be boosted by the 
interesting term relevance.


This is not the same as boost functions: 
http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29



On 3/2/11 7:45 AM, Markus Jelsma wrote:

There is a mlt.boost parameter.

On Wednesday 02 March 2011 16:28:35 dar...@ontrenet.com wrote:

I think what's being asked is obvious, in that, they want to modify the
sorted relevancy of the results of MLT. Where, instead of (or in addition
to) sorting by the mlt score, some modified function/subquery can be used
to further sort the results.

One example. You run a MLT query against a document about Shoes. You find
other documents about Shoes, but you want the relevancy modified by other
facets or fields like size, brand, style, etc. This should be done inside
the original MLT query and not after the results are retrieved.

On Wed, 02 Mar 2011 22:51:22 +0900, Koji Sekiguchi

wrote:

(11/03/02 0:23), Mark wrote:

Is it possible to add function queries/boosts to the results that are

by


MLT? If not out of the box
how would one go about achieving this functionality?

Thanks

Beside the point, why do you need such function?
If you give us more information/background of your needs, it might help
responders.

regards,

Koji

dismax query with no/empty/: q parameter?


For standard query handler fq-only queries, we used q=*:*.  However, with
dismax, that returns 0 results.  Are fq-only queries possible with dismax?  




Thanks!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/dismax-query-with-no-empty-q-parameter-tp2619170p2619170.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Efficient boolean query

you are correct that my query is a tange one, probably should have mentioned
it in the first post.
this is the debug data:





 0
 4173
 
  on
  on

  0
  timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
  2.2
  10
 




 timestamp:[2011-02-01T00:00:00Z TO NOW] AND
oferiko
 timestamp:[2011-02-01T00:00:00Z TO NOW] AND
oferiko
 +timestamp:[129651840 TO 1299069584823]
+contents:oferiko
 +timestamp:[129651840 TO
1299069584823] +contents:oferiko
 
 LuceneQParser

 
  4171.0
  
0.0

 0.0



 0.0


 0.0


 0.0



 0.0


 0.0

  

  
4171.0

 4171.0


 0.0



 0.0


 0.0



 0.0


 0.0

  
 





On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley wrote:

> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
> > Hey all,
> > I have an index with a lot of documents with the term X and no documents
> > with the term Y.
> > If i query for X it take a few seconds and returns the results.
> > If I query for Y it takes a millisecond and returns an empty set.
> > If i query for Y AND X it takes a few seconds and returns an empty set.
>
> This depends on the specifics of what X is.   Some query types must
> generate all hits first internally - an example is a multi-term query
> (like numeric range query, etc) that matches many terms.
>
> Can you show the generated query (i.e. add debugQuery=true to the request)?
>
> -Yonik
> http://lucidimagination.com
>

Re: Efficient boolean query

timestamp is of type:




On Wed, Mar 2, 2011 at 8:11 PM, Ofer Fort  wrote:

> you are correct that my query is a tange one, probably should have
> mentioned it in the first post.
> this is the debug data:
>
> 
> 
>
> 
>  0
>  4173
>  
>   on
>   on
>
>   0
>   timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
>   2.2
>   10
>  
> 
> 
> 
>
>  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> oferiko
>  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> oferiko
>  +timestamp:[129651840 TO 1299069584823]
> +contents:oferiko
>  +timestamp:[129651840 TO
> 1299069584823] +contents:oferiko
>  
>  LuceneQParser
>
>  
>   4171.0
>   
> 0.0
> 
>  0.0
> 
>
> 
>  0.0
> 
> 
>  0.0
> 
> 
>  0.0
>
> 
> 
>  0.0
> 
> 
>  0.0
> 
>   
>
>   
> 4171.0
> 
>  4171.0
> 
> 
>  0.0
>
> 
> 
>  0.0
> 
> 
>  0.0
> 
> 
>
>  0.0
> 
> 
>  0.0
> 
>   
>  
> 
>
> 
>
>
>
> On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley 
> wrote:
>
>> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
>> > Hey all,
>> > I have an index with a lot of documents with the term X and no documents
>> > with the term Y.
>> > If i query for X it take a few seconds and returns the results.
>> > If I query for Y it takes a millisecond and returns an empty set.
>> > If i query for Y AND X it takes a few seconds and returns an empty set.
>>
>> This depends on the specifics of what X is.   Some query types must
>> generate all hits first internally - an example is a multi-term query
>> (like numeric range query, etc) that matches many terms.
>>
>> Can you show the generated query (i.e. add debugQuery=true to the
>> request)?
>>
>> -Yonik
>> http://lucidimagination.com
>>
>
>

'Registering' a query / Percolation

2011-03-02 Thread Baillie, Robert

Hi,

I wondered if anyone knew if there are capabilities in Solr to
'register' a query much like Elasticsearch's 'percolation'
functionality.

I.E. Instruct Solr that you are interested in documents that match a
given query and then have Solr notify you (through whatever callback
mechanism is specified) if and when a document appears that matches the
query.

We are planning on writing some software that will effectively grind
Solr to give us the same behaviour, but if Solr has this registration
built in, it would be very useful and much easier on our resources...

Cheers
Rob Baillie

This email transmission is confidential and intended solely for the 
addressee. If you are not the intended addressee, you must not 
disclose, copy or distribute the contents of this transmission. If you 
have received this transmission in error, please notify the sender 
immediately.

http://www.sthree.com

Re: Efficient boolean query

On Wed, Mar 2, 2011 at 1:58 PM, Ofer Fort  wrote:
> Thanks,
> But each query tries to see if there is something new since the last result
> that was found, so rounding things will return the same documents over  and
> over again, till we reach to the next rounded point.
>
> Could i use the document id somehow?  or something else that's bigger than
> my last search?
>
> And even it was a simple term query, on the lucene side of things, why would
> it try to fetch ALL the terms if one of the required ones resulted in an
> empty set?

In general, all items are fetched for a big multi-term query because
it's very difficult to answer the question "what's the first document
after x that matches any of the terms" without doing so.

More specifically, Lucene does do some short-circuiting for
non-matches (at least in trunk... not sure about other versions).
If you reorder your query to
oferiko AND timestamp:[2011-02-01T00:00:00Z TO NOW]

Then when there is no match on oferiko, BooleanScorer will not ask for
the scorer for the second clause.

-Yonik
http://lucidimagination.com

Re: Understanding multi-field queries with q and fq

Anyone understand how to do boolean logic across multiple fields?  

Dismax is nice for searching multiple fields, but doesn't necessarily
support our syntax requirements. eDismax appears to be not available until
Solr 3.1.   

In the meantime, it looks like we need to support applying the user's query
to multiple fields, so if the user enters "led zeppelin merle" we need to be
able to do the logical equivalent of 

&fq=field1:led zeppelin merle OR field2:led zeppelin merle


Any ideas?  :)



mrw wrote:
> 
> After searching this list, Google, and looking through the Pugh book, I am
> a little confused about the right way to structure a query.
> 
> The Packt book uses the example of the MusicBrainz DB full of song
> metadata.  What if they also had the song lyrics in English and German as
> files on disk, and wanted to index them along with the metadata, so that
> each document would basically have song title, artist, publisher, date,
> ..., All_Metadata (copy field of all metadata fields), Text_English, and
> Text_German fields?  
> 
> There can only be one default field, correct?  So if we want to search for
> all songs containing (zeppelin AND (dog OR merle)) do we 
> 
> repeat the entire query text for all three major fields in the 'q' clause
> (assuming we don't want to use the cache):
> 
> q=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin AND
> (dog OR merle)+Text_German:(zeppelin AND (dog OR merle))
> 
> or repeat the entire query text for all three major fields in the 'fq'
> clause (assuming we want to use the cache):
> 
> q=*:*&fq=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin
> AND (dog OR merle)+Text_German:zeppelin AND (dog OR merle))
> 
> ?
> 
> Thanks!
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Understanding-multi-field-queries-with-q-and-fq-tp2528866p2619700.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Efficient boolean query

On Wed, Mar 2, 2011 at 2:43 PM, Ofer Fort  wrote:
> I didn't see this behavior, running solr 1.4.1, was that implemented
> after this release?

I think so.
It's implemented now in BooleanWeight.scorer()

  for (Weight w  : weights) {
BooleanClause c =  cIter.next();
Scorer subScorer = w.scorer(context, ScorerContext.def());
if (subScorer == null) {
  if (c.isRequired()) {
return null;
  }

And TermWeight returns null from scorer() if there are no matches for
the segment.

-Yonik
http://lucidimagination.com

Re: dismax query with no/empty/: q parameter?


: For standard query handler fq-only queries, we used q=*:*.  However, with
: dismax, that returns 0 results.  Are fq-only queries possible with dismax?  

they are if you use the q.alt param.

http://wiki.apache.org/solr/DisMaxQParserPlugin#q.alt

-Hoss

Formatting the XML returned

2011-03-02 Thread Brian Lamb

Hi all,

This list has proven itself quite useful since I got started with Solr. I'm
wondering if it is possible to dictate the XML that is returned by a search?
Right now it seems very inefficient in that it is formatted like:

Val
Val

Etc.

I would like to change it so that it reads something like:

Val
Val

Is this possible? If so, how?

Thanks,

Brian Lamb

sort by price puts unknown prices first

2011-03-02 Thread Scott K

When I sort by price ascending, documents with no price are listed
first. I would like them listed last. I tried adding the
sortMissingLast flag, even though it says it is only for strings, but
it did not help. Why doesn't sortMissingLast work on non-strings? This
seems like a very common issue, but I couldn't find any solutions when
I searched this group and on google.

The map function almost works, but if I use this, then prices of 0 are
treated as null, which is not what I want.
sort=map(price,0,0,99)+asc

schema.xml:

   

Thanks, Scott

Re: Formatting the XML returned

If you're confortable with XSL you can create a transformer and use Solr's 
XSLTResponseWriter to do the job.

http://wiki.apache.org/solr/XsltResponseWriter
> Hi all,
> 
> This list has proven itself quite useful since I got started with Solr. I'm
> wondering if it is possible to dictate the XML that is returned by a
> search? Right now it seems very inefficient in that it is formatted like:
> 
> Val
> Val
> 
> Etc.
> 
> I would like to change it so that it reads something like:
> 
> Val
> Val
> 
> Is this possible? If so, how?
> 
> Thanks,
> 
> Brian Lamb

Re: Efficient boolean query

Thanks,
But each query tries to see if there is something new since the last result
that was found, so rounding things will return the same documents over  and
over again, till we reach to the next rounded point.

Could i use the document id somehow?  or something else that's bigger than
my last search?

And even it was a simple term query, on the lucene side of things, why would
it try to fetch ALL the terms if one of the required ones resulted in an
empty set?

thanks for your help, specifically on this matter and in general, to the
search community :-)

On Wed, Mar 2, 2011 at 8:35 PM, Yonik Seeley wrote:

> One way to speed things up would be to reduce the resolution on
> timestamps that you index.
> Another way would be to decrease the precisionStep on the tdate field
> type (bigger index, but faster range queries)
> Yet another way is to use "fq" filters that can be reused many times.
>
> One way to increase fq reuse is to round.
> This rounds up to the nearest hour... assumes 2011-02-01T00:00:00Z is
> the same across many queries.
> fq=timestamp:[2011-02-01T00:00:00Z TO NOW/HOUR+1HOUR]
>
> Another way is to split the filter into two parts - a large part that
> doesn't change much + a small part that does.
> Again this assumes that the first endpoint is reused across many queries.
> fq=timestamp:[2011-02-01T00:00:00Z TO
> NOW/HOUR+1HOUR]&fq=timestamp:[NOW/HOUR TO NOW]
>
> If the first endpoint is *not* reused across many queries, then you
> can still use the same strategy as above by adding another small "fq"
> for the lower endpoint.
>
> -Yonik
> http://lucidimagination.com
>
>
>
> On Wed, Mar 2, 2011 at 1:11 PM, Ofer Fort  wrote:
> > you are correct that my query is a tange one, probably should have
> mentioned
> > it in the first post.
> > this is the debug data:
> >
> > 
> > 
> >
> > 
> >  0
> >  4173
> >  
> >  on
> >  on
> >
> >  0
> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
> >  2.2
> >  10
> >  
> > 
> > 
> > 
> >
> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> > oferiko
> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> > oferiko
> >  +timestamp:[129651840 TO 1299069584823]
> > +contents:oferiko
> >  +timestamp:[129651840 TO
> > 1299069584823] +contents:oferiko
> >  
> >  LuceneQParser
> >
> >  
> >  4171.0
> >  
> >0.0
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >  
> >
> >  
> >4171.0
> >
> > 4171.0
> >
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >  
> >  
> > 
> >
> > 
> >
> >
> > On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley  >wrote:
> >
> >> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
> >> > Hey all,
> >> > I have an index with a lot of documents with the term X and no
> documents
> >> > with the term Y.
> >> > If i query for X it take a few seconds and returns the results.
> >> > If I query for Y it takes a millisecond and returns an empty set.
> >> > If i query for Y AND X it takes a few seconds and returns an empty
> set.
> >>
> >> This depends on the specifics of what X is.   Some query types must
> >> generate all hits first internally - an example is a multi-term query
> >> (like numeric range query, etc) that matches many terms.
> >>
> >> Can you show the generated query (i.e. add debugQuery=true to the
> request)?
> >>
> >> -Yonik
> >> http://lucidimagination.com
> >>
> >
>

memory leak during undeploying

2011-03-02 Thread Search Learn

Hello,
We currently deploy and undeploy solr web application potentially hundred's
of times during a typical day. when the solr is undeployed, its classes are
not getting unloaded and eventually we are running into permgen error.
There are couple of JIRA's related to this:
https://issues.apache.org/jira/browse/LUCENE-2237,
https://issues.apache.org/jira/browse/SOLR-1735. Even after applying these
patches, the issue still remains.
Does any body have any suggestions for this?

Our environment:
 Apache Tomcat/6.0.29
 Solr 1.4.1

Thanks,
rajesh

Re: Efficient boolean query

I'm guessing what i was describing is a short-circuit evaluation and i see
that lucene doesn't have it:
http://lucene.472066.n3.nabble.com/Short-circuit-in-query-td738551.html

Still would love to hear any suggestions for my type of query

ofer

On Wed, Mar 2, 2011 at 8:58 PM, Ofer Fort  wrote:

> Thanks,
> But each query tries to see if there is something new since the last result
> that was found, so rounding things will return the same documents over  and
> over again, till we reach to the next rounded point.
>
> Could i use the document id somehow?  or something else that's bigger than
> my last search?
>
> And even it was a simple term query, on the lucene side of things, why
> would it try to fetch ALL the terms if one of the required ones resulted in
> an empty set?
>
> thanks for your help, specifically on this matter and in general, to the
> search community :-)
>
>
> On Wed, Mar 2, 2011 at 8:35 PM, Yonik Seeley 
> wrote:
>
>> One way to speed things up would be to reduce the resolution on
>> timestamps that you index.
>> Another way would be to decrease the precisionStep on the tdate field
>> type (bigger index, but faster range queries)
>> Yet another way is to use "fq" filters that can be reused many times.
>>
>> One way to increase fq reuse is to round.
>> This rounds up to the nearest hour... assumes 2011-02-01T00:00:00Z is
>> the same across many queries.
>> fq=timestamp:[2011-02-01T00:00:00Z TO NOW/HOUR+1HOUR]
>>
>> Another way is to split the filter into two parts - a large part that
>> doesn't change much + a small part that does.
>> Again this assumes that the first endpoint is reused across many queries.
>> fq=timestamp:[2011-02-01T00:00:00Z TO
>> NOW/HOUR+1HOUR]&fq=timestamp:[NOW/HOUR TO NOW]
>>
>> If the first endpoint is *not* reused across many queries, then you
>> can still use the same strategy as above by adding another small "fq"
>> for the lower endpoint.
>>
>> -Yonik
>> http://lucidimagination.com
>>
>>
>>
>> On Wed, Mar 2, 2011 at 1:11 PM, Ofer Fort  wrote:
>> > you are correct that my query is a tange one, probably should have
>> mentioned
>> > it in the first post.
>> > this is the debug data:
>> >
>> > 
>> > 
>> >
>> > 
>> >  0
>> >  4173
>> >  
>> >  on
>> >  on
>> >
>> >  0
>> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
>> >  2.2
>> >  10
>> >  
>> > 
>> > 
>> > 
>> >
>> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
>> > oferiko
>> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
>> > oferiko
>> >  +timestamp:[129651840 TO 1299069584823]
>> > +contents:oferiko
>> >  +timestamp:[129651840 TO
>> > 1299069584823] +contents:oferiko
>> >  
>> >  LuceneQParser
>> >
>> >  
>> >  4171.0
>> >  
>> >0.0
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >  
>> >
>> >  
>> >4171.0
>> >
>> > 4171.0
>> >
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >  
>> >  
>> > 
>> >
>> > 
>> >
>> >
>> > On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley <
>> yo...@lucidimagination.com>wrote:
>> >
>> >> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
>> >> > Hey all,
>> >> > I have an index with a lot of documents with the term X and no
>> documents
>> >> > with the term Y.
>> >> > If i query for X it take a few seconds and returns the results.
>> >> > If I query for Y it takes a millisecond and returns an empty set.
>> >> > If i query for Y AND X it takes a few seconds and returns an empty
>> set.
>> >>
>> >> This depends on the specifics of what X is.   Some query types must
>> >> generate all hits first internally - an example is a multi-term query
>> >> (like numeric range query, etc) that matches many terms.
>> >>
>> >> Can you show the generated query (i.e. add debugQuery=true to the
>> request)?
>> >>
>> >> -Yonik
>> >> http://lucidimagination.com
>> >>
>> >
>>
>
>

Re: Understanding multi-field queries with q and fq

2011-03-02 Thread Sujit Pal

This could probably be done using a custom QParser plugin?

Define the pattern like this:

String queryTemplate = "title:%Q%^2.0 body:%Q%";

then replace the %Q% with the value of the Q param, send it through
QueryParser.parse() and return the query.

-sujit

On Wed, 2011-03-02 at 11:28 -0800, mrw wrote:
> Anyone understand how to do boolean logic across multiple fields?  
> 
> Dismax is nice for searching multiple fields, but doesn't necessarily
> support our syntax requirements. eDismax appears to be not available until
> Solr 3.1.   
> 
> In the meantime, it looks like we need to support applying the user's query
> to multiple fields, so if the user enters "led zeppelin merle" we need to be
> able to do the logical equivalent of 
> 
> &fq=field1:led zeppelin merle OR field2:led zeppelin merle
> 
> 
> Any ideas?  :)
> 
> 
> 
> mrw wrote:
> > 
> > After searching this list, Google, and looking through the Pugh book, I am
> > a little confused about the right way to structure a query.
> > 
> > The Packt book uses the example of the MusicBrainz DB full of song
> > metadata.  What if they also had the song lyrics in English and German as
> > files on disk, and wanted to index them along with the metadata, so that
> > each document would basically have song title, artist, publisher, date,
> > ..., All_Metadata (copy field of all metadata fields), Text_English, and
> > Text_German fields?  
> > 
> > There can only be one default field, correct?  So if we want to search for
> > all songs containing (zeppelin AND (dog OR merle)) do we 
> > 
> > repeat the entire query text for all three major fields in the 'q' clause
> > (assuming we don't want to use the cache):
> > 
> > q=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin AND
> > (dog OR merle)+Text_German:(zeppelin AND (dog OR merle))
> > 
> > or repeat the entire query text for all three major fields in the 'fq'
> > clause (assuming we want to use the cache):
> > 
> > q=*:*&fq=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin
> > AND (dog OR merle)+Text_German:zeppelin AND (dog OR merle))
> > 
> > ?
> > 
> > Thanks!
> > 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Understanding-multi-field-queries-with-q-and-fq-tp2528866p2619700.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: multi-core solr, specifying the data directory

2011-03-02 Thread Mike Sokolov

Yes - I commented out the  element in solrconfig.xml and then 
got the expected behavior: the core used a data subdirectory in the core 
subdirectory.


It seems like the problem arises from using the solrconfig.xml that's 
distributed as example/solr/conf/solrconfig.xml


The solrconfig.xml's in  example/multicore/ don't have the  
element.


-Mike

On 03/01/2011 08:24 PM, Chris Hostetter wrote:

:
:${solr.data.dir:./solr/data}

that directive says "use the solr.data.dir system property to pick a path,
if it is not set, use "./solr/data" (realtive the CWD)

if you want it to use the default, then you need to eliminate it
completley, or you need to change it to the empty string...

${solr.data.dir:}

or...




-Hoss

Re: Efficient boolean query

I didn't see this behavior, running solr 1.4.1, was that implemented
after this release?

On Wednesday, March 2, 2011, Yonik Seeley  wrote:
> On Wed, Mar 2, 2011 at 1:58 PM, Ofer Fort  wrote:
>> Thanks,
>> But each query tries to see if there is something new since the last result
>> that was found, so rounding things will return the same documents over  and
>> over again, till we reach to the next rounded point.
>>
>> Could i use the document id somehow?  or something else that's bigger than
>> my last search?
>>
>> And even it was a simple term query, on the lucene side of things, why would
>> it try to fetch ALL the terms if one of the required ones resulted in an
>> empty set?
>
> In general, all items are fetched for a big multi-term query because
> it's very difficult to answer the question "what's the first document
> after x that matches any of the terms" without doing so.
>
> More specifically, Lucene does do some short-circuiting for
> non-matches (at least in trunk... not sure about other versions).
> If you reorder your query to
> oferiko AND timestamp:[2011-02-01T00:00:00Z TO NOW]
>
> Then when there is no match on oferiko, BooleanScorer will not ask for
> the scorer for the second clause.
>
> -Yonik
> http://lucidimagination.com
>

Re: sort by price puts unknown prices first


: When I sort by price ascending, documents with no price are listed
: first. I would like them listed last. I tried adding the
: sortMissingLast flag, even though it says it is only for strings, but

it works for any field type *backed* by a string, including the 
SortableIntField (and it's breathren)

: it did not help. Why doesn't sortMissingLast work on non-strings? This
: seems like a very common issue, but I couldn't find any solutions when
: I searched this group and on google.

historicly it has been because of a fundemental limitation in how the 
Lucene FieldCache has historicly worked where the array backed FieldCaches 
use the default numeric value (ie: 0) when docs have no value (but in the 
case of Strings, the default is "null" which is easy to test for)

i am 99.99% certain this has changed on the trunk, so all of the 
Trie*Fields should support the sortMissing* options in 4.x


-Hoss

Re: memory leak during undeploying

2011-03-02 Thread François Schiettecatte

Hi

I get the same problem on tomcat with other applications, so this does not 
appear to be limited to SOLR. I got the error on tomcat 6 and 7. The only 
solution I found was to kill tomcat and start it again.

François

On Mar 2, 2011, at 2:28 PM, Search Learn wrote:

> Hello,
> We currently deploy and undeploy solr web application potentially hundred's
> of times during a typical day. when the solr is undeployed, its classes are
> not getting unloaded and eventually we are running into permgen error.
> There are couple of JIRA's related to this:
> https://issues.apache.org/jira/browse/LUCENE-2237,
> https://issues.apache.org/jira/browse/SOLR-1735. Even after applying these
> patches, the issue still remains.
> Does any body have any suggestions for this?
> 
> Our environment:
> Apache Tomcat/6.0.29
> Solr 1.4.1
> 
> Thanks,
> rajesh

Re: memory leak during undeploying

2011-03-02 Thread François Schiettecatte

Hi

I get the same problem on tomcat with other applications, so this does not 
appear to be limited to SOLR. I got the error on tomcat 6 and 7. The only 
solution I found was to kill tomcat and start it again.

François

On Mar 2, 2011, at 2:28 PM, Search Learn wrote:

> Hello,
> We currently deploy and undeploy solr web application potentially hundred's
> of times during a typical day. when the solr is undeployed, its classes are
> not getting unloaded and eventually we are running into permgen error.
> There are couple of JIRA's related to this:
> https://issues.apache.org/jira/browse/LUCENE-2237,
> https://issues.apache.org/jira/browse/SOLR-1735. Even after applying these
> patches, the issue still remains.
> Does any body have any suggestions for this?
> 
> Our environment:
> Apache Tomcat/6.0.29
> Solr 1.4.1
> 
> Thanks,
> rajesh

Re: dismax query with no/empty/: q parameter?


Ah...so I need to be doing 

&q.alt=*:*
&fq=:.

Of course, now that you showed me what I look for, I also see the
explanation in the Packt book.  Sheesh.

Thanks for the tip!


Chris Hostetter-3 wrote:
> 
> : For standard query handler fq-only queries, we used q=*:*.  However,
> with
> : dismax, that returns 0 results.  Are fq-only queries possible with
> dismax?  
> 
> they are if you use the q.alt param.
> 
> http://wiki.apache.org/solr/DisMaxQParserPlugin#q.alt
> 
> -Hoss
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/dismax-query-with-no-empty-q-parameter-tp2619170p2620158.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multi-core solr, specifying the data directory

2011-03-02 Thread Jonathan Rochkind

Meanwhile, I'm having trouble getting the expected behavior at all. I'll 
try to give the right details (without overwhelming with too many), if 
anyone can see what's going on.


Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at 
/opt/solr/solr_indexer/solr.xml


The solr.xml includes actually only one core, let's start out nice and 
simple:








[The enable.master thing is a custom property my solrconfig.xml uses in 
places unrelated to dataDir]


1. First try, the  solrconfig at 
/opt/solr/solr_indexer/master_prod/conf/solrconfig.xml includes NO 
dataDir element at all.


WOAH. It just worked. Go figure. I don't know what I tried differently 
before, maybe Mike is right that people (including me) get confused by 
the  element being there, and needing to delete it entirely to 
get that default behavior.


So anyway yeah. Sorry, thanks, appears to be working, although 
possibly confusing for the newbie to set up for reasons that aren't 
entirely clear, since several of us in this thread had trouble getting 
it right.


On 3/2/2011 2:42 PM, Mike Sokolov wrote:

Yes - I commented out the  element in solrconfig.xml and then
got the expected behavior: the core used a data subdirectory in the core
subdirectory.

It seems like the problem arises from using the solrconfig.xml that's
distributed as example/solr/conf/solrconfig.xml

The solrconfig.xml's in  example/multicore/ don't have the
element.

-Mike

On 03/01/2011 08:24 PM, Chris Hostetter wrote:

:
:${solr.data.dir:./solr/data}

that directive says "use the solr.data.dir system property to pick a path,
if it is not set, use "./solr/data" (realtive the CWD)

if you want it to use the default, then you need to eliminate it
completley, or you need to change it to the empty string...

 ${solr.data.dir:}

or...

 


-Hoss

Re: Efficient boolean query

That's great, just what I needed, I was debugging and was expecting to
see something like this.
 i'll look through the SVN history to see in which version it was added.
Thanks

On Wednesday, March 2, 2011, Yonik Seeley  wrote:
> On Wed, Mar 2, 2011 at 2:43 PM, Ofer Fort  wrote:
>> I didn't see this behavior, running solr 1.4.1, was that implemented
>> after this release?
>
> I think so.
> It's implemented now in BooleanWeight.scorer()
>
>       for (Weight w  : weights) {
>         BooleanClause c =  cIter.next();
>         Scorer subScorer = w.scorer(context, ScorerContext.def());
>         if (subScorer == null) {
>           if (c.isRequired()) {
>             return null;
>           }
>
> And TermWeight returns null from scorer() if there are no matches for
> the segment.
>
> -Yonik
> http://lucidimagination.com
>

Re: multi-core solr, specifying the data directory

2011-03-02 Thread Jonathan Rochkind

I wonder if what doesn't work is trying to set an explicit relative path 
there, instead of using the baked in default "data".  If you set an 
explicit relative path, is it relative to the current core solr.home, or 
to the main solr.home?


Let's try it to see Yep, THAT's what doesn't work, and probably what 
I was trying to do before.


In solrconfig.xml for a core, I do data.

I expected that would be interpreted relative to current core solr.home, 
but it is, judging by the log files, instead based on the 'main' 
solr.home (above the cores, where the solr.xml is) -- or maybe even on 
some other value, the tomcat base url or something?


Is _that_ a bug?

On 3/2/2011 3:38 PM, Jonathan Rochkind wrote:

Meanwhile, I'm having trouble getting the expected behavior at all. I'll
try to give the right details (without overwhelming with too many), if
anyone can see what's going on.

Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at
/opt/solr/solr_indexer/solr.xml

The solr.xml includes actually only one core, let's start out nice and
simple:







[The enable.master thing is a custom property my solrconfig.xml uses in
places unrelated to dataDir]

1. First try, the  solrconfig at
/opt/solr/solr_indexer/master_prod/conf/solrconfig.xml includes NO
dataDir element at all.

WOAH. It just worked. Go figure. I don't know what I tried differently
before, maybe Mike is right that people (including me) get confused by
the  element being there, and needing to delete it entirely to
get that default behavior.

So anyway yeah. Sorry, thanks, appears to be working, although
possibly confusing for the newbie to set up for reasons that aren't
entirely clear, since several of us in this thread had trouble getting
it right.

On 3/2/2011 2:42 PM, Mike Sokolov wrote:

Yes - I commented out the   element in solrconfig.xml and then
got the expected behavior: the core used a data subdirectory in the core
subdirectory.

It seems like the problem arises from using the solrconfig.xml that's
distributed as example/solr/conf/solrconfig.xml

The solrconfig.xml's in  example/multicore/ don't have the
element.

-Mike

On 03/01/2011 08:24 PM, Chris Hostetter wrote:

:
:${solr.data.dir:./solr/data}

that directive says "use the solr.data.dir system property to pick a path,
if it is not set, use "./solr/data" (realtive the CWD)

if you want it to use the default, then you need to eliminate it
completley, or you need to change it to the empty string...

  ${solr.data.dir:}

or...

  


-Hoss

Re: memory leak during undeploying

Hi, 

I remember reading somewhere that undeploying an application in Tomcat won't 
release memory, thus repeating the cycle will indeed exhaust the permgen. You 
could enable garbage collection of the permgen.

HotSpot can do this for you but it depends on using CMS which you might not 
want to use at all if you're running a small server with little memory. Anyway 
here are the options for enabling it:

Modern JVM:
-XX:+CMSClassUnloadingEnabled

If you're running an older JVM try:
-XX:+CMSPermGenSweepingEnabled

I just wish i could find the documentation on this one but it seems to left the 
internet. This is taken from my Wiki so it seems i read it at least once and 
made a note. 

Please report the results

Cheers,

> Hello,
> We currently deploy and undeploy solr web application potentially hundred's
> of times during a typical day. when the solr is undeployed, its classes are
> not getting unloaded and eventually we are running into permgen error.
> There are couple of JIRA's related to this:
> https://issues.apache.org/jira/browse/LUCENE-2237,
> https://issues.apache.org/jira/browse/SOLR-1735. Even after applying these
> patches, the issue still remains.
> Does any body have any suggestions for this?
> 
> Our environment:
>  Apache Tomcat/6.0.29
>  Solr 1.4.1
> 
> Thanks,
> rajesh

Solr Admin Interface, reworked - Go on? Go away?


Hi List,

given that fact that my java-knowledge is sort of non-existing .. my 
idea was to rework the Solr Admin Interface.


Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy, 
but it was an idea few weeks ago - and i would like to contrib 
something, a thing which has to be non-java but not useless - hopefully ;)


Actually it's completly work-in-progress .. but i'm interested in what 
you guys think. Right direction? Completly Wrong, just drop it?


http://files.mathe.is/solr-admin/01_dashboard.png
http://files.mathe.is/solr-admin/02_query.png
http://files.mathe.is/solr-admin/03_schema.png
http://files.mathe.is/solr-admin/04_analysis.png
http://files.mathe.is/solr-admin/05_plugins.png

It's actually using one index.jsp to generate to basic frame, including 
cores and their navigation. Everything else is loaded via existing 
SolrAdminHandler.


Any Questions, Ideas, Thoughts outta there? Please, let me know :)

Regards
Stefan

Re: Solr Admin Interface, reworked - Go on? Go away?

Nice! It makes multi core navigation a lot easier. What license do the icons 
have?

> Hi List,
> 
> given that fact that my java-knowledge is sort of non-existing .. my
> idea was to rework the Solr Admin Interface.
> 
> Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy,
> but it was an idea few weeks ago - and i would like to contrib
> something, a thing which has to be non-java but not useless - hopefully ;)
> 
> Actually it's completly work-in-progress .. but i'm interested in what
> you guys think. Right direction? Completly Wrong, just drop it?
> 
> http://files.mathe.is/solr-admin/01_dashboard.png
> http://files.mathe.is/solr-admin/02_query.png
> http://files.mathe.is/solr-admin/03_schema.png
> http://files.mathe.is/solr-admin/04_analysis.png
> http://files.mathe.is/solr-admin/05_plugins.png
> 
> It's actually using one index.jsp to generate to basic frame, including
> cores and their navigation. Everything else is loaded via existing
> SolrAdminHandler.
> 
> Any Questions, Ideas, Thoughts outta there? Please, let me know :)
> 
> Regards
> Stefan

Re: multi-core solr, specifying the data directory


: I wonder if what doesn't work is trying to set an explicit relative path
: there, instead of using the baked in default "data".  If you set an explicit
: relative path, is it relative to the current core solr.home, or to the main
: solr.home?

it's realtive the current working dir of the process.

: Is _that_ a bug?

it's more an artifact of evolution -- it's just always worked that way, 
and changing it now would be a pretty nasty break for existing users.

if people are interested in helping to fix this, my vote would be to see a 
patch that let you include an optional 'rel' attribute on the data dir to 
say what you wnat the path to be relative to...

  foo
  foo
  foo 

typically people either use the default, or want a specific path outside 
of the solr home altogether (ie: on a completley differnet partion) and 
just use an absolute path, so it doesn't wind up being a big burden.

The biggest issue was the one that's already been mentioned: for a 
while the main example solrconfig.xml was using...

  ${solr.data.dir:./data}

...instead of...

  ${solr.data.dir:}

...meaning that the dataDir defaulted to "./data" (relative the CWD) 
instead of "" (which tells Solr to use it's own default in the 
instanceDir)  but this has been fixed for 3.x


-Hoss

Re: Solr Admin Interface, reworked - Go on? Go away?


Hey Markus,

actually it's CC BY 3.0 - Yusuke Kamiyamane created the "Fugue Icons" 
(http://p.yusukekamiyamane.com/)


Regards
Stefan

Am 02.03.2011 21:46, schrieb Markus Jelsma:

Nice! It makes multi core navigation a lot easier. What license do the icons
have?


Hi List,

given that fact that my java-knowledge is sort of non-existing .. my
idea was to rework the Solr Admin Interface.

Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy,
but it was an idea few weeks ago - and i would like to contrib
something, a thing which has to be non-java but not useless - hopefully ;)

Actually it's completly work-in-progress .. but i'm interested in what
you guys think. Right direction? Completly Wrong, just drop it?

http://files.mathe.is/solr-admin/01_dashboard.png
http://files.mathe.is/solr-admin/02_query.png
http://files.mathe.is/solr-admin/03_schema.png
http://files.mathe.is/solr-admin/04_analysis.png
http://files.mathe.is/solr-admin/05_plugins.png

It's actually using one index.jsp to generate to basic frame, including
cores and their navigation. Everything else is loaded via existing
SolrAdminHandler.

Any Questions, Ideas, Thoughts outta there? Please, let me know :)

Regards
Stefan

Re: Solr Admin Interface, reworked - Go on? Go away?

2011-03-02 Thread Robert Muir

On Wed, Mar 2, 2011 at 3:47 PM, Stefan Matheis
 wrote:

> Any Questions, Ideas, Thoughts outta there? Please, let me know :)
>

My only question would be: would you mind creating a JIRA issue with
your modifications?

I was just yesterday looking at this admin stuff and thinking, man
this could really use a facelift...

Re: sort by price puts unknown prices first

2011-03-02 Thread Scott K

On Wed, Mar 2, 2011 at 12:21, Chris Hostetter  wrote:
> historicly it has been because of a fundemental limitation in how the
> Lucene FieldCache has historicly worked where the array backed FieldCaches
> use the default numeric value (ie: 0) when docs have no value (but in the
> case of Strings, the default is "null" which is easy to test for)
>
> i am 99.99% certain this has changed on the trunk, so all of the
> Trie*Fields should support the sortMissing* options in 4.x

I am running a 4.x build and just tried the most recent nightly build,
apache-solr-4.0-2011-03-02_08-06-07.tgz, and am still seeing this
issue.

Other than creating a new indexed field where I manually map no value
to a high number, is there a way to sort on a function query that puts
undefined values to the end? Is there no way to use map to change
undefined values?

Thanks, Scott

Re: memory leak during undeploying

2011-03-02 Thread Search Learn

Thanks for the suggestions. Tomcat does release permgen memory with
appropriate jvm options and configuration settings (
clearReferencesStopTimerThreads, clearReferencesThreadLocals).
When I did heap analysis, the culprit always seems to
be TimeLimitedCollector thread. Because of this, considerable amount of
classes are not getting unloaded.
More description of problem (and how to fix it) can be found at:
http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java,
http://blogs.sun.com/fkieviet/entry/how_to_fix_the_dreaded

Has any body successfully unloaded classes from permgen till now?

rajesh

On Wed, Mar 2, 2011 at 1:39 PM, Markus Jelsma wrote:

> Hi,
>
> I remember reading somewhere that undeploying an application in Tomcat
> won't
> release memory, thus repeating the cycle will indeed exhaust the permgen.
> You
> could enable garbage collection of the permgen.
>
> HotSpot can do this for you but it depends on using CMS which you might not
> want to use at all if you're running a small server with little memory.
> Anyway
> here are the options for enabling it:
>
> Modern JVM:
> -XX:+CMSClassUnloadingEnabled
>
> If you're running an older JVM try:
> -XX:+CMSPermGenSweepingEnabled
>
> I just wish i could find the documentation on this one but it seems to left
> the
> internet. This is taken from my Wiki so it seems i read it at least once
> and
> made a note.
>
> Please report the results
>
> Cheers,
>
> > Hello,
> > We currently deploy and undeploy solr web application potentially
> hundred's
> > of times during a typical day. when the solr is undeployed, its classes
> are
> > not getting unloaded and eventually we are running into permgen error.
> > There are couple of JIRA's related to this:
> > https://issues.apache.org/jira/browse/LUCENE-2237,
> > https://issues.apache.org/jira/browse/SOLR-1735. Even after applying
> these
> > patches, the issue still remains.
> > Does any body have any suggestions for this?
> >
> > Our environment:
> >  Apache Tomcat/6.0.29
> >  Solr 1.4.1
> >
> > Thanks,
> > rajesh
>

Re: Solr Admin Interface, reworked - Go on? Go away?


Looks nice.

Might be also worth it to create a page with large query field for pasting
in complete URL-encoded queries that cross cores, etc.  I did that at work
(via ASP.net) so we could paste in queries from logs and debug them.  We
tend to use that quite a bit.


Cheers


Stefan Matheis wrote:
> 
> Hi List,
> 
> given that fact that my java-knowledge is sort of non-existing .. my 
> idea was to rework the Solr Admin Interface.
> 
> Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy, 
> but it was an idea few weeks ago - and i would like to contrib 
> something, a thing which has to be non-java but not useless - hopefully ;)
> 
> Actually it's completly work-in-progress .. but i'm interested in what 
> you guys think. Right direction? Completly Wrong, just drop it?
> 
> http://files.mathe.is/solr-admin/01_dashboard.png
> http://files.mathe.is/solr-admin/02_query.png
> http://files.mathe.is/solr-admin/03_schema.png
> http://files.mathe.is/solr-admin/04_analysis.png
> http://files.mathe.is/solr-admin/05_plugins.png
> 
> It's actually using one index.jsp to generate to basic frame, including 
> cores and their navigation. Everything else is loaded via existing 
> SolrAdminHandler.
> 
> Any Questions, Ideas, Thoughts outta there? Please, let me know :)
> 
> Regards
> Stefan
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Admin-Interface-reworked-Go-on-Go-away-tp2620365p2620745.html
Sent from the Solr - User mailing list archive at Nabble.com.

Dismax, q, q.alt, and defaultSearchField?

We have two banks of Solr nodes with identical schemas.  The data I'm
searching for is in both banks.

One has defaultSearchField set to field1, the other has defaultSearchField
set to field2.

We need to support both user queries and facet queries that have no user
content.  For the latter, it appears I need to use q.alt=*:*, so I am
investigating also using q.alt for user content (e.g., q.alt=banana).

I run the following query:

q.alt=banana
&defType=dismax
&mm=1
&tie=0.1
&qf=field1+field2


On bank one, I get the expected results, but on bank two, I get 0 results.

I noticed (via debugQuery=true), that when I use q.alt, it resolves using
the defaultSearchField (e.g., field1:banana), not the value of the qf param. 
Therefore, I get different results.

If I switched to using q for user queries and q.alt for facet queries, I
would still get different results, because q would resolve against the
fields in the qf param, and q.alt would resolve against the default search
field.

Is there a way to override this behavior in order to get consistent results?

Thanks!






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-q-q-alt-and-defaultSearchField-tp2621061p2621061.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Admin Interface, reworked - Go on? Go away?


: given that fact that my java-knowledge is sort of non-existing .. my idea was
: to rework the Solr Admin Interface.

Contributions of all kinds are welcome!

: Actually it's completly work-in-progress .. but i'm interested in what you
: guys think. Right direction? Completly Wrong, just drop it?

I think it looks awesome

: It's actually using one index.jsp to generate to basic frame, including cores
: and their navigation. Everything else is loaded via existing SolrAdminHandler.

This is the exact approach that's been discussed in the past (but no one 
has really had a chance to tackle yet) ... eliminating the use of JSPs, 
and relying entirely on HTML and Javascript (or the 
VelocityResponseWriter) to style the output from the existing Admin 
RequestHandlers -- that way we can be confident that all info available in 
the admin UI (and all functionality it performs) can be achieved by remote 
clients using those same request handlers.

By all means -- keep working on this, and (as someone else already 
mentioned) please don't hesitate to attach your work in progress stuff to 
a Jira issue (where others can help provide feedback not only on the 
screenshots, but also the implementation)

If you run into any issues where you can't replicate something 
in the existing JSPs (or accomplish some new desirable functionality)
because the info is not available from a request handler, don't hesitate 
to open feature request jiras to get the functionality added (and the 
folks with java know how can work on patches)


-Hoss

Re: Solr Admin Interface, reworked - Go on? Go away?


Robert,

even in this WIP-State? if so .. i'll try one tomorrow evening after work

Regards
Stefan

Am 02.03.2011 22:02, schrieb Robert Muir:

On Wed, Mar 2, 2011 at 3:47 PM, Stefan Matheis
  wrote:


Any Questions, Ideas, Thoughts outta there? Please, let me know :)



My only question would be: would you mind creating a JIRA issue with
your modifications?

I was just yesterday looking at this admin stuff and thinking, man
this could really use a facelift...

Re: Solr Admin Interface, reworked - Go on? Go away?

mrw,

you mean a field like here
(http://files.mathe.is/solr-admin/02_query.png) on the right side,
between meta-navigation and plain solr-xml response?

actually it's just to display the computed url, but if so .. we could
use a larger field for that, of course :)

Regards
Stefan

Am 02.03.2011 22:31, schrieb mrw:

Looks nice.

Might be also worth it to create a page with large query field for pasting
in complete URL-encoded queries that cross cores, etc. I did that at work
(via ASP.net) so we could paste in queries from logs and debug them. We
tend to use that quite a bit.

Cheers

Stefan Matheis wrote:

Hi List,

given that fact that my java-knowledge is sort of non-existing .. my
idea was to rework the Solr Admin Interface.

Compared to CouchDBs Futon or the MongoDB Admin-Utils .. not that fancy,
but it was an idea few weeks ago - and i would like to contrib
something, a thing which has to be non-java but not useless - hopefully ;)

Actually it's completly work-in-progress .. but i'm interested in what
you guys think. Right direction? Completly Wrong, just drop it?

http://files.mathe.is/solr-admin/01_dashboard.png
http://files.mathe.is/solr-admin/02_query.png
http://files.mathe.is/solr-admin/03_schema.png
http://files.mathe.is/solr-admin/04_analysis.png
http://files.mathe.is/solr-admin/05_plugins.png

It's actually using one index.jsp to generate to basic frame, including
cores and their navigation. Everything else is loaded via existing
SolrAdminHandler.

Any Questions, Ideas, Thoughts outta there? Please, let me know :)

Regards
Stefan

--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-Admin-Interface-reworked-Go-on-Go-away-tp2620365p2620745.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Admin Interface, reworked - Go on? Go away?

2011-03-02 Thread Robert Muir

On Wed, Mar 2, 2011 at 5:34 PM, Stefan Matheis
 wrote:
> Robert,
>
> even in this WIP-State? if so .. i'll try one tomorrow evening after work
>

Its totally up to you, sometimes it can be useful to upload a partial
or WIP solution to an issue: as Hoss mentioned its a good way to get
feedback and additional ideas while you work.

Re: Solr Admin Interface, reworked - Go on? Go away?


: even in this WIP-State? if so .. i'll try one tomorrow evening after work

When in doubt, remember Yonik's Law Of Patches...

http://wiki.apache.org/solr/HowToContribute?highlight=law+of+patches#Contributing_Code_.28Features.2C_Big_Fixes.2C_Tests.2C_etc29

A half-baked patch in Jira, with no documentation, no tests
and no backwards compatibility is better than no patch at all.


-Hoss

Re: sort by price puts unknown prices first

On Wed, Mar 2, 2011 at 4:19 PM, Scott K  wrote:
> On Wed, Mar 2, 2011 at 12:21, Chris Hostetter  
> wrote:
>> historicly it has been because of a fundemental limitation in how the
>> Lucene FieldCache has historicly worked where the array backed FieldCaches
>> use the default numeric value (ie: 0) when docs have no value (but in the
>> case of Strings, the default is "null" which is easy to test for)
>>
>> i am 99.99% certain this has changed on the trunk, so all of the
>> Trie*Fields should support the sortMissing* options in 4.x
>
> I am running a 4.x build and just tried the most recent nightly build,
> apache-solr-4.0-2011-03-02_08-06-07.tgz, and am still seeing this
> issue.

Hmmm, this looks like maybe a bug.
It works if you put sortMissingLast="true" on the fieldType, but not
if it's just on the field.

For now, work around by adding it to the fieldType, and I'll investigate.

-Yonik
http://lucidimagination.com

Re: multiple localParams for each query clause

2011-03-02 Thread Roman Chyla

Thanks Jonathan, this will be useful -- in the meantime, I have
implemented the query rewriting, using the QueryParsing.toString()
utility as an example.

On Wed, Mar 2, 2011 at 5:40 PM, Jonathan Rochkind  wrote:
> Not per clause, no. But you can use the "nested queries" feature to set
> local params for each nested query instead.  Which is in fact one of the
> most common use cases for local params.
>
> &q=_query_:"{type=x q.field=z}something" AND
> _query_:"{!type=database}something"
>
> URL encode that whole thing though.
>
> http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/
>
> On 3/2/2011 10:24 AM, Roman Chyla wrote:
>>
>> Hi,
>>
>> Is it possible to set local arguments for each query clause?
>>
>> example:
>>
>> {!type=x q.field=z}something AND {!type=database}something
>>
>>
>> I am pulling together result sets coming from two sources, Solr index
>> and DB engine - however I realized that local parameters apply only to
>> the whole query - so I don't know how to set the query to mark the
>> second clause as db-searchable.
>>
>> Thanks,
>>
>>   Roman
>

Re: MLT with boost

2011-03-02 Thread Koji Sekiguchi


(11/03/03 2:54), Mark wrote:

High level overview. We have items and we have sellers. The scoring of our 
documents is such that
our boost functions outweight the pure lucene term/query scoring. Our boost 
functions basically take
into account how "good" the seller is.

Now for MLT searches we would like to incorporate this same sort of behavior.



Sounds reasonable. You can open an issue (on Lucene, I think).

Just for your infomation, MLT in 3x/trunk debug has more rich information
(assembled queries and explanation):

https://issues.apache.org/jira/browse/SOLR-860

It doesn't solve your problem, but you'd like to try...

Koji
--
http://www.rondhuit.com/en/

RE: Understanding multi-field queries with q and fq

2011-03-02 Thread Bob Sandiford

Have you looked at the 'qf' parameter?

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com 
_
http://www.cosugi.org/ 




> -Original Message-
> From: mrw [mailto:mikerobertsw...@gmail.com]
> Sent: Wednesday, March 02, 2011 2:28 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Understanding multi-field queries with q and fq
> 
> Anyone understand how to do boolean logic across multiple fields?
> 
> Dismax is nice for searching multiple fields, but doesn't necessarily
> support our syntax requirements. eDismax appears to be not available
> until
> Solr 3.1.
> 
> In the meantime, it looks like we need to support applying the user's
> query
> to multiple fields, so if the user enters "led zeppelin merle" we need
> to be
> able to do the logical equivalent of
> 
> &fq=field1:led zeppelin merle OR field2:led zeppelin merle
> 
> 
> Any ideas?  :)
> 
> 
> 
> mrw wrote:
> >
> > After searching this list, Google, and looking through the Pugh book,
> I am
> > a little confused about the right way to structure a query.
> >
> > The Packt book uses the example of the MusicBrainz DB full of song
> > metadata.  What if they also had the song lyrics in English and
> German as
> > files on disk, and wanted to index them along with the metadata, so
> that
> > each document would basically have song title, artist, publisher,
> date,
> > ..., All_Metadata (copy field of all metadata fields), Text_English,
> and
> > Text_German fields?
> >
> > There can only be one default field, correct?  So if we want to
> search for
> > all songs containing (zeppelin AND (dog OR merle)) do we
> >
> > repeat the entire query text for all three major fields in the 'q'
> clause
> > (assuming we don't want to use the cache):
> >
> > q=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin
> AND
> > (dog OR merle)+Text_German:(zeppelin AND (dog OR merle))
> >
> > or repeat the entire query text for all three major fields in the
> 'fq'
> > clause (assuming we want to use the cache):
> >
> > q=*:*&fq=(+All_Metadata:zeppelin AND (dog OR
> merle)+Text_English:zeppelin
> > AND (dog OR merle)+Text_German:zeppelin AND (dog OR merle))
> >
> > ?
> >
> > Thanks!
> >
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Understanding-multi-field-queries-
> with-q-and-fq-tp2528866p2619700.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr under Tomcat

2011-03-02 Thread rajini maski

Sai,

 The index directory will be in your Solr_home//Conf//data directory..
The path for this directory need to be given where ever you want to
by changing the data-dir path in config XML that is present in the same
//conf folder . You need to stop tomcat service to delete this directory and
then restart tomcat. The tomcat itself generates the data folder at the path
specified in config if this folder is not available. The folder usually has
two sub-folders- index and spell-check

Regards,
Rajani Maski

On Wed, Mar 2, 2011 at 7:39 PM, Thumuluri, Sai <
sai.thumul...@verizonwireless.com> wrote:

> Good Morning,
> We have deployed Solr 1.4.1 under Tomcat and it works great, however I
> cannot find where the index (directory) is created. I set solr home in
> web.xml under /webapps/solr/WEB-INF/, but not sure where the data
> directory is. I have a need where I need to completely index the site
> and it would help for me to stop solr, delete index directory and
> restart solr prior to re-indexing the content.
>
> Thanks,
> Sai Thumuluri
>
>
>

RE: Solr under Tomcat

2011-03-02 Thread Thumuluri, Sai

Thank you - I found it. 

-Original Message-
From: rajini maski [mailto:rajinima...@gmail.com] 
Sent: Thursday, March 03, 2011 12:03 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr under Tomcat

Sai,

 The index directory will be in your Solr_home//Conf//data
directory..
The path for this directory need to be given where ever you want to
by changing the data-dir path in config XML that is present in the same
//conf folder . You need to stop tomcat service to delete this directory
and
then restart tomcat. The tomcat itself generates the data folder at the
path
specified in config if this folder is not available. The folder usually
has
two sub-folders- index and spell-check

Regards,
Rajani Maski

On Wed, Mar 2, 2011 at 7:39 PM, Thumuluri, Sai <
sai.thumul...@verizonwireless.com> wrote:

> Good Morning,
> We have deployed Solr 1.4.1 under Tomcat and it works great, however I
> cannot find where the index (directory) is created. I set solr home in
> web.xml under /webapps/solr/WEB-INF/, but not sure where the data
> directory is. I have a need where I need to completely index the site
> and it would help for me to stop solr, delete index directory and
> restart solr prior to re-indexing the content.
>
> Thanks,
> Sai Thumuluri
>
>
>

Re: memory leak during undeploying