Re: Changing the default Fuzzy minSimilarity?

2010-12-15 Thread Jan Høydahl / Cominvent
>> A fuzzy query foo~ defaults to a similarity of 0.5, i.e. equal to foo~0.5
>> 
> 
> just as an FYI, this isn't true in trunk (4.0) any more.
> 
> the defaults are changed so that it never enumerates the entire
> dictionary (slow) like before, see:
> https://issues.apache.org/jira/browse/LUCENE-2667
> 
> so, the default is now foo~2 (2 edit distances).

Got it. I need this for production on 1.4.1. Any clue on how to patch in a new 
default, without?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Omitting tf but not positions

2010-12-15 Thread Jan Høydahl / Cominvent
Hi,

I have a case where I use DisMax "pf" to boost on phrase match in a field. I 
use omitNorms=true to avoid length normalization to mess with my scores.

However, for some documents, the phrase "foo bar" occur more than one time in 
the same field, and I get an unintended TF boost for one of them

1.4142135 = tf(phraseFreq=2.0)
vs
1.0 = tf(phraseFreq=1.0)

I could use omitTermFreqAndPositions but that would disable phrase search 
ability, wouldn't it?
Any way to disable TF/IDF normalization without also disabling positions?

Solr 1.4.1

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com



Re: [DIH] Example for SQL Server

2010-12-15 Thread Savvas-Andreas Moysidis
Hi Adam,

we are using DIH to index off an SQL Server database(the freeby SQLExpress
one.. ;) ). We have defined the following in our
%TOMCAT_HOME%\solr\conf\data-config.xml:


  
  

  


We downloaded a JDBC driver from here http://jtds.sourceforge.net/faq.html and
found it to be a quite stable driver.

And the only thing we really had to do was drop that library in
%TOMCAT_HOME%\lib directory (for Tomcat 6+).

Hope that helps.
-- Savvas.

On 14 December 2010 22:46, Erick Erickson  wrote:

> The config isn't really any different for various sql instances, about the
> only difference is the driver. Have you seen the example in the
> distribution somewhere like
> /example/example-DIH/solr/db/conf/db-data-config.xml?
>
> Also, there's a magic URL for debugging DIH at:
> .../solr/admin/dataimport.jsp
>
> If none of that is useful, could you post your attempt and maybe someone
> can
> offer some hints?
>
> Best
> Erick
>
> On Tue, Dec 14, 2010 at 5:32 PM, Adam Estrada <
> estrada.adam.gro...@gmail.com
> > wrote:
>
> > Does anyone have an example config.xml file I can take a look at for SQL
> > Server? I need to index a lot of data from a DB and can't seem to figure
> > out
> > the right syntax so any help would be greatly appreciated. What is the
> > correct /jar file to use and where do I put it in order for it to work?
> >
> > Thanks,
> > Adam
> >
>


Problem using curl in PHP to get Solr results

2010-12-15 Thread Dennis Gearon
I finally figured out how to use curl to GET results, i.e. just turn all spaces 
into '%20' in my type of queries. I'm using solar spatial, and then searching 
in 
both the default text field and a couple of columns. Works fine on in the 
browser.

But if I query for it using curl in PHP, there's an error somewhere in the 
JSON. 
I don't know if it's in the PHP food chain or something else. 


Just putting my solution to GETing from curl in PHP and my problem up here, for 
others to find.

 Of course, if anyone knows the answer, all the better.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread pankaj bhatt
HI ,


On Wed, Dec 15, 2010 at 2:52 PM, Dennis Gearon wrote:

> I finally figured out how to use curl to GET results, i.e. just turn all
> spaces
> into '%20' in my type of queries. I'm using solar spatial, and then
> searching in
> both the default text field and a couple of columns. Works fine on in the
> browser.
>
> But if I query for it using curl in PHP, there's an error somewhere in the
> JSON.
> I don't know if it's in the PHP food chain or something else.
>
>
> Just putting my solution to GETing from curl in PHP and my problem up here,
> for
> others to find.
>
>  Of course, if anyone knows the answer, all the better.
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread Stephen Weiss
Forgive me if this seems like a dumb question but have you tried the 
Apache_Solr_Service class?

http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html

It's really quite good at handling the nuts and bolts of making the HTTP 
requests and decoding the responses for PHP.  I almost always use it when 
working from PHP.  It's all over Google so I don't know how someone would miss 
it but I don't know why else someone would bother curling a GET to SOLR 
otherwise.

--
Steve

On Dec 15, 2010, at 4:22 AM, Dennis Gearon wrote:

> I finally figured out how to use curl to GET results, i.e. just turn all 
> spaces 
> into '%20' in my type of queries. I'm using solar spatial, and then searching 
> in 
> both the default text field and a couple of columns. Works fine on in the 
> browser.
> 
> But if I query for it using curl in PHP, there's an error somewhere in the 
> JSON. 
> I don't know if it's in the PHP food chain or something else. 
> 
> 
> Just putting my solution to GETing from curl in PHP and my problem up here, 
> for 
> others to find.
> 
> Of course, if anyone knows the answer, all the better.
> 
> Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better 
> idea to learn from others’ mistakes, so you do not have to make them 
> yourself. 
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.
> 



Re: Omitting tf but not positions

2010-12-15 Thread Robert Muir
On Wed, Dec 15, 2010 at 3:09 AM, Jan Høydahl / Cominvent
 wrote:
> Any way to disable TF/IDF normalization without also disabling positions?
>

see Similarity.tf(float) and Similarity.tf(int)

if you want to change this for both terms and phrases just override
Similarity.tf(float), since by default Similarity.tf(int) delegates to
that.
otherwise, override both.

of course the big limitation being you cant customize Similarity per-field yet.


French stemming / size of synonyms file

2010-12-15 Thread Emmanuel Bégué
Hello,

According to the wiki http://wiki.apache.org/solr/LanguageAnalysis,
the light stemmers for French (solr.FrenchLightStemFilterFactory and
solr.FrenchMinimalStemFilterFactory) are only available for SOLR 3.1.

Is there a way to make them work with 1.4.1?

- - -

Additionally, there is an "official" list of inflected word forms for
the French language produced by a government agency (this being
France...) It's called "Morphalou":
http://www.cnrtl.fr/lexiques/morphalou/ and it contains over 540 k
inflicted forms.

It's a 162 Mo XML file; it would not be very hard to transform it into
the format for synonyms files for SOLR, but it would result in a
rather huge text file (probably smaller than the original XML, but
still around 100 Mo). How large can a synonyms file be? Is it
dependant on the Java heap size...?

Or is there a better way to use such a list than a synonyms file?

Thanks,
Regards,
EB


Re: Search with facet.pivot

2010-12-15 Thread Erik Hatcher
One oddity is the duplicated sections:

  
root_category_name,parent_category_name,category
root_category_id,parent_category_id,category_id
  

That's in your responseHeader twice.  Perhaps something fishy caused from that? 
 Is this hardcoded in your solrconfig.xml request handler mapping (I can't 
imagine how you could get that effect from request params)?   Try removing this 
duplication and see if that helps.

Erik


On Dec 14, 2010, at 22:17 , Anders Dam wrote:

> I forgot to mention that the query is handlede by the Dismax Request Handler
> 
> Grant, from the  tag and down you see all the query
> parameters used. The only thing varying from query to query is the actual
> query (q), When searching on by example '1000' (q=1000) facet.pivot fields
> are correctly returned, while when searching on by example 'OKI' the
> facet.pivot fields are not returned.
> 
> If it is of any help, what I am searching are products compatible with
> certain printers, the printer models are stored in a relational database,
> where each printer cartridge belongs to more categories, the categorized are
> in a often 2-3 level deep hierarchy which is flattened out at data import
> time, so that there is a column at import data (DataImportHandler) called
> category_name,parent_category_name and root_category_name, these fields are
> copied to the field category_search field also mentioned in my first mail.
> 
> Here are the response header listing all the params used for querying on
> pastebin.com with a bit better formatting, hope that helps:
> http://pastebin.com/9FgijpJ6
> 
> If there is any information I can provide to
> help us solve this problem I will be happy to provide it.
> 
> Thanks in advance,
> 
> Anders
> 
> 
> On Tue, Dec 14, 2010 at 7:47 PM, Grant Ingersoll wrote:
> 
>> The formatting of your message is a bit hard to read.  Could you please
>> clarify which commands worked and which ones didn't?  Since the pivot stuff
>> is relatively new, there could very well be a bug, so if you can give a
>> simple test case that shows what is going on that would also be helpful,
>> albeit not required.
>> 
>> On Dec 12, 2010, at 10:18 PM, Anders Dam wrote:
>> 
>>> Hi,
>>> 
>>> I have a minor problem in getting the pivoting working correctly. The
>> thing
>>> is that two otherwise equal search queries behave differently, namely one
>> is
>>> returning the search result with the facet.pivot fields below and another
>> is
>>> returning the search result with an empty facet.pivot. This is a problem,
>>> since I am particularly interested in displaying the pivots.
>>> 
>>> Perhaps anyone has an idea about what is going wrong in this case, For
>>> clarity I paste the parameters used for searching:
>>> 
>>> 
>>> 
>>> 0
>>> 41
>>> -
>>> 
>>> 
>>>   2<-1 5<-2 6<90%
>>>   
>>> on
>>> 1
>>> 0.01
>>> 
>>>   category_search
>>>
>>> 0
>>> 
>>> 
>>>   *:*
>>>
>>> category
>>> true
>>> dismax
>>> all
>>> 
>>>   *,score
>>>
>>> true
>>> 1
>>> 
>>> true
>>> 
>>>   shop_name:colorbob.dk
>>>
>>> -
>>> 
>>> root_category_name,parent_category_name,category
>>> root_category_id,parent_category_id,category_id
>>> 
>>> 100
>>> -
>>> 
>>> root_category_name,parent_category_name,category
>>> root_category_id,parent_category_id,category_id
>>> 
>>> OKI
>>> 100
>>> 
>>> 
>>> 
>>> I see no pattern in what queries is returning the pivot fields and which
>>> ones are not
>>> 
>>> 
>>> The field searched in is defined as:
>>> 
>>> > stored="false"
>>> required="false" termVectors="on" termPositions="on" termOffsets="on" />
>>> 
>>> And the edgytext type is defined as
>>>   >> positionIncrementGap="100">
>>>
>>>  
>>>   >> stemEnglishPossessive="0" splitOnNumerics="0" preserveOriginal="1"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="1" />
>>>   
>>>  >> maxGramSize="25" />
>>>
>>>
>>>  
>>>   >> stemEnglishPossessive="0" splitOnNumerics="0" preserveOriginal="1"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" />
>>>   
>>>
>>>   
>>> 
>>> I am using apache-solr-4.0-2010-11-26_08-36-06 release
>>> 
>>> Thanks in advance,
>>> 
>>> Anders Dam
>> 
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 



R: limit the search results to one category

2010-12-15 Thread Andrea Gazzarini
Did you try with filterquery?

Andrea Gazzarini

-Original Message-
From: sara motahari 
Date: Tue, 14 Dec 2010 17:34:52 
To: 
Reply-To: solr-user@lucene.apache.org
Subject: limit the search results to one category

Hi all,

I am using a dismax request handler with vrious fieds that it searches, but I 
also want to enable the users to select a category from a drop-down list 
and only get the results that belong to that category. It seems I can't use a 
nested query with dismax as the first one and standard as the nested one? Is 
there another way to do this?


  


Re: French stemming / size of synonyms file

2010-12-15 Thread Robert Muir
2010/12/15 Emmanuel Bégué :
> Hello,
>
> According to the wiki http://wiki.apache.org/solr/LanguageAnalysis,
> the light stemmers for French (solr.FrenchLightStemFilterFactory and
> solr.FrenchMinimalStemFilterFactory) are only available for SOLR 3.1.
>
> Is there a way to make them work with 1.4.1?

you could take the source code and backport it to solr 1.4.1... but see below:

>
> - - -
>
> Additionally, there is an "official" list of inflected word forms for
> the French language produced by a government agency (this being
> France...) It's called "Morphalou":
> http://www.cnrtl.fr/lexiques/morphalou/ and it contains over 540 k
> inflicted forms.
>
> Or is there a better way to use such a list than a synonyms file?

In this case I would recommend also considering StemmerOverrideFilter
(again only in 3.1+, sorry)
See 
http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory

The StemmerOverrideFilter will "stem" based on a tab-separated
dictionary. But, when it does this it also marks the word with
KeywordAttribute, which tells any future stemmer to ignore it.

So with this approach you can have a StemmerOverrideFilter with your
dictionary, then followed by a stemmer which will only work on words
that aren't in your dictionary.
The words that hit the dictionary will be completely ignored by the stemmer.

This should also be much more RAM-efficient than using SynonymFilter


Dataimport performance

2010-12-15 Thread Robert Gründler
Hi,

we're looking for some comparison-benchmarks for importing large tables from a 
mysql database (full import).

Currently, a full-import of ~ 8 Million rows from a MySQL database takes around 
3 hours, on a QuadCore Machine with 16 GB of
ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, 
where it is the only app. The tomcat instance
has the following memory-related java_opts:

-Xms4096M -Xmx5120M


The data-config.xml looks like this (only 1 entity):

  








  


  


We have the feeling that 3 hours for this import is quite long - regarding the 
performance of the server running solr/mysql. 

Are we wrong with that assumption, or do people experience similar import times 
with this amount of data to be imported?


thanks!


-robert





Re: [DIH] Example for SQL Server

2010-12-15 Thread Adam Estrada
Thanks All,

Testing here shortly and will report back asap.

w/r,
Adam

On Wed, Dec 15, 2010 at 4:10 AM, Savvas-Andreas Moysidis <
savvas.andreas.moysi...@googlemail.com> wrote:

> Hi Adam,
>
> we are using DIH to index off an SQL Server database(the freeby SQLExpress
> one.. ;) ). We have defined the following in our
> %TOMCAT_HOME%\solr\conf\data-config.xml:
> 
>
>name="mssqlDatasource"
>  driver="net.sourceforge.jtds.jdbc.Driver"
>  url="jdbc:jtds:sqlserver://{server.name
> }:{server.port}/{dbInstanceName};instance=SQLEXPRESS"
>  convertType="true"
>  user="{user.name}"
>  password="{user.password}"/>
>  
> dataSource="mssqlDatasource"
>   query="your query here" />
>  
> 
>
> We downloaded a JDBC driver from here http://jtds.sourceforge.net/faq.htmland
> found it to be a quite stable driver.
>
> And the only thing we really had to do was drop that library in
> %TOMCAT_HOME%\lib directory (for Tomcat 6+).
>
> Hope that helps.
> -- Savvas.
>
> On 14 December 2010 22:46, Erick Erickson  wrote:
>
> > The config isn't really any different for various sql instances, about
> the
> > only difference is the driver. Have you seen the example in the
> > distribution somewhere like
> > /example/example-DIH/solr/db/conf/db-data-config.xml?
> >
> > Also, there's a magic URL for debugging DIH at:
> > .../solr/admin/dataimport.jsp
> >
> > If none of that is useful, could you post your attempt and maybe someone
> > can
> > offer some hints?
> >
> > Best
> > Erick
> >
> > On Tue, Dec 14, 2010 at 5:32 PM, Adam Estrada <
> > estrada.adam.gro...@gmail.com
> > > wrote:
> >
> > > Does anyone have an example config.xml file I can take a look at for
> SQL
> > > Server? I need to index a lot of data from a DB and can't seem to
> figure
> > > out
> > > the right syntax so any help would be greatly appreciated. What is the
> > > correct /jar file to use and where do I put it in order for it to work?
> > >
> > > Thanks,
> > > Adam
> > >
> >
>


Problem with multicore

2010-12-15 Thread Jörg Agatz
Hallo Users,

I habve a Problem wit Solr 1.4.1 on Ubuntu 10.10

I have download the new version and extract it!

than i have copy the solr.xml from example/multicore/solr.xml to
/examples/solr/solr.xml











  

  





  





than i create folders example/solr/core0 and example/solr/core1
and in each folder a conf folder, with the original schema.xml and
solrconfig.xml ect..

start Solr with "sudo java -Dsolr.solr.home=multicore -jar start.jar"

but nuw i cant index something with:

sudo java -Ddata=args -Dcommit=yes -Durl=
http://localhost:8983/solr/core1/update -jar post.jar *.xml

i always get:

SimplePostTool: version 1.2

SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8,
other encodings are not currently supported

SimplePostTool: POSTing args to http://localhost:8983/solr/core1/update..

SimplePostTool: FATAL: Solr returned an error:
Unexpected_character_m_code_109_in_prolog_expected___at_rowcol_unknownsource_11

serv...@joa-desktop:~/Desktop/apache-solr-1.4.1/example/exampledocs$


Some ideas what i have make wrong?

King


Re: Dataimport performance

2010-12-15 Thread Adam Estrada
What version of Solr are you using?

Adam

2010/12/15 Robert Gründler 

> Hi,
>
> we're looking for some comparison-benchmarks for importing large tables
> from a mysql database (full import).
>
> Currently, a full-import of ~ 8 Million rows from a MySQL database takes
> around 3 hours, on a QuadCore Machine with 16 GB of
> ram and a Raid 10 storage setup. Solr is running on a apache tomcat
> instance, where it is the only app. The tomcat instance
> has the following memory-related java_opts:
>
> -Xms4096M -Xmx5120M
>
>
> The data-config.xml looks like this (only 1 entity):
>
>  
>
>
>
>
>
> name="sf_unique_id"/>
>
>
>  
>
>
>  
>
>
> We have the feeling that 3 hours for this import is quite long - regarding
> the performance of the server running solr/mysql.
>
> Are we wrong with that assumption, or do people experience similar import
> times with this amount of data to be imported?
>
>
> thanks!
>
>
> -robert
>
>
>
>


Re: Dataimport performance

2010-12-15 Thread Erick Erickson
You're adding on the order of 750 rows (docs)/second, which isn't bad...

have you profiled the machine as this runs? Even just with top (assuming
unix)...
because the very first question is always "what takes the time, getting
the data from MySQL or indexing or I/O?".

If you aren't maxing out your CPU, then you probably want to explore the
other
questions (db query speed, network latency) to get a sense whether you're
going as fast as you can or not...

Best
Erick

2010/12/15 Robert Gründler 

> Hi,
>
> we're looking for some comparison-benchmarks for importing large tables
> from a mysql database (full import).
>
> Currently, a full-import of ~ 8 Million rows from a MySQL database takes
> around 3 hours, on a QuadCore Machine with 16 GB of
> ram and a Raid 10 storage setup. Solr is running on a apache tomcat
> instance, where it is the only app. The tomcat instance
> has the following memory-related java_opts:
>
> -Xms4096M -Xmx5120M
>
>
> The data-config.xml looks like this (only 1 entity):
>
>  
>
>
>
>
>
> name="sf_unique_id"/>
>
>
>  
>
>
>  
>
>
> We have the feeling that 3 hours for this import is quite long - regarding
> the performance of the server running solr/mysql.
>
> Are we wrong with that assumption, or do people experience similar import
> times with this amount of data to be imported?
>
>
> thanks!
>
>
> -robert
>
>
>
>


Re: Dataimport performance

2010-12-15 Thread Robert Gründler
> What version of Solr are you using?


Solr Specification Version: 1.4.1
Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42
Lucene Specification Version: 2.9.3
Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55


-robert



> 
> Adam
> 
> 2010/12/15 Robert Gründler 
> 
>> Hi,
>> 
>> we're looking for some comparison-benchmarks for importing large tables
>> from a mysql database (full import).
>> 
>> Currently, a full-import of ~ 8 Million rows from a MySQL database takes
>> around 3 hours, on a QuadCore Machine with 16 GB of
>> ram and a Raid 10 storage setup. Solr is running on a apache tomcat
>> instance, where it is the only app. The tomcat instance
>> has the following memory-related java_opts:
>> 
>> -Xms4096M -Xmx5120M
>> 
>> 
>> The data-config.xml looks like this (only 1 entity):
>> 
>> 
>>   
>>   
>>   
>>   
>>   
>>   > name="sf_unique_id"/>
>> 
>>   
>> 
>>   
>> 
>> 
>> 
>> 
>> We have the feeling that 3 hours for this import is quite long - regarding
>> the performance of the server running solr/mysql.
>> 
>> Are we wrong with that assumption, or do people experience similar import
>> times with this amount of data to be imported?
>> 
>> 
>> thanks!
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 



Re: Dataimport performance

2010-12-15 Thread Bernd Fehling
We are currently running Solr 4.x from trunk.

-d64 -Xms10240M -Xmx10240M

Total Rows Fetched: 24935988
Total Documents Skipped: 0
Total Documents Processed: 24568997
Time Taken: 5:55:19.104

24.5 Million Docs as XML from filesystem with less than 6 hours.

May be your MySQL is the bottleneck?

Regards
Bernd


Am 15.12.2010 14:40, schrieb Robert Gründler:
> Hi,
> 
> we're looking for some comparison-benchmarks for importing large tables from 
> a mysql database (full import).
> 
> Currently, a full-import of ~ 8 Million rows from a MySQL database takes 
> around 3 hours, on a QuadCore Machine with 16 GB of
> ram and a Raid 10 storage setup. Solr is running on a apache tomcat instance, 
> where it is the only app. The tomcat instance
> has the following memory-related java_opts:
> 
> -Xms4096M -Xmx5120M
> 
> 
> The data-config.xml looks like this (only 1 entity):
> 
>   
> 
> 
> 
> 
> 
>  name="sf_unique_id"/>
> 
> 
>   
> 
> 
>   
> 
> 
> We have the feeling that 3 hours for this import is quite long - regarding 
> the performance of the server running solr/mysql. 
> 
> Are we wrong with that assumption, or do people experience similar import 
> times with this amount of data to be imported?
> 
> 
> thanks!
> 
> 
> -robert
> 
> 
> 

-- 
*
Bernd FehlingUniversitätsbibliothek Bielefeld
Dipl.-Inform. (FH)Universitätsstr. 25
Tel. +49 521 106-4060   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


Re: Dataimport performance

2010-12-15 Thread Tim Heckman
2010/12/15 Robert Gründler :
> The data-config.xml looks like this (only 1 entity):
>
>      
>        
>        
>        
>        
>        
>         name="sf_unique_id"/>
>
>        
>          
>        
>
>      

So there's one track entity with an artist sub-entity. My (admittedly
rather limited) experience has been that sub-entities, where you have
to run a separate query for every row in the parent entity, really
slow down data import. For my own purposes, I wrote a custom data
import using SolrJ to improve the performance (from 3 hours to 10
minutes).

Just as a test, how long does it take if you comment out the artists entity?


Re: Problem with multicore

2010-12-15 Thread Tommaso Teofili
Hi Jörg,
I think the first thing you should check is your Ubuntu's encoding, second
one is file permissions (BTW why are you sudoing?).
Did you try using the bash script under example/exampledocs named "post.sh"
(use it like this: 'sh post.sh *.xml')
Cheers,
Tommaso


2010/12/15 Jörg Agatz 

> Hallo Users,
>
> I habve a Problem wit Solr 1.4.1 on Ubuntu 10.10
>
> I have download the new version and extract it!
>
> than i have copy the solr.xml from example/multicore/solr.xml to
> /examples/solr/solr.xml
>
> 
>
> 
>
>
> 
>
> 
>
>
>  
>
>  
>
>
>
>
>
>  
>
> 
>
>
>
> than i create folders example/solr/core0 and example/solr/core1
> and in each folder a conf folder, with the original schema.xml and
> solrconfig.xml ect..
>
> start Solr with "sudo java -Dsolr.solr.home=multicore -jar start.jar"
>
> but nuw i cant index something with:
>
> sudo java -Ddata=args -Dcommit=yes -Durl=
> http://localhost:8983/solr/core1/update -jar post.jar *.xml
>
> i always get:
>
> SimplePostTool: version 1.2
>
> SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8,
> other encodings are not currently supported
>
> SimplePostTool: POSTing args to http://localhost:8983/solr/core1/update..
>
> SimplePostTool: FATAL: Solr returned an error:
>
> Unexpected_character_m_code_109_in_prolog_expected___at_rowcol_unknownsource_11
>
> serv...@joa-desktop:~/Desktop/apache-solr-1.4.1/example/exampledocs$
>
>
> Some ideas what i have make wrong?
>
> King
>


Re: Dataimport performance

2010-12-15 Thread Robert Gründler
i've benchmarked the import already with 500k records, one time without the 
artists subquery, and one time without the join in the main query:


Without subquery: 500k in 3 min 30 sec

Without join and without subquery: 500k in 2 min 30.

With subquery and with left join:   320k in 6 Min 30


so the joins / subqueries are definitely a bottleneck. 

How exactly did you implement the custom data import? 

In our case, we need to de-normalize the relations of the sql data for the 
index, 
so i fear i can't really get rid of the join / subquery.


-robert





On Dec 15, 2010, at 15:43 , Tim Heckman wrote:

> 2010/12/15 Robert Gründler :
>> The data-config.xml looks like this (only 1 entity):
>> 
>>  
>>
>>
>>
>>
>>
>>> name="sf_unique_id"/>
>> 
>>
>>  
>>
>> 
>>  
> 
> So there's one track entity with an artist sub-entity. My (admittedly
> rather limited) experience has been that sub-entities, where you have
> to run a separate query for every row in the parent entity, really
> slow down data import. For my own purposes, I wrote a custom data
> import using SolrJ to improve the performance (from 3 hours to 10
> minutes).
> 
> Just as a test, how long does it take if you comment out the artists entity?



Lower level filtering

2010-12-15 Thread Michael Owen

Hi all,
I'm currently using Solr and I've got a question about filtering on a lower 
level than filter queries.
We want to be able to restrict the documents that can possibly be returned to a 
users query. From another system we'll get a list of document unique ids for 
the user which is all the documents that they can possibly see (i.e. a base 
index list as such). The criteria for what document ids get returned is going 
to be quite flexible. As the number of ids can be up to index size - 1 (i.e. 
thousands) using a filter query doesn't seem right for entering a filter query 
which is so large.
Can something be done at a lower level - perhaps at a Lucene level - as I 
understand Lucene starts from a bitset of possible documents it can return - 
could we AND this with a filter bitset returned from the other system? Would 
this be a good way forward? 
And then how would you do this in Solr with still keeping Solr's extra 
functionality it brings over Lucene. A new SearchHandler?
Thanks
Mike





  

Re: Lower level filtering

2010-12-15 Thread Stephen Green
On Wed, Dec 15, 2010 at 9:49 AM, Michael Owen  wrote:
> I'm currently using Solr and I've got a question about filtering on a lower 
> level than filter queries.
> We want to be able to restrict the documents that can possibly be returned to 
> a users query. From another system we'll get a list of document unique ids 
> for the user which is all the documents that they can possibly see (i.e. a 
> base index list as such). The criteria for what document ids get returned is 
> going to be quite flexible. As the number of ids can be up to index size - 1 
> (i.e. thousands) using a filter query doesn't seem right for entering a 
> filter query which is so large.
> Can something be done at a lower level - perhaps at a Lucene level - as I 
> understand Lucene starts from a bitset of possible documents it can return - 
> could we AND this with a filter bitset returned from the other system? Would 
> this be a good way forward?
> And then how would you do this in Solr with still keeping Solr's extra 
> functionality it brings over Lucene. A new SearchHandler?

I actually submitted a patch a while ago in Solr-2052 that allows you
to specify a bit filter and a filter query (you could specify either,
but not both.)

Otis pointed out that the patch can't be applied against the current
source, so I need to go back and make it work with the current source
(new job = no time).  I'll see if I can find the time this weekend to
do this.

Steve
-- 
Stephen Green
http://thesearchguy.wordpress.com


Re: Lower level filtering

2010-12-15 Thread Savvas-Andreas Moysidis
It might not be practical in your case, but is it possible to get from that
other system, a list of ids the user is *not* allow to see and somehow
invert the logic in the filter?

Regards,
-- Savvas.

On 15 December 2010 14:49, Michael Owen  wrote:

>
> Hi all,
> I'm currently using Solr and I've got a question about filtering on a lower
> level than filter queries.
> We want to be able to restrict the documents that can possibly be returned
> to a users query. From another system we'll get a list of document unique
> ids for the user which is all the documents that they can possibly see (i.e.
> a base index list as such). The criteria for what document ids get returned
> is going to be quite flexible. As the number of ids can be up to index size
> - 1 (i.e. thousands) using a filter query doesn't seem right for entering a
> filter query which is so large.
> Can something be done at a lower level - perhaps at a Lucene level - as I
> understand Lucene starts from a bitset of possible documents it can return -
> could we AND this with a filter bitset returned from the other system? Would
> this be a good way forward?
> And then how would you do this in Solr with still keeping Solr's extra
> functionality it brings over Lucene. A new SearchHandler?
> Thanks
> Mike
>
>
>
>
>
>


Re: Dataimport performance

2010-12-15 Thread Tim Heckman
The custom import I wrote is a java application that uses the SolrJ
library. Basically, where I had sub-entities in the DIH config I did
the mappings inside my java code.

1. Identify a subset or "chunk" of the primary id's to work on (so I
don't have to load everything into memory at once) and put those in a
temp table. I used a modulus on the id.
2. Select all of the outer entity from the database (joining on the
id's in the temp table), and load the data from that result set into
new solr input documents. I keep these in a hash map keyed on the
id's.
3. Then select all of the inner entity, joining on the id's from the
temp table. The result set has to include the id's from step 2. I go
through this result set and load the data into the matching solr input
documents from step 2.
4. Push that set of input documents to solr (optionally committing
them), then go back to step 1 using the next subset or chunk.

Not sure if this is the absolute best approach, but it's working well
enough for my specific case.

Tim


2010/12/15 Robert Gründler :
> i've benchmarked the import already with 500k records, one time without the 
> artists subquery, and one time without the join in the main query:
>
>
> Without subquery: 500k in 3 min 30 sec
>
> Without join and without subquery: 500k in 2 min 30.
>
> With subquery and with left join:   320k in 6 Min 30
>
>
> so the joins / subqueries are definitely a bottleneck.
>
> How exactly did you implement the custom data import?
>
> In our case, we need to de-normalize the relations of the sql data for the 
> index,
> so i fear i can't really get rid of the join / subquery.
>
>
> -robert
>
>
>
>
>
> On Dec 15, 2010, at 15:43 , Tim Heckman wrote:
>
>> 2010/12/15 Robert Gründler :
>>> The data-config.xml looks like this (only 1 entity):
>>>
>>>      
>>>        
>>>        
>>>        
>>>        
>>>        
>>>        >> name="sf_unique_id"/>
>>>
>>>        
>>>          
>>>        
>>>
>>>      
>>
>> So there's one track entity with an artist sub-entity. My (admittedly
>> rather limited) experience has been that sub-entities, where you have
>> to run a separate query for every row in the parent entity, really
>> slow down data import. For my own purposes, I wrote a custom data
>> import using SolrJ to improve the performance (from 3 hours to 10
>> minutes).
>>
>> Just as a test, how long does it take if you comment out the artists entity?
>
>


Custom scoring for searhing geographic objects

2010-12-15 Thread Pavel Minchenkov
Hi,
Please give me advise how to create custom scoring. I need to result that
documents were in order, depending on how popular each term in the document
(popular = how many times it appears in the index) and length of the
document (less terms - higher in search results).

For example, index contains following data:

ID| SEARCH_FIELD
--
1 | Russia
2 | Russia, Moscow
3 | Russia, Volgograd
4 | Russia, Ivanovo
5 | Russia, Ivanovo, Altayskaya street 45
6 | Russia, Moscow, Kremlin
7 | Russia, Moscow, Altayskaya street
8 | Russia, Moscow, Altayskaya street 15
9 | Russia, Moscow, Altayskaya street 15/26


And I should get next results:


Query | Document result set
--
Russia| 1,2,4,3,6,7,8,9,5
Moscow  | 2,6,7,8,9
Ivanovo| 4,5
Altayskaya  | 7,8,9,5

In fact --- it is a search for geographic objects (cities, streets, houses).
At the same time can be given only part of the address, and the results
should appear the most relevant results.

Thanks.
-- 
Pavel Minchenkov


RE: Lower level filtering

2010-12-15 Thread Michael Owen

That was a quick response Steve!
Sounds all great! Much appreciated. Definitely think specifying a bit filter is 
something that many people many find useful.

I'll have a look at Solr-2052 too.
Thanks again,
Mike

> Date: Wed, 15 Dec 2010 09:57:54 -0500
> Subject: Re: Lower level filtering
> From: eelstretch...@gmail.com
> To: solr-user@lucene.apache.org
> 
> On Wed, Dec 15, 2010 at 9:49 AM, Michael Owen  
> wrote:
> > I'm currently using Solr and I've got a question about filtering on a lower 
> > level than filter queries.
> > We want to be able to restrict the documents that can possibly be returned 
> > to a users query. From another system we'll get a list of document unique 
> > ids for the user which is all the documents that they can possibly see 
> > (i.e. a base index list as such). The criteria for what document ids get 
> > returned is going to be quite flexible. As the number of ids can be up to 
> > index size - 1 (i.e. thousands) using a filter query doesn't seem right for 
> > entering a filter query which is so large.
> > Can something be done at a lower level - perhaps at a Lucene level - as I 
> > understand Lucene starts from a bitset of possible documents it can return 
> > - could we AND this with a filter bitset returned from the other system? 
> > Would this be a good way forward?
> > And then how would you do this in Solr with still keeping Solr's extra 
> > functionality it brings over Lucene. A new SearchHandler?
> 
> I actually submitted a patch a while ago in Solr-2052 that allows you
> to specify a bit filter and a filter query (you could specify either,
> but not both.)
> 
> Otis pointed out that the patch can't be applied against the current
> source, so I need to go back and make it work with the current source
> (new job = no time).  I'll see if I can find the time this weekend to
> do this.
> 
> Steve
> -- 
> Stephen Green
> http://thesearchguy.wordpress.com
  

RE: Lower level filtering

2010-12-15 Thread Michael Owen

Good point - though the inverse could be true where only a few documents is 
allowed and then a big list still exists. Even in the middle ground, its still 
going to be a long list of thousands.

Thanks
Mike


> Date: Wed, 15 Dec 2010 14:58:33 +
> Subject: Re: Lower level filtering
> From: savvas.andreas.moysi...@googlemail.com
> To: solr-user@lucene.apache.org
> 
> It might not be practical in your case, but is it possible to get from that
> other system, a list of ids the user is *not* allow to see and somehow
> invert the logic in the filter?
> 
> Regards,
> -- Savvas.
> 
> On 15 December 2010 14:49, Michael Owen  wrote:
> 
> >
> > Hi all,
> > I'm currently using Solr and I've got a question about filtering on a lower
> > level than filter queries.
> > We want to be able to restrict the documents that can possibly be returned
> > to a users query. From another system we'll get a list of document unique
> > ids for the user which is all the documents that they can possibly see (i.e.
> > a base index list as such). The criteria for what document ids get returned
> > is going to be quite flexible. As the number of ids can be up to index size
> > - 1 (i.e. thousands) using a filter query doesn't seem right for entering a
> > filter query which is so large.
> > Can something be done at a lower level - perhaps at a Lucene level - as I
> > understand Lucene starts from a bitset of possible documents it can return -
> > could we AND this with a filter bitset returned from the other system? Would
> > this be a good way forward?
> > And then how would you do this in Solr with still keeping Solr's extra
> > functionality it brings over Lucene. A new SearchHandler?
> > Thanks
> > Mike
> >
> >
> >
> >
> >
> >
  

Re: Lower level filtering

2010-12-15 Thread Erick Erickson
Here's the problem with what you're outlining:
Solr/Lucene doc ids are NOT invariant, so
the doc IDs you get from "the other system"
will not be directly usable by in the filter. But
assuming the other system stores what you've
defined as  you could walk the
index and get the doc IDs from that (See TermDocs
in the Lucene API documentation).

There's an extensive discussion of this here:
http://lucene.472066.n3.nabble.com/filter-query-from-external-list-of-Solr-unique-IDs-td1709060.html

But this sounds a lot like this patch (I've only skimmed the
comments, so I may be off base):
https://issues.apache.org/jira/browse/SOLR-2272
note that this is a patch that I don't think has been
committed yet, so you'd have to get the source, apply
the patch and then use it. See:
http://wiki.apache.org/solr/HowToContribute
for instructions here.

Best
Erick

On Wed, Dec 15, 2010 at 9:49 AM, Michael Owen wrote:

>
> Hi all,
> I'm currently using Solr and I've got a question about filtering on a lower
> level than filter queries.
> We want to be able to restrict the documents that can possibly be returned
> to a users query. From another system we'll get a list of document unique
> ids for the user which is all the documents that they can possibly see (i.e.
> a base index list as such). The criteria for what document ids get returned
> is going to be quite flexible. As the number of ids can be up to index size
> - 1 (i.e. thousands) using a filter query doesn't seem right for entering a
> filter query which is so large.
> Can something be done at a lower level - perhaps at a Lucene level - as I
> understand Lucene starts from a bitset of possible documents it can return -
> could we AND this with a filter bitset returned from the other system? Would
> this be a good way forward?
> And then how would you do this in Solr with still keeping Solr's extra
> functionality it brings over Lucene. A new SearchHandler?
> Thanks
> Mike
>
>
>
>
>
>


Copying the index from one solr instance to another

2010-12-15 Thread Robert Gründler
Hi again,

let's say you have 2 solr Instances, which have both exactly the same 
configuration (schema, solrconfig, etc).

Could it cause any troubles if we import an index from a SQL database on solr 
instance A, and copy the whole
index to the datadir of solr instance B (both solr instances run on different 
servers) ?.

As far as i can tell, this should work and solr instance B should have the 
exact same index as solr instance A after the copy-process.

Do we miss something, or is this workflow safe to go with?

-robert

Re: Copying the index from one solr instance to another

2010-12-15 Thread Shawn Heisey

On 12/15/2010 10:05 AM, Robert Gründler wrote:

Hi again,

let's say you have 2 solr Instances, which have both exactly the same 
configuration (schema, solrconfig, etc).

Could it cause any troubles if we import an index from a SQL database on solr 
instance A, and copy the whole
index to the datadir of solr instance B (both solr instances run on different 
servers) ?.

As far as i can tell, this should work and solr instance B should have the 
exact same index as solr instance A after the copy-process.


I believe this should work, but I would take a couple of precautions.  
I'd stop Solr before putting the new index into place.  If you can't 
have it down for the entirety of the copy process, then copy it into an 
adjacent directory, shut down solr, rename the directories, and restart 
Solr.


If the Solr that built the index (specifically, the Lucene that comes 
with it) is newer than the one that you are copying to, it won't work.


If you've checked all that and if you're still having trouble, let us know.

Shawn



Re: Copying the index from one solr instance to another

2010-12-15 Thread Robert Gründler
thanks for your feedback. we can shutdown both solr servers for the time of the 
copy-process, and both 
solr instances run the same version, so we should be ok.

i'll let you know if we encounter any troubles.


-robert



On Dec 15, 2010, at 18:11 , Shawn Heisey wrote:

> On 12/15/2010 10:05 AM, Robert Gründler wrote:
>> Hi again,
>> 
>> let's say you have 2 solr Instances, which have both exactly the same 
>> configuration (schema, solrconfig, etc).
>> 
>> Could it cause any troubles if we import an index from a SQL database on 
>> solr instance A, and copy the whole
>> index to the datadir of solr instance B (both solr instances run on 
>> different servers) ?.
>> 
>> As far as i can tell, this should work and solr instance B should have the 
>> exact same index as solr instance A after the copy-process.
> 
> I believe this should work, but I would take a couple of precautions.  I'd 
> stop Solr before putting the new index into place.  If you can't have it down 
> for the entirety of the copy process, then copy it into an adjacent 
> directory, shut down solr, rename the directories, and restart Solr.
> 
> If the Solr that built the index (specifically, the Lucene that comes with 
> it) is newer than the one that you are copying to, it won't work.
> 
> If you've checked all that and if you're still having trouble, let us know.
> 
> Shawn
> 



Parenthesis in query string

2010-12-15 Thread Tommaso Teofili
Hi all,
I've just noticed a strange behavior (or, at least, I didn't expect that),
when adding useless parenthesis to a query.
Using the lucene query parser in Solr I get no results with the query:

* ((( NOT (text:"something"))) AND date <= 2010-12-15) *

while I get the expected results when the query is :

*( NOT (text:"something") AND date <= 2010-12-15) *

Setting the debugQuery=true param I get this for the first query sample:

((( NOT (text:"something"))) AND data <= 2010-12-15)


((( NOT (text:"something"))) AND data <= 2010-12-15)


+(-PhraseQuery(text:"something")) +text:dat PhraseQuery(text:"2010 12 15")


+(-text:"something") +text:dat text:"2010 12 15"


while I get the following in the second (right) query sample:

( NOT (text:"something") AND data <= 2010-12-15)


( NOT (text:"something") AND data <= 2010-12-15)


-PhraseQuery(text:"something") +text:dat PhraseQuery(text:"2010 12 15")


-text:"something" +text:dat text:"2010 12 15"


Is that something expected and I am missing something or it's a bug?
Thanks in advance.
Regards,
Tommaso


Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread Dennis Gearon
I want to just pass the JSON through after qualifying the user's access to the 
site. 


Didn't want to spend the horse power to receive it as PHP array syntax, run the 
risk of someone putting bad stuff in the contents and running 'exec()' on it, 
and then spending the extra horsepower to putput it as json.

I had that page up in the browwser to look at it later. If it deons't do the 
above, I will be glad to have the Solr access abstracted, thanks :-)


 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Stephen Weiss 
To: solr-user@lucene.apache.org
Sent: Wed, December 15, 2010 1:36:11 AM
Subject: Re: Problem using curl in PHP to get Solr results

Forgive me if this seems like a dumb question but have you tried the 
Apache_Solr_Service class?

http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html

It's really quite good at handling the nuts and bolts of making the HTTP 
requests and decoding the responses for PHP.  I almost always use it when 
working from PHP.  It's all over Google so I don't know how someone would miss 
it but I don't know why else someone would bother curling a GET to SOLR 
otherwise.

--
Steve

On Dec 15, 2010, at 4:22 AM, Dennis Gearon wrote:

> I finally figured out how to use curl to GET results, i.e. just turn all 
> spaces 
>
> into '%20' in my type of queries. I'm using solar spatial, and then searching 
>in 
>
> both the default text field and a couple of columns. Works fine on in the 
> browser.
> 
> But if I query for it using curl in PHP, there's an error somewhere in the 
>JSON. 
>
> I don't know if it's in the PHP food chain or something else. 
> 
> 
> Just putting my solution to GETing from curl in PHP and my problem up here, 
> for 
>
> others to find.
> 
> Of course, if anyone knows the answer, all the better.
> 
> Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
>better 
>
> idea to learn from others’ mistakes, so you do not have to make them 
> yourself. 

> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.
>


Dates BC

2010-12-15 Thread Agethle, Matthias
Hi everyone,

does the solr.TrieDateField support dates BC?
I indexed negative dates and I'm able to query them,
but if I store them, they show up as postitive dates.

Thanks
Matthias



search for a number within a range, where range values are mentioned in documents

2010-12-15 Thread Arunkumar Ayyavu
Hi!

I have a typical case where in an attribute (in a DB record) can
contain different ranges of numeric values. Let us say the range
values in this attribute for "record1" are
(2-4,5000-8000,45000-5,454,231,1000). As you can see this
attribute can also contain isolated numeric values such as 454, 231
and 1000. Now, I want to return "record1" if the user searches for
20001 or 5003 or 231 or 5. Right now, I'm exploding the range
values (within a transformer) and indexing "record1" for each of the
values within a range. But this could result in out-of-memory error if
the range is too large. Could you help me figure out a better way of
addressing this type of queries using Solr.

Thanks a ton.

-- 
Arun


Transparent redundancy in Solr

2010-12-15 Thread Tommaso Teofili
Hi all,
me, Upayavira and other guys at Sourcesense have collected some Solr
architectural views inside the presentation at [1].
For sure one can set up an architecture for failover and resiliency on the
"search face" (search slaves with coordinators and distributed search) but
I'd like to ask how would you reach transparent redundancy in Solr on the
"index face".
On slide 13 we put 2 slave backup masters and so if one of the main masters
goes down you can switch slaves' replication on the backup master.
First question if how could it be made automatic?
In a previous thread [2] I talked about a possible solution writing the
master url of slaves in a properties file so when you have to switch you
change that url to the backup master and reload the slave's core but that is
not automatic :-) Any more advanced ideas?
Second question: when main master comes up how can it be automatically
considered as the backup master (since hopefully the backup master has
received some indexing requests in the meantime)? Also consider that its
index should be wiped out and replicated from the new master to ensure index
integrity.
Looking forward for your feedback,
Cheers,
Tommaso

[1] : http://www.slideshare.net/sourcesense/sharded-solr-setup-with-master
[2] : http://markmail.org/thread/vjj5jovbg6evpmpp


Re: search for a number within a range, where range values are mentioned in documents

2010-12-15 Thread Jonathan Rochkind
I'm not sure you're right that it will result in an out-of-memory error 
if the range is too large. I don't think it will, I think it'll be fine 
as far as memory goes, because of how Lucene works. Or do you actually 
have reason to believe it was causing you memory issues?  Or do you just 
mean memory issues in your "transformer", not actually in Solr?


Using Trie fields should also make it fine as far as CPU time goes.  
Using a trie int field with a non-zero "precision" should likely be 
helpful in this case.


It _will_ increase the on-disk size of your indexes.

I'm not sure if there's a better approach, i can't think of one, but 
maybe someone else knows one.


On 12/15/2010 12:56 PM, Arunkumar Ayyavu wrote:

Hi!

I have a typical case where in an attribute (in a DB record) can
contain different ranges of numeric values. Let us say the range
values in this attribute for "record1" are
(2-4,5000-8000,45000-5,454,231,1000). As you can see this
attribute can also contain isolated numeric values such as 454, 231
and 1000. Now, I want to return "record1" if the user searches for
20001 or 5003 or 231 or 5. Right now, I'm exploding the range
values (within a transformer) and indexing "record1" for each of the
values within a range. But this could result in out-of-memory error if
the range is too large. Could you help me figure out a better way of
addressing this type of queries using Solr.

Thanks a ton.



Re: Copying the index from one solr instance to another

2010-12-15 Thread Rob Casson
just making sure that you're aware of the built-in replication:

 http://wiki.apache.org/solr/SolrReplication

can pull the indexes, along with config files.

cheers,
rob

2010/12/15 Robert Gründler :
> Hi again,
>
> let's say you have 2 solr Instances, which have both exactly the same 
> configuration (schema, solrconfig, etc).
>
> Could it cause any troubles if we import an index from a SQL database on solr 
> instance A, and copy the whole
> index to the datadir of solr instance B (both solr instances run on different 
> servers) ?.
>
> As far as i can tell, this should work and solr instance B should have the 
> exact same index as solr instance A after the copy-process.
>
> Do we miss something, or is this workflow safe to go with?
>
> -robert


Re: facet.pivot for date fields

2010-12-15 Thread Adeel Qureshi
Thanks Pankaj - that was useful to know. I havent used the query stuff
before for facets .. so that was good to know .. but the problem is still
there because I want the hierarchical counts which is exactly what
facet.pivot does ..

so e.g. i want to count for fieldC within fieldB and even fieldB within
fieldA .. that kind of stuff .. for string based fields .. facet.pivot does
exactly that and does it very well .. but it doesnt seems to work for date
ranges .. so in this case I want counts to be broken down by fieldA and
fieldB and then fieldB counts for monthly ranges .. I understand that I
might be able to use facet.query to construct several queries to get these
counts .. e.g. *facet.query=fieldA:someValue AND fieldB:someValue AND
fieldC:[NOW-1YEAR TO NOW]* .. but there could be thousand of possible
combinations for fieldA and fieldB which will require as many facet.queries
which I am assuming is not the way to go ..

it might be confusing what I have explained above so the simple question
still is if there is a way to get date range counts included in facet.pivot

Adeel


On Tue, Dec 14, 2010 at 10:53 PM, pankaj bhatt  wrote:

> Hi Adeel,
>  You can make use of facet.query attribute to make the Faceting work
> across a range of dates. Here i am using the duration, just replace the
> field with a field date and Range values as the DATE in SOLR Format.
> so your query parameter will be like this ( you can pass multiple parameter
> of "facet.query" name)
>
> http//blasdsdfsd/q?=asdfasd&facet.query=itemduration:[0 To
> 49]&facet.query=itemduration:[50 To 99]&facet.query=itemduration:[100 To
> 149]
>
> Hope, it helps.
>
> / Pankaj Bhatt.
>
> On Wed, Dec 15, 2010 at 2:01 AM, Adeel Qureshi  >wrote:
>
> > It doesnt seems like pivot facetting works on dates .. I was just curious
> > if
> > thats how its supposed to be or I am doing something wrong .. if I
> include
> > a
> > datefield in the pivot list .. i simply dont get any facet results back
> for
> > that datefield
> >
> > Thanks
> > Adeel
> >
>


Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread Andrew McCombe
Hi

You could use Solr's php serialized object output (wt=phps) and then convert
it to json in your php:


Regards
Andrew McCombe

On 15 December 2010 17:49, Dennis Gearon  wrote:

> I want to just pass the JSON through after qualifying the user's access to
> the
> site.
>
>
> Didn't want to spend the horse power to receive it as PHP array syntax, run
> the
> risk of someone putting bad stuff in the contents and running 'exec()' on
> it,
> and then spending the extra horsepower to putput it as json.
>
> I had that page up in the browwser to look at it later. If it deons't do
> the
> above, I will be glad to have the Solr access abstracted, thanks :-)
>
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> - Original Message 
> From: Stephen Weiss 
> To: solr-user@lucene.apache.org
> Sent: Wed, December 15, 2010 1:36:11 AM
> Subject: Re: Problem using curl in PHP to get Solr results
>
> Forgive me if this seems like a dumb question but have you tried the
> Apache_Solr_Service class?
>
> http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html
>
> It's really quite good at handling the nuts and bolts of making the HTTP
> requests and decoding the responses for PHP.  I almost always use it when
> working from PHP.  It's all over Google so I don't know how someone would
> miss
> it but I don't know why else someone would bother curling a GET to SOLR
> otherwise.
>
> --
> Steve
>
> On Dec 15, 2010, at 4:22 AM, Dennis Gearon wrote:
>
> > I finally figured out how to use curl to GET results, i.e. just turn all
> spaces
> >
> > into '%20' in my type of queries. I'm using solar spatial, and then
> searching
> >in
> >
> > both the default text field and a couple of columns. Works fine on in the
> > browser.
> >
> > But if I query for it using curl in PHP, there's an error somewhere in
> the
> >JSON.
> >
> > I don't know if it's in the PHP food chain or something else.
> >
> >
> > Just putting my solution to GETing from curl in PHP and my problem up
> here, for
> >
> > others to find.
> >
> > Of course, if anyone knows the answer, all the better.
> >
> > Dennis Gearon
> >
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes. It is usually a
> >better
> >
> > idea to learn from others’ mistakes, so you do not have to make them
> yourself.
>
> > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> >
> > EARTH has a Right To Life,
> > otherwise we all die.
> >
>


Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread Markus Jelsma
The GeoDistanceComponent triggers the problem. It may be an issue in the 
component but it could very well be a Solr issue. It seems you missed a very 
recent thread on this one.

https://issues.apache.org/jira/browse/SOLR-2278

> I finally figured out how to use curl to GET results, i.e. just turn all
> spaces into '%20' in my type of queries. I'm using solar spatial, and then
> searching in both the default text field and a couple of columns. Works
> fine on in the browser.
> 
> But if I query for it using curl in PHP, there's an error somewhere in the
> JSON. I don't know if it's in the PHP food chain or something else.
> 
> 
> Just putting my solution to GETing from curl in PHP and my problem up here,
> for others to find.
> 
>  Of course, if anyone knows the answer, all the better.
> 
>  Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better idea to learn from others’ mistakes, so you do not have to make
> them yourself. from
> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.


Re: [DIH] Example for SQL Server

2010-12-15 Thread Adam Estrada
I got it to work! This is an excellent article for importing SQL Server data
in to your index.

http://www.chrisumbel.com/article/lucene_solr_sql_server

Adam

On Wed, Dec 15, 2010 at 8:43 AM, Adam Estrada wrote:

> Thanks All,
>
> Testing here shortly and will report back asap.
>
> w/r,
> Adam
>
>
> On Wed, Dec 15, 2010 at 4:10 AM, Savvas-Andreas Moysidis <
> savvas.andreas.moysi...@googlemail.com> wrote:
>
>> Hi Adam,
>>
>> we are using DIH to index off an SQL Server database(the freeby SQLExpress
>> one.. ;) ). We have defined the following in our
>> %TOMCAT_HOME%\solr\conf\data-config.xml:
>> 
>>
>>  >  name="mssqlDatasource"
>>  driver="net.sourceforge.jtds.jdbc.Driver"
>>  url="jdbc:jtds:sqlserver://{server.name
>> }:{server.port}/{dbInstanceName};instance=SQLEXPRESS"
>>  convertType="true"
>>  user="{user.name}"
>>  password="{user.password}"/>
>>  
>>> dataSource="mssqlDatasource"
>>   query="your query here" />
>>  
>> 
>>
>> We downloaded a JDBC driver from here
>> http://jtds.sourceforge.net/faq.html and
>> found it to be a quite stable driver.
>>
>> And the only thing we really had to do was drop that library in
>> %TOMCAT_HOME%\lib directory (for Tomcat 6+).
>>
>> Hope that helps.
>> -- Savvas.
>>
>> On 14 December 2010 22:46, Erick Erickson 
>> wrote:
>>
>> > The config isn't really any different for various sql instances, about
>> the
>> > only difference is the driver. Have you seen the example in the
>> > distribution somewhere like
>> > /example/example-DIH/solr/db/conf/db-data-config.xml?
>> >
>> > Also, there's a magic URL for debugging DIH at:
>> > .../solr/admin/dataimport.jsp
>> >
>> > If none of that is useful, could you post your attempt and maybe someone
>> > can
>> > offer some hints?
>> >
>> > Best
>> > Erick
>> >
>> > On Tue, Dec 14, 2010 at 5:32 PM, Adam Estrada <
>> > estrada.adam.gro...@gmail.com
>> > > wrote:
>> >
>> > > Does anyone have an example config.xml file I can take a look at for
>> SQL
>> > > Server? I need to index a lot of data from a DB and can't seem to
>> figure
>> > > out
>> > > the right syntax so any help would be greatly appreciated. What is the
>> > > correct /jar file to use and where do I put it in order for it to
>> work?
>> > >
>> > > Thanks,
>> > > Adam
>> > >
>> >
>>
>
>


Re: Exceptions in Embedded Solr

2010-12-15 Thread Antoniya Statelova
I experienced this on an EmbeddedSolrServer which was running behind a
tomcat process. After restarting the tomcat process 2-3 times (implying this
also recreates the SolrServer every time as well) this issue went away but I
don't know why it ever started. It looked like the searcher shutdown was not
clean the previous time and I believe that could have to do with it.

Tony

On Sat, Dec 4, 2010 at 11:44 PM, Tharindu Mathew wrote:

> Any help on this?
>
> On Thu, Dec 2, 2010 at 7:51 PM, Tharindu Mathew 
> wrote:
> > Hi everyone,
> >
> > I get the exception below when using Embedded Solr suddenly. If I
> > delete the Solr index it goes back to normal, but it obviously has to
> > start indexing from scratch. Any idea what the cause of this is?
> >
> > java.lang.RuntimeException: java.io.FileNotFoundException:
> >
> /home/evanthika/WSO2/CARBON/GREG/3.6.0/23-11-2010/normal/wso2greg-3.6.0/solr/data/index/segments_2
> > (No such file or directory)
> > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068)
> > at org.apache.solr.core.SolrCore.(SolrCore.java:579)
> > at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
> > at
> org.wso2.carbon.registry.indexing.solr.SolrClient.(SolrClient.java:103)
> > at
> org.wso2.carbon.registry.indexing.solr.SolrClient.getInstance(SolrClient.java:115)
> > ... 44 more
> > Caused by: java.io.FileNotFoundException:
> >
> /home/evanthika/WSO2/CARBON/GREG/3.6.0/23-11-2010/normal/wso2greg-3.6.0/solr/data/index/segments_2
> > (No such file or directory)
> > at java.io.RandomAccessFile.open(Native Method)
> > at java.io.RandomAccessFile.(RandomAccessFile.java:212)
> > at
> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:78)
> > at
> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:108)
> > at
> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.(NIOFSDirectory.java:94)
> > at
> org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:70)
> > at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:691)
> > at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:236)
> > at
> org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72)
> > at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
> > at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69)
> > at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
> > at org.apache.lucene.index.IndexReader.open(IndexReader.java:403)
> > at
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
> > at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057)
> > ... 48 more
> >
> > [2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.SolrCore} -
> > REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@58f24b6
> > (null) has a reference count of 1
> > [2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.SolrCore} -
> > REFCOUNT ERROR: unreferenced org.apache.solr.core.solrc...@654dbbf6
> > (null) has a reference count of 1
> > [2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.CoreContainer} -
> > CoreContainer was not shutdown prior to finalize(), indicates a bug --
> > POSSIBLE RESOURCE LEAK!!!
> > [2010-11-23 14:14:46,568] ERROR {org.apache.solr.core.CoreContainer} -
> > CoreContainer was not shutdown prior to finalize(), indicates a bug --
> > POSSIBLE RESOURCE LEAK!!!
> >
> > --
> > Regards,
> >
> > Tharindu
> >
> >
> >
> > --
> > Regards,
> >
> > Tharindu
> >
>
>
>
> --
> Regards,
>
> Tharindu
>


[Adding] Entities when indexing a DB

2010-12-15 Thread Adam Estrada
All,

I have successfully indexed a single entity but when I try multiple entities
is the second is skipped all together. Is there something wrong with my
config file?



  
  

  
  
  
  
  


  
  
  
  
  

  



Re: [Adding] Entities when indexing a DB

2010-12-15 Thread Allistair Crossley
mission.id and event.id if the same value will be overwriting the indexed 
document. your ids need to be unique across all documents. i usually have a 
field id_original that i map the table id to, and then for id per entity i 
usually prefix it with the entity name in the value mapped to the schema id 
field

On 15 Dec 2010, at 20:49, Adam Estrada wrote:

> All,
> 
> I have successfully indexed a single entity but when I try multiple entities
> is the second is skipped all together. Is there something wrong with my
> config file?
> 
> 
> 
> driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
>   url="jdbc:sqlserver://10.0.2.93;databaseName="50_DEV"
>   user="adam"
>   password="password"/>
>  
>query = "SELECT IdMission AS id,
>CoreGroup AS cat,
>StrMissionname AS subject,
>strDescription AS description,
>DateCreated AS pubdate
>FROM dbo.tblMission">
>  
>  
>  
>  
>  
>
> query = "SELECT strsubject AS subject,
>strsummary as description,
>datecreated as date,
>CoreGroup as cat,
>idevent as id
>FROM dbo.tblEvent">
>  
>  
>  
>  
>  
>
>  
> 



Re: Dates BC

2010-12-15 Thread Chris Hostetter

: does the solr.TrieDateField support dates BC?
: I indexed negative dates and I'm able to query them,
: but if I store them, they show up as postitive dates.

Hmm... definitely seems to be a bug.  

I *think* this is another manifestation of SOLR-1899 (because of how the 
hokey formatting code uses a GregorianCalender obejct) and have made a 
note there to explicitly test negative years before resolving - but it may 
actually be an unrelated problem somewhere else in the date handling.


-Hoss


Re: Facet same field with different preifx

2010-12-15 Thread Chris Hostetter

: Can I facet the same field twice with a different prefix as per example
: below?

not at the moment.  it should be if/when someone gets arround to 
working on SOLR-2251...

https://issues.apache.org/jira/browse/SOLR-2251

-Hoss


Re: Custom scoring for searhing geographic objects

2010-12-15 Thread Grant Ingersoll
Have a look at http://lucene.apache.org/java/3_0_2/scoring.html on how Lucene's 
scoring works.  You can override the Similarity class in Solr as well via the 
schema.xml file.  

On Dec 15, 2010, at 10:28 AM, Pavel Minchenkov wrote:

> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
> 
> For example, index contains following data:
> 
> ID| SEARCH_FIELD
> --
> 1 | Russia
> 2 | Russia, Moscow
> 3 | Russia, Volgograd
> 4 | Russia, Ivanovo
> 5 | Russia, Ivanovo, Altayskaya street 45
> 6 | Russia, Moscow, Kremlin
> 7 | Russia, Moscow, Altayskaya street
> 8 | Russia, Moscow, Altayskaya street 15
> 9 | Russia, Moscow, Altayskaya street 15/26
> 
> 
> And I should get next results:
> 
> 
> Query | Document result set
> --
> Russia| 1,2,4,3,6,7,8,9,5
> Moscow  | 2,6,7,8,9
> Ivanovo| 4,5
> Altayskaya  | 7,8,9,5
> 
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
> 
> Thanks.
> -- 
> Pavel Minchenkov

--
Grant Ingersoll
http://www.lucidimagination.com



[ANN] General Availability of LucidWorks Enterprise

2010-12-15 Thread Grant Ingersoll
Lucid Imagination is pleased to announce the general availability of our Apache 
Solr/Lucene powered LucidWorks Enterprise (LWE).  LWE is designed to make it 
easier for people to get up to speed on search by providing easier management, 
integration with libraries commonly used in building search applications (such 
as crawling) as well as value add components developed by Lucid Imagination all 
packaged on top of Apache Solr while still giving access to Solr.

You can get more info in the press release: 
http://www.lucidimagination.com/About/Company-News/Lucid-Imagination-Announces-General-Availability-and-Free-Download-LucidWorks-Ent

Other Details:
Download LucidWorks Enterprise software: www.lucidimagination.com/lwe/download
View free documentation: http://lucidworks.lucidimagination.com
View a demonstration of LucidWorks Enterprise: 
http://www.lucidimagination.com/lwe/demos 
Access LucidWorks Enterprise whitepapers and tutorials: 
www.lucidimagination.com/lwe/whitepapers
Read further commentary on the Lucid Imagination blog

Cheers,
Grant

--
Grant Ingersoll
http://www.lucidimagination.com



Re: Viewing query debug explanation with dismax and multicore

2010-12-15 Thread Chris Hostetter

: I am trying to debug my queries and see how scoring is done. I have 6 cores 
and 
: send the quesy to 6 shards and it's dismax handler (with search on various 
: fields with different boostings). I enable debug, and view source but I'm 
unable 
: to see the explanations. I'm returning ID and score as the "fl" field. Am I 

you'll need to provide us with more details -- what does your query URL 
look like? what does your request handler config look like? what does the 
response look like? (does it even have a debug section)

FWIW: doing a distributed query across the "example" setup from the 3x 
branch and the trunk i was able to see score explanations.

: supposed to retun something else to be able to see the explanation? or is it 
: because it's multi-core?

FYI: a terminology clarification: "Multi-core" is a term used to describe 
a single solr instance running multiple SolrCores (ie: using solr.xml) ... 
using the shards param is a refered to as "distributed search" ... they 
are orthoginal concepts.  you can do a distributed search across 
several solr instances that are not using multi-core, or you can 
query a core in a multi-core instance, or you can do a 
distributed search of several cores, some or all of which may be 
running as part or multi-core solr instances.

-Hoss


Re: limit the search results to one category

2010-12-15 Thread Chris Hostetter

: Subject: limit the search results to one category
: References: <427522.34555...@web52907.mail.re2.yahoo.com>
: <930238.38683...@web51308.mail.re2.yahoo.com>
: In-Reply-To: <930238.38683...@web51308.mail.re2.yahoo.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss


Re: Problem with multicore

2010-12-15 Thread Chris Hostetter

: SimplePostTool: FATAL: Solr returned an error:
: 
Unexpected_character_m_code_109_in_prolog_expected___at_rowcol_unknownsource_11

if you look at your solr log (or the HTTP response body, SimplePostTool 
only gives you the status line) you'll see the more human readable form of 
that error which is probably something like...

   Unexpected character 'm' (code 109) in prolog; expected '<' 
 at [row,col {unknown-source}]: [1,1]


in short: this has nothing to do with the fact that you are running 
multi-core, and everything to do with the fact that one of your xml files 
isn't valid XML and has an "m" in the first character of the first line.

(it is most likely one of the XML files you are trying to post .. but 
there is a remote possibility it is in one of your config files -- i can't 
remember if config parsing errors are saved to use as HTTP errors in this 
way, but since you didn't confirm wehter you could actually load things 
like the admin screen after starting solr, i'm not sure off the top of my 
head)



-Hoss


Re: [ANN] General Availability of LucidWorks Enterprise

2010-12-15 Thread Andy
Congrats!

A couple questions:

1) Which version of Solr is this based on?
2) How is LWE different from standard Solr? How should one choose between the 
two?

Thanks.

--- On Wed, 12/15/10, Grant Ingersoll  wrote:

> From: Grant Ingersoll 
> Subject: [ANN] General Availability of LucidWorks Enterprise
> To: solr-user@lucene.apache.org, java-u...@lucene.apache.org
> Date: Wednesday, December 15, 2010, 4:39 PM
> Lucid Imagination is pleased to
> announce the general availability of our Apache Solr/Lucene
> powered LucidWorks Enterprise (LWE).  LWE is designed
> to make it easier for people to get up to speed on search by
> providing easier management, integration with libraries
> commonly used in building search applications (such as
> crawling) as well as value add components developed by Lucid
> Imagination all packaged on top of Apache Solr while still
> giving access to Solr.
> 
> You can get more info in the press release: 
> http://www.lucidimagination.com/About/Company-News/Lucid-Imagination-Announces-General-Availability-and-Free-Download-LucidWorks-Ent
> 
> Other Details:
> Download LucidWorks Enterprise software:
> www.lucidimagination.com/lwe/download
> View free documentation: http://lucidworks.lucidimagination.com
> View a demonstration of LucidWorks Enterprise: 
> http://www.lucidimagination.com/lwe/demos
> 
> Access LucidWorks Enterprise whitepapers and tutorials:
> www.lucidimagination.com/lwe/whitepapers
> Read further commentary on the Lucid Imagination blog
> 
> Cheers,
> Grant
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 





Re: Parenthesis in query string

2010-12-15 Thread Ahmet Arslan
I think this is related to http://search-lucene.com/m/lM9CXH2Pl7

Also as explained here http://search-lucene.com/m/g4JmKSGMaI/ 
it is better to use + - operator rather than OR AND NOT parenthesis.

http://wiki.apache.org/lucene-java/BooleanQuerySyntax

Just for your information: There is no such syntax 'date <= 2010-12-15' that 
returns documents having date less than or equal to 2010-12-15. 


--- On Wed, 12/15/10, Tommaso Teofili  wrote:

> From: Tommaso Teofili 
> Subject: Parenthesis in query string
> To: solr-user@lucene.apache.org
> Date: Wednesday, December 15, 2010, 7:24 PM
> Hi all,
> I've just noticed a strange behavior (or, at least, I
> didn't expect that),
> when adding useless parenthesis to a query.
> Using the lucene query parser in Solr I get no results with
> the query:
> 
> * ((( NOT (text:"something"))) AND date <= 2010-12-15)
> *
> 
> while I get the expected results when the query is :
> 
> *( NOT (text:"something") AND date <= 2010-12-15) *
> 
> Setting the debugQuery=true param I get this for the first
> query sample:
> 
> ((( NOT (text:"something"))) AND data <= 2010-12-15)
> 
> 
> ((( NOT (text:"something"))) AND data <= 2010-12-15)
> 
> 
> +(-PhraseQuery(text:"something")) +text:dat
> PhraseQuery(text:"2010 12 15")
> 
> 
> +(-text:"something") +text:dat text:"2010 12 15"
> 
> 
> while I get the following in the second (right) query
> sample:
> 
> ( NOT (text:"something") AND data <= 2010-12-15)
> 
> 
> ( NOT (text:"something") AND data <= 2010-12-15)
> 
> 
> -PhraseQuery(text:"something") +text:dat
> PhraseQuery(text:"2010 12 15")
> 
> 
> -text:"something" +text:dat text:"2010 12 15"
> 
> 
> Is that something expected and I am missing something or
> it's a bug?
> Thanks in advance.
> Regards,
> Tommaso
> 


  


Re: [Adding] Entities when indexing a DB

2010-12-15 Thread Adam Estrada
Ahhh...I found that I did not set a dataSource name and when I did that and
then referred each entity to that dataSource all went according to plan ;-)



  
  

  
  
  
  
  
  
  



  
  
  
  
  
  
  

  


Solr Rocks!
Adam



On Wed, Dec 15, 2010 at 3:53 PM, Allistair Crossley wrote:

> mission.id and event.id if the same value will be overwriting the indexed
> document. your ids need to be unique across all documents. i usually have a
> field id_original that i map the table id to, and then for id per entity i
> usually prefix it with the entity name in the value mapped to the schema id
> field
>
> On 15 Dec 2010, at 20:49, Adam Estrada wrote:
>
> > All,
> >
> > I have successfully indexed a single entity but when I try multiple
> entities
> > is the second is skipped all together. Is there something wrong with my
> > config file?
> >
> > 
> > 
> >   >   driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> >   url="jdbc:sqlserver://10.0.2.93;databaseName="50_DEV"
> >   user="adam"
> >   password="password"/>
> >  
> > >query = "SELECT IdMission AS id,
> >CoreGroup AS cat,
> >StrMissionname AS subject,
> >strDescription AS description,
> >DateCreated AS pubdate
> >FROM dbo.tblMission">
> >  
> >  
> >  
> >  
> >  
> >
> > > query = "SELECT strsubject AS subject,
> >strsummary as description,
> >datecreated as date,
> >CoreGroup as cat,
> >idevent as id
> >FROM dbo.tblEvent">
> >  
> >  
> >  
> >  
> >  
> >
> >  
> > 
>
>


Re: nexus of synonyms and stemming, take 2

2010-12-15 Thread Chris Hostetter

: This is a fairly basic synonyms question: how does synonyms handle stemming?

it's all a question of how your analysis chain is configured forh te field 
type.

if you have your stemming filter before your synonyms filter, then the 
synonyms.txt file needs to map the *stems* of hte synonyms.

if you have stemming after synonyms, then you need to have mappings for 
*all* of the roots.

i encourage you to experiement with differnet settings and look at the 
analysis.jsp tool to see the effects.


-Hoss


Re: can solrj swap cores?

2010-12-15 Thread Chris Hostetter

: One of our developers had initially tried swapping solr cores (e.g. core0
: and core1) using the solrj api, but it failed. (don't have the exact error)
: He susequently replaced the call with straight http (i.e. http client).
: 
: Unfortunately I don't have the exact error in front of me...

off the top of my head i can't think of any reason why it shouldn't work 
... w/ the details of hte error it's pretty much impossible for any one 
to offer you any assistance.

: Finally, can someone comment on the solrj javadoc on CoreAdminRequest:
:  * This class is experimental and subject to change.

it means that at the time the API was written, the contributor wasn't 
certain that it was the "correct" api for long term support, so they put 
that there as a CYA disclaimer in case future changes were needed in order 
to make hte API work properly.

in essence: while some APIs in Solr are very stable, and well supported, 
that one may change in future releases w/o a straight forward migration 
strategy if hte nature of the underlying functionalit (core 
administration) changes in some way that makes the current API unsuitable.

(regretable, we are not as good about auditing those types of disclaimers 
as we should be before/after releases) 

-Hoss


Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread Dennis Gearon
I will look into the security and processor power implications of that. Good 
idea, thx.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Andrew McCombe 
To: solr-user@lucene.apache.org
Sent: Wed, December 15, 2010 11:07:54 AM
Subject: Re: Problem using curl in PHP to get Solr results

Hi

You could use Solr's php serialized object output (wt=phps) and then convert
it to json in your php:


Regards
Andrew McCombe

On 15 December 2010 17:49, Dennis Gearon  wrote:

> I want to just pass the JSON through after qualifying the user's access to
> the
> site.
>
>
> Didn't want to spend the horse power to receive it as PHP array syntax, run
> the
> risk of someone putting bad stuff in the contents and running 'exec()' on
> it,
> and then spending the extra horsepower to putput it as json.
>
> I had that page up in the browwser to look at it later. If it deons't do
> the
> above, I will be glad to have the Solr access abstracted, thanks :-)
>
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> - Original Message 
> From: Stephen Weiss 
> To: solr-user@lucene.apache.org
> Sent: Wed, December 15, 2010 1:36:11 AM
> Subject: Re: Problem using curl in PHP to get Solr results
>
> Forgive me if this seems like a dumb question but have you tried the
> Apache_Solr_Service class?
>
> http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html
>
> It's really quite good at handling the nuts and bolts of making the HTTP
> requests and decoding the responses for PHP.  I almost always use it when
> working from PHP.  It's all over Google so I don't know how someone would
> miss
> it but I don't know why else someone would bother curling a GET to SOLR
> otherwise.
>
> --
> Steve
>
> On Dec 15, 2010, at 4:22 AM, Dennis Gearon wrote:
>
> > I finally figured out how to use curl to GET results, i.e. just turn all
> spaces
> >
> > into '%20' in my type of queries. I'm using solar spatial, and then
> searching
> >in
> >
> > both the default text field and a couple of columns. Works fine on in the
> > browser.
> >
> > But if I query for it using curl in PHP, there's an error somewhere in
> the
> >JSON.
> >
> > I don't know if it's in the PHP food chain or something else.
> >
> >
> > Just putting my solution to GETing from curl in PHP and my problem up
> here, for
> >
> > others to find.
> >
> > Of course, if anyone knows the answer, all the better.
> >
> > Dennis Gearon
> >
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes. It is usually a
> >better
> >
> > idea to learn from others’ mistakes, so you do not have to make them
> yourself.
>
> > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> >
> > EARTH has a Right To Life,
> > otherwise we all die.
> >
>



Re: can solrj swap cores?

2010-12-15 Thread Tim Heckman
It's been working for me. One thing to look out for might be the url
you're using in SolrUtil.getSolrServer()? The url you use for
reindexing won't be the same as the one you use to swap cores. Make
sure it's using "admin/cores" and not "production/admin/cores" or
"reindex/admin/cores".

Sorry if this is the obvious first thing you looked at already.  :)
It's the first thing that came to my mind.

Tim


On Fri, Dec 3, 2010 at 12:45 PM, Will Milspec  wrote:
> hi all,
>
> Does solrj support "swapping cores"?
>
> One of our developers had initially tried swapping solr cores (e.g. core0
> and core1) using the solrj api, but it failed. (don't have the exact error)
> He susequently replaced the call with straight http (i.e. http client).
>
> Unfortunately I don't have the exact error in front of me...
>
> Solrj code:
>
>               CoreAdminRequest car = new CoreAdminRequest();
>               car.setCoreName("production");
>               car.setOtherCoreName("reindex");
>               car.setAction(CoreAdminParams.CoreAdminAction.SWAP);
>
>              SolrServer solrServer = SolrUtil.getSolrServer();
>              car.process(solrServer);
>              solrServer.commit();
>
> Finally, can someone comment on the solrj javadoc on CoreAdminRequest:
>  * This class is experimental and subject to change.
>
> thanks,
>
> will
>


Memory use during merges (OOM)

2010-12-15 Thread Burton-West, Tom
Hello all,

Are there any general guidelines for determining the main factors in memory use 
during merges?

We recently changed our indexing configuration to speed up indexing but in the 
process of doing a very large merge we are running out of memory.
Below is a list of the changes and part of the indexwriter log.  The changes 
increased the indexing though-put by almost an order of magnitude.
(about 600 documents per hour to about 6000 documents per hour.  Our documents 
are about 800K)

We are trying to determine which of the changes to tweak to avoid the OOM, but 
still keep the benefit of the increased indexing throughput

Is it likely that the changes to ramBufferSizeMB are the culprit or could it be 
the mergeFactor change from 10-20?

 Is there any obvious relationship between ramBufferSizeMB and the memory 
consumed by Solr?
 Are there rules of thumb for the memory needed in terms of the number or size 
of segments?

Our largest segments prior to the failed merge attempt were between 5GB and 
30GB.  The memory allocated to the Solr/tomcat JVM is 10GB.

Tom Burton-West
-

Changes to indexing configuration:
mergeScheduler
before: serialMergeScheduler
after:concurrentMergeScheduler
mergeFactor
before: 10
after : 20
ramBufferSizeMB
before: 32
  after: 320

excerpt from indexWriter.log

Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP: findMerges: 40 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP:   level 7.23609 to 7.98609: 20 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP: 0 to 20: add this merge
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP:   level 5.44878 to 6.19878: 20 segments
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: LMP: 20 to 40: add this merge

...
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: applyDeletes
Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; 
http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted 
docIDs and 0 deleted queries on 40 segments.
Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
http-8091-Processor70]: hit exception flushing deletes
Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; 
http-8091-Processor70]: hit OutOfMemoryError inside updateDocument
tom



Re: Problem using curl in PHP to get Solr results

2010-12-15 Thread Dennis Gearon
well, it was three problems:

1/ I was saving the file as a 'complete web page', uknowingly, from firefox.
2/ I had a small message for troubleshooting being spit out after the json.
3/ My partner had output all the spatial solr 'tiers' information, and there's 
a 
binary value in there that stops all the JSON viewer plugins for firefox from 
going past it. At this point, PHP *AND* the firefox plugins consider the json 
to 
be valid, but the plugin won't display past a character with value 0x007F. Or 
at 
least, that's what the browser is displaying in a box at thatpoint.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Dennis Gearon 
To: solr-user@lucene.apache.org
Sent: Wed, December 15, 2010 3:28:40 PM
Subject: Re: Problem using curl in PHP to get Solr results

I will look into the security and processor power implications of that. Good 
idea, thx.

Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 

idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Andrew McCombe 
To: solr-user@lucene.apache.org
Sent: Wed, December 15, 2010 11:07:54 AM
Subject: Re: Problem using curl in PHP to get Solr results

Hi

You could use Solr's php serialized object output (wt=phps) and then convert
it to json in your php:


Regards
Andrew McCombe

On 15 December 2010 17:49, Dennis Gearon  wrote:

> I want to just pass the JSON through after qualifying the user's access to
> the
> site.
>
>
> Didn't want to spend the horse power to receive it as PHP array syntax, run
> the
> risk of someone putting bad stuff in the contents and running 'exec()' on
> it,
> and then spending the extra horsepower to putput it as json.
>
> I had that page up in the browwser to look at it later. If it deons't do
> the
> above, I will be glad to have the Solr access abstracted, thanks :-)
>
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
> - Original Message 
> From: Stephen Weiss 
> To: solr-user@lucene.apache.org
> Sent: Wed, December 15, 2010 1:36:11 AM
> Subject: Re: Problem using curl in PHP to get Solr results
>
> Forgive me if this seems like a dumb question but have you tried the
> Apache_Solr_Service class?
>
> http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html
>
> It's really quite good at handling the nuts and bolts of making the HTTP
> requests and decoding the responses for PHP.  I almost always use it when
> working from PHP.  It's all over Google so I don't know how someone would
> miss
> it but I don't know why else someone would bother curling a GET to SOLR
> otherwise.
>
> --
> Steve
>
> On Dec 15, 2010, at 4:22 AM, Dennis Gearon wrote:
>
> > I finally figured out how to use curl to GET results, i.e. just turn all
> spaces
> >
> > into '%20' in my type of queries. I'm using solar spatial, and then
> searching
> >in
> >
> > both the default text field and a couple of columns. Works fine on in the
> > browser.
> >
> > But if I query for it using curl in PHP, there's an error somewhere in
> the
> >JSON.
> >
> > I don't know if it's in the PHP food chain or something else.
> >
> >
> > Just putting my solution to GETing from curl in PHP and my problem up
> here, for
> >
> > others to find.
> >
> > Of course, if anyone knows the answer, all the better.
> >
> > Dennis Gearon
> >
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes. It is usually a
> >better
> >
> > idea to learn from others’ mistakes, so you do not have to make them
> yourself.
>
> > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> >
> > EARTH has a Right To Life,
> > otherwise we all die.
> >
>


Re: Next Word - Any Suggestions?

2010-12-15 Thread Sean O'Connor

Hi Christopher,
One option comes to mind: shingles?

I have not done anything with them yet, but that is on my radar for 
sometime about a month out. Speaking unencumbered by experience or 
substantial understanding, my guess is that shingles would be great for 
you if you can select shingles with something like a terms prefix.


AFAIU: Shingling[1] basically takes a number of terms/words, and 
combines them into a single token. You could set the (max)shingle size 
to 2, and then find some way to use the terms component on the shingled 
field with a prefix, potentially:

http://wiki.apache.org/solr/TermsComponent

I'm interested in what you find out, so please post back if you 
find something outside the mailing list.

Thanks,

Sean


[1] see something like: 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=%28shingle%29, 
but the Solr 1.4 Enterprise Search Server book is well worth the money, 
and I believe there is an ebook version for $10-20.


On 10/26/2010 08:26 AM, Christopher Ball wrote:

Am about to implement a custom query that is sort of mash-up of Facets,
Highlighting, and SpanQuery - but thought I'd see if anyone has done
anything similar.



In simple words, I need facet on the next word given a target word.



For example, if my index only had the following 5 documents (comprised of a
sentence each):



Doc 1 - The quick brown fox jumped over the fence.

Doc 2 - The sly fox skipped over the fence.

Doc 3 - The fat fox skipped his afternoon class.

Doc 4 - A brown duck and red fox, crashed the party.

Doc 5 - Charles Brown! Fox! Crashed my damn car.



The query should give the frequency of the distinct terms after the word
"fox":



skipped - 2

crashed - 2

jumped - 1



Long-term, do the opposite - frequency of the distinct terms before the word
"fox":



brown - 2

sly - 1

fat - 1

red - 1



My guess is that either the FastVectorHighlighter or SpanQuery would be a
reasonable starting point. I was hoping to take advantage of Vectors as I am
storing termVectors, termPositions, and termOffsets for the field in
question.



Grateful for any thoughts . . . reference implementations . . . words of
encouragement . . . free beer - whatever you can offer.



Gracias,



Christopher








Thank you!

2010-12-15 Thread Adam Estrada
I just want to say that this list serve has been invaluable to a newbie like
me ;-) I posted a question earlier today and literally 10 minutes later I
got an answer that helped me solve my problem. This is proof that there is a
experienced and energetic community behind this FOSS group of projects and I
really appreciate everyone who has put up with my otherwise trivial
questions!  More importantly, thanks to all of the contributors who make the
whole thing possible!  I attended the Lucene Revolution conference in Boston
this year and the information that I was able to take away from the whole
thing has made me and my vocation a lot more valuable. Keep up the
outstanding work in the discovery of useful information from a sea of bleh
;-)

Kindest regards,
Adam


Re: Dataimport performance

2010-12-15 Thread Lance Norskog
Can you do just one join in the top-level query? The DIH does not have
a batching mechanism for these joins, but your database does.

On Wed, Dec 15, 2010 at 7:11 AM, Tim Heckman  wrote:
> The custom import I wrote is a java application that uses the SolrJ
> library. Basically, where I had sub-entities in the DIH config I did
> the mappings inside my java code.
>
> 1. Identify a subset or "chunk" of the primary id's to work on (so I
> don't have to load everything into memory at once) and put those in a
> temp table. I used a modulus on the id.
> 2. Select all of the outer entity from the database (joining on the
> id's in the temp table), and load the data from that result set into
> new solr input documents. I keep these in a hash map keyed on the
> id's.
> 3. Then select all of the inner entity, joining on the id's from the
> temp table. The result set has to include the id's from step 2. I go
> through this result set and load the data into the matching solr input
> documents from step 2.
> 4. Push that set of input documents to solr (optionally committing
> them), then go back to step 1 using the next subset or chunk.
>
> Not sure if this is the absolute best approach, but it's working well
> enough for my specific case.
>
> Tim
>
>
> 2010/12/15 Robert Gründler :
>> i've benchmarked the import already with 500k records, one time without the 
>> artists subquery, and one time without the join in the main query:
>>
>>
>> Without subquery: 500k in 3 min 30 sec
>>
>> Without join and without subquery: 500k in 2 min 30.
>>
>> With subquery and with left join:   320k in 6 Min 30
>>
>>
>> so the joins / subqueries are definitely a bottleneck.
>>
>> How exactly did you implement the custom data import?
>>
>> In our case, we need to de-normalize the relations of the sql data for the 
>> index,
>> so i fear i can't really get rid of the join / subquery.
>>
>>
>> -robert
>>
>>
>>
>>
>>
>> On Dec 15, 2010, at 15:43 , Tim Heckman wrote:
>>
>>> 2010/12/15 Robert Gründler :
 The data-config.xml looks like this (only 1 entity):

      
        
        
        
        
        
        >>> name="sf_unique_id"/>

        
          
        

      
>>>
>>> So there's one track entity with an artist sub-entity. My (admittedly
>>> rather limited) experience has been that sub-entities, where you have
>>> to run a separate query for every row in the parent entity, really
>>> slow down data import. For my own purposes, I wrote a custom data
>>> import using SolrJ to improve the performance (from 3 hours to 10
>>> minutes).
>>>
>>> Just as a test, how long does it take if you comment out the artists entity?
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com