System requirements in my case?

2012-05-22 Thread Bruno Mannina

Dear Solr users,

My company would like to use solr to index around 80 000 000 documents 
(xml files with around 5~10ko size each).

My program (robot) will connect to this solr with boolean requests.

Number of users: around 1000
Number of requests by user and by day: 300
Number of users by day: 30

I would like to subscribe to a host provider with this configuration:
- Dedicated Server
- Ubuntu
- Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
- Unlimited bandwidth
- IP fixe

Do you think this configuration is enough?

Thanks for your info,
Sincerely
Bruno


Re: Question about sampling

2012-05-22 Thread Lance Norskog
My mistake- I did not research whether the data above is stored a
strings. The hashcode has to be stored as strings for this trick to
work.

On Sun, May 20, 2012 at 8:25 PM, Otis Gospodnetic
 wrote:
> I'd be curious about this, too!
> I suspect the answer is: not doable, patches welcome. :)
> But I'd love to be wrong!
>
> Otis
> 
> Performance Monitoring for Solr / ElasticSearch / HBase - 
> http://sematext.com/spm
>
>
>
>>
>> From: Yuval Dotan 
>>To: solr-user 
>>Sent: Wednesday, May 16, 2012 9:43 AM
>>Subject: Question about sampling
>>
>>Hi Guys
>>We have an environment containing billions of documents.
>>Faceting over this large result set could take many seconds, and so we
>>thought we might be able to use statistical sampling of a smaller result
>>set from the facet, and give an approximate result much quicker.
>>Is there any way to facet only a random sample of the results?
>>Thanks
>>Yuval
>>
>>
>>



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing & Searching MySQL table with Hindi and English data

2012-05-22 Thread Gora Mohanty
On 22 May 2012 12:07, KP Sanjailal  wrote:
> Hi,
>
> Thank you so much for replying.
>
> The MySQL database server is running on a Fedora Core 12 Machine with Hindi
> Language Support enabled.  Details of the database are - ENGINE=3DMyISAM and
>  DEFAULT CHARSET=3Dutf8
>
> Data is imported using the Solr DataImportHandler (mysql jdbc driver).
> In the schema.xml file the title field is defined as:
> 

Please show us your schema.xml and the configuration
file for the DataImportHandler (you might wish to obscure
sensitive details like username/password). Have you tried
the SELECT in the DIH configuration outside of Solr. Is
it producing proper UTF-8?

Regards,
Gora


fsv=true not returning sort_values for distributed searches

2012-05-22 Thread XJ
We use fsv=true to help debug sortings which works great for
non-distributed searches. However, its not working (no sort_values in
response) for multi shard queries. Any idea how to get this fixed?

thanks,
XJ


Strategy for maintaining De-normalized indexes

2012-05-22 Thread Sohail Aboobaker
Hi,

I have a very basic question and hopefully there is a simple answer to
this. We are trying to index a simple product catalog which has a master
product and child products. Each master product can have multiple child
products. A master product can be assigned one or more product categories.
Now, we need to be able to show counts of categories based on number of
child products in each category. We have indexed data using a join and
selecting appropriate values for index from each table. This is basically a
De-normalized result set. It works perfectly for our search purposes.
However, maintaining the index and keeping index up to date is an issue.
Whenever a product master is updated with a new category, we will need to
delete all the index entries for child products in index and insert them
again. This seems a lot of activity for a regular on-going operation i.e.
product category updates.

Since, join between schemas is only available in 4.0, what are other
strategies to maintain or to create such queries.

Thanks for your help.

Regards,
Sohail


RE: trunk cloud ui not working

2012-05-22 Thread Phil Hoy
Hi, 

I was using windows 7 but it is fine with chrome on Windows Web Server 2008 R2 
also I asked a colleague with windows 7 and it is fine for him too, so really 
sorry but I think it was a !'works on my machine' thing. 

Of course if I track down the cause I will reply to this email again.

Thanks,
Phil

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: 21 May 2012 18:22
To: solr-user@lucene.apache.org
Subject: Re: trunk cloud ui not working

What OS? I was just trying trunk and looking at that view on Chrome on OSX and 
Linux and did not see an issue.

On May 21, 2012, at 1:15 PM, Phil Hoy wrote:

> After further investigation I have found that it is not a problem on firefox, 
> only chrome and IE. 
> 
> Phil
> 
> -Original Message-
> Sent: 21 May 2012 18:05
> To: solr-user@lucene.apache.org
> Subject: trunk cloud ui not working
> 
> Hi,
> 
> I am running from the trunk and the localhost:8983/solr/#/~cloud page shows 
> nothing but "Fetch Zookeeper Data".
> 
> If I run fiddler I see that:
> http://localhost:8983/solr/zookeeper?wt=json&detail=true&path=%2Fclust
> erstate.json
> and
> http://localhost:8983/solr/zookeeper?wt=json&path=%2Flive_nodes
> are called and return data but no update to the ui.
> 
> Cheers,
> Phil
> 
> 
> __
> This email has been scanned by the brightsolid Email Security System. 
> Powered by MessageLabs 
> __

- Mark Miller
lucidimagination.com












__
This email has been scanned by the brightsolid Email Security System. Powered 
by MessageLabs 
__


Re: How can i search site name

2012-05-22 Thread Jan Høydahl
You need to explain your case in much more detail to get precise help. Please 
read http://wiki.apache.org/solr/UsingMailingLists

If your problem is that you have a URL and want to know the domain for it, e.g. 
www.company.com/foo/bar/index.html and you want only www.company.com you can 
use the UrlClassifyProcessor, see SOLR-2826.

--
Jan Høydahl, search solution architect
Cominvent AS - www.facebook.com/Cominvent
Solr Training - www.solrtraining.com

On 22. mai 2012, at 08:03, Shameema Umer wrote:

> Sorry,
> Please let me know how can I search site name using the solr query syntax.
> My results should show title, url and content.
> Title and content are being searched even though the
> content.
> 
> I need url or site name too. please, help.
> 
> Thanks in advance.
> 
> On Tue, May 22, 2012 at 11:05 AM, ketan kore  wrote:
> 
>> you can go on www.google.com and just type the site which you want to
>> search and google will show you the results as simple as that ...
>> 



Re: System requirements in my case?

2012-05-22 Thread findbestopensource
Dedicated Server may not be required. If you want to cut down cost, then
prefer shared server.

How much the RAM?

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina  wrote:

> Dear Solr users,
>
> My company would like to use solr to index around 80 000 000 documents
> (xml files with around 5~10ko size each).
> My program (robot) will connect to this solr with boolean requests.
>
> Number of users: around 1000
> Number of requests by user and by day: 300
> Number of users by day: 30
>
> I would like to subscribe to a host provider with this configuration:
> - Dedicated Server
> - Ubuntu
> - Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
> - Unlimited bandwidth
> - IP fixe
>
> Do you think this configuration is enough?
>
> Thanks for your info,
> Sincerely
> Bruno
>


Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread Tanguy Moal
Hello,

Can't the ID (uniqueKey) of the indexed documents (i.e. denormalized data)
be a combination of the master product id and the child product id ?

Therefor whenever you update your master product db entry, you simply need
to reindex documents depending on the master product entry.

You can even simply reindex your whole DB, updates are made in place (i.e.
old documents are *completely* overwritten by their respective updates).

There's nothing to delete if you build your unique key in a "maintainable"
way.

You can re-index documents whenever you need to do so.

--
Tanguy

2012/5/22 Sohail Aboobaker 

> Hi,
>
> I have a very basic question and hopefully there is a simple answer to
> this. We are trying to index a simple product catalog which has a master
> product and child products. Each master product can have multiple child
> products. A master product can be assigned one or more product categories.
> Now, we need to be able to show counts of categories based on number of
> child products in each category. We have indexed data using a join and
> selecting appropriate values for index from each table. This is basically a
> De-normalized result set. It works perfectly for our search purposes.
> However, maintaining the index and keeping index up to date is an issue.
> Whenever a product master is updated with a new category, we will need to
> delete all the index entries for child products in index and insert them
> again. This seems a lot of activity for a regular on-going operation i.e.
> product category updates.
>
> Since, join between schemas is only available in 4.0, what are other
> strategies to maintain or to create such queries.
>
> Thanks for your help.
>
> Regards,
> Sohail
>


Re: System requirements in my case?

2012-05-22 Thread lboutros
Hi Bruno,

will you use facets and result sorting ? 
What is the update frequency/volume ?

This could impact the amount of memory/server count.

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/System-requirements-in-my-case-tp3985309p3985327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread findbestopensource
Thats how de-normalization works. You need to update all child products.

If you just need the count and you are using facets then maintain a map
between category and main product, main product and child product. Lucene
db has no schema. You could retrieve the data based on its type.

Category record will have Category name, ProductName and a type
(CATEGORY_TYPE)
Child product record will have ProductName, MainProductName ProductDetails,
and type (PRODUCT_TYPE)

Now in this you may need to use two queries. Given the category name, fetch
the main product name and query using it to fetch the child products. Hope
it helps.

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 1:37 PM, Sohail Aboobaker wrote:

> Hi,
>
> I have a very basic question and hopefully there is a simple answer to
> this. We are trying to index a simple product catalog which has a master
> product and child products. Each master product can have multiple child
> products. A master product can be assigned one or more product categories.
> Now, we need to be able to show counts of categories based on number of
> child products in each category. We have indexed data using a join and
> selecting appropriate values for index from each table. This is basically a
> De-normalized result set. It works perfectly for our search purposes.
> However, maintaining the index and keeping index up to date is an issue.
> Whenever a product master is updated with a new category, we will need to
> delete all the index entries for child products in index and insert them
> again. This seems a lot of activity for a regular on-going operation i.e.
> product category updates.
>
> Since, join between schemas is only available in 4.0, what are other
> strategies to maintain or to create such queries.
>
> Thanks for your help.
>
> Regards,
> Sohail
>


Multicore Solr

2012-05-22 Thread Shanu Jha
Hi all,

greetings from my end. This is my first post on this mailing list. I have
few questions on multicore solr. For background we want to create a core
for each user logged in to our application. In that case it may be 50, 100,
1000, N-numbers. Each core will be used to write and search index in real
time.

1. Is this a good idea to go with?
2. What are the pros and cons of this approch?

Awaiting for your response.

Regards
AJ


Re: System requirements in my case?

2012-05-22 Thread Bruno Mannina

My choice: http://www.ovh.com/fr/serveurs_dedies/eg_best_of.xml

24 Go DDR3

Le 22/05/2012 10:26, findbestopensource a écrit :

Dedicated Server may not be required. If you want to cut down cost, then
prefer shared server.

How much the RAM?

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina  wrote:


Dear Solr users,

My company would like to use solr to index around 80 000 000 documents
(xml files with around 5~10ko size each).
My program (robot) will connect to this solr with boolean requests.

Number of users: around 1000
Number of requests by user and by day: 300
Number of users by day: 30

I would like to subscribe to a host provider with this configuration:
- Dedicated Server
- Ubuntu
- Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
- Unlimited bandwidth
- IP fixe

Do you think this configuration is enough?

Thanks for your info,
Sincerely
Bruno





Re: System requirements in my case?

2012-05-22 Thread Bruno Mannina

Hi,

facets I don't know yet because I don't know exactly what is facets (sorry)

Sorting: yes
Scoring: yes

Concerning update Frequency : every week
Volume: around 1Go data by year


Merci beaucoup :)

Aix En Provence
France

Le 22/05/2012 10:35, lboutros a écrit :

Hi Bruno,

will you use facets and result sorting ?
What is the update frequency/volume ?

This could impact the amount of memory/server count.

Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/System-requirements-in-my-case-tp3985309p3985327.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: Multicore Solr

2012-05-22 Thread findbestopensource
Having cores per user is not good idea. The count is too high. Keep
everything in single core. You could filter the data based on user name or
user id.

Regards
Aditya
www.findbestopensource.com



On Tue, May 22, 2012 at 2:29 PM, Shanu Jha  wrote:

> Hi all,
>
> greetings from my end. This is my first post on this mailing list. I have
> few questions on multicore solr. For background we want to create a core
> for each user logged in to our application. In that case it may be 50, 100,
> 1000, N-numbers. Each core will be used to write and search index in real
> time.
>
> 1. Is this a good idea to go with?
> 2. What are the pros and cons of this approch?
>
> Awaiting for your response.
>
> Regards
> AJ
>


Re: System requirements in my case?

2012-05-22 Thread findbestopensource
Seems to be fine. Go head.

Before hosting, Have you tried / tested your application in local setup.
RAM usage is what matters in terms of Solr. Just benchmark your app for 100
000 documents, Log the memory used. Calculate the RAM reqd for 80 000 000
documents.

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 2:36 PM, Bruno Mannina  wrote:

> My choice: 
> http://www.ovh.com/fr/**serveurs_dedies/eg_best_of.xml
>
> 24 Go DDR3
>
> Le 22/05/2012 10:26, findbestopensource a écrit :
>
>  Dedicated Server may not be required. If you want to cut down cost, then
>> prefer shared server.
>>
>> How much the RAM?
>>
>> Regards
>> Aditya
>> www.findbestopensource.com
>>
>>
>> On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina  wrote:
>>
>>  Dear Solr users,
>>>
>>> My company would like to use solr to index around 80 000 000 documents
>>> (xml files with around 5~10ko size each).
>>> My program (robot) will connect to this solr with boolean requests.
>>>
>>> Number of users: around 1000
>>> Number of requests by user and by day: 300
>>> Number of users by day: 30
>>>
>>> I would like to subscribe to a host provider with this configuration:
>>> - Dedicated Server
>>> - Ubuntu
>>> - Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
>>> - Unlimited bandwidth
>>> - IP fixe
>>>
>>> Do you think this configuration is enough?
>>>
>>> Thanks for your info,
>>> Sincerely
>>> Bruno
>>>
>>>
>


Re: is commit a sequential process in solr indexing

2012-05-22 Thread findbestopensource
Yes. Lucene / Solr supports multi threaded environment. You could do commit
from two different threads to same core or different core.

Regards
Aditya
www.findbestopensource.com

On Tue, May 22, 2012 at 12:35 AM, jame vaalet  wrote:

> hi,
> my use case here is to search all the incoming documents for certain
> comination of words which are pre-determined. So what am doing here is,
> create a batch of x docs according to their creation date, index them,
> commit them and search them for query (pre-determined).
> My question is, if i have to make the entire process multi threaded and two
> threads are trying to commit two different set of batchs, will the commit
> happen in parallel. what if am trying to commit to different solr-cores ?
>
> --
>
> -JAME
>


Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread Sohail Aboobaker
Thank you for quick replies.

Can't the ID (uniqueKey) of the indexed documents (i.e. denormalized data)
be a combination of the master product id and the child product id ?
  -- We do not need it as each child is already a unique key.

Therefore whenever you update your master product db entry, you simply
need  to reindex documents depending on the master product entry.
  -- This is where the confusion might be. I may have misread it but Apache
Solr3 Enterprise Search, it mentions that "if any part of the document
needs to be updated, the entire document must be replaced. Internally this
is a deletion and an addition". Is re-indexing all detail records a huge
performance hit? Assuming that a master can have upto 10 to 20k of child
records?

Thanks again.

Sohail


Re: How can i search site name

2012-05-22 Thread Jan Høydahl
Hi,

I would probably use (e)DisMax.
Index your url and metadata fields as text without stemming, e.g. text_general
Then query as &q=mycompany&defType=edismax&qf=title^10 content^1 url^5
If you like to give higher weight to the domain/site part of the URL, apply 
UrlClassifyProcessor and search the "domain" field separately with higher 
weight.

--
Jan Høydahl, search solution architect
Cominvent AS - www.facebook.com/Cominvent
Solr Training - www.solrtraining.com

On 22. mai 2012, at 12:23, Shameema Umer wrote:

> Thanks Li Li and Jan.
> 
> Yes, if url is www.company.com/foo/bar/index.html, I should be able to
> search the sub-strings like company, foo or bar etc.
> 
> when I changed the part of my schema file from
> 
>content
> 
> to
> 
>   stext
>   
>   
>   
> 
> server error occurred after restarting solr. Do I need to re-index solr.
> Please help me as i need to search title url and content with privilege to
> title. If DisMaxRequestHandler helps me solve my problems, let me know the
> best tutorial page to study
> it.
> 
> Thanks
> Shameema
> 



Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread Tanguy Moal
It all depends on the frequency at which you refresh your data, on your
deployment (master/slave setup), ...
Many things need to be taken into account!

Did you face any performance issue while building your index?
If you didn't, rebuilding it shouldn't be more problematic.

--
Tanguy

2012/5/22 Sohail Aboobaker 

> Thank you for quick replies.
>
> Can't the ID (uniqueKey) of the indexed documents (i.e. denormalized data)
> be a combination of the master product id and the child product id ?
>   -- We do not need it as each child is already a unique key.
>
> Therefore whenever you update your master product db entry, you simply
> need  to reindex documents depending on the master product entry.
>   -- This is where the confusion might be. I may have misread it but Apache
> Solr3 Enterprise Search, it mentions that "if any part of the document
> needs to be updated, the entire document must be replaced. Internally this
> is a deletion and an addition". Is re-indexing all detail records a huge
> performance hit? Assuming that a master can have upto 10 to 20k of child
> records?
>
> Thanks again.
>
> Sohail
>


Re: Strategy for maintaining De-normalized indexes

2012-05-22 Thread Sohail Aboobaker
We are still in design phase, so we haven't hit any performance issues. We
do not want to discover performance issues too late during QA :) We would
rather account for any issues during the design phase.

The refresh rate on fields that we are using from master table will be
rare. May be three or four times in a year.

Regards,
Sohail


Re: How can i search site name

2012-05-22 Thread Shameema Umer
Thanks Jan.* It worked perfect*. Thats all i needed.
May the God bless you.

Regards
Shameema

On Tue, May 22, 2012 at 4:57 PM, Jan Høydahl  wrote:

> Hi,
>
> I would probably use (e)DisMax.
> Index your url and metadata fields as text without stemming, e.g.
> text_general
> Then query as &q=mycompany&defType=edismax&qf=title^10 content^1 url^5
> If you like to give higher weight to the domain/site part of the URL,
> apply UrlClassifyProcessor and search the "domain" field separately with
> higher weight.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.facebook.com/Cominvent
> Solr Training - www.solrtraining.com
>
> On 22. mai 2012, at 12:23, Shameema Umer wrote:
>
> > Thanks Li Li and Jan.
> >
> > Yes, if url is www.company.com/foo/bar/index.html, I should be able to
> > search the sub-strings like company, foo or bar etc.
> >
> > when I changed the part of my schema file from
> >
> >content
> >
> > to
> >
> >   stext
> >   
> >   
> >   
> >
> > server error occurred after restarting solr. Do I need to re-index solr.
> > Please help me as i need to search title url and content with privilege
> to
> > title. If DisMaxRequestHandler helps me solve my problems, let me know
> the
> > best tutorial page to study
> > it.<
> http://wiki.apache.org/solr/DisMaxRequestHandler?action=fullsearch&context=180&value=linkto%3A%22DisMaxRequestHandler%22
> >
> >
> > Thanks
> > Shameema
> > <
> http://wiki.apache.org/solr/DisMaxRequestHandler?action=fullsearch&context=180&value=linkto%3A%22DisMaxRequestHandler%22
> >
>
>


Installing Solr on Tomcat using Shell - Code wrong?

2012-05-22 Thread Spadez
Hi,

This is the install process I used in my shell script to try and get Tomcat
running with Solr (debian server):



I swear this used to work, but currently only Tomcat works. The Solr page
just comes up with "The requested resource (/solr/admin) is not available."

Can anyone give me some insight into why this isnt working? Its driving me
nuts.

James

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Installing-Solr-on-Tomcat-using-Shell-Code-wrong-tp3985393.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: System requirements in my case?

2012-05-22 Thread Jan Høydahl
Hi,

It is impossible to guess the required HW size without more knowledge about 
data and usage. 80 mill docs is a fair amount.

Here's how I would approach sizing the setup:
1) Get your schema in shape, removing unnecessary stored/indexed fields
2) To a test index locally of a part of the dataset, e.g. 10 mill docs and 
perform an Optimize
3) Measure the size of the index folder, multiply with 8 to get a clue of total 
index size
4) Do some benchmarking with realistic types of queries to identify performance 
bottlenecks on query side

Depending on your requirements for search performance, you can beef up your RAM 
to hold the whole index or depend on slow disks as a bottleneck. If you find 
that total size of index is 16Gb, you should leave >16Gb free for OS disk 
caching, e.g. allocate 8Gb to Tomcat/Solr and leave the rest for the OS. If I 
should guess, you probably find that one server gets overloaded or too slow 
with your amount of docs, and that you end up with sharding across 2-4 servers.

PS: Do you always need to search all data? A trick may be to partition your 
data such that say 80% of searches go to a "fresh" index with 10% of the 
content, while the remaining searches include everything.

--
Jan Høydahl, search solution architect
Cominvent AS - www.facebook.com/Cominvent
Solr Training - www.solrtraining.com

On 22. mai 2012, at 11:06, Bruno Mannina wrote:

> My choice: http://www.ovh.com/fr/serveurs_dedies/eg_best_of.xml
> 
> 24 Go DDR3
> 
> Le 22/05/2012 10:26, findbestopensource a écrit :
>> Dedicated Server may not be required. If you want to cut down cost, then
>> prefer shared server.
>> 
>> How much the RAM?
>> 
>> Regards
>> Aditya
>> www.findbestopensource.com
>> 
>> 
>> On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina  wrote:
>> 
>>> Dear Solr users,
>>> 
>>> My company would like to use solr to index around 80 000 000 documents
>>> (xml files with around 5~10ko size each).
>>> My program (robot) will connect to this solr with boolean requests.
>>> 
>>> Number of users: around 1000
>>> Number of requests by user and by day: 300
>>> Number of users by day: 30
>>> 
>>> I would like to subscribe to a host provider with this configuration:
>>> - Dedicated Server
>>> - Ubuntu
>>> - Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
>>> - Unlimited bandwidth
>>> - IP fixe
>>> 
>>> Do you think this configuration is enough?
>>> 
>>> Thanks for your info,
>>> Sincerely
>>> Bruno
>>> 
> 



Re: Multicore Solr

2012-05-22 Thread Shanu Jha
Hi,

Could please tell me what do you mean by filter data by users? I would like
to know is there real problem creating a core for a user. ie. resource
utilization, cpu usage etc.

AJ

On Tue, May 22, 2012 at 4:39 PM, findbestopensource <
findbestopensou...@gmail.com> wrote:

> Having cores per user is not good idea. The count is too high. Keep
> everything in single core. You could filter the data based on user name or
> user id.
>
> Regards
> Aditya
> www.findbestopensource.com
>
>
>
> On Tue, May 22, 2012 at 2:29 PM, Shanu Jha  wrote:
>
> > Hi all,
> >
> > greetings from my end. This is my first post on this mailing list. I have
> > few questions on multicore solr. For background we want to create a core
> > for each user logged in to our application. In that case it may be 50,
> 100,
> > 1000, N-numbers. Each core will be used to write and search index in real
> > time.
> >
> > 1. Is this a good idea to go with?
> > 2. What are the pros and cons of this approch?
> >
> > Awaiting for your response.
> >
> > Regards
> > AJ
> >
>


Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Yandong Yao
Hi Darren,

Thanks very much for your reply.

The reason I want to control core indexing/searching is that I want to
use one core to store one customer's data (all customer share same
config):  such as customer 1 use coreForCustomer1 and customer 2
use coreForCustomer2.

Is there any better way than using different core for different customer?

Another way maybe use different collection for different customer, while
not sure how many collections solr cloud could support. Which way is better
in terms of flexibility/scalability? (suppose there are tens of thousands
customers).

Regards,
Yandong

2012/5/22 Darren Govoni 

> Why do you want to control what gets indexed into a core and then
> knowing what core to search? That's the kind of "knowing" that SolrCloud
> solves. In SolrCloud, it handles the distribution of documents across
> shards and retrieves them regardless of which node is searched from.
> That is the point of "cloud", you don't know the details of where
> exactly documents are being managed (i.e. they are cloudy). It can
> change and re-balance from time to time. SolrCloud performs the
> distributed search for you, therefore when you try to search a node/core
> with no documents, all the results from the "cloud" are retrieved
> regardless. This is considered "A Good Thing".
>
> It requires a change in thinking about indexing and searching
>
> On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
> > Hi Guys,
> >
> > I use following command to start solr cloud according to solr cloud wiki.
> >
> > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
> > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
> > yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
> -jar
> > start.jar
> >
> > Then I have created several cores using CoreAdmin API (
> > http://localhost:8983/solr/admin/cores?action=CREATE&name=
> > &collection=collection1), and clusterstate.json show following
> > topology:
> >
> >
> > collection1:
> > -- shard1:
> >   -- collection1
> >   -- CoreForCustomer1
> >   -- CoreForCustomer3
> >   -- CoreForCustomer5
> > -- shard2:
> >   -- collection1
> >   -- CoreForCustomer2
> >   -- CoreForCustomer4
> >
> >
> > 1) Index:
> >
> > Using following command to index mem.xml file in exampledocs directory.
> >
> > yydzero:exampledocs bjcoe$ java -Durl=
> > http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
> > SimplePostTool: version 1.4
> > SimplePostTool: POSTing files to
> > http://localhost:8983/solr/coreForCustomer3/update..
> > SimplePostTool: POSTing file mem.xml
> > SimplePostTool: COMMITting Solr index changes.
> >
> > And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
> > 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
> > core has 0 documents.
> >
> > *Question 1:*  Is this expected behavior? How do I to index documents
> into
> > a specific core?
> >
> > *Question 2*:  If SolrCloud don't support this yet, how could I extend it
> > to support this feature (index document to particular core), where
> should i
> > start, the hashing algorithm?
> >
> > *Question 3*:  Why the documents are also indexed into 'coreForCustomer1'
> > and 'coreForCustomer5'?  The default replica for documents are 1, right?
> >
> > Then I try to index some document to 'coreForCustomer2':
> >
> > $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
> > post.jar ipod_video.xml
> >
> > While 'coreForCustomer2' still have 0 documents and documents in
> ipod_video
> > are indexed to core for customer 1/3/5.
> >
> > *Question 4*:  Why this happens?
> >
> > 2) Search: I use "
> > http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml"; to
> > search against 'CoreForCustomer2', while it will return all documents in
> > the whole collection even though this core has no documents at all.
> >
> > Then I use "
> >
> http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml&shards=localhost:8983/solr/coreForCustomer2
> ",
> > and it will return 0 documents.
> >
> > *Question 5*: So If want to search against a particular core, we need to
> > use 'shards' parameter and use solrCore name as parameter value, right?
> >
> >
> > Thanks very much in advance!
> >
> > Regards,
> > Yandong
>
>
>


Re: System requirements in my case?

2012-05-22 Thread Bruno Mannina
I installed a temp server on my university with 12 000 docs (Ubuntu+solr 
3.6.0)

May be I can preview the size of memory I need?

Q: How can I check the memory used?


Le 22/05/2012 13:14, findbestopensource a écrit :

Seems to be fine. Go head.

Before hosting, Have you tried / tested your application in local setup.
RAM usage is what matters in terms of Solr. Just benchmark your app for 100
000 documents, Log the memory used. Calculate the RAM reqd for 80 000 000
documents.

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 2:36 PM, Bruno Mannina  wrote:


My choice: 
http://www.ovh.com/fr/**serveurs_dedies/eg_best_of.xml

24 Go DDR3

Le 22/05/2012 10:26, findbestopensource a écrit :

  Dedicated Server may not be required. If you want to cut down cost, then

prefer shared server.

How much the RAM?

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina   wrote:

  Dear Solr users,

My company would like to use solr to index around 80 000 000 documents
(xml files with around 5~10ko size each).
My program (robot) will connect to this solr with boolean requests.

Number of users: around 1000
Number of requests by user and by day: 300
Number of users by day: 30

I would like to subscribe to a host provider with this configuration:
- Dedicated Server
- Ubuntu
- Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
- Unlimited bandwidth
- IP fixe

Do you think this configuration is enough?

Thanks for your info,
Sincerely
Bruno






RE: Wildcard-Search Solr 3.5.0

2012-05-22 Thread spring
> > The text may contain "FooBar".
> > 
> > When I do a wildcard search like this: "Foo*" - no hits.
> > When I do a wildcard search like this: "foo*" - doc is
> > found.
> 
> Please see http://wiki.apache.org/solr/MultitermQueryAnalysis


Well, it works in 3.6. With one exception: If I use german umlauts it does
not work anymore.

Text: Bär

Bä* -> no hits
Bär -> hits

What can I do in this case?

Thank you



Re: System requirements in my case?

2012-05-22 Thread Bruno Mannina

Hi Jan,

Thanks for all these details !

Answers are below.

Sincerely,
Bruno


Le 22/05/2012 13:58, Jan Høydahl a écrit :

Hi,

It is impossible to guess the required HW size without more knowledge about 
data and usage. 80 mill docs is a fair amount.

Here's how I would approach sizing the setup:
1) Get your schema in shape, removing unnecessary stored/indexed fields

Ok good idea !

2) To a test index locally of a part of the dataset, e.g. 10 mill docs and 
perform an Optimize

Concerning test, I have only actually a sample with 12000 docs. no more :'(

3) Measure the size of the index folder, multiply with 8 to get a clue of total 
index size

With 12 000 docs my index folder size is: 33Mo
ps: I use "solr.clustering.enabled=true"


4) Do some benchmarking with realistic types of queries to identify performance 
bottlenecks on query side

yep, this point is for later.


Depending on your requirements for search performance, you can beef up your RAM to 
hold the whole index or depend on slow disks as a bottleneck. If you find that 
total size of index is 16Gb, you should leave>16Gb free for OS disk caching, 
e.g. allocate 8Gb to Tomcat/Solr and leave the rest for the OS. If I should guess, 
you probably find that one server gets overloaded or too slow with your amount of 
docs, and that you end up with sharding across 2-4 servers.
I will take a look to see if I can easely increase RAM on the server 
(actually 24Go)


Another question concerning the execution of solr, have just to run java 
-jar start.jar ?

or you think I must run it with another way ?



PS: Do you always need to search all data? A trick may be to partition your data such 
that say 80% of searches go to a "fresh" index with 10% of the content, while 
the remaining searches include everything.
Yes I need to search to the whole index, even old document must be 
requested.




--
Jan Høydahl, search solution architect
Cominvent AS - www.facebook.com/Cominvent
Solr Training - www.solrtraining.com

On 22. mai 2012, at 11:06, Bruno Mannina wrote:


My choice: http://www.ovh.com/fr/serveurs_dedies/eg_best_of.xml

24 Go DDR3

Le 22/05/2012 10:26, findbestopensource a écrit :

Dedicated Server may not be required. If you want to cut down cost, then
prefer shared server.

How much the RAM?

Regards
Aditya
www.findbestopensource.com


On Tue, May 22, 2012 at 12:36 PM, Bruno Mannina   wrote:


Dear Solr users,

My company would like to use solr to index around 80 000 000 documents
(xml files with around 5~10ko size each).
My program (robot) will connect to this solr with boolean requests.

Number of users: around 1000
Number of requests by user and by day: 300
Number of users by day: 30

I would like to subscribe to a host provider with this configuration:
- Dedicated Server
- Ubuntu
- Intel Xeon i7 2x 266+ GHz 12 Go 2 * 1500Go
- Unlimited bandwidth
- IP fixe

Do you think this configuration is enough?

Thanks for your info,
Sincerely
Bruno








Re: Newbie with Carrot2?

2012-05-22 Thread Stanislaw Osinski
Hi Bruno,

Just to confirm -- are you seeing the clusters array in the result at all
()? To get reasonable clusters, you should request at
least 30-50 documents (rows), but even with smaller values, you should see
an empty clusters array.

Staszek

On Sun, May 20, 2012 at 9:20 PM, Bruno Mannina  wrote:

> Le 20/05/2012 11:43, Stanislaw Osinski a écrit :
>
>  Hi Bruno,
>>
>> Here's the wiki documentation for Solr's clustering component:
>>
>> http://wiki.apache.org/solr/**ClusteringComponent
>>
>> For configuration examples, take a look at the Configuration section:
>> http://wiki.apache.org/solr/**ClusteringComponent#**Configuration
>> .
>>
>> If you hit any problems, let me know.
>>
>> Staszek
>>
>> On Sun, May 20, 2012 at 11:38 AM, Bruno Mannina  wrote:
>>
>>  Dear all,
>>>
>>> I use Solr 3.6.0 and I indexed some documents (around 12000).
>>> Each documents contains a Abstract-en field (and some other fields).
>>>
>>> Is it possible to use Carrot2 to create cluster (classes) with the
>>> Abstract-en field?
>>>
>>> What must I configure in the schema.xml ? or in other files?
>>>
>>> Sorry for my newbie question, but I found only documentation for
>>> Workbench
>>> tool.
>>>
>>> Bruno
>>>
>>>  Thx for this link but I have a problem to configure my solrconfig.xml
> in the section:
> (note I run java -Dsolr.clustering.enabled=**true)
>
> I have a field named abstract-en, and I would like to use only this field.
>
> I would like to know if my requestHandler is good?
> I have a doubt with the content of  : carrot.title, carrot.url
>
> and also the latest field
> abstract-en
> edismax
> 
>  abstract-en^1.0
> 
> *:*
> 10
> *,score
>
> because the result when I do a request is exactly like a search request
> (without more information)
>
>
> My entire requestHandler is:
>
>  enable="${solr.clustering.**enabled:false}" class="solr.SearchHandler">
> 
> true
> **default
> **true
> 
> name
> id
> 
> **abstract-en
> 
> **true
> 
> 
> 
> false
> abstract-en
> edismax
> 
>  abstract-en^1.0
> 
> *:*
> 10
> *,score
> 
> 
> clustering
> 
> 
>
>


Re: Question about sampling

2012-05-22 Thread rita
Hi Lance, 
Could you provide more details about implementing this using
SignatureUpdateProcessor? 
Example can be helpful. 

-
Rita
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Question-about-sampling-tp3984103p3985379.html
Sent from the Solr - User mailing list archive at Nabble.com.


Multicore solr

2012-05-22 Thread Shanu Jha
Hi all,

greetings from my end. This is my first post on this mailing list. I have
few questions on multicore solr. For background we want to create a core
for each user logged in to our application. In that case it may be 50, 100,
1000, N-numbers. Each core will be used to write and search index in real
time.

1. Is this a good idea to go with?
2. What are the pros and cons of this approch?

Awaiting for your response.

Regards
AJ


Re: Newbie with Carrot2?

2012-05-22 Thread Bruno Mannina

Arfff

Clusters are at the end of my XML answer 









..
..



ok all work fine now !


Le 22/05/2012 15:33, Stanislaw Osinski a écrit :

Hi Bruno,

Just to confirm -- are you seeing the clusters array in the result at all
()? To get reasonable clusters, you should request at
least 30-50 documents (rows), but even with smaller values, you should see
an empty clusters array.

Staszek

On Sun, May 20, 2012 at 9:20 PM, Bruno Mannina  wrote:


Le 20/05/2012 11:43, Stanislaw Osinski a écrit :

  Hi Bruno,

Here's the wiki documentation for Solr's clustering component:

http://wiki.apache.org/solr/**ClusteringComponent

For configuration examples, take a look at the Configuration section:
http://wiki.apache.org/solr/**ClusteringComponent#**Configuration
.

If you hit any problems, let me know.

Staszek

On Sun, May 20, 2012 at 11:38 AM, Bruno Mannina   wrote:

  Dear all,

I use Solr 3.6.0 and I indexed some documents (around 12000).
Each documents contains a Abstract-en field (and some other fields).

Is it possible to use Carrot2 to create cluster (classes) with the
Abstract-en field?

What must I configure in the schema.xml ? or in other files?

Sorry for my newbie question, but I found only documentation for
Workbench
tool.

Bruno

  Thx for this link but I have a problem to configure my solrconfig.xml

in the section:
(note I run java -Dsolr.clustering.enabled=**true)

I have a field named abstract-en, and I would like to use only this field.

I would like to know if my requestHandler is good?
I have a doubt with the content of  : carrot.title, carrot.url

and also the latest field
abstract-en
edismax

  abstract-en^1.0

*:*
10
*,score

because the result when I do a request is exactly like a search request
(without more information)


My entire requestHandler is:



true
**default
**true

name
id

**abstract-en

**true



false
abstract-en
edismax

  abstract-en^1.0

*:*
10
*,score


clustering








Uncatchable Exception on solrj3.6.0

2012-05-22 Thread Jamel ESSOUSSI
Hi,

I use solr-solrj 3.6.0 and solr-core 3.6.0:

I have reimplemented the handleError of the ConcurrentUpdateSolrServer
class:


final ConcurrentUpdateSolrServer newSolrServer = new
ConcurrentUpdateSolrServer(url, client, 100, 10){
@Override
public void handleError(Throwable ex) {
// TODO Auto-generated method stub
super.handleError(ex);
}
};

My problem is when an exception is thrown in the solr server side I cannot
catch it in the client side.

Thanks

-- Jamel E

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Uncatchable-Exception-on-solrj3-6-0-tp3985437.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Darren Govoni

I'm curious what the solrcloud experts say, but my suggestion is to try not to 
over-engineering the search architecture  on solrcloud. For example, what is 
the benefit of managing the what cores are indexed and searched? Having to know 
those details, in my mind, works against the automation in solrcore, but maybe 
there's a good reason you want to do it this way.

--- Original Message ---
On 5/22/2012  07:35 AM Yandong Yao wrote:Hi Darren,

Thanks very much for your reply.

The reason I want to control core indexing/searching is that I want to
use one core to store one customer's data (all customer share same
config):  such as customer 1 use coreForCustomer1 and customer 2
use coreForCustomer2.

Is there any better way than using different core for different customer?

Another way maybe use different collection for different customer, while
not sure how many collections solr cloud could support. Which way is better
in terms of flexibility/scalability? (suppose there are tens of thousands
customers).

Regards,
Yandong

2012/5/22 Darren Govoni 

> Why do you want to control what gets indexed into a core and then
> knowing what core to search? That's the kind of "knowing" that SolrCloud
> solves. In SolrCloud, it handles the distribution of documents across
> shards and retrieves them regardless of which node is searched from.
> That is the point of "cloud", you don't know the details of where
> exactly documents are being managed (i.e. they are cloudy). It can
> change and re-balance from time to time. SolrCloud performs the
> distributed search for you, therefore when you try to search a node/core
> with no documents, all the results from the "cloud" are retrieved
> regardless. This is considered "A Good Thing".
>
> It requires a change in thinking about indexing and searching
>
> On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
> > Hi Guys,
> >
> > I use following command to start solr cloud according to solr cloud 
wiki.
> >
> > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
> > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
> > yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
> -jar
> > start.jar
> >
> > Then I have created several cores using CoreAdmin API (
> > http://localhost:8983/solr/admin/cores?action=CREATE&name=
> > &collection=collection1), and clusterstate.json show following
> > topology:
> >
> >
> > collection1:
> > -- shard1:
> >   -- collection1
> >   -- CoreForCustomer1
> >   -- CoreForCustomer3
> >   -- CoreForCustomer5
> > -- shard2:
> >   -- collection1
> >   -- CoreForCustomer2
> >   -- CoreForCustomer4
> >
> >
> > 1) Index:
> >
> > Using following command to index mem.xml file in exampledocs directory.
> >
> > yydzero:exampledocs bjcoe$ java -Durl=
> > http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
> > SimplePostTool: version 1.4
> > SimplePostTool: POSTing files to
> > http://localhost:8983/solr/coreForCustomer3/update..
> > SimplePostTool: POSTing file mem.xml
> > SimplePostTool: COMMITting Solr index changes.
> >
> > And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
> > 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
> > core has 0 documents.
> >
> > *Question 1:*  Is this expected behavior? How do I to index documents
> into
> > a specific core?
> >
> > *Question 2*:  If SolrCloud don't support this yet, how could I extend 
it
> > to support this feature (index document to particular core), where
> should i
> > start, the hashing algorithm?
> >
> > *Question 3*:  Why the documents are also indexed into 
'coreForCustomer1'
> > and 'coreForCustomer5'?  The default replica for documents are 1, right?
> >
> > Then I try to index some document to 'coreForCustomer2':
> >
> > $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
> > post.jar ipod_video.xml
> >
> > While 'coreForCustomer2' still have 0 documents and documents in
> ipod_video
> > are indexed to core for customer 1/3/5.
> >
> > *Question 4*:  Why this happens?
> >
> > 2) Search: I use "
> > http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml"; to
> > search against 'CoreForCustomer2', while it will return all documents in
> > the whole collection even though this core has no documents at all.
> >
> > Then I use "
> >
> 
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml&shards=localhost:8983/solr/coreForCustomer2
> ",
> > and it will return 0 documents.
> >
> > *Question 5*: So If want to search against a particular core, we need to
> > use 'shards' parameter and use solrCore name as parameter value, right?
> >
> >
> > Thanks very much in advance!
> >
> > Regards,
> > Yandong
>
>
>



Re: Installing Solr on Tomcat using Shell - Code wrong?

2012-05-22 Thread Li Li
you should find some clues from tomcat log
在 2012-5-22 晚上7:49,"Spadez" 写道:

> Hi,
>
> This is the install process I used in my shell script to try and get Tomcat
> running with Solr (debian server):
>
>
>
> I swear this used to work, but currently only Tomcat works. The Solr page
> just comes up with "The requested resource (/solr/admin) is not available."
>
> Can anyone give me some insight into why this isnt working? Its driving me
> nuts.
>
> James
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Installing-Solr-on-Tomcat-using-Shell-Code-wrong-tp3985393.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Mark Miller
I think the key is this: you want to think of a SolrCore on a single node Solr 
installation as a collection on a multi node SolrCloud installation.

So if you would use multiple SolrCore's with a std Solr setup, you should be 
using multiple collections in SolrCloud. If you were going to try to do 
everything in one SolrCore, that would be like putting everything in one 
collection in SolrCloud. I don't think it generally makes sense to try and work 
at the SolrCore level when working with SolrCloud. This will be made more clear 
once we add a simple collections api.

So I think your choice should be similar to using a single node - do you want 
to put everything in one 'collection' and use a filter to separate customers 
(with all its caveats and limitations) or do you want to use a collection per 
customer. You can always start up more clusters if you reach any limits.



On May 22, 2012, at 10:08 AM, Darren Govoni wrote:

> I'm curious what the solrcloud experts say, but my suggestion is to try not 
> to over-engineering the search architecture  on solrcloud. For example, what 
> is the benefit of managing the what cores are indexed and searched? Having to 
> know those details, in my mind, works against the automation in solrcore, but 
> maybe there's a good reason you want to do it this way.
> 
> --- Original Message ---
> On 5/22/2012  07:35 AM Yandong Yao wrote:Hi Darren,
> 
> Thanks very much for your reply.
> 
> The reason I want to control core indexing/searching is that I want to
> use one core to store one customer's data (all customer share same
> config):  such as customer 1 use coreForCustomer1 and customer 2
> use coreForCustomer2.
> 
> Is there any better way than using different core for different customer?
> 
> Another way maybe use different collection for different customer, while
> not sure how many collections solr cloud could support. Which way is 
> better
> in terms of flexibility/scalability? (suppose there are tens of thousands
> customers).
> 
> Regards,
> Yandong
> 
> 2012/5/22 Darren Govoni 
> 
> > Why do you want to control what gets indexed into a core and then
> > knowing what core to search? That's the kind of "knowing" that SolrCloud
> > solves. In SolrCloud, it handles the distribution of documents across
> > shards and retrieves them regardless of which node is searched from.
> > That is the point of "cloud", you don't know the details of where
> > exactly documents are being managed (i.e. they are cloudy). It can
> > change and re-balance from time to time. SolrCloud performs the
> > distributed search for you, therefore when you try to search a node/core
> > with no documents, all the results from the "cloud" are retrieved
> > regardless. This is considered "A Good Thing".
> >
> > It requires a change in thinking about indexing and searching
> >
> > On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
> > > Hi Guys,
> > >
> > > I use following command to start solr cloud according to solr cloud 
> wiki.
> > >
> > > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
> > > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
> > > yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
> > -jar
> > > start.jar
> > >
> > > Then I have created several cores using CoreAdmin API (
> > > http://localhost:8983/solr/admin/cores?action=CREATE&name=
> > > &collection=collection1), and clusterstate.json show 
> following
> > > topology:
> > >
> > >
> > > collection1:
> > > -- shard1:
> > >   -- collection1
> > >   -- CoreForCustomer1
> > >   -- CoreForCustomer3
> > >   -- CoreForCustomer5
> > > -- shard2:
> > >   -- collection1
> > >   -- CoreForCustomer2
> > >   -- CoreForCustomer4
> > >
> > >
> > > 1) Index:
> > >
> > > Using following command to index mem.xml file in exampledocs 
> directory.
> > >
> > > yydzero:exampledocs bjcoe$ java -Durl=
> > > http://localhost:8983/solr/coreForCustomer3/update -jar post.jar 
> mem.xml
> > > SimplePostTool: version 1.4
> > > SimplePostTool: POSTing files to
> > > http://localhost:8983/solr/coreForCustomer3/update..
> > > SimplePostTool: POSTing file mem.xml
> > > SimplePostTool: COMMITting Solr index changes.
> > >
> > > And now SolrAdmin UI shows that 'coreForCustomer1', 
> 'coreForCustomer3',
> > > 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and 
> other 2
> > > core has 0 documents.
> > >
> > > *Question 1:*  Is this expected behavior? How do I to index documents
> > into
> > > a specific core?
> > >
> > > *Question 2*:  If SolrCloud don't support this yet, how could I 
> extend it
> > > to support this feature (index document to particular core), where
> > should i
> > > start, the hashing algorithm?
> > >
> > > *Question 3*:  Why the documents are also indexed into 
> 'coreForCustomer1'
> > > and 'coreForCustomer5'?  The default replica for documents are 1, 
> right?
> > >
> > > Then I try to index som

Re: Multicore solr

2012-05-22 Thread Sohail Aboobaker
It would help if you provide your use case. What are you indexing for each
user and why would you need a separate core for indexing each user? How do
you decide schema for each user? It might be better to describe your use
case and desired results. People on the list will be able to advice on the
best approach.

Sohail


Re: solr tokenizer not splitting unbreakable expressions

2012-05-22 Thread Tanguy Moal
Hello Elisabeth,

Wouldn't it be more simple to have a custom component inside of the
front-end to your search server that would transform a query like <> into <<"hotel de ville" paris>> (I.e. turning each
occurence of the sequence "hotel de ville" into a phrase query ) ?

Concerning protections inside of the tokenizer, I think that is not
possible actually.
The main reason for this could be that the QueryParser will break the query
on each space before passing each query-part through the analysis of every
searched field. Hence all the smart things you would put at indexing time
to wrap a sequence of tokens into a single one is not reproducible at query
time.

Please someone correct me if I'm wrong!

Alternatively, I think you might do so with a custom query parser (in order
to have phrases sent to the analyzers instead of words). But since
tokenizers don't have support for protected words list, you would need an
additional custom token filter that would consume the tokens stream and
annotate those matching an entry in the protection list.
Unfortunately, if your protected list is long, you will have performance
issues. Unless you rely on a dedicated data structure, like Trie-based
structures (Patricia-trie, ...) You can find solid implementations on the
Internet (see https://github.com/rkapsi/patricia-trie).

Then you could make your filter consume a "sliding window" of tokens while
the window matches in your trie.
Once you have a complete match in your trie, the filter can set an
attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the
first matching token, and make the attribute be the complete match (e.g.
"Hotel de ville").
If you don't have a complete match, drop the unmatched tokens leaving them
unmodified.

I Hope this helps...

--
Tanguy


2012/5/22 elisabeth benoit 

> Hello,
>
> Does someone know if there is a way to configure a tokenizer to split on
> white spaces, all words excluding a bunch of expressions listed in a file?
>
> For instance, if a want "hotel de ville" not to be split in words, a
> request like "hotel de ville paris" would be split into two tokens:
>
> "hotel de ville" and "paris" instead of 4 tokens
>
> "hotel"
> "de"
> "ville"
> "paris"
>
> I imagine something like
>
>  protected="protoexpressions.txt"/>
>
> Thanks a lot,
> Elisabeth
>


WFST with autosuggest/geo

2012-05-22 Thread William Bell
Does anyone have the slides or sample code from:

Building Query Auto-Completion Systems with Lucene 4.0
Presented by Sudarshan Gaikaiwari, Software Engineer,Yelp

We want to implement WFST with GEO boosting.


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Binary updates handler does not propagate failures?

2012-05-22 Thread Jozef Vilcek
Hi all,

I am facing following issue ...
I have an application which is feeding Solr 3.6 index with document
updates via Solrj 3.6. I use a binary request writer, because of the
issue with XML when sending insert and deletes at once (
https://issues.apache.org/jira/browse/SOLR-1752 )

Now, I have noticed that if I sent a malformed document to the index,
I see in logs it got refused by the index, but on the Solrj side,
returned UpdateResponse does not indicate any kind of failure (no
exception thrown, response status code == 0 ). When I switch to XML
requests, I receive exception when sending malformed document.

By looking at Solr's  BinaryUpdateRequestHandler.java
http://svn.apache.org/viewvc/lucene/dev/tags/lucene_solr_3_6_0/solr/core/src/java/org/apache/solr/handler/BinaryUpdateRequestHandler.java?view=markup
at lines 98 - 105, exceptions are not propagated, therefore
RequestHandlerBase can not set it into the response ...

Is this intended behavior?
What am I doing wrong?
Any suggestions?

Many thanks in advance.

Best,
Jozef


How to handle filter query against empty fields

2012-05-22 Thread Jozef Vilcek
Hi all,

I have a field(s) in a schema which I need to be able to specify in a
filter query. The field is not mandatory, therefore it can be empty. I
need to be able to run a query with a filer : " return only docs which
does not have value for the field " ...

What would be the optimal recommended way of doing this with Solr?

Thanks!

Best,
Jozef


Re: How to handle filter query against empty fields

2012-05-22 Thread Ahmet Arslan
> I have a field(s) in a schema which I need to be able to
> specify in a
> filter query. The field is not mandatory, therefore it can
> be empty. I
> need to be able to run a query with a filer : " return only
> docs which
> does not have value for the field " ...
> 
> What would be the optimal recommended way of doing this with
> Solr?

There are two approaches for this. Please read my earlier post :

http://search-lucene.com/m/72Q4YURpgY





Faceted on Similarity ?

2012-05-22 Thread Robby
Hi All,

I'm quite a new user both to Lucene / Solr. I want to ask if faceted search
can be used to do a grouping for multiple field's value based on similarity
? I have look at the faceted index so far, but from my understanding they
only works on exact single and definite range values.

For example, if I have name, address, id number and nationality. And with
rows that had a degree of similarity distance between these fields we will
group them together.

Sample results will be like this :

Group1
Name : Angel, Address: Jakarta, ID : 123, Nationality: Indonesian
Name : Angeline, Address: Jayakarta, ID : 123, Nationality: Indonesian

Group2
Name : Frank, Address: Jl. Tubagus Angke Jakarta, ID : 333,
Nationality: Indonesian
Name : Frans, Address: Jl. T. Angke Jakarta, ID : 332, Nationality:
Indonesian


Hope I make myself clear and asking in proper way. Very sorry if my English
is not good enough...

Thanks,

Robby


Re: System requirements in my case?

2012-05-22 Thread Stanislaw Osinski
>
> 3) Measure the size of the index folder, multiply with 8 to get a clue of
>> total index size
>>
> With 12 000 docs my index folder size is: 33Mo
> ps: I use "solr.clustering.enabled=true"


Clustering is performed at search time, it doesn't affect the size of the
index (but obviously it does affect the search response times).

Staszek


Re: Solr mail dataimporter cannot be found

2012-05-22 Thread Stefan Matheis
Hey Emma,

thanks for reporting this, i opened SOLR-3478 and will commit this soon

Stefan 


On Monday, May 21, 2012 at 10:47 PM, Emma Bo Liu wrote:

> Hi,
> 
> I want to index emails using solr. I put the user name, password, hostname
> in data-config.xml under mail folder. This is a valid email but when I run
> in url http://localhost:8983/solr/mail/dataimport?command=full-import It
> said cannot access mail/dataimporter reason: no found. But when i run
> http://localhost:8983/solr/rss/dataimport?command=full-import
> or 
> http://localhost:8983/solr/db/dataimport?command=full-impor
> They can be found.
> 
> In addition, when I run the command java
> -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar , on the left side of
> solr UI, there are db, rss, tika and solr but no mail. Is it a bug that
> mail indexing? Thank you so much!
> 
> Best,
> 
> Emma 




clickable links as results?

2012-05-22 Thread 12rad
Hi, 

I want to display - a clickable link to the document along if a search
matches along with the no of times the search query matched. 
What should i be looking at? 
I am fairly new to Solr and don't know how I can achieve this. 

Thanks for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/clickable-links-as-results-tp3985505.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Not able to use the highlighting feature! Want to return snippets of text

2012-05-22 Thread 12rad
That worked! 
Thanks!
I did  
 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Not-able-to-use-the-highlighting-feature-Want-to-return-snippets-of-text-Urgent-tp3985012p3985507.html
Sent from the Solr - User mailing list archive at Nabble.com.


index-time boosting using DIH

2012-05-22 Thread geeky2
hello all,

can i use the technique described on the wiki at:

http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts

if i am populating my core using a DIH?

looking at the posts on this subject and the wiki docs - leads me to believe
that you can only use this when you are using the xml interface for
importing data?

thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-time-boosting-using-DIH-tp3985508.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index-time boosting using DIH

2012-05-22 Thread Dyer, James
See http://wiki.apache.org/solr/DataImportHandler#Special_Commands and the 
$docBoost pseudo-field name.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: geeky2 [mailto:gee...@hotmail.com] 
Sent: Tuesday, May 22, 2012 2:12 PM
To: solr-user@lucene.apache.org
Subject: index-time boosting using DIH

hello all,

can i use the technique described on the wiki at:

http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts

if i am populating my core using a DIH?

looking at the posts on this subject and the wiki docs - leads me to believe
that you can only use this when you are using the xml interface for
importing data?

thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-time-boosting-using-DIH-tp3985508.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Highlight feature

2012-05-22 Thread Chris Hostetter

: That is the default response format. If you would like to change that, 
: you could extend the search handler or post process the XML data. 
: Another option would be to use the javabin (if your app is java based) 
: and build xml the way your app would need.

there is actaully a more straight forward way to do stuff like this in 
trunk now, such that it can work with any response writer, using the 
"DocTransformer" API.

there is already an "ExplainAugmenter" that can inline the explain info 
for a document, we just need someone to help write a corrisponding 
HighlightAugmenter , i've opened a Jira if anone wnats to take a crack at 
a patch...

https://issues.apache.org/jira/browse/SOLR-3479

-Hoss


Re: Solr 3.6 fails when using XSLT

2012-05-22 Thread Chris Hostetter

what does your results.xsl look like? or more sepcificly: can you post a 
very small example XSL that has this problem?

you mentioned you are using xsl:include and that doesn't seem to work ... 
is that a seperate problem, or does removing/adding the xsl:including 
fix/cause this problem?

what does your xsl:include look like? where to the various xsl templates 
live in the filesystem realtive to eachother?


: Date: Fri, 11 May 2012 08:24:45 -0700 (PDT)
: From: "pramila_tha...@ontla.ola.org" 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Solr 3.6 fails when using XSLT
: 
: Hi Everyone,
: 
: I have recently upgraded to *solr 3.6 from solr 1.4.*
: My XSL where working fine in solr 1.4.
: 
: but now with Solr 3.6 I keep getting the following Error 
: 
: /getTransformer fails in getContentType java.lang.RuntimeException:
: getTransformer fails in getContentType /
: 
: But instead of results.xsl If I use example.xsl, it is fine.
: 
: I fine my xsl:include does not seem to work with Solr 3.6
: 
: Can someone please let me know what am I doing wrong?
: 
: Thanks,
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-fails-when-using-XSLT-tp3980240.html
: Sent from the Solr - User mailing list archive at Nabble.com.

-Hoss


RE: Solr 3.6 fails when using XSLT

2012-05-22 Thread pramila_tha...@ontla.ola.org
Hi Everyone,

This is what worked in solr 1.4 and did not work in solr 3.6.

Actually solr 3.6 requires all the xsl to be present in conf/xslt directory
All paths leading to xsl should be relative to conf directory.

But before this was not the case.







-->









Thanks,

--Pramila Thakur

From: Chris Hostetter-3 [via Lucene] 
[mailto:ml-node+s472066n3985522...@n3.nabble.com]
Sent: Tuesday, May 22, 2012 3:37 PM
To: Thakur, Pramila
Subject: Re: Solr 3.6 fails when using XSLT


what does your results.xsl look like? or more sepcificly: can you post a
very small example XSL that has this problem?

you mentioned you are using xsl:include and that doesn't seem to work ...
is that a seperate problem, or does removing/adding the xsl:including
fix/cause this problem?

what does your xsl:include look like? where to the various xsl templates
live in the filesystem realtive to eachother?


: Date: Fri, 11 May 2012 08:24:45 -0700 (PDT)
: From: "[hidden email]" 
<[hidden email]>
: Reply-To: [hidden email]
: To: [hidden email]
: Subject: Solr 3.6 fails when using XSLT
:
: Hi Everyone,
:
: I have recently upgraded to *solr 3.6 from solr 1.4.*
: My XSL where working fine in solr 1.4.
:
: but now with Solr 3.6 I keep getting the following Error
:
: /getTransformer fails in getContentType java.lang.RuntimeException:
: getTransformer fails in getContentType /
:
: But instead of results.xsl If I use example.xsl, it is fine.
:
: I fine my xsl:include does not seem to work with Solr 3.6
:
: Can someone please let me know what am I doing wrong?
:
: Thanks,
:
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-fails-when-using-XSLT-tp3980240.html
: Sent from the Solr - User mailing list archive at Nabble.com.

-Hoss


If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Solr-3-6-fails-when-using-XSLT-tp3980240p3985522.html
To unsubscribe from Solr 3.6 fails when using XSLT, click 
here.
NAML


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-3-6-fails-when-using-XSLT-tp3980240p3985524.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Jetty rerturning HTTP error code 413

2012-05-22 Thread Sai
Hi Alexandre,

Can you please let me know how did you fix this issue. I am also getting this
error when I pass very large query to Solr.

An reply is highly appreciated.

Thanks,
Sai



RE: index-time boosting using DIH

2012-05-22 Thread geeky2
thanks for the reply,

so to use the $docBoost pseudo-field name, would you do something like below
- and would this technique likely increase my total index time?




  

 
  

 
 
 ... 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-time-boosting-using-DIH-tp3985508p3985527.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index-time boosting using DIH

2012-05-22 Thread Dyer, James
You need to add the $docBoost pseudo-field to the document somehow.  A 
transformer is one way to do it.  You could just add it to a SELECT statement, 
which is especially convienent if the boost value somehow is derrived from the 
data:

SELECT case when SELL_MORE_FLAG='Y' then 999 ELSE null END as '$docBoost', 
...other fields... from some_table, etc

Either way I wouldn't expect it to make the indexing be noticably slower.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: geeky2 [mailto:gee...@hotmail.com] 
Sent: Tuesday, May 22, 2012 3:06 PM
To: solr-user@lucene.apache.org
Subject: RE: index-time boosting using DIH

thanks for the reply,

so to use the $docBoost pseudo-field name, would you do something like below
- and would this technique likely increase my total index time?




  

 
  

 
 
 ... 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-time-boosting-using-DIH-tp3985508p3985527.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: index-time boosting using DIH

2012-05-22 Thread geeky2
thank you james for the feedback - i appreciate it.

ultimately - i was trying to decide if i was missing the boat by ONLY using
query time boosting, and i should really be using index time boosting.

but after your reply, reading the solr book, and looking at the lucene dox -
it looks like index-time boosting is not what i need.  i can probably do
better by using query-time boosting and the proper sort params.

thanks again

--
View this message in context: 
http://lucene.472066.n3.nabble.com/index-time-boosting-using-DIH-tp3985508p3985539.html
Sent from the Solr - User mailing list archive at Nabble.com.


always getting distinct count of -1 in luke response (solr4 snapshot)

2012-05-22 Thread Mike Hugo
We're testing a snapshot of Solr4 and I'm looking at some of the responses
from the Luke request handler.  Everything looks good so far, with the
exception of the "distinct" attribute which (in Solr3) shows me the
distinct number of terms for a given field.

Given the request below, I'm consistently getting a response back with a
value in the "distinct" field of -1.  Is there something different I need
to do to get back the actual distinct count?

Thanks!

Mike

http://localhost:8080/solr/core1/admin/luke?wt=json&fl=label&numTerms=1

"fields": {
"label": {
"type": "text_general",
"schema": "IT-M--",
"index": "(unstored field)",
"docs": 63887,
*"distinct": -1,*
"topTerms": [


Indexing Polygons

2012-05-22 Thread Young, Cody
Hi All,

I'm trying to figure out how to index polygons in solr (trunk). I'm using LSP 
right now as the solr integration of the new spatial module hasn't completed. I 
have searching for a point using a polygon working, but I'm also looking for 
searching for a polygon using a point.

I've seen some indication that LSP supports this but I haven't been able to 
find an example.

What field type would I need to use? Would it be multivalued?

Please and thank you!
Cody




Re: Faceted on Similarity ?

2012-05-22 Thread Lee Carroll
Take a look at the clustering component

http://wiki.apache.org/solr/ClusteringComponent

Consider clustering off line and indexing the pre calculated group memberships

I might be wrong but I don't think their is any faceting mileage here.
Depending upon the use case
you might get some use out of the mlt handler

http://wiki.apache.org/solr/MoreLikeThis



On 22 May 2012 18:00, Robby  wrote:
> Hi All,
>
> I'm quite a new user both to Lucene / Solr. I want to ask if faceted search
> can be used to do a grouping for multiple field's value based on similarity
> ? I have look at the faceted index so far, but from my understanding they
> only works on exact single and definite range values.
>
> For example, if I have name, address, id number and nationality. And with
> rows that had a degree of similarity distance between these fields we will
> group them together.
>
> Sample results will be like this :
> 
> Group1
>    Name : Angel, Address: Jakarta, ID : 123, Nationality: Indonesian
>    Name : Angeline, Address: Jayakarta, ID : 123, Nationality: Indonesian
>
> Group2
>    Name : Frank, Address: Jl. Tubagus Angke Jakarta, ID : 333,
> Nationality: Indonesian
>    Name : Frans, Address: Jl. T. Angke Jakarta, ID : 332, Nationality:
> Indonesian
> 
>
> Hope I make myself clear and asking in proper way. Very sorry if my English
> is not good enough...
>
> Thanks,
>
> Robby


RE: Solr 3.6 fails when using XSLT

2012-05-22 Thread Chris Hostetter

: This is what worked in solr 1.4 and did not work in solr 3.6.
: 
: Actually solr 3.6 requires all the xsl to be present in conf/xslt directory
: All paths leading to xsl should be relative to conf directory.
: 
: But before this was not the case.

Right ... this was actually a bug (in how all relative paths in xml 
includes, or xsl includes, were resolved) that was fixed in Solr 3.1, as 
noted in CHANGES.txt...

* SOLR-1656: XIncludes and other HREFs in XML files loaded by ResourceLoader
  are fixed to be resolved using the URI standard (RFC 2396). The system
  identifier is no longer a plain filename with path, it gets initialized
  using a custom URI scheme "solrres:". This scheme is resolved using a
  EntityResolver that utilizes ResourceLoader
  (org.apache.solr.common.util.SystemIdResolver). This makes all relative
  pathes in Solr's config files behave like expected. This change
  introduces some backwards breaks in the API: Some config classes
  (Config, SolrConfig, IndexSchema) were changed to take
  org.xml.sax.InputSource instead of InputStream. There may also be some
  backwards breaks in existing config files, it is recommended to check
  your config files / XSLTs and replace all XIncludes/HREFs that were
  hacked to use absolute paths to use relative ones. (uschindler)




-Hoss


Re: Multicore solr

2012-05-22 Thread Amit Jha
Hi,

Thanks for your advice.
It is basically a meta search application. Users can perform a search on N 
number of data sources at a time. We broadcast Parallel search to each  
selected data sources and write data to solr using custom build API(API and 
solr are deployed on separate machine API job is to perform parallel search, 
write data to solr ). API respond to application that some results are 
available then application fires  a search query to display the results(query 
would be q=unique_search_id). And other side API keep writing data to solr and 
user can fire a search to solr to view all results. 

In current scenario we are using single solr server & we performing real time 
index and search. Performing these operations on single solr making process 
slow as index size increases. 

So we are planning to use multi core solr and each user will have its core. All 
core will have the same schema.

Please suggest if this approach has any issues.

Rgds
AJ

On 22-May-2012, at 20:14, Sohail Aboobaker  wrote:

> It would help if you provide your use case. What are you indexing for each
> user and why would you need a separate core for indexing each user? How do
> you decide schema for each user? It might be better to describe your use
> case and desired results. People on the list will be able to advice on the
> best approach.
> 
> Sohail


apache query

2012-05-22 Thread ketan kore
   hellooi have configured solr on tomcat 7 in windows so when i
manually start tomcat server and when i hit the solr it searches very well

 in my browser .


and when i write a java class with main method as follows the results are
fetched and shown on console.

public class Code{
public static void main(String[] args) throws MalformedURLException,
SolrServerException {
SolrServer solr = new CommonsHttpSolrServer("http://192.168.16.221:8080/solr
");
ModifiableSolrParams params = new ModifiableSolrParams();
params.set("q", "subhash");

System.out.println("response = " + params);
QueryResponse response = solr.query(params);
System.out.println("response = " + response);
}
}

But when i use the same code in method and call this method using its
object this error is shown

java.lang.NoClassDefFoundError:
org/apache/solr/client/solrj/SolrServerException

Eagerly waiting for your reply.


Re: Multi-words synonyms matching

2012-05-22 Thread elisabeth benoit
Hello Bernd,

Thanks for your advice.

I have one question: how did you manage to map one word to a multiwords
synonym???

I've tried (in synonyms.txt)

mairie, hotel de ville

mairie, hotel\ de\ ville

mairie => mairie, hotel de ville

mairie => mairie, hotel\ de\ ville

but nothing prevents mairie from matching with "hotel"...

The only way I found is to use
tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms declaration
in schema.xml, but then since "mairie" is not alone in my index field, it
doesn't match.


best regards,
Elisabeth




the only way I found, I schema.xml, is to use



2012/5/15 Bernd Fehling 

> Without reading the whole thread let me say that you should not trust
> the solr admin analysis. It takes the whole multiword search and runs
> it all together at once through each analyzer step (factory).
> But this is not how the real system works. First pitfall, the query parser
> is also splitting at white space (if not a phrase query). Due to this,
> a multiword query is send chunk after chunk through the analyzer and,
> second pitfall, each chunk runs through the whole analyzer by its own.
>
> So if you are dealing with multiword synonyms you have the following
> problems. Either you turn your query into a phrase so that the whole
> phrase is analyzed at once and therefore looked up as multiword synonym
> but phrase queries are not analyzed !!! OR you send your query chunk
> by chunk through the analyzer but then they are not multiwords anymore
> and are not found in your synonyms.txt.
>
> From my experience I can say that it requires some deep work to get it done
> but it is possible. I have connected a thesaurus to solr which is doing
> query time expansion (no need to reindex if the thesaurus changes).
> The thesaurus holds synonyms and "used for terms" in 24 languages. So
> it is also some kind of language translation. And naturally the thesaurus
> translates from single term to multi term synonyms and vice versa.
>
> Regards,
> Bernd
>
>
> Am 14.05.2012 13:54, schrieb elisabeth benoit:
> > Just for the record, I'd like to conclude this thread
> >
> > First, you were right, there was no behaviour difference between fq and q
> > parameters.
> >
> > I realized that:
> >
> > 1) my synonym (hotel de ville) has a stopword in it (de) and since I used
> > tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms
> declaration,
> > there was no stopword removal in the indewed expression, so when
> requesting
> > "hotel de ville", after stopwords removal in query, Solr was comparing
> > "hotel de ville"
> > with "hotel ville"
> >
> > but my queries never even got to that point since
> >
> > 2) I made a mistake using "mairie" alone in the admin interface when
> > testing my schema. The real field was something like "collectivités
> > territoriales mairie",
> > so the synonym "hotel de ville" was not even applied, because of the
> > tokenizerFactory="solr.KeywordTokenizerFactory" in my synonym definition
> > not splitting field into words when parsing
> >
> > So my problem is not solved, and I'm considering solving it outside of
> Solr
> > scope, unless someone else has a clue
> >
> > Thanks again,
> > Elisabeth
> >
> >
> >
> > 2012/4/25 Erick Erickson 
> >
> >> A little farther down the debug info output you'll find something
> >> like this (I specified fq=name:features)
> >>
> >> 
> >> name:features
> >> 
> >>
> >>
> >> so it may well give you some clue. But unless I'm reading things wrong,
> >> your
> >> q is going against a field that has much more information than the
> >> CATEGORY_ANALYZED field, is it possible that the data from your
> >> test cases simply isn't _in_ CATEGORY_ANALYZED?
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
> >>  wrote:
> >>> I'm not at the office until next Wednesday, and I don't have my Solr
> >> under
> >>> hand, but isn't debugQuery=on giving informations only about q
> parameter
> >>> matching and nothing about fq parameter? Or do you mean
> >>> "parsed_filter_querie"s gives information about fq?
> >>>
> >>> CATEGORY_ANALYZED is being populated by a copyField instruction in
> >>> schema.xml, and has the same field type as my catchall field, the
> search
> >>> field for my searchHandler (the one being used by q parameter).
> >>>
> >>> CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is text)
> >>>
> >>> CATEGORY (a string) is copied in catchall field (field type is text),
> >> and a
> >>> lot of other fields are copied too in that catchall field.
> >>>
> >>> So as far as I can see, the same analysis should be done in both cases,
> >> but
> >>> obviously I'm missing something, and the only thing I can think of is a
> >>> different behavior between q and fq parameter.
> >>>
> >>> I'll check that parsed_filter_querie first thing in the morning next
> >>> Wednesday.
> >>>
> >>> Thanks a lot for your help.
> >>>
> >>> Elisabeth
> >>>
> >>>
> >>> 2012/4/24 Erick Erickson 
> >>>
>  Elisabeth:
> >

Re: solr tokenizer not splitting unbreakable expressions

2012-05-22 Thread elisabeth benoit
Hello Tanguy,

I guess you're right, maybe this shouldn't be done in Solr but inside of
the front-end.

Thanks a lot for your answer.

Elisabeth

2012/5/22 Tanguy Moal 

> Hello Elisabeth,
>
> Wouldn't it be more simple to have a custom component inside of the
> front-end to your search server that would transform a query like < de ville paris>> into <<"hotel de ville" paris>> (I.e. turning each
> occurence of the sequence "hotel de ville" into a phrase query ) ?
>
> Concerning protections inside of the tokenizer, I think that is not
> possible actually.
> The main reason for this could be that the QueryParser will break the query
> on each space before passing each query-part through the analysis of every
> searched field. Hence all the smart things you would put at indexing time
> to wrap a sequence of tokens into a single one is not reproducible at query
> time.
>
> Please someone correct me if I'm wrong!
>
> Alternatively, I think you might do so with a custom query parser (in order
> to have phrases sent to the analyzers instead of words). But since
> tokenizers don't have support for protected words list, you would need an
> additional custom token filter that would consume the tokens stream and
> annotate those matching an entry in the protection list.
> Unfortunately, if your protected list is long, you will have performance
> issues. Unless you rely on a dedicated data structure, like Trie-based
> structures (Patricia-trie, ...) You can find solid implementations on the
> Internet (see https://github.com/rkapsi/patricia-trie).
>
> Then you could make your filter consume a "sliding window" of tokens while
> the window matches in your trie.
> Once you have a complete match in your trie, the filter can set an
> attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the
> first matching token, and make the attribute be the complete match (e.g.
> "Hotel de ville").
> If you don't have a complete match, drop the unmatched tokens leaving them
> unmodified.
>
> I Hope this helps...
>
> --
> Tanguy
>
>
> 2012/5/22 elisabeth benoit 
>
> > Hello,
> >
> > Does someone know if there is a way to configure a tokenizer to split on
> > white spaces, all words excluding a bunch of expressions listed in a
> file?
> >
> > For instance, if a want "hotel de ville" not to be split in words, a
> > request like "hotel de ville paris" would be split into two tokens:
> >
> > "hotel de ville" and "paris" instead of 4 tokens
> >
> > "hotel"
> > "de"
> > "ville"
> > "paris"
> >
> > I imagine something like
> >
> >  > protected="protoexpressions.txt"/>
> >
> > Thanks a lot,
> > Elisabeth
> >
>