Re: Do docValues influence range faceting speed in solr?

2013-08-10 Thread Mikhail Khludnev
Hello,

I don't think so. I looked at sources - range and query facets are backed
on SolrIndexSearcher.numDocs(Query, DocSet).


On Fri, Aug 9, 2013 at 2:05 PM, omu_negru  wrote:

> Hello,
> From my understanding doc.values work like a persistent field-cache, which
> is awesome for when you have to do sorting on a field after matching a
> certain query (assuming you have doc-values for that field enabled)
>
> That being the case , Field faceting is indeed improved by having docValues
> enabled on the field you want to facet on. My question is , does the same
> apply to range faceting? After making a query on field "foo" (text field)
> and doing range faceting on field "bar" which is a trie-int field , will
> the
> same mechanism apply to the faceting ? solr starts hitting the field cache
> (doc values) for all the documents matched by my query and collecting the
> "bar" field values for the relevant facet results
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Do-docValues-influence-range-faceting-speed-in-solr-tp4083483.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Problem with UniqueKey When Using ContentStreamUpdateRequest

2013-08-10 Thread adammyatt
Using Solr 4.3.1 with SolrJ...

In my schema xml I have defined : 

 

and farther down

attachmentid

Then following all the examples of ContentStreamUpdateRequest online I am
trying to index a file using : 

File file = new File("c:\\solr\\indexme.pdf");

ContentStreamUpdateRequest csur = new
ContentStreamUpdateRequest("/update/extract");
csur.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
csur.addFile(file, "application/octet-stream");
csur.setPram("attachmentid", "fileid_" + i); // where i is a unique number
solrCore.request(csur);
solrCore.commit();
QueryResponse rsp = solrCore.query(new SolrQuery("*:*"));
System.out.println(rsp);


When I run this it connects fine to the Solr server but get this Java
exception : 

HttpSolrServer$RemoteSolrException: Document is missing mandatory uniqueKey
field: attachmentid

If I look in the log file for solr server I see the request made it there as
this shows up : 

webapp=/solr path=/update/extract params={attachmentid=fileid_0
&wt=javabin&version=2} {} 0 817
5687 [qtp533789436-16] ERROR org.apache.solr.core.SolrCore  û
org.apache.solr.co
mmon.SolrException: Document is missing mandatory uniqueKey field:
attachmentid

Based on the log statement above it looks like my uniqueKey "attachmentid"
is getting passed. Since it is defined in my schema.xml I'm at a loss why
this doesn't seem to work?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-UniqueKey-When-Using-ContentStreamUpdateRequest-tp4083684.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Problem with UniqueKey When Using ContentStreamUpdateRequest

2013-08-10 Thread Jack Krupansky

There isn't any parameter for /update/extra named "attachmentid".

You probably wanted to use the "literal.*" parameter to set a field to a 
literal value, such as:


   litereral.attachmentid=fileid_123

-- Jack Krupansky

-Original Message- 
From: adammyatt

Sent: Friday, August 09, 2013 11:05 PM
To: solr-user@lucene.apache.org
Subject: Problem with UniqueKey When Using ContentStreamUpdateRequest

Using Solr 4.3.1 with SolrJ...

In my schema xml I have defined :



and farther down

attachmentid

Then following all the examples of ContentStreamUpdateRequest online I am
trying to index a file using :

File file = new File("c:\\solr\\indexme.pdf");

ContentStreamUpdateRequest csur = new
ContentStreamUpdateRequest("/update/extract");
csur.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
csur.addFile(file, "application/octet-stream");
csur.setPram("attachmentid", "fileid_" + i); // where i is a unique number
solrCore.request(csur);
solrCore.commit();
QueryResponse rsp = solrCore.query(new SolrQuery("*:*"));
System.out.println(rsp);


When I run this it connects fine to the Solr server but get this Java
exception :

HttpSolrServer$RemoteSolrException: Document is missing mandatory uniqueKey
field: attachmentid

If I look in the log file for solr server I see the request made it there as
this shows up :

webapp=/solr path=/update/extract params={attachmentid=fileid_0
&wt=javabin&version=2} {} 0 817
5687 [qtp533789436-16] ERROR org.apache.solr.core.SolrCore  û
org.apache.solr.co
mmon.SolrException: Document is missing mandatory uniqueKey field:
attachmentid

Based on the log statement above it looks like my uniqueKey "attachmentid"
is getting passed. Since it is defined in my schema.xml I'm at a loss why
this doesn't seem to work?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-UniqueKey-When-Using-ContentStreamUpdateRequest-tp4083684.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Spelling suggestions.

2013-08-10 Thread Kamaljeet Kaur

Actually I don't have much knowledge about the files and configuring
something with solr. Using apache-solr 3.5.0 and requesting the
following URL I got the result as shown in the attachment:

http://localhost:8983/solr/spell?q=delll
ultrashar&spellcheck=true&spellcheck.collate=true&spellcheck.build=true

On editing solrconfig.xml, as given in the above link, restarting solr
given the error: http://tny.cz/cccf5c8f

Running http://localhost:8983/solr/ gives the error: http://tny.cz/0c533931

Here is my solrconfig.xml: http://tny.cz/4855ed9f
Where is the problem?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Spelling-suggestions-tp4083519p4083715.html
Sent from the Solr - User mailing list archive at Nabble.com.


Configuring SpellCehckComponent

2013-08-10 Thread Kamaljeet Kaur
I am using apache-solr-3.5.0
Want to apply Spell check, partial search, phonetic search and autocomplete
to my project database with solr. Started with first feature. They
   tell to
do some changes in solrconfig.xml. i am not getting where to change and what
to change to configure it. Pasting in config tag, gives error.

Where to place the following? Should I paste it as it is or need some
changes?



  solr.IndexBasedSpellChecker
  ./spellchecker
  content
  true



Please help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuring-SpellCehckComponent-tp4083731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Could not load config for solrconfig.xml

2013-08-10 Thread shuargan
Do you remember what was your mistake?
Im having the same issue

I have this solr.xml conf under Catalina/localhost



 



and is throwing the same error...
HTTP Status 500 - {msg=SolrCore 'collection1' is not available due to init
failure: Could not load config for
solrconfig.xml,trace=org.apache.solr.common.SolrException: SolrCore
'collection1' is not available due to init failure: Could not load config
for solrconfig.xml at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:860) at 


I have no idea what to do



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-load-config-for-solrconfig-xml-tp4052152p4083741.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Percolate feature?

2013-08-10 Thread Mark
> So to reiteratve your examples from before, but change the "labels" a 
> bit and add some more converse examples (and ignore the "highlighting" 
> aspect for a moment...
> 
> doc1 = "Sony"
> doc2 = "Samsung Galaxy"
> doc3 = "Sony Playstation"
> 
> queryA = "Sony Experia"   ... matches only doc1
> queryB = "Sony Playstation 3" ... matches doc3 and doc1
> queryC = "Samsung 52inch LC"  ... doesn't match anything
> queryD = "Samsung Galaxy S4"  ... matches doc2
> queryE = "Galaxy Samsung S4"  ... matches doc2
> 
> 
> ...do i still have that correct?

Yes

> 2) if you *do* care about using non-trivial analysis, then you can't use 
> the simple "termfreq()" function, which deals with raw terms -- in stead 
> you have to use the "query()" function to ensure that the input is parsed 
> appropriately -- but then you have to wrap that function in something that 
> will normalize the scores - so in place of termfreq('words','Galaxy') 
> you'd want something like...


Yes we will be using non-trivial analysis. Now heres another twist… what if we 
don't care about scoring?


Let's talk about the real use case. We are marketplace that sells products that 
users have listed. For certain popular, high risk or restricted keywords we 
charge the seller an extra fee/ban the listing. We now have sellers purposely 
misspelling their listings to circumvent this fee. They will start adding 
suffixes to their product listings such as "Sonies" knowing that it gets 
indexed down to "Sony" and thus matching a users query for Sony. Or they will 
munge together numbers and products… "2013Sony". Same thing goes for adding 
crazy non-ascii characters to the front of the keyword "Î’Sony". This is 
obviously a problem because we aren't charging for these keywords and more 
importantly it makes our search results look like shit. 

We would like to:

1) Detect when a certain keyword is in a product title at listing time so we 
may charge the seller. This was my idea of a "reverse search" although sounds 
like I may have caused to much confusion with that term.
2) Attempt to autocorrect these titles hence the need for highlighting so we 
can try and replace the terms… this of course done outside of Solr via an 
external service.

Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) 
this makes conventional approaches such as regex quite troublesome. Regex is 
also quite slow and scales horribly and always needs to be in lockstep with 
schema changes.

Now knowing this, is there a good way to approach this?

Thanks


On Aug 9, 2013, at 11:56 AM, Chris Hostetter  wrote:

> 
> : I'll look into this. Thanks for the concrete example as I don't even 
> : know which classes to start to look at to implement such a feature.
> 
> Either roman isn't understanding what you are aksing for, or i'm not -- 
> but i don't think what roman described will work for you...
> 
> : > so if your query contains no duplicates and all terms must match, you can
> : > be sure that you are collecting docs only when the number of terms matches
> : > number of clauses in the query
> 
> several of the examples you gave did not match what Roman is describing, 
> as i understand it.  Most people on this thread seem to be getting 
> confused by having their perceptions "flipped" about what your "data known 
> in advance is" vs the "data you get at request time".
> 
> You described this...
> 
> : > Product keyword:  "Sony"
> : > Product keyword:  "Samsung Galaxy"
> : > 
> : > We would like to be able to detect given a product title whether or
> : >> not it
> : > matches any known keywords. For a keyword to be matched all of it's
> : >> terms
> : > must be present in the product title given.
> : > 
> : > Product Title: "Sony Experia"
> : > Matches and returns a highlight: "Sony Experia"
> 
> ...suggesting that what you call "product keywords" are the "data you know 
> about in advance" and "product titles" are the data you get at request 
> time.
> 
> So your example of the "request time" input (ie: query) "Sony Experia" 
> matching "data known in advance (ie: indexed document) "Sony" would not 
> work with Roman's example.
> 
> To rephrase (what i think i understand is) your goal...
> 
> * you have many (10*3+) documents known in advance
> * any document D contain a set of words W(D) of varing sizes
> * any requests Q contains a set of words W(Q) of varing izes
> * you want a given request R to match a document D if and only if:
>   - W(D) is a subset of W(Q)
>   - ie: no iten exists in W(D) that does not exist in W(Q)
>   - ie: any number of items may exist in W(Q) that are not in W(D)
> 
> So to reiteratve your examples from before, but change the "labels" a 
> bit and add some more converse examples (and ignore the "highlighting" 
> aspect for a moment...
> 
> doc1 = "Sony"
> doc2 = "Samsung Galaxy"
> doc3 = "Sony Playstation"
> 
> queryA = "Sony Experia"   ... matches only doc1
> queryB = "Sony Playsta

defType

2013-08-10 Thread William Bell
What are the possible options for defType?

lucene
dismax
edismax

Others?

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Percolate feature?

2013-08-10 Thread Jack Krupansky

Now we're getting somewhere!

To (over-simplify), you simply want to know if a given "listing" would match 
a high-value pattern, either in a "clean" manner (obvious keywords) or in an 
"unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)


To a large this also depends on how rich and powerful your end-user query 
support is. So, if the user searches for "sony", "samsung", or "apple", will 
it match some oddball listing that fuzzily matches those terms.


So... tell us, how rich your query interface is. I mean, do you support 
wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", 
or... will "sony" match "sonblah-blah")?


Reverse-search may in fact be what you need in this case since you literally 
do mean "if I index this document, will it match any of these queries" (but 
doesn't score a hit on your direct check for whether it is a clean keyword 
match.)


In your previous examples you only gave clean product titles, not examples 
of circumventions of simple keyword matches.


-- Jack Krupansky

-Original Message- 
From: Mark

Sent: Saturday, August 10, 2013 6:24 PM
To: solr-user@lucene.apache.org
Cc: Chris Hostetter
Subject: Re: Percolate feature?


So to reiteratve your examples from before, but change the "labels" a
bit and add some more converse examples (and ignore the "highlighting"
aspect for a moment...

doc1 = "Sony"
doc2 = "Samsung Galaxy"
doc3 = "Sony Playstation"

queryA = "Sony Experia"   ... matches only doc1
queryB = "Sony Playstation 3" ... matches doc3 and doc1
queryC = "Samsung 52inch LC"  ... doesn't match anything
queryD = "Samsung Galaxy S4"  ... matches doc2
queryE = "Galaxy Samsung S4"  ... matches doc2


...do i still have that correct?


Yes


2) if you *do* care about using non-trivial analysis, then you can't use
the simple "termfreq()" function, which deals with raw terms -- in stead
you have to use the "query()" function to ensure that the input is parsed
appropriately -- but then you have to wrap that function in something that
will normalize the scores - so in place of termfreq('words','Galaxy')
you'd want something like...



Yes we will be using non-trivial analysis. Now heres another twist… what if 
we don't care about scoring?



Let's talk about the real use case. We are marketplace that sells products 
that users have listed. For certain popular, high risk or restricted 
keywords we charge the seller an extra fee/ban the listing. We now have 
sellers purposely misspelling their listings to circumvent this fee. They 
will start adding suffixes to their product listings such as "Sonies" 
knowing that it gets indexed down to "Sony" and thus matching a users query 
for Sony. Or they will munge together numbers and products… "2013Sony". Same 
thing goes for adding crazy non-ascii characters to the front of the keyword 
"Î’Sony". This is obviously a problem because we aren't charging for these 
keywords and more importantly it makes our search results look like shit.


We would like to:

1) Detect when a certain keyword is in a product title at listing time so we 
may charge the seller. This was my idea of a "reverse search" although 
sounds like I may have caused to much confusion with that term.
2) Attempt to autocorrect these titles hence the need for highlighting so we 
can try and replace the terms… this of course done outside of Solr via an 
external service.


Since we do some stemming (KStemmer) and filtering 
(WordDelimiterFilterFactory) this makes conventional approaches such as 
regex quite troublesome. Regex is also quite slow and scales horribly and 
always needs to be in lockstep with schema changes.


Now knowing this, is there a good way to approach this?

Thanks


On Aug 9, 2013, at 11:56 AM, Chris Hostetter  
wrote:




: I'll look into this. Thanks for the concrete example as I don't even
: know which classes to start to look at to implement such a feature.

Either roman isn't understanding what you are aksing for, or i'm not -- 
but i don't think what roman described will work for you...


: > so if your query contains no duplicates and all terms must match, you 
can
: > be sure that you are collecting docs only when the number of terms 
matches

: > number of clauses in the query

several of the examples you gave did not match what Roman is describing,
as i understand it.  Most people on this thread seem to be getting
confused by having their perceptions "flipped" about what your "data known
in advance is" vs the "data you get at request time".

You described this...

: > Product keyword:  "Sony"
: > Product keyword:  "Samsung Galaxy"
: >
: > We would like to be able to detect given a product title whether 
or

: >> not it
: > matches any known keywords. For a keyword to be matched all of 
it's

: >> terms
: > must be present in the product title given.
: >
: > Product Title: "Sony Experia"
: > Matches and returns a highlight: "Sony Experia"

...suggesting that what y

Re: defType

2013-08-10 Thread Jack Krupansky

The full list is in my book. What did you need in particular?

(Actually, I forgot to add "maxscore" to my list.)

-- Jack Krupansky

-Original Message- 
From: William Bell 
Sent: Saturday, August 10, 2013 6:30 PM 
To: solr-user@lucene.apache.org 
Subject: defType 


What are the possible options for defType?

lucene
dismax
edismax

Others?

--
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: defType

2013-08-10 Thread William Bell
Can you list them out?

Thanks.

raw
lucene
dismax
edismax
field




On Sat, Aug 10, 2013 at 4:45 PM, Jack Krupansky wrote:

> The full list is in my book. What did you need in particular?
>
> (Actually, I forgot to add "maxscore" to my list.)
>
> -- Jack Krupansky
>
> -Original Message- From: William Bell Sent: Saturday, August 10,
> 2013 6:30 PM To: solr-user@lucene.apache.org Subject: defType
> What are the possible options for defType?
>
> lucene
> dismax
> edismax
>
> Others?
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: defType

2013-08-10 Thread Erik Hatcher
See http://slideshare.net/erikhatcher/solr-query-parsing slides 4 and 5 

Plus whatever's been added since then. 

   Erik

On Aug 10, 2013, at 19:05, William Bell  wrote:

> Can you list them out?
> 
> Thanks.
> 
> raw
> lucene
> dismax
> edismax
> field
> 
> 
> 
> 
> On Sat, Aug 10, 2013 at 4:45 PM, Jack Krupansky 
> wrote:
> 
>> The full list is in my book. What did you need in particular?
>> 
>> (Actually, I forgot to add "maxscore" to my list.)
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: William Bell Sent: Saturday, August 10,
>> 2013 6:30 PM To: solr-user@lucene.apache.org Subject: defType
>> What are the possible options for defType?
>> 
>> lucene
>> dismax
>> edismax
>> 
>> Others?
>> 
>> --
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076
> 
> 
> 
> -- 
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076


Re: defType

2013-08-10 Thread Koji Sekiguchi

See line 33 to 50 at
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/search/QParserPlugin.java?view=markup

koji
--
http://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html

(13/08/11 8:05), William Bell wrote:

Can you list them out?

Thanks.

raw
lucene
dismax
edismax
field




On Sat, Aug 10, 2013 at 4:45 PM, Jack Krupansky wrote:


The full list is in my book. What did you need in particular?

(Actually, I forgot to add "maxscore" to my list.)

-- Jack Krupansky

-Original Message- From: William Bell Sent: Saturday, August 10,
2013 6:30 PM To: solr-user@lucene.apache.org Subject: defType
What are the possible options for defType?

lucene
dismax
edismax

Others?

--
Bill Bell
billnb...@gmail.com
cell 720-256-8076











Re: Percolate feature?

2013-08-10 Thread Mark
Our schema is pretty basic.. nothing fancy going on here


  






  
   







  



On Aug 10, 2013, at 3:40 PM, "Jack Krupansky"  wrote:

> Now we're getting somewhere!
> 
> To (over-simplify), you simply want to know if a given "listing" would match 
> a high-value pattern, either in a "clean" manner (obvious keywords) or in an 
> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)
> 
> To a large this also depends on how rich and powerful your end-user query 
> support is. So, if the user searches for "sony", "samsung", or "apple", will 
> it match some oddball listing that fuzzily matches those terms.
> 
> So... tell us, how rich your query interface is. I mean, do you support 
> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", 
> or... will "sony" match "sonblah-blah")?
> 
> Reverse-search may in fact be what you need in this case since you literally 
> do mean "if I index this document, will it match any of these queries" (but 
> doesn't score a hit on your direct check for whether it is a clean keyword 
> match.)
> 
> In your previous examples you only gave clean product titles, not examples of 
> circumventions of simple keyword matches.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Mark
> Sent: Saturday, August 10, 2013 6:24 PM
> To: solr-user@lucene.apache.org
> Cc: Chris Hostetter
> Subject: Re: Percolate feature?
> 
>> So to reiteratve your examples from before, but change the "labels" a
>> bit and add some more converse examples (and ignore the "highlighting"
>> aspect for a moment...
>> 
>> doc1 = "Sony"
>> doc2 = "Samsung Galaxy"
>> doc3 = "Sony Playstation"
>> 
>> queryA = "Sony Experia"   ... matches only doc1
>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>> queryD = "Samsung Galaxy S4"  ... matches doc2
>> queryE = "Galaxy Samsung S4"  ... matches doc2
>> 
>> 
>> ...do i still have that correct?
> 
> Yes
> 
>> 2) if you *do* care about using non-trivial analysis, then you can't use
>> the simple "termfreq()" function, which deals with raw terms -- in stead
>> you have to use the "query()" function to ensure that the input is parsed
>> appropriately -- but then you have to wrap that function in something that
>> will normalize the scores - so in place of termfreq('words','Galaxy')
>> you'd want something like...
> 
> 
> Yes we will be using non-trivial analysis. Now heres another twist… what if 
> we don't care about scoring?
> 
> 
> Let's talk about the real use case. We are marketplace that sells products 
> that users have listed. For certain popular, high risk or restricted keywords 
> we charge the seller an extra fee/ban the listing. We now have sellers 
> purposely misspelling their listings to circumvent this fee. They will start 
> adding suffixes to their product listings such as "Sonies" knowing that it 
> gets indexed down to "Sony" and thus matching a users query for Sony. Or they 
> will munge together numbers and products… "2013Sony". Same thing goes for 
> adding crazy non-ascii characters to the front of the keyword "Î’Sony". This 
> is obviously a problem because we aren't charging for these keywords and more 
> importantly it makes our search results look like shit.
> 
> We would like to:
> 
> 1) Detect when a certain keyword is in a product title at listing time so we 
> may charge the seller. This was my idea of a "reverse search" although sounds 
> like I may have caused to much confusion with that term.
> 2) Attempt to autocorrect these titles hence the need for highlighting so we 
> can try and replace the terms… this of course done outside of Solr via an 
> external service.
> 
> Since we do some stemming (KStemmer) and filtering 
> (WordDelimiterFilterFactory) this makes conventional approaches such as regex 
> quite troublesome. Regex is also quite slow and scales horribly and always 
> needs to be in lockstep with schema changes.
> 
> Now knowing this, is there a good way to approach this?
> 
> Thanks
> 
> 
> On Aug 9, 2013, at 11:56 AM, Chris Hostetter  wrote:
> 
>> 
>> : I'll look into this. Thanks for the concrete example as I don't even
>> : know which classes to start to look at to implement such a feature.
>> 
>> Either roman isn't understanding what you are aksing for, or i'm not -- but 
>> i don't think what roman described will work for you...
>> 
>> : > so if your query contains no duplicates and all terms must match, you can
>> : > be sure that you are collecting docs only when the number of terms 
>> matches
>> : > number of clauses in the query
>> 
>> several of the examples you gave did not match what Roman is describing,
>> as i understand it.  Most people on this thread seem to be getting
>> confused by having their perceptions "flipped" about what your "data known
>> in advance is" v