Re: How can i get collect stemmed query?

2010-10-18 Thread Ahmet Arslan
Are you using KLTQueryAnalyzer outside of the Solr? (pre-process)
Or you defined a fieldType in schema.xml that uses KLTQueryAnalyzer?

Can you append &debugQuery=on to your search url and paste output?

--- On Mon, 10/18/10, Jerad  wrote:

> From: Jerad 
> Subject: How can i get collect stemmed query?
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 9:15 AM
> 
> Hi~. I'm beginner who wanna make search system by using
> solr 1.4.1 and lucene
> 2.92.
> 
> I got a collect lucene query from my custom Analyzer and
> filter from given
> query,
> but no result displayed.
> 
> Here is my Analyzer source.
> 
> --
> public class KLTQueryAnalyzer extends Analyzer{
>     public static final Version LUCENE_VERSION =
> Version.LUCENE_29;
>     public static int QUERY_MIN_LEN_WORD_FILTER =
> 1;
>     public static int QUERY_MAX_LEN_WORD_FILTER =
> 40;
>     
>     public int elapsedTime = 0;
>     
>     @Override
>     public TokenStream tokenStream(String
> paramString, Reader reader) {
>         StandardTokenizer tokenizer =
> new StandardTokenizer( 
>            
> du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader );
> 
>         TokenStream tokenStream = new
> LengthFilter( tokenizer,
> QUERY_MIN_LEN_WORD_FILTER,
>          
>    QUERY_MAX_LEN_WORD_FILTER );
>         tokenStream = new
> LowerCaseFilter( tokenStream );
> 
> 
>         //My custom stemmer method
>         KLTSingleWordStemmer stemer =
> new
> KLTSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER,
> QUERY_MAX_LEN_WORD_FILTER);
> 
>         //My custom analyzer filter.
> this filter return sub-merged query.
>         //ex) INPUT : flyaway
>         // 
>    RETURN VALUE : fly +body:away
>         tokenStream = new
> KLTQueryStemFilter( tokenStream, stemer, this );
> 
>         return tokenStream;
>     }
> }
> --
> 
> 
> example query)  Input User query : +body:flyaway 
>                
>       Expected analyzed query : +body:fly
> +body:away
> 
>               INDEXED
> DATA : body> fly away
> 
> 
> I'm expecting 1 docs returned from index, but I have no
> result returned.
> 
> explain my custom flow
> 
> 1. User input query : +body:flyaway
> 2. Analyzer return that : fly +body:away
> 3. Solr attach search field tag at filter returned query :
> "+body" as i
> defined at schema.xml.(default operator "AND")
> 4. I'm indexed 1 docs that have field name "body", has
> containing this
> phrase "fly away"
> 5. I expect 1 docs return of result by query "+body:fly
> +body:away" but 0
> docs returned.
> 
> What's the problem?? Anybody help me please~ :>
> 
> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-can-i-get-collect-stemmed-query-tp1723055p1723055.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
> 





Re: how can i use solrj binary format for indexing?

2010-10-18 Thread Peter Karich
Hi,

you can try to parse the xml via Java yourself and then push the
SolrInputDocuments it via SolrJ to solr.
setting format to binaray + using the streaming update processor should
improve performance,
but I am not sure... and performant (+less mem!) reading xml in Java is
another topic ... ;-)

Regards,
Peter.

> Hi all
> I have a huge amount of xml files for indexing.
> I want to index using solrj binary format to get performance gain.
> Because I heard that using xml files to index is quite slow.
> But I don't know how to use index through solrj binary format and can't find
> examples.
> Please give some help.
> Thanks,
>   


-- 
http://jetwick.com twitter search prototype



Re: query between two date

2010-10-18 Thread Savvas-Andreas Moysidis
You'll have to supply your dates in a format Solr expects (e.g.
2010-10-19T08:29:43Z
and not 2010-10-19). If you don't need millisecond granularity you can use
the DateMath syntax to specify that.

Please, also check http://wiki.apache.org/solr/SolrQuerySyntax.

On 17 October 2010 10:54, nedaha  wrote:

>
> Hi there,
>
> At first i have to explain the situation.
> I have 2 fields indexed named tdm_avail1 and tdm_avail2 that are arrays of
> some different dates.
>
> "This is a sample doc"
>
>
> 
> 2010-10-21T08:29:43Z
> 2010-10-22T08:29:43Z
> 2010-10-25T08:29:43Z
> 2010-10-26T08:29:43Z
> 2010-10-27T08:29:43Z
> 
>
> 
> 2010-10-19T08:29:43Z
> 2010-10-20T08:29:43Z
> 2010-10-21T08:29:43Z
> 2010-10-22T08:29:43Z
> 
>
> And in my search form i have 2 field named check-in date and check-out
> date.
> I want solr to compare the range that user enter in the search form with
> the
> values of tdm_avail1 and tdm_avail2 and return doc if all dates between
> check-in and check-out dates matches with tdm_avail1 or tdm_avail2 values.
>
> for example if user enter:
> check-in date: 2010-10-19
> check-out date: 2010-10-21
> that is match with tdm_avail2 then doc must be returned.
>
> but if user enter:
> check-in date: 2010-10-25
> check-out date: 2010-10-29
> doc could not be returned.
>
> so i want the query that gives me the mentioned result. could you help me
> please?
>
> thanks in advance
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1718566.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


AW: How do you programatically create new cores?

2010-10-18 Thread Bastian
A http-get call is simply made by entering the url into your browser, like
shown in the example in the wiki:

http://localhost:8983/solr/admin/cores?action=CREATE&name=coreX&instanceDir=
path_to_instance_directory&config=config_file_name.xml&schema=schem_file_nam
e.xml&dataDir=data 

-Ursprüngliche Nachricht-
Von: Tharindu Mathew [mailto:mcclou...@gmail.com] 
Gesendet: Sonntag, 17. Oktober 2010 18:07
An: solr-user@lucene.apache.org
Cc: solr-user@lucene.apache.org
Betreff: Re: How do you programatically create new cores?

Hi Marc, 

Thanks for the reply. 

So as I understand I need to make a http get call with an action parameter
set to create to dynamically create a core? I do not see an API to do this
anywhere. 

On Oct 17, 2010, at 3:54 PM, Marc Sturlese  wrote:

> 
> You have to create the core's folder with it's conf inside the Solr home.
> Once done you can call the create action of the admin handler:
> http://wiki.apache.org/solr/CoreAdmin#CREATE
> If you need to dinamically create, start and stop lots of cores 
> there's this patch, but don't know about it's current state:
> http://wiki.apache.org/solr/LotsOfCores
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-do-you-programatically-create-n
> ew-cores-tp1706487p1718648.html Sent from the Solr - User mailing list 
> archive at Nabble.com.



Re: query between two date

2010-10-18 Thread nedaha

Thanks for your reply.
I know about the solr date format!! Check-in and Check-out dates are
user-friendly format that we use in our search form for system's users. and
i change the format via code and then send them to solr.
I want to know how can i make a query to compare a range between check-in
and check-out date with some separate different days that i have in solr
index.
for example: 
check-in date is: 2010-10-19T00:00:00Z
and
check-out date is: 2010-10-21T00:00:00Z

when i want to build a query from my application i have a range date but in
solr index i have separate dates.
So how can i compare them to get the appropriate result? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1723752.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can i get collect stemmed query?

2010-10-18 Thread Jerad

Oops, I'm Sorry! I found some mistakes on previous posted source.( Main class
name has been wrong :<)

This is the collect analyzer source.
---
public class MyCustomQueryAnalyzer extends Analyzer{ 
public static final Version LUCENE_VERSION = Version.LUCENE_29; 
public static int QUERY_MIN_LEN_WORD_FILTER = 1; 
public static int QUERY_MAX_LEN_WORD_FILTER = 40; 

public int elapsedTime = 0; 

@Override 
public TokenStream tokenStream(String paramString, Reader reader) { 
StandardTokenizer tokenizer = new StandardTokenizer( 
du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader ); 

TokenStream tokenStream = new LengthFilter( tokenizer,
QUERY_MIN_LEN_WORD_FILTER, 
 QUERY_MAX_LEN_WORD_FILTER ); 
tokenStream = new LowerCaseFilter( tokenStream ); 


//My custom stemmer method 
MyCustomSingleWordStemmer stemer = new
MyCustomSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER,
QUERY_MAX_LEN_WORD_FILTER); 

//My custom analyzer filter. this filter return sub-merged query. 
//ex) INPUT : flyaway 
// RETURN VALUE : fly +body:away 
tokenStream = new KLTQueryStemFilter( tokenStream, stemer, this ); 

return tokenStream; 
} 
} 

---

[Additional info]

1. MyCustomQueryAnalyzer made outside of Solr.
I made this analyzer outside of the solr package and make it to ~.jar
and located at 

   
~/Solr/example/work/Jetty_0_0_0_0_8982_solr.war__solr__-2c5peu/webapp/WEB-INF/lib
 

2. I edited field type and field name in scheme.xml which to be searched.




  


  
  

  


This is my custom scheme.xml and custom search field type.

3. I've got this xml result when I append &debugQuery=on to my search url.


   
- 
- 
  0 
  0 
- 
  on 
  on 
  0 
  +body:flyaway 
  2.2 
  10 
  
  
   
- 
  +body:flyaway 
  +body:flyaway 
  +body:fly +body:away/str> 
  +body:fly +body:away/str> 
   
  LuceneQParser 
- 
  0.0 
- 
  0.0 
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
  
- 
  0.0 
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
- 
  0.0 
  
  
  
  
  


I really appreciate your advice~ :)

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-i-get-collect-search-result-from-custom-filtered-query-tp1723055p1723815.html
Sent from the Solr - User mailing list archive at Nabble.com.


Boosting documents based on the vote count

2010-10-18 Thread Alexandru Badiu
Hello all,

I have a field in my schema which holds the number of votes a document
has. How can I boost documents based on that number?

Something like the one which has the maximum number has a boost of 10,
the one with the smallest number has 0.5 and in between the values get
calculated automatically.

Thanks,
Alexandru Badiu


Re: query between two date

2010-10-18 Thread Savvas-Andreas Moysidis
ok, maybe don't get this right..

are you trying to match something like check-in date > 2010-10-19T00:00:00Z
AND check-out date < 2010-10-21T00:00:00Z  *or* check-in date =
2010-10-19T00:00:00Z
AND check-out date = 2010-10-21T00:00:00Z?

On 18 October 2010 10:05, nedaha  wrote:

>
> Thanks for your reply.
> I know about the solr date format!! Check-in and Check-out dates are
> user-friendly format that we use in our search form for system's users. and
> i change the format via code and then send them to solr.
> I want to know how can i make a query to compare a range between check-in
> and check-out date with some separate different days that i have in solr
> index.
> for example:
> check-in date is: 2010-10-19T00:00:00Z
> and
> check-out date is: 2010-10-21T00:00:00Z
>
> when i want to build a query from my application i have a range date but in
> solr index i have separate dates.
> So how can i compare them to get the appropriate result?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1723752.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


solr requirements

2010-10-18 Thread satya swaroop
Hi All,
I am planning to have a separate server for solr and regarding
hardware requirements i have a doubt about what configuration to be needed.
I know it will be hard to tell but i just need a minimum requirement for the
particular situation as follows::


1) There are 1000 regular users using solr and Every day each user indexes
10 files of 1KB each and totally it leads to a size of 10MB for a day and it
goes on...???

2)How much of RAM is used by solr in genral???

Thanks,
satya


Re: query between two date

2010-10-18 Thread nedaha

The exact query that i want is:

check-in date >= 2010-10-19T00:00:00Z 
AND check-out date <= 2010-10-21T00:00:00Z

but because of the structure that i have to index i don't have specific
start date and end date in my solr index to compare with check-in and
check-out date range. I have some dates that available to reserve! 

Could you please help me? :)


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1724062.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: API for using Multi cores with SolrJ

2010-10-18 Thread Peter Karich
I asked this myself ... here could be some pointers:

http://lucene.472066.n3.nabble.com/SolrJ-and-Multi-Core-Set-up-td1411235.html
http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-in-Single-Core-td475238.html

> Hi everyone,
>
> I'm trying to write some code for creating and using multi cores.
>
> Is there a method available for this purpose or do I have to do a HTTP
> to a URL such as
> http://localhost:8983/solr/admin/cores?action=STATUS&core=core0
>
> Is there an API available for this purpose. For example, if I want to
> create a new core named "core01" and then check for the status and
> then insert a document to that index of core01, how do I do it?
>
> Any help or a document would help greatly.
>
> Thanks in advance.
>
> --
> Regards,
>
> Tharindu
>
>   


-- 
http://jetwick.com twitter search prototype



Re: query between two date

2010-10-18 Thread Savvas-Andreas Moysidis
ok, I see now..well, the only query that comes to mind is something like:

check-in date:[2010-10-19T00:00:00Z TO *] AND check-out date:[* TO
2010-10-21T00:00:00Z]
would something like that work?

On 18 October 2010 11:04, nedaha  wrote:

>
> The exact query that i want is:
>
> check-in date >= 2010-10-19T00:00:00Z
> AND check-out date <= 2010-10-21T00:00:00Z
>
> but because of the structure that i have to index i don't have specific
> start date and end date in my solr index to compare with check-in and
> check-out date range. I have some dates that available to reserve!
>
> Could you please help me? :)
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1724062.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How can i get collect stemmed query?

2010-10-18 Thread Ahmet Arslan
rawquerystring = +body:flyaway
parsedquery = +body:fly +body:away

shows that your custom filter is working as you expected.

However you are using different tokenizers in query (standardtokenizer 
hard-coded) and index (whitespacetokenizer) time. That may cause numFound=0.  

For example if your indexed document contains 'fly, away' in its body field, 
your query won't return it. Because of comma. 

admin/analysis.jsp shows indexed tokens. 

You can issue a *:* query to see if that document really exists.
q=*:*&fl=body

Your query analyzer definition should look like   :


you cannot have both an analyzer and a tokenizer at the same time.

Once you get this working, in your case it is better to write a custom filter 
factory plug-in and define query analyzer using it. ( for performance reason)
And you can load your plug-in easier : 
http://wiki.apache.org/solr/SolrPlugins#How_to_Load_Plugins


          
  
        

      


--- On Mon, 10/18/10, Jerad  wrote:

> From: Jerad 
> Subject: Re: How can i get collect stemmed query?
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 12:14 PM
> 
> Oops, I'm Sorry! I found some mistakes on previous posted
> source.( Main class
> name has been wrong :<)
> 
> This is the collect analyzer source.
> ---
> public class MyCustomQueryAnalyzer extends Analyzer{ 
>     public static final Version LUCENE_VERSION =
> Version.LUCENE_29; 
>     public static int QUERY_MIN_LEN_WORD_FILTER =
> 1; 
>     public static int QUERY_MAX_LEN_WORD_FILTER =
> 40; 
>         
>     public int elapsedTime = 0; 
>         
>     @Override 
>     public TokenStream tokenStream(String
> paramString, Reader reader) { 
>         StandardTokenizer tokenizer =
> new StandardTokenizer( 
>            
> du.utas.mcrdr.ir.lucene.WebDocIR.LUCENE_VERSION, reader ); 
> 
>         TokenStream tokenStream = new
> LengthFilter( tokenizer,
> QUERY_MIN_LEN_WORD_FILTER, 
>          
>    QUERY_MAX_LEN_WORD_FILTER ); 
>         tokenStream = new
> LowerCaseFilter( tokenStream ); 
> 
> 
>         //My custom stemmer method 
>         MyCustomSingleWordStemmer
> stemer = new
> MyCustomSingleWordStemmer(QUERY_MIN_LEN_WORD_FILTER,
> QUERY_MAX_LEN_WORD_FILTER); 
> 
>         //My custom analyzer filter.
> this filter return sub-merged query. 
>         //ex) INPUT : flyaway 
>         // 
>    RETURN VALUE : fly +body:away 
>         tokenStream = new
> KLTQueryStemFilter( tokenStream, stemer, this ); 
> 
>         return tokenStream; 
>     } 
> } 
> 
> ---
> 
> [Additional info]
> 
> 1. MyCustomQueryAnalyzer made outside of Solr.
>     I made this analyzer outside of the solr
> package and make it to ~.jar
> and located at 
> 
>    
> ~/Solr/example/work/Jetty_0_0_0_0_8982_solr.war__solr__-2c5peu/webapp/WEB-INF/lib
> 
> 
> 2. I edited field type and field name in scheme.xml which
> to be searched.
> 
>      indexed="true" stored="true"
> omitNorms="true"/>
> 
>      class="solr.TextField">
>       
>            class="solr.WhitespaceTokenizerFactory"/>
>          class="solr.LowerCaseFilterFactory"/>
>       
>        class="com.testsolr.ir.customAnalyzer.MyCustomQueryAnalyzer">
>          class="solr.WhitespaceTokenizerFactory"/>
>       
>     
> 
>     This is my custom scheme.xml and custom
> search field type.
> 
> 3. I've got this xml result when I append
> &debugQuery=on to my search url.
> 
> 
>    
> - 
> - 
>   0 
>   0 
> - 
>   on 
>   on 
>   0 
>   +body:flyaway 
>   2.2 
>   10 
>   
>   
>    /> 
> - 
>    name="rawquerystring">+body:flyaway 
>    name="querystring">+body:flyaway 
>   +body:fly
> +body:away/str> 
>   +body:fly
> +body:away/str> 
>    
>   LuceneQParser
> 
> - 
>   0.0 
> - 
>   0.0 
> -  name="org.apache.solr.handler.component.QueryComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.FacetComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.MoreLikeThisComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.HighlightComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.StatsComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.DebugComponent">
>   0.0 
>   
>   
> - 
>   0.0 
> -  name="org.apache.solr.handler.component.QueryComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.FacetComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.MoreLikeThisComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.HighlightComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.StatsComponent">
>   0.0 
>   
> -  name="org.apache.solr.handler.component.DebugComponent">
>   0.0 
>   
>   
>   
>   
>   
> -

Re: Boosting documents based on the vote count

2010-10-18 Thread Ahmet Arslan
> I have a field in my schema which holds the number of votes
> a document
> has. How can I boost documents based on that number?

you can do it with http://wiki.apache.org/solr/FunctionQuery


  


Re: Boosting documents based on the vote count

2010-10-18 Thread Alexandru Badiu
I know but I can't figure out what functions to use. :)

On Mon, Oct 18, 2010 at 1:38 PM, Ahmet Arslan  wrote:
>> I have a field in my schema which holds the number of votes
>> a document
>> has. How can I boost documents based on that number?
>
> you can do it with http://wiki.apache.org/solr/FunctionQuery
>
>
>
>


Implementing Search Suggestion on Solr

2010-10-18 Thread Pablo Recio Quijano

Hi!

I'm trying to implement some kind of Search Suggestion on a search 
engine I have implemented. This search suggestions should not be 
automatically like the one described for the SpellCheckComponent [1]. 
I'm looking something like:


"SAS oppositions" => "Public job offers for some-company"

So I will have to define it manually. I was thinking about synonyms [2] 
but I don't know if it's the proper way to do it, because semantically 
those terms are not synonyms.


Any ideas or suggestions?

Regards,

[1] http://wiki.apache.org/solr/SpellCheckComponent
[2] 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory


Re: Term is duplicated when updating a document

2010-10-18 Thread Thomas Kellerer

Thanks.

Not really the answer I wanted to hear, but at least I know this is not my 
fault ;)

Regards
Thomas

Erick Erickson, 15.10.2010 20:42:

This is actually known behavior. The problem is that when you update
a document, it's deleted and re-added, but the original is marked as
deleted. However, the terms aren't touched, both the original and the new
document's terms are counted. It'd be hard, very hard, to remove
the terms from the inverted index efficiently.

But when you optimize, all the deleted documents (and their assiociated
terms) are physically removed from the files, thus your term counts change.

HTH
Erick

On Fri, Oct 15, 2010 at 10:05 AM, Thomas Kellererwrote:


Thanks for the answer.


  Which fields are modified when the document is updated/replaced.




Only one field was changed, but it was not the one where the auto-suggest
term is coming from.


  Are there any differences in the content of the fields that you are using

for the AutoSuggest.


No


  Have you changed you schema.xml file recently? If you have, then there may

have been changes in the way these fields are analyzed and broken down to
terms.



No, I did a complete index rebuild to rule out things like that.
Then after startup, did a search, then updated the document and did a
search again.

Regards
Thomas




This may be a bug if you did not change the field or the schema file but
the
terms count is changing.

On Fri, Oct 15, 2010 at 9:14 AM, Thomas Kellerer
  wrote:

  Hi,


we are updating our documents (that represent products in our shop) when
a
dealer modifies them, by calling
SolrServer.add(SolrInputDocument) with the updated document.

My understanding is, that there is no other way of updating an existing
document.


However we also use a term query to autocomplete the search field for the
user, but each time adocument is updated (added) the term count is
incremented. So after starting with a new index the count is e.g. 1, then
the document (that contains that term) is updated, and the count is 2,
the
next update will set this to 3 and so on.

One the index is optimized (by calling SolServer.optimize()) the count is
correct again.

Am I missing something or is this a bug in Solr/Lucene?

Thanks in advance
Thomas
















Re: "Virtual field", Statistics

2010-10-18 Thread Tanguy Moal
Hello Lance, thank you for your reply.

I created the following JIRA issue:
https://issues.apache.org/jira/browse/SOLR-2171, as suggested.

Can you tell me how new issues are handled by the development teams,
and whether there's a way I could help/contribute ?

--
Tanguy

2010/10/16 Lance Norskog :
> Please add a JIRA issue requesting this. A bunch of things are not
> supported for functions: returning as a field value, for example.
>
> On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal  wrote:
>> Dear solr-user folks,
>>
>> I would like to use the stats module to perform very basic statistics
>> (mean, min and max) which is actually working just fine.
>>
>> Nethertheless I found a little limitation that bothers me a tiny bit :
>> how to perform the exact same statistics, but on the result of a
>> function query rather than a field.
>>
>> Example :
>> schema :
>> - string : id
>> - float : width
>> - float : height
>> - float : depth
>> - string : color
>> - float : price
>>
>> What I'd like to do is something like :
>> select?price:[45.5 TO
>> 99.99]&stats=on&stats.facet=color&stats.field={volume=product(product(width,
>> height), depth)}
>> I would expect to obtain :
>>
>> 
>>  
>>  
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   ...
>>   
>>    
>>     
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>    
>>    
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>      ...
>>    
>>    
>>   
>>  
>>  
>> 
>>
>> Of course computing the volume can be performed before indexing data,
>> but defining virtual fields on the fly given an arbitrary function is
>> powerful and I am comfortable with the idea that many others would
>> appreciate. Especially for BI needs and so on... :-D
>> Is there a way to do it easily that I would have not been able to
>> find, or is it actually impossible ?
>>
>> Thank you very much in advance for your help.
>>
>> --
>> Tanguy
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Ken Stanley
Just following up to see if anybody might have some words of wisdom on the
issue?

Thank you,

Ken

It looked like something resembling white marble, which was
probably what it was: something resembling white marble.
-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"


On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley  wrote:

> Hello all,
>
> I am using SOLR-1.4.1 with the DataImportHandler, and I am trying to follow
> the advice from
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmlabout 
> converting date fields to SortableLong fields for better memory
> efficiency. However, whenever I try to do this using the DateFormater, I get
> exceptions when indexing for every row that tries to create my sortable
> fields.
>
> In my schema.xml, I have the following definitions for the fieldType and
> dynamicField:
>
>  stored="false" sortMissingLast="true" omitNorms="true" />
>  />
>
> In my dih.xml, I have the following definitions:
>
> 
> 
>  name="xml_stories"
> rootEntity="false"
> dataSource="null"
> processor="FileListEntityProcessor"
> fileName="legacy_stories.*\.xml$"
> recursive="false"
> baseDir="/usr/local/extracts"
> newerThan="${dataimporter.xml_stories.last_index_time}"
> >
>  name="stories"
> pk="id"
> dataSource="xml_stories"
> processor="XPathEntityProcessor"
> url="${xml_stories.fileAbsolutePath}"
> forEach="/RECORDS/RECORD"
> stream="true"
>
> transformer="DateFormatTransformer,HTMLStripTransformer,RegexTransformer,TemplateTransformer"
> onError="continue"
> >
>  xpath="/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL" />
>  sourceColName="_modified_date" dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'" />
>
>  xpath="/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL" />
>  sourceColName="_df_date_published" dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"
> />
>
>  sourceColName="modified_date" dateTimeFormat="MMddhhmmss" />
>  sourceColName="df_date_published" dateTimeFormat="MMddhhmmss" />
> 
> 
> 
> 
>
> The fields in question are in the formats:
>
> 
> 
> 
> 2001-12-04T00:00:00Z
> 
> 
> 2001-12-04T19:38:01Z
> 
> 
> 
>
> The exception that I am receiving is:
>
> Oct 15, 2010 6:23:24 PM
> org.apache.solr.handler.dataimport.DateFormatTransformer transformRow
> WARNING: Could not parse a Date field
> java.text.ParseException: Unparseable date: "Wed Nov 28 21:39:05 EST 2007"
> at java.text.DateFormat.parse(DateFormat.java:337)
> at
> org.apache.solr.handler.dataimport.DateFormatTransformer.process(DateFormatTransformer.java:89)
> at
> org.apache.solr.handler.dataimport.DateFormatTransformer.transformRow(DateFormatTransformer.java:69)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.applyTransformer(EntityProcessorWrapper.java:195)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:241)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:357)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:383)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
>
> I know that it has to be the SortableLong fields, because if I remove just
> those two lines from my dih.xml, everything imports as I expect it to. Am I
> doing something wrong? Mis-using the SortableLong and/or DateTransformer? Is
> this not supported in my version of SOLR? I'm not very experienced with
> Java, so digging into the code would be a lost cause for me right now. I was
> hoping that somebody here might be able to help point me in the
> right/correct direction.
>
> It should be noted that the modified_date and df_date_published fields
> index just fine (so long as I do it as I've defined above).
>
> Thank you,
>
> - Ken
>
> It looked like something resembling white marble, which was
> probably what it was: something resembling white marble.
> -- Douglas Adams, "The Hitchhikers Guide to the Galaxy"
>


Re: indexing mysql database

2010-10-18 Thread Erick Erickson
Also, the little-advertised DIH debug page can help, see:
solr/admin/dataimport.jsp

Best
Erick

On Sun, Oct 17, 2010 at 11:56 AM, William Pierce wrote:

> Two suggestions:  a) Noticed that your dih spec in the solrconfig.xml seems
> to to refer to "db-data-config.xml" but you said that your file was
> db-config.xml.   You may want to check this to make sure that your file
> names are correct.  b) what does your log say when you ran the import
> process?
>
> - Bill
>
> -Original Message- From: do3do3
> Sent: Sunday, October 17, 2010 8:29 AM
> To: solr-user@lucene.apache.org
> Subject: indexing mysql database
>
>
>
> i try to index table in mysql database,
> 1st i create db-config.xml file which contain
>  Driver="com.mysql.jdbc.Driver" encoding="UTF-8"
> url="jdbc:mysql://localhost:3306/(database name)"
> user="(user)" password="(password)" batchSize="-1"/>
> followed by
> 
> and defining of table like
> 
> 
> 2nd i add this field in schema.xml file
> and finally decide in solronfig.xml file the db-config.xml file as
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>   
>   db-data-config.xml
>   
>  
> i found index folder which contain only segment.gen & segment_1 files
> and when try to search no result i got
> any body can present a help ???
> thanks in advance
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-mysql-database-tp1719883p1719883.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Michael Sokolov
I think if you look closely you'll find the date quoted in the Exception
report doesn't match any of the declared formats in the schema.  I would
suggest, as a first step, hunting through your data to see where that date
is coming from.

-Mike

> -Original Message-
> From: Ken Stanley [mailto:doh...@gmail.com] 
> Sent: Monday, October 18, 2010 7:40 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR DateTime and SortableLongField field type problems
> 
> Just following up to see if anybody might have some words of 
> wisdom on the issue?
> 
> Thank you,
> 
> Ken
> 
> It looked like something resembling white marble, which was 
> probably what it was: something resembling white marble.
> -- Douglas Adams, "The Hitchhikers Guide to 
> the Galaxy"
> 
> 
> On Fri, Oct 15, 2010 at 6:42 PM, Ken Stanley  wrote:
> 
> > Hello all,
> >
> > I am using SOLR-1.4.1 with the DataImportHandler, and I am 
> trying to 
> > follow the advice from 
> > 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg11887.htmla
> > bout converting date fields to SortableLong fields for 
> better memory 
> > efficiency. However, whenever I try to do this using the 
> DateFormater, 
> > I get exceptions when indexing for every row that tries to 
> create my sortable fields.
> >
> > In my schema.xml, I have the following definitions for the 
> fieldType 
> > and
> > dynamicField:
> >
> >  indexed="true"
> > stored="false" sortMissingLast="true" omitNorms="true" /> 
> >  stored="false" indexed="true"
> > />
> >
> > In my dih.xml, I have the following definitions:
> >
> > 
> > 
> >  > name="xml_stories"
> > rootEntity="false"
> > dataSource="null"
> > processor="FileListEntityProcessor"
> > fileName="legacy_stories.*\.xml$"
> > recursive="false"
> > baseDir="/usr/local/extracts"
> > newerThan="${dataimporter.xml_stories.last_index_time}"
> > >
> >  > name="stories"
> > pk="id"
> > dataSource="xml_stories"
> > processor="XPathEntityProcessor"
> > url="${xml_stories.fileAbsolutePath}"
> > forEach="/RECORDS/RECORD"
> > stream="true"
> >
> > 
> transformer="DateFormatTransformer,HTMLStripTransformer,RegexT
> ransformer,TemplateTransformer"
> > onError="continue"
> > >
> >  > xpath="/RECORDS/RECORD/pr...@name='R_ModifiedTime']/PVAL" />
> >  > sourceColName="_modified_date" 
> > dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'" />
> >
> >  > xpath="/RECORDS/RECORD/pr...@name='R_StoryDate']/PVAL" />
> >  > sourceColName="_df_date_published" 
> dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"
> > />
> >
> >  > sourceColName="modified_date" dateTimeFormat="MMddhhmmss" />
> >  > sourceColName="df_date_published" dateTimeFormat="MMddhhmmss" />
> > 
> > 
> > 
> > 
> >
> > The fields in question are in the formats:
> >
> > 
> > 
> > 
> > 2001-12-04T00:00:00Z
> > 
> > 
> > 2001-12-04T19:38:01Z
> > 
> > 
> > 
> >
> > The exception that I am receiving is:
> >
> > Oct 15, 2010 6:23:24 PM
> > org.apache.solr.handler.dataimport.DateFormatTransformer 
> transformRow
> > WARNING: Could not parse a Date field
> > java.text.ParseException: Unparseable date: "Wed Nov 28 
> 21:39:05 EST 2007"
> > at java.text.DateFormat.parse(DateFormat.java:337)
> > at
> > 
> org.apache.solr.handler.dataimport.DateFormatTransformer.proce
> ss(DateFormatTransformer.java:89)
> > at
> > 
> org.apache.solr.handler.dataimport.DateFormatTransformer.trans
> formRow(DateFormatTransformer.java:69)
> > at
> > 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.appl
> yTransformer(EntityProcessorWrapper.java:195)
> > at
> > 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.next
> Row(EntityProcessorWrapper.java:241)
> > at
> > 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(Do
> cBuilder.java:357)
> > at
> > 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(Do
> cBuilder.java:383)
> > at
> > 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBu
> ilder.java:242)
> > at
> > 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuild
> er.java:180)
> > at
> > 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(D
> ataImporter.java:331)
> > at
> > 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImp
> orter.java:389)
> > at
> > 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.jav
> > a:370)
> >
> > I know that it has to be the SortableLong fields, because 
> if I remove 
> > just those two lines from my dih.xml, everything imports as 
> I expect 
> > it to. Am I doing something wrong? Mis-using th

Re: how can i use solrj binary format for indexing?

2010-10-18 Thread Jason, Kim

Hi, Gora
I haven't tried yet indexing huge amount of xml files through curl or pure
java(like a post.jar).
Indexing through xml is really fast?
How many files did you index? And How did it(using curl or pure java)?

Thanks, Gora
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1724645.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr requirements

2010-10-18 Thread Erick Erickson
Well, always get the biggest, fastest machine you can ...

On a serious note, you're right, there's not much info to go
on here. And even if there were more info, Solr performance
depends on how you search your data as well as how much
data you have...

About the only way you can really tell is to set your system
up and use the admin>>statistics page to monitor your
system. In particular, monitor your cache evictions etc.

This page may also help:
http://wiki.apache.org/solr/SolrPerformanceFactors

Best
Erick

On Mon, Oct 18, 2010 at 5:59 AM, satya swaroop wrote:

> Hi All,
>I am planning to have a separate server for solr and regarding
> hardware requirements i have a doubt about what configuration to be needed.
> I know it will be hard to tell but i just need a minimum requirement for
> the
> particular situation as follows::
>
>
> 1) There are 1000 regular users using solr and Every day each user indexes
> 10 files of 1KB each and totally it leads to a size of 10MB for a day and
> it
> goes on...???
>
> 2)How much of RAM is used by solr in genral???
>
> Thanks,
> satya
>


Re: "Virtual field", Statistics

2010-10-18 Thread Erick Erickson
The beauty/problem with open source is issues are picked up when
"somebody"  thinks they're important enough and has the time/energy
to work on it. And that person can be you ...

What usually happens is that someone submits a patch, various
people comment on it, look it over, ask for changes or provide
other feedback (e.g. "Have you considered XYZ", or "You
do realize that if we implement this patch, the universe
will end, don't you? "). Then, after a bunch of back-and
forths one of the committers decides that it's ready to be included
in the trunk and/or the branches.

The chances of the particular changed you need being included in
trunk go up dramatically if you provide a patch. And
keep pushing (gently) on the issue.

One tip, though. Before investing a lot of time and energy in
creating a patch, figure out how you expect to change the code
and ask some questions (via commenting on the
JIRA issue) about what you're thinking about doing. You'll often
get some really valuable feedback before investing lots of time...

See: http://wiki.apache.org/solr/HowToContribute for the details
of getting the source, compiling, running unit tests, setting
up your IDE, etc.

Best
Erick


On Mon, Oct 18, 2010 at 6:59 AM, Tanguy Moal  wrote:

> Hello Lance, thank you for your reply.
>
> I created the following JIRA issue:
> https://issues.apache.org/jira/browse/SOLR-2171, as suggested.
>
> Can you tell me how new issues are handled by the development teams,
> and whether there's a way I could help/contribute ?
>
> --
> Tanguy
>
> 2010/10/16 Lance Norskog :
> > Please add a JIRA issue requesting this. A bunch of things are not
> > supported for functions: returning as a field value, for example.
> >
> > On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal 
> wrote:
> >> Dear solr-user folks,
> >>
> >> I would like to use the stats module to perform very basic statistics
> >> (mean, min and max) which is actually working just fine.
> >>
> >> Nethertheless I found a little limitation that bothers me a tiny bit :
> >> how to perform the exact same statistics, but on the result of a
> >> function query rather than a field.
> >>
> >> Example :
> >> schema :
> >> - string : id
> >> - float : width
> >> - float : height
> >> - float : depth
> >> - string : color
> >> - float : price
> >>
> >> What I'd like to do is something like :
> >> select?price:[45.5 TO
> >>
> 99.99]&stats=on&stats.facet=color&stats.field={volume=product(product(width,
> >> height), depth)}
> >> I would expect to obtain :
> >>
> >> 
> >>  
> >>  
> >>   ...
> >>   ...
> >>   ...
> >>   ...
> >>   ...
> >>   ...
> >>   ...
> >>   ...
> >>   
> >>
> >> 
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>
> >>
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>  ...
> >>
> >>
> >>   
> >>  
> >>  
> >> 
> >>
> >> Of course computing the volume can be performed before indexing data,
> >> but defining virtual fields on the fly given an arbitrary function is
> >> powerful and I am comfortable with the idea that many others would
> >> appreciate. Especially for BI needs and so on... :-D
> >> Is there a way to do it easily that I would have not been able to
> >> find, or is it actually impossible ?
> >>
> >> Thank you very much in advance for your help.
> >>
> >> --
> >> Tanguy
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>


Re: Boosting documents based on the vote count

2010-10-18 Thread Ahmet Arslan
> I know but I can't figure out what
> functions to use. :)

Oh, I see. Why not just use {!boost b=log(vote)}?

May be scale(vote,0.5,10)?


  


Re: how can i use solrj binary format for indexing?

2010-10-18 Thread Gora Mohanty
On Mon, Oct 18, 2010 at 5:26 PM, Jason, Kim  wrote:
>
> Hi, Gora
> I haven't tried yet indexing huge amount of xml files through curl or pure
> java(like a post.jar).
> Indexing through xml is really fast?
> How many files did you index? And How did it(using curl or pure java)?
[...]

We did it through curl. There were some 3.5 million XML files, and some
60 fields in the Solr schema, with minor tokenising, though with some
facets. A total of about 40GB of data. We used five Solr instances, and
five cores on each instance. From what I recall, it took 6h, though here
we might have well been limited by the read speed on a slow network
drive that held the data. If done in this way, one might need to merge the
data from the various cores, a task which took us about 1.5h.

Regards,
Gora


Re: solr requirements

2010-10-18 Thread satya swaroop
Hi,
   here is some more info about it. I use Solr to output only the file
names(file id's). Here i enclose the fields in my schema.xml and presently i
have only about 40MB of indexed data.


   
   
   

   
   
   
   

   
   
   
   

   
   
   
   


   
   
   
   
   
   
   
   
   
   
   

   


   
   

   
   

   
   
   

   

 



Regards,
satya


RE: query between two date

2010-10-18 Thread Jonathan Rochkind
Recommend using the "pdate" format for faster range queries. 

Here's how (or one way) to  do a range query in solr

defType=lucene&q=some_field:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]

Does that answer your question?  I don't really understand what you're trying 
to do with your two dates. You can of course combine range queries with 
operators with the standard/lucene query parser:

defType=lucene&q=some_field:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] 
AND other_field=[whatever TO whatever]

There are ways to make a query comparing the values of two fields too using 
function queries. But it's slighlty confusing and i'm not sure that's what you 
want to do, I'm not really sure what you want to do. Want to give an example of 
exactly what input you have (from your application), and what question you are 
trying to answer from your index?

From: nedaha [neda...@gmail.com]
Sent: Monday, October 18, 2010 5:05 AM
To: solr-user@lucene.apache.org
Subject: Re: query between two date

Thanks for your reply.
I know about the solr date format!! Check-in and Check-out dates are
user-friendly format that we use in our search form for system's users. and
i change the format via code and then send them to solr.
I want to know how can i make a query to compare a range between check-in
and check-out date with some separate different days that i have in solr
index.
for example:
check-in date is: 2010-10-19T00:00:00Z
and
check-out date is: 2010-10-21T00:00:00Z

when i want to build a query from my application i have a range date but in
solr index i have separate dates.
So how can i compare them to get the appropriate result?
--
View this message in context: 
http://lucene.472066.n3.nabble.com/query-between-two-date-tp1718566p1723752.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR DateTime and SortableLongField field type problems

2010-10-18 Thread Ken Stanley
On Mon, Oct 18, 2010 at 7:52 AM, Michael Sokolov wrote:

> I think if you look closely you'll find the date quoted in the Exception
> report doesn't match any of the declared formats in the schema.  I would
> suggest, as a first step, hunting through your data to see where that date
> is coming from.
>
> -Mike
>
>
[Note: RE-sending this because apparently in my sleepy-stupor, I clicked to
wrong Reply button and never sent this to the list (It's a Monday) :)]

I've noticed that date anomaly as well, and I've discovered that is one of
the gotchas of DIH: it seems to modify my date to that format. All of the
dates in the data are in the correct "-MM-dd'T'hh:mm:ss'Z'" format. Once
it is run through dateTImeFormat, I assume it is converted into a date
object; trying to use that date object in any other form (i.e., using
template, or even another dateTimeFormat) results in the exception I've
described (displaying the date in the incorrect format).

Thanks,

Ken Stanley


Re: Implementing Search Suggestion on Solr

2010-10-18 Thread Dennis Gearon
What an interesting application :-)

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/18/10, Pablo Recio Quijano  wrote:

> From: Pablo Recio Quijano 
> Subject: Implementing Search Suggestion on Solr
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 3:53 AM
> Hi!
> 
> I'm trying to implement some kind of Search Suggestion on a
> search engine I have implemented. This search suggestions
> should not be automatically like the one described for the
> SpellCheckComponent [1]. I'm looking something like:
> 
> "SAS oppositions" => "Public job offers for
> some-company"
> 
> So I will have to define it manually. I was thinking about
> synonyms [2] but I don't know if it's the proper way to do
> it, because semantically those terms are not synonyms.
> 
> Any ideas or suggestions?
> 
> Regards,
> 
> [1] http://wiki.apache.org/solr/SpellCheckComponent
> [2] 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>


Re: API for using Multi cores with SolrJ

2010-10-18 Thread Tharindu Mathew
Thanks Peter. That helps a lot. It's weird that this not documented anywhere. :(

On Mon, Oct 18, 2010 at 3:42 PM, Peter Karich  wrote:
> I asked this myself ... here could be some pointers:
>
> http://lucene.472066.n3.nabble.com/SolrJ-and-Multi-Core-Set-up-td1411235.html
> http://lucene.472066.n3.nabble.com/EmbeddedSolrServer-in-Single-Core-td475238.html
>
>> Hi everyone,
>>
>> I'm trying to write some code for creating and using multi cores.
>>
>> Is there a method available for this purpose or do I have to do a HTTP
>> to a URL such as
>> http://localhost:8983/solr/admin/cores?action=STATUS&core=core0
>>
>> Is there an API available for this purpose. For example, if I want to
>> create a new core named "core01" and then check for the status and
>> then insert a document to that index of core01, how do I do it?
>>
>> Any help or a document would help greatly.
>>
>> Thanks in advance.
>>
>> --
>> Regards,
>>
>> Tharindu
>>
>>
>
>
> --
> http://jetwick.com twitter search prototype
>
>



-- 
Regards,

Tharindu


Re: API for using Multi cores with SolrJ

2010-10-18 Thread Ryan McKinley
On Mon, Oct 18, 2010 at 10:12 AM, Tharindu Mathew  wrote:
> Thanks Peter. That helps a lot. It's weird that this not documented anywhere. 
> :(

Feel free to edit the wiki :)


Re: how can i use solrj binary format for indexing?

2010-10-18 Thread Ryan McKinley
Do you already have the files as solr XML?  If so, I don't think you need solrj

If you need to build SolrInputDocuments from your existing structure,
solrj is a good choice.  If you are indexing lots of stuff, check the
StreamingUpdateSolrServer:
http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html


On Sun, Oct 17, 2010 at 11:01 PM, Jason, Kim  wrote:
>
> Hi all
> I have a huge amount of xml files for indexing.
> I want to index using solrj binary format to get performance gain.
> Because I heard that using xml files to index is quite slow.
> But I don't know how to use index through solrj binary format and can't find
> examples.
> Please give some help.
> Thanks,
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1722612.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: how can i use solrj binary format for indexing?

2010-10-18 Thread Jason, Kim

Thank you for reply, Gora

But I still have several questions.
Did you use separate index?
If so, you indexed 0.7 million Xml files per instance
and merged it. Is it Right?
Please let me know how to work multiple instances and cores in your case.

Regards,
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1725679.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Disable (or prohibit) per-field overrides

2010-10-18 Thread Jonathan Rochkind
You know about the 'invariant' that can be set in the request handler, 
right?  Not sure if that will do for you or not, but sounds related.


Added recnetly to some wiki page somewhere although the feature has been 
there for a long time.  Let's see if I can find the wiki page...Ah yes:


http://wiki.apache.org/solr/SearchHandler#Configuration

Markus Jelsma wrote:

Hi,

Thanks for the suggestion and pointer. We've implemented it using a single 
regex in Nginx for now. 


Cheers,

  

: Anyone knows useful method to disable or prohibit the per-field override
: features for the search components? If not, where to start to make it
: configurable via solrconfig and attempt to come up with a working patch?

If your goal is to prevent *clients* from specifying these (while you're
still allowed to use them in your defaults) then the simplest solution is
probably something external to Solr -- along the lines of mod_rewrite.

Internally...

that would be tough.

You could probably write a SearchComponent (configured to run "first")
that does it fairly easily -- just wrap the SolrParams in an impl that
retuns null anytime a component asks for a param name that starts with
"f." (and excludes those param names when asked for a list of the param
names)


It could probably be generalized to support arbitrary rules i na way
that might be handy for other folks, but it would still just be
wrapping all of hte params, so it would prevent you from using them
in your config as well.

Ultimatley i think a general solution would need to be in
RequestHandlerBase ... where it wraps the request params using the
defaults and invariants ... you'd want the custom exclusion rules to apply
only to the request params from the client.




-Hoss



Re: Disable (or prohibit) per-field overrides

2010-10-18 Thread Markus Jelsma
Thanks for your reply. But as i replied the following to Erick's suggestion 
which is quite the same:

>  Yes, we're using it but the problem is that there can be many fields
>  and that means quite a large list of parameters to set for each request
>  handler, and there can be many request handlers.
>  
>  It's not very practical for us to maintain such big set of invariants.

It's much easier for us to maintain a very short white list than a huge black 
list.

Cheers

On Monday, October 18, 2010 04:59:09 pm Jonathan Rochkind wrote:
> You know about the 'invariant' that can be set in the request handler,
> right?  Not sure if that will do for you or not, but sounds related.
> 
> Added recnetly to some wiki page somewhere although the feature has been
> there for a long time.  Let's see if I can find the wiki page...Ah yes:
> 
> http://wiki.apache.org/solr/SearchHandler#Configuration
> 
> Markus Jelsma wrote:
> > Hi,
> > 
> > Thanks for the suggestion and pointer. We've implemented it using a
> > single regex in Nginx for now.
> > 
> > Cheers,
> > 
> >> : Anyone knows useful method to disable or prohibit the per-field
> >> : override features for the search components? If not, where to start
> >> : to make it configurable via solrconfig and attempt to come up with a
> >> : working patch?
> >> 
> >> If your goal is to prevent *clients* from specifying these (while you're
> >> still allowed to use them in your defaults) then the simplest solution
> >> is probably something external to Solr -- along the lines of
> >> mod_rewrite.
> >> 
> >> Internally...
> >> 
> >> that would be tough.
> >> 
> >> You could probably write a SearchComponent (configured to run "first")
> >> that does it fairly easily -- just wrap the SolrParams in an impl that
> >> retuns null anytime a component asks for a param name that starts with
> >> "f." (and excludes those param names when asked for a list of the param
> >> names)
> >> 
> >> 
> >> It could probably be generalized to support arbitrary rules i na way
> >> that might be handy for other folks, but it would still just be
> >> wrapping all of hte params, so it would prevent you from using them
> >> in your config as well.
> >> 
> >> Ultimatley i think a general solution would need to be in
> >> RequestHandlerBase ... where it wraps the request params using the
> >> defaults and invariants ... you'd want the custom exclusion rules to
> >> apply only to the request params from the client.
> >> 
> >> 
> >> 
> >> 
> >> -Hoss

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


query pending commits?

2010-10-18 Thread Ryan McKinley
I have an indexing pipeline that occasionally needs to check if a
document is already in the index (even if not commited yet).

Any suggestions on how to do this without calling  before each check?

I have a list of document ids and need to know which ones are in the
index (actually I need to know which ones are not in the index)  I
figured I would write a custome RequestHandler that would check the
main Reader and the UpdateHander reader, but it now looks like
'update' is handled directly within IndexWriter.

Any ideas?

thanks
ryan


Commits on service after shutdown

2010-10-18 Thread Ezequiel Calderara
 Hi, i'm new in the mailing list.
I'm implementing Solr in my actual job, and i'm having some problems.
I was testing the consistency of the "commits". I found for example that if
we add X documents to the index (without commiting) and then we restart the
service, the documents are commited. They show up in the results. This is
interpreted to me like an error.
But when we add X documents to the index (without commiting) and then we
kill the process and we start it again, the documents doesn't appear. This
behaviour is the one i want.

Is there any param to avoid the auto-committing of documents after a
shutdown?
Is there any param to keep those un-commited documents "alive" after a kill?

Thanks!

-- 
__
Ezequiel.

Http://www.ironicnet.com 


Re: Commits on service after shutdown

2010-10-18 Thread Israel Ekpo
The documents should be implicitly committed when the Lucene index is
closed.

When you perform a graceful shutdown, the Lucene index gets closed and the
documents get committed implicitly.

When the shutdown is abrupt as in a KILL -9, then this does not happen and
the updates are lost.

You can use the auto commit parameter when sending your updates so that the
changes are saved right away, thought this could slow down the indexing
speed considerably but I do not believe there are parameters to keep those
un-commited documents "alive" after a kill.



On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara wrote:

>  Hi, i'm new in the mailing list.
> I'm implementing Solr in my actual job, and i'm having some problems.
> I was testing the consistency of the "commits". I found for example that if
> we add X documents to the index (without commiting) and then we restart the
> service, the documents are commited. They show up in the results. This is
> interpreted to me like an error.
> But when we add X documents to the index (without commiting) and then we
> kill the process and we start it again, the documents doesn't appear. This
> behaviour is the one i want.
>
> Is there any param to avoid the auto-committing of documents after a
> shutdown?
> Is there any param to keep those un-commited documents "alive" after a
> kill?
>
> Thanks!
>
> --
> __
> Ezequiel.
>
> Http://www.ironicnet.com 
>



-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


RE: how can i use solrj binary format for indexing?

2010-10-18 Thread Sharp, Jonathan
>Hi all
>I have a huge amount of xml files for indexing.
>I want to index using solrj binary format to get performance gain.
>Because I heard that using xml files to index is quite slow.
>But I don't know how to use index through solrj binary format and can't >find 
>examples.
>Please give some help.
>Thanks,

You might want to take a look at this section of the wiki too --
http://wiki.apache.org/solr/Solrj#Setting_the_RequestWriter

-Jon

-Original Message-
From: Jason, Kim [mailto:hialo...@gmail.com] 
Sent: Monday, October 18, 2010 7:52 AM
To: solr-user@lucene.apache.org
Subject: Re: how can i use solrj binary format for indexing?


Thank you for reply, Gora

But I still have several questions.
Did you use separate index?
If so, you indexed 0.7 million Xml files per instance
and merged it. Is it Right?
Please let me know how to work multiple instances and cores in your case.

Regards,
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1725679.html
Sent from the Solr - User mailing list archive at Nabble.com.


-
SECURITY/CONFIDENTIALITY WARNING:  
This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to receive further communications via e-mail, please reply to this message 
and inform the sender that you do not wish to receive further e-mail from the 
sender. 

-



ApacheCon Atlanta Meetup

2010-10-18 Thread Grant Ingersoll
Is there interest in having a Meetup at ApacheCon?  Who's going?  Would anyone 
like to present?  We could do something less formal, too, and just have drinks 
and Q&A/networking.  Thoughts?

-Grant



Spell checking question from a Solr novice

2010-10-18 Thread Xin Li
Hi, 

I am looking for a quick solution to improve a search engine's spell checking 
performance. I was wondering if anyone tried to integrate Google SpellCheck API 
with Solr search engine (if possible). Google spellcheck came to my mind 
because of two reasons. First, it is costly to clean up the data to be used as 
spell check baseline. Secondly, google probably has the most complete set of 
misspelled search terms. That's why I would like to know if it is a feasible 
way to go.

Thanks,
Xin
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the
addressee(s) named herein.  If you are not an intended recipient, 
please contact the sender immediately and take the steps 
necessary 
to delete the message completely from your computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.


Re: Commits on service after shutdown

2010-10-18 Thread Ezequiel Calderara
I understand, but i want to have control of what is commit or not.
In our scenario, we want to add documents to the index, and maybe after an
hour trigger the commit.

If in the middle, we have a server shutdown or any process sending a
Shutdown signal to the process. I don't want those documents being commited.

Should i file a bug issue or an enhacement issue?.

Thanks


On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpo  wrote:

> The documents should be implicitly committed when the Lucene index is
> closed.
>
> When you perform a graceful shutdown, the Lucene index gets closed and the
> documents get committed implicitly.
>
> When the shutdown is abrupt as in a KILL -9, then this does not happen and
> the updates are lost.
>
> You can use the auto commit parameter when sending your updates so that the
> changes are saved right away, thought this could slow down the indexing
> speed considerably but I do not believe there are parameters to keep those
> un-commited documents "alive" after a kill.
>
>
>
> On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara  >wrote:
>
> >  Hi, i'm new in the mailing list.
> > I'm implementing Solr in my actual job, and i'm having some problems.
> > I was testing the consistency of the "commits". I found for example that
> if
> > we add X documents to the index (without commiting) and then we restart
> the
> > service, the documents are commited. They show up in the results. This is
> > interpreted to me like an error.
> > But when we add X documents to the index (without commiting) and then we
> > kill the process and we start it again, the documents doesn't appear.
> This
> > behaviour is the one i want.
> >
> > Is there any param to avoid the auto-committing of documents after a
> > shutdown?
> > Is there any param to keep those un-commited documents "alive" after a
> > kill?
> >
> > Thanks!
> >
> > --
> > __
> > Ezequiel.
> >
> > Http://www.ironicnet.com  <
> http://www.ironicnet.com/>
> >
>
>
>
> --
> °O°
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>



-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Commits on service after shutdown

2010-10-18 Thread Matthew Hall
 No.. you would just turn autocommit off, and have the thread that is 
doing updates to your indexes commit every hour.   I'd think that this 
would take care of the scenario that you are describing.


Matt

On 10/18/2010 3:50 PM, Ezequiel Calderara wrote:

I understand, but i want to have control of what is commit or not.
In our scenario, we want to add documents to the index, and maybe after an
hour trigger the commit.

If in the middle, we have a server shutdown or any process sending a
Shutdown signal to the process. I don't want those documents being commited.

Should i file a bug issue or an enhacement issue?.

Thanks


On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpo  wrote:


The documents should be implicitly committed when the Lucene index is
closed.

When you perform a graceful shutdown, the Lucene index gets closed and the
documents get committed implicitly.

When the shutdown is abrupt as in a KILL -9, then this does not happen and
the updates are lost.

You can use the auto commit parameter when sending your updates so that the
changes are saved right away, thought this could slow down the indexing
speed considerably but I do not believe there are parameters to keep those
un-commited documents "alive" after a kill.



On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara
wrote:
  Hi, i'm new in the mailing list.
I'm implementing Solr in my actual job, and i'm having some problems.
I was testing the consistency of the "commits". I found for example that

if

we add X documents to the index (without commiting) and then we restart

the

service, the documents are commited. They show up in the results. This is
interpreted to me like an error.
But when we add X documents to the index (without commiting) and then we
kill the process and we start it again, the documents doesn't appear.

This

behaviour is the one i want.

Is there any param to avoid the auto-committing of documents after a
shutdown?
Is there any param to keep those un-commited documents "alive" after a
kill?

Thanks!

--
__
Ezequiel.

Http://www.ironicnet.com  <

http://www.ironicnet.com/>


--
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/








Re: Commits on service after shutdown

2010-10-18 Thread Ezequiel Calderara
But if something happens in between that hour, i will have lost or committed
the documents to the index out of the schedule.

How can i handle this scenario?

I think that Solr (or Lucene) should make sure of the
durabilityof
the data even if its in an uncommited state.
On Mon, Oct 18, 2010 at 4:53 PM, Matthew Hall wrote:

>  No.. you would just turn autocommit off, and have the thread that is doing
> updates to your indexes commit every hour.   I'd think that this would take
> care of the scenario that you are describing.
>
> Matt
>
>
> On 10/18/2010 3:50 PM, Ezequiel Calderara wrote:
>
>> I understand, but i want to have control of what is commit or not.
>> In our scenario, we want to add documents to the index, and maybe after an
>> hour trigger the commit.
>>
>> If in the middle, we have a server shutdown or any process sending a
>> Shutdown signal to the process. I don't want those documents being
>> commited.
>>
>> Should i file a bug issue or an enhacement issue?.
>>
>> Thanks
>>
>>
>> On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpo
>>  wrote:
>>
>> The documents should be implicitly committed when the Lucene index is
>>> closed.
>>>
>>> When you perform a graceful shutdown, the Lucene index gets closed and
>>> the
>>> documents get committed implicitly.
>>>
>>> When the shutdown is abrupt as in a KILL -9, then this does not happen
>>> and
>>> the updates are lost.
>>>
>>> You can use the auto commit parameter when sending your updates so that
>>> the
>>> changes are saved right away, thought this could slow down the indexing
>>> speed considerably but I do not believe there are parameters to keep
>>> those
>>> un-commited documents "alive" after a kill.
>>>
>>>
>>>
>>> On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara>>
 wrote:
  Hi, i'm new in the mailing list.
 I'm implementing Solr in my actual job, and i'm having some problems.
 I was testing the consistency of the "commits". I found for example that

>>> if
>>>
 we add X documents to the index (without commiting) and then we restart

>>> the
>>>
 service, the documents are commited. They show up in the results. This
 is
 interpreted to me like an error.
 But when we add X documents to the index (without commiting) and then we
 kill the process and we start it again, the documents doesn't appear.

>>> This
>>>
 behaviour is the one i want.

 Is there any param to avoid the auto-committing of documents after a
 shutdown?
 Is there any param to keep those un-commited documents "alive" after a
 kill?

 Thanks!

 --
 __
 Ezequiel.

 Http://www.ironicnet.com <
 http://www.ironicnet.com/>  <

>>> http://www.ironicnet.com/>
>>>
>>>
>>> --
>>> °O°
>>> "Good Enough" is not good enough.
>>> To give anything less than your best is to sacrifice the gift.
>>> Quality First. Measure Twice. Cut Once.
>>> http://www.israelekpo.com/
>>>
>>>
>>
>>
>


-- 
__
Ezequiel.

Http://www.ironicnet.com


RE: Spell checking question from a Solr novice

2010-10-18 Thread Xin Li
Oops, never mind. Just read Google API policy. 1000 queries per day limit & for 
non-commercial use only. 



-Original Message-
From: Xin Li 
Sent: Monday, October 18, 2010 3:43 PM
To: solr-user@lucene.apache.org
Subject: Spell checking question from a Solr novice

Hi, 

I am looking for a quick solution to improve a search engine's spell checking 
performance. I was wondering if anyone tried to integrate Google SpellCheck API 
with Solr search engine (if possible). Google spellcheck came to my mind 
because of two reasons. First, it is costly to clean up the data to be used as 
spell check baseline. Secondly, google probably has the most complete set of 
misspelled search terms. That's why I would like to know if it is a feasible 
way to go.

Thanks,
Xin
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the
addressee(s) named herein.  If you are not an intended recipient, 
please contact the sender immediately and take the steps 
necessary 
to delete the message completely from your computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the
addressee(s) named herein.  If you are not an intended recipient, 
please contact the sender immediately and take the steps 
necessary 
to delete the message completely from your computer system.

Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.


Re: Commits on service after shutdown

2010-10-18 Thread Ezequiel Calderara
I'll see if i can resolve this adding an extra core with the same schema for
holding this documents.
So, Core0 will act as a "Queue" and the Core1 will be the real index. And
the commit in the core0 will trigger an add to the core1 and its commit.
That way i can be sure of not losing data.

It surprises me that solr doesn't have this feature built-in. I still have
to verify the perfomance, but looks good to me.

Anyway, any help would be appreciated.


On Mon, Oct 18, 2010 at 5:05 PM, Ezequiel Calderara wrote:

> But if something happens in between that hour, i will have lost or
> committed the documents to the index out of the schedule.
>
> How can i handle this scenario?
>
> I think that Solr (or Lucene) should make sure of the 
> durabilityof the 
> data even if its in an uncommited state.
>   On Mon, Oct 18, 2010 at 4:53 PM, Matthew Hall  > wrote:
>
>>  No.. you would just turn autocommit off, and have the thread that is
>> doing updates to your indexes commit every hour.   I'd think that this would
>> take care of the scenario that you are describing.
>>
>> Matt
>>
>>
>> On 10/18/2010 3:50 PM, Ezequiel Calderara wrote:
>>
>>> I understand, but i want to have control of what is commit or not.
>>> In our scenario, we want to add documents to the index, and maybe after
>>> an
>>> hour trigger the commit.
>>>
>>> If in the middle, we have a server shutdown or any process sending a
>>> Shutdown signal to the process. I don't want those documents being
>>> commited.
>>>
>>> Should i file a bug issue or an enhacement issue?.
>>>
>>> Thanks
>>>
>>>
>>> On Mon, Oct 18, 2010 at 3:54 PM, Israel Ekpo
>>>  wrote:
>>>
>>> The documents should be implicitly committed when the Lucene index is
 closed.

 When you perform a graceful shutdown, the Lucene index gets closed and
 the
 documents get committed implicitly.

 When the shutdown is abrupt as in a KILL -9, then this does not happen
 and
 the updates are lost.

 You can use the auto commit parameter when sending your updates so that
 the
 changes are saved right away, thought this could slow down the indexing
 speed considerably but I do not believe there are parameters to keep
 those
 un-commited documents "alive" after a kill.



 On Mon, Oct 18, 2010 at 2:46 PM, Ezequiel Calderara>>>
> wrote:
>  Hi, i'm new in the mailing list.
> I'm implementing Solr in my actual job, and i'm having some problems.
> I was testing the consistency of the "commits". I found for example
> that
>
 if

> we add X documents to the index (without commiting) and then we restart
>
 the

> service, the documents are commited. They show up in the results. This
> is
> interpreted to me like an error.
> But when we add X documents to the index (without commiting) and then
> we
> kill the process and we start it again, the documents doesn't appear.
>
 This

> behaviour is the one i want.
>
> Is there any param to avoid the auto-committing of documents after a
> shutdown?
> Is there any param to keep those un-commited documents "alive" after a
> kill?
>
> Thanks!
>
> --
> __
> Ezequiel.
>
> Http://www.ironicnet.com <
> http://www.ironicnet.com/>  <
>
 http://www.ironicnet.com/>


 --
 °O°
 "Good Enough" is not good enough.
 To give anything less than your best is to sacrifice the gift.
 Quality First. Measure Twice. Cut Once.
 http://www.israelekpo.com/


>>>
>>>
>>
>
>
> --
>  __
> Ezequiel.
>
> Http://www.ironicnet.com 
>



-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Spell checking question from a Solr novice

2010-10-18 Thread Pradeep Singh
I haven't yet but I was going to use the spell checker in the lucene contrib
module. That spellchecker is ngram based and previously I have noticed that
I get better results from ngram based spellcheck rather than fuzzy string
match based ones.

On Mon, Oct 18, 2010 at 12:43 PM, Xin Li  wrote:

> Hi,
>
> I am looking for a quick solution to improve a search engine's spell
> checking performance. I was wondering if anyone tried to integrate Google
> SpellCheck API with Solr search engine (if possible). Google spellcheck came
> to my mind because of two reasons. First, it is costly to clean up the data
> to be used as spell check baseline. Secondly, google probably has the most
> complete set of misspelled search terms. That's why I would like to know if
> it is a feasible way to go.
>
> Thanks,
> Xin
> This electronic mail message contains information that (a) is or
> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE
> PROTECTED
> BY LAW FROM DISCLOSURE, and (b) is intended only for the use of
> the
> addressee(s) named herein.  If you are not an intended recipient,
> please contact the sender immediately and take the steps
> necessary
> to delete the message completely from your computer system.
>
> Not Intended as a Substitute for a Writing: Notwithstanding the
> Uniform Electronic Transaction Act or any other law of similar
> effect, absent an express statement to the contrary, this e-mail
> message, its contents, and any attachments hereto are not
> intended
> to represent an offer or acceptance to enter into a contract and
> are not otherwise intended to bind this sender,
> barnesandnoble.com
> llc, barnesandnoble.com inc. or any other person or entity.
>


Re: Spell checking question from a Solr novice

2010-10-18 Thread Jonathan Rochkind
In general, the benefit of the built-in Solr spellcheck is that it can 
use a dictionary based on your actual index.


If you want to use some external API, you certainly can, in your actual 
client app -- but it doesn't really need to involve Solr at all anymore, 
does it?  Is there any benefit I'm not thinking of to doing that on the 
solr side, instead of just in your client app?


I think Yahoo (and maybe Microsoft?) have similar APIs with more 
generous ToSs, but I haven't looked in a while.


Xin Li wrote:
Oops, never mind. Just read Google API policy. 1000 queries per day limit & for non-commercial use only. 




-Original Message-
From: Xin Li 
Sent: Monday, October 18, 2010 3:43 PM

To: solr-user@lucene.apache.org
Subject: Spell checking question from a Solr novice

Hi, 


I am looking for a quick solution to improve a search engine's spell checking 
performance. I was wondering if anyone tried to integrate Google SpellCheck API 
with Solr search engine (if possible). Google spellcheck came to my mind 
because of two reasons. First, it is costly to clean up the data to be used as 
spell check baseline. Secondly, google probably has the most complete set of 
misspelled search terms. That's why I would like to know if it is a feasible 
way to go.

Thanks,
Xin
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the
addressee(s) named herein.  If you are not an intended recipient, 
please contact the sender immediately and take the steps 
necessary 
to delete the message completely from your computer system.


Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.
This electronic mail message contains information that (a) is or 
may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
PROTECTED 
BY LAW FROM DISCLOSURE, and (b) is intended only for the use of 
the
addressee(s) named herein.  If you are not an intended recipient, 
please contact the sender immediately and take the steps 
necessary 
to delete the message completely from your computer system.


Not Intended as a Substitute for a Writing: Notwithstanding the 
Uniform Electronic Transaction Act or any other law of similar 
effect, absent an express statement to the contrary, this e-mail 
message, its contents, and any attachments hereto are not 
intended 
to represent an offer or acceptance to enter into a contract and 
are not otherwise intended to bind this sender, 
barnesandnoble.com 
llc, barnesandnoble.com inc. or any other person or entity.
  


Re: Spell checking question from a Solr novice

2010-10-18 Thread Pradeep Singh
I think a spellchecker based on your index has clear advantages. You can
spellcheck words specific to your domain which may not be available in an
outside dictionary. You can always dump the list from wordnet to get a
starter english dictionary.

But then it also means that misspelled words from your domain become the
suggested correct word. Hmmm ... you'll need to have a way to prune out such
words. Even then, your own domain based dictionary is a total go.

On Mon, Oct 18, 2010 at 1:55 PM, Jonathan Rochkind  wrote:

> In general, the benefit of the built-in Solr spellcheck is that it can use
> a dictionary based on your actual index.
>
> If you want to use some external API, you certainly can, in your actual
> client app -- but it doesn't really need to involve Solr at all anymore,
> does it?  Is there any benefit I'm not thinking of to doing that on the solr
> side, instead of just in your client app?
>
> I think Yahoo (and maybe Microsoft?) have similar APIs with more generous
> ToSs, but I haven't looked in a while.
>
>
> Xin Li wrote:
>
>> Oops, never mind. Just read Google API policy. 1000 queries per day limit
>> & for non-commercial use only.
>>
>>
>> -Original Message-
>> From: Xin Li Sent: Monday, October 18, 2010 3:43 PM
>> To: solr-user@lucene.apache.org
>> Subject: Spell checking question from a Solr novice
>>
>> Hi,
>> I am looking for a quick solution to improve a search engine's spell
>> checking performance. I was wondering if anyone tried to integrate Google
>> SpellCheck API with Solr search engine (if possible). Google spellcheck came
>> to my mind because of two reasons. First, it is costly to clean up the data
>> to be used as spell check baseline. Secondly, google probably has the most
>> complete set of misspelled search terms. That's why I would like to know if
>> it is a feasible way to go.
>>
>> Thanks,
>> Xin
>> This electronic mail message contains information that (a) is or may be
>> CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM
>> DISCLOSURE, and (b) is intended only for the use of the
>> addressee(s) named herein.  If you are not an intended recipient, please
>> contact the sender immediately and take the steps necessary to delete the
>> message completely from your computer system.
>>
>> Not Intended as a Substitute for a Writing: Notwithstanding the Uniform
>> Electronic Transaction Act or any other law of similar effect, absent an
>> express statement to the contrary, this e-mail message, its contents, and
>> any attachments hereto are not intended to represent an offer or acceptance
>> to enter into a contract and are not otherwise intended to bind this sender,
>> barnesandnoble.com llc, barnesandnoble.com inc. or any other person or
>> entity.
>> This electronic mail message contains information that (a) is or may be
>> CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM
>> DISCLOSURE, and (b) is intended only for the use of the
>> addressee(s) named herein.  If you are not an intended recipient, please
>> contact the sender immediately and take the steps necessary to delete the
>> message completely from your computer system.
>>
>> Not Intended as a Substitute for a Writing: Notwithstanding the Uniform
>> Electronic Transaction Act or any other law of similar effect, absent an
>> express statement to the contrary, this e-mail message, its contents, and
>> any attachments hereto are not intended to represent an offer or acceptance
>> to enter into a contract and are not otherwise intended to bind this sender,
>> barnesandnoble.com llc, barnesandnoble.com inc. or any other person or
>> entity.
>>
>>
>


Re: Spell checking question from a Solr novice

2010-10-18 Thread Jason Blackerby
If you know the misspellings you could prevent them from being added to the
dictionary with a StopFilterFactory like so:


  





  


where misspelled_words.txt contains the misspellings.

On Mon, Oct 18, 2010 at 5:14 PM, Pradeep Singh  wrote:

> I think a spellchecker based on your index has clear advantages. You can
> spellcheck words specific to your domain which may not be available in an
> outside dictionary. You can always dump the list from wordnet to get a
> starter english dictionary.
>
> But then it also means that misspelled words from your domain become the
> suggested correct word. Hmmm ... you'll need to have a way to prune out
> such
> words. Even then, your own domain based dictionary is a total go.
>
> On Mon, Oct 18, 2010 at 1:55 PM, Jonathan Rochkind 
> wrote:
>
> > In general, the benefit of the built-in Solr spellcheck is that it can
> use
> > a dictionary based on your actual index.
> >
> > If you want to use some external API, you certainly can, in your actual
> > client app -- but it doesn't really need to involve Solr at all anymore,
> > does it?  Is there any benefit I'm not thinking of to doing that on the
> solr
> > side, instead of just in your client app?
> >
> > I think Yahoo (and maybe Microsoft?) have similar APIs with more generous
> > ToSs, but I haven't looked in a while.
> >
> >
> > Xin Li wrote:
> >
> >> Oops, never mind. Just read Google API policy. 1000 queries per day
> limit
> >> & for non-commercial use only.
> >>
> >>
> >> -Original Message-
> >> From: Xin Li Sent: Monday, October 18, 2010 3:43 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Spell checking question from a Solr novice
> >>
> >> Hi,
> >> I am looking for a quick solution to improve a search engine's spell
> >> checking performance. I was wondering if anyone tried to integrate
> Google
> >> SpellCheck API with Solr search engine (if possible). Google spellcheck
> came
> >> to my mind because of two reasons. First, it is costly to clean up the
> data
> >> to be used as spell check baseline. Secondly, google probably has the
> most
> >> complete set of misspelled search terms. That's why I would like to know
> if
> >> it is a feasible way to go.
> >>
> >> Thanks,
> >> Xin
> >> This electronic mail message contains information that (a) is or may be
> >> CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM
> >> DISCLOSURE, and (b) is intended only for the use of the
> >> addressee(s) named herein.  If you are not an intended recipient, please
> >> contact the sender immediately and take the steps necessary to delete
> the
> >> message completely from your computer system.
> >>
> >> Not Intended as a Substitute for a Writing: Notwithstanding the Uniform
> >> Electronic Transaction Act or any other law of similar effect, absent an
> >> express statement to the contrary, this e-mail message, its contents,
> and
> >> any attachments hereto are not intended to represent an offer or
> acceptance
> >> to enter into a contract and are not otherwise intended to bind this
> sender,
> >> barnesandnoble.com llc, barnesandnoble.com inc. or any other person or
> >> entity.
> >> This electronic mail message contains information that (a) is or may be
> >> CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW FROM
> >> DISCLOSURE, and (b) is intended only for the use of the
> >> addressee(s) named herein.  If you are not an intended recipient, please
> >> contact the sender immediately and take the steps necessary to delete
> the
> >> message completely from your computer system.
> >>
> >> Not Intended as a Substitute for a Writing: Notwithstanding the Uniform
> >> Electronic Transaction Act or any other law of similar effect, absent an
> >> express statement to the contrary, this e-mail message, its contents,
> and
> >> any attachments hereto are not intended to represent an offer or
> acceptance
> >> to enter into a contract and are not otherwise intended to bind this
> sender,
> >> barnesandnoble.com llc, barnesandnoble.com inc. or any other person or
> >> entity.
> >>
> >>
> >
>


Schema required?

2010-10-18 Thread Frank Calfo
We need to index documents where the fields in the document can change 
frequently.

It appears that we would need to update our Solr schema definition before we 
can reindex using new fields.

Is there any way to make the Solr schema optional?



--frank



I need to indexing the first character of a field in another field

2010-10-18 Thread Renato Wesenauer
Hello guys,

I need to indexing the first character of the field "autor" in another field
"inicialautor".
Example:
   autor = Mark Webber
   inicialautor = M

I did a javascript function in the dataimport, but the field  inicialautor
indexing empty.

The function:

function InicialAutor(linha) {
var aut = linha.get("autor");
if (aut != null) {
  if (aut.length > 0) {
  var ch = aut.charAt(0);
  linha.put("inicialautor", ch);
  }
  else {
  linha.put("inicialautor", '');
  }
}
else {
linha.put("inicialautor", '');
}
return linha;
}

What's wrong?

Thank's,

Renato Wesenauer


RE: Schema required?

2010-10-18 Thread Tim Gilbert
Hi Frank,

Check out the Dynamic Fields option from here
http://wiki.apache.org/solr/SchemaXml

Tim

-Original Message-
From: Frank Calfo [mailto:fca...@aravo.com] 
Sent: Monday, October 18, 2010 5:25 PM
To: solr-user@lucene.apache.org
Subject: Schema required?

We need to index documents where the fields in the document can change
frequently.

It appears that we would need to update our Solr schema definition
before we can reindex using new fields.

Is there any way to make the Solr schema optional?



--frank



Admin for spellchecker?

2010-10-18 Thread Pradeep Singh
Do we need an admin screen for spellchecker? Where you can browse the words
and delete the ones you don't like so that they don't get suggested?


Re: Spell checking question from a Solr novice

2010-10-18 Thread Ezequiel Calderara
You can cross the new words against a dictionary and keep them in the file
as Jason described...

What Pradeep said is true, is always better to have "suggestions" related to
your index that have suggestions with no results...


On Mon, Oct 18, 2010 at 6:24 PM, Jason Blackerby wrote:

> If you know the misspellings you could prevent them from being added to the
> dictionary with a StopFilterFactory like so:
>
> positionIncrementGap="100" >
>  
>
>
> words="misspelled_words.txt"/>
> replacement="" replace="all"/>
>
>  
>
>
> where misspelled_words.txt contains the misspellings.
>
> On Mon, Oct 18, 2010 at 5:14 PM, Pradeep Singh 
> wrote:
>
> > I think a spellchecker based on your index has clear advantages. You can
> > spellcheck words specific to your domain which may not be available in an
> > outside dictionary. You can always dump the list from wordnet to get a
> > starter english dictionary.
> >
> > But then it also means that misspelled words from your domain become the
> > suggested correct word. Hmmm ... you'll need to have a way to prune out
> > such
> > words. Even then, your own domain based dictionary is a total go.
> >
> > On Mon, Oct 18, 2010 at 1:55 PM, Jonathan Rochkind 
> > wrote:
> >
> > > In general, the benefit of the built-in Solr spellcheck is that it can
> > use
> > > a dictionary based on your actual index.
> > >
> > > If you want to use some external API, you certainly can, in your actual
> > > client app -- but it doesn't really need to involve Solr at all
> anymore,
> > > does it?  Is there any benefit I'm not thinking of to doing that on the
> > solr
> > > side, instead of just in your client app?
> > >
> > > I think Yahoo (and maybe Microsoft?) have similar APIs with more
> generous
> > > ToSs, but I haven't looked in a while.
> > >
> > >
> > > Xin Li wrote:
> > >
> > >> Oops, never mind. Just read Google API policy. 1000 queries per day
> > limit
> > >> & for non-commercial use only.
> > >>
> > >>
> > >> -Original Message-
> > >> From: Xin Li Sent: Monday, October 18, 2010 3:43 PM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Spell checking question from a Solr novice
> > >>
> > >> Hi,
> > >> I am looking for a quick solution to improve a search engine's spell
> > >> checking performance. I was wondering if anyone tried to integrate
> > Google
> > >> SpellCheck API with Solr search engine (if possible). Google
> spellcheck
> > came
> > >> to my mind because of two reasons. First, it is costly to clean up the
> > data
> > >> to be used as spell check baseline. Secondly, google probably has the
> > most
> > >> complete set of misspelled search terms. That's why I would like to
> know
> > if
> > >> it is a feasible way to go.
> > >>
> > >> Thanks,
> > >> Xin
> > >> This electronic mail message contains information that (a) is or may
> be
> > >> CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW
> FROM
> > >> DISCLOSURE, and (b) is intended only for the use of the
> > >> addressee(s) named herein.  If you are not an intended recipient,
> please
> > >> contact the sender immediately and take the steps necessary to delete
> > the
> > >> message completely from your computer system.
> > >>
> > >> Not Intended as a Substitute for a Writing: Notwithstanding the
> Uniform
> > >> Electronic Transaction Act or any other law of similar effect, absent
> an
> > >> express statement to the contrary, this e-mail message, its contents,
> > and
> > >> any attachments hereto are not intended to represent an offer or
> > acceptance
> > >> to enter into a contract and are not otherwise intended to bind this
> > sender,
> > >> barnesandnoble.com llc, barnesandnoble.com inc. or any other person
> or
> > >> entity.
> > >> This electronic mail message contains information that (a) is or may
> be
> > >> CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE PROTECTED BY LAW
> FROM
> > >> DISCLOSURE, and (b) is intended only for the use of the
> > >> addressee(s) named herein.  If you are not an intended recipient,
> please
> > >> contact the sender immediately and take the steps necessary to delete
> > the
> > >> message completely from your computer system.
> > >>
> > >> Not Intended as a Substitute for a Writing: Notwithstanding the
> Uniform
> > >> Electronic Transaction Act or any other law of similar effect, absent
> an
> > >> express statement to the contrary, this e-mail message, its contents,
> > and
> > >> any attachments hereto are not intended to represent an offer or
> > acceptance
> > >> to enter into a contract and are not otherwise intended to bind this
> > sender,
> > >> barnesandnoble.com llc, barnesandnoble.com inc. or any other person
> or
> > >> entity.
> > >>
> > >>
> > >
> >
>



-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: I need to indexing the first character of a field in another field

2010-10-18 Thread Ezequiel Calderara
How are you declaring the transformer in the dataconfig?

On Mon, Oct 18, 2010 at 6:31 PM, Renato Wesenauer <
renato.wesena...@gmail.com> wrote:

> Hello guys,
>
> I need to indexing the first character of the field "autor" in another
> field
> "inicialautor".
> Example:
>   autor = Mark Webber
>   inicialautor = M
>
> I did a javascript function in the dataimport, but the field  inicialautor
> indexing empty.
>
> The function:
>
>function InicialAutor(linha) {
>var aut = linha.get("autor");
>if (aut != null) {
>  if (aut.length > 0) {
>  var ch = aut.charAt(0);
>  linha.put("inicialautor", ch);
>  }
>  else {
>  linha.put("inicialautor", '');
>  }
>}
>else {
>linha.put("inicialautor", '');
>}
>return linha;
>}
>
> What's wrong?
>
> Thank's,
>
> Renato Wesenauer
>



-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: I need to indexing the first character of a field in another field

2010-10-18 Thread Pradeep Singh
You can use regular expression based template transformer without writing a
separate function. It's pretty easy to use.

On Mon, Oct 18, 2010 at 2:31 PM, Renato Wesenauer <
renato.wesena...@gmail.com> wrote:

> Hello guys,
>
> I need to indexing the first character of the field "autor" in another
> field
> "inicialautor".
> Example:
>   autor = Mark Webber
>   inicialautor = M
>
> I did a javascript function in the dataimport, but the field  inicialautor
> indexing empty.
>
> The function:
>
>function InicialAutor(linha) {
>var aut = linha.get("autor");
>if (aut != null) {
>  if (aut.length > 0) {
>  var ch = aut.charAt(0);
>  linha.put("inicialautor", ch);
>  }
>  else {
>  linha.put("inicialautor", '');
>  }
>}
>else {
>linha.put("inicialautor", '');
>}
>return linha;
>}
>
> What's wrong?
>
> Thank's,
>
> Renato Wesenauer
>


Re: Admin for spellchecker?

2010-10-18 Thread Ezequiel Calderara
i was thinking about, you also would need to mark a word like valid, so it
doesn't mark it as wrong.


On Mon, Oct 18, 2010 at 6:37 PM, Pradeep Singh  wrote:

> Do we need an admin screen for spellchecker? Where you can browse the words
> and delete the ones you don't like so that they don't get suggested?
>



-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Schema required?

2010-10-18 Thread Jonathan Rochkind

Frank Calfo wrote:

We need to index documents where the fields in the document can change 
frequently.

It appears that we would need to update our Solr schema definition before we 
can reindex using new fields.

Is there any way to make the Solr schema optional?
  
No. But you can design your schema more flexibly than you are designing 
it.  Design it in a more abstract way, so it doesn't in fact need to 
change when external factors change.


I mean, every time you change your schema you are going to have to 
change any client applications that use your solr index to look things 
up using new fields and such too, right? You don't want to go changing 
your schema all the time. You want to design your schema so it doesn't 
need to change.


Solr is not an rdbms. You do not need to 'normalize' your data, or 
design your schema in the same way you would an rdbms. Design your 
schema to feed your actual and potential client apps.



Jonathan


  


Re: I need to indexing the first character of a field in another field

2010-10-18 Thread Jonathan Rochkind
You can just do this with a copyfield in your schema.xml instead.  Copy 
to a field which uses regexpfilter or some other analyzer to limit to 
first non-whitespace char (and perhaps force upcase too if you want). 
That's what I'd do, easier and will work if you index to Solr from 
something other than dataimport as well.


Renato Wesenauer wrote:

Hello guys,

I need to indexing the first character of the field "autor" in another field
"inicialautor".
Example:
   autor = Mark Webber
   inicialautor = M

I did a javascript function in the dataimport, but the field  inicialautor
indexing empty.

The function:

function InicialAutor(linha) {
var aut = linha.get("autor");
if (aut != null) {
  if (aut.length > 0) {
  var ch = aut.charAt(0);
  linha.put("inicialautor", ch);
  }
  else {
  linha.put("inicialautor", '');
  }
}
else {
linha.put("inicialautor", '');
}
return linha;
}

What's wrong?

Thank's,

Renato Wesenauer

  


Re: I need to indexing the first character of a field in another field

2010-10-18 Thread Chris Hostetter

This exact topic was just discussed a few days ago...

http://search.lucidimagination.com/search/document/7b6e2cc37bbb95c8/faceting_and_first_letter_of_fields#3059a28929451cb4

My comments on when/where it makes sense to put this logic...

http://search.lucidimagination.com/search/document/7b6e2cc37bbb95c8/faceting_and_first_letter_of_fields#7b6e2cc37bbb95c8


: Date: Mon, 18 Oct 2010 19:31:28 -0200
: From: Renato Wesenauer 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: I need to indexing the first character of a field in another field
: 
: Hello guys,
: 
: I need to indexing the first character of the field "autor" in another field
: "inicialautor".
: Example:
:autor = Mark Webber
:inicialautor = M
: 
: I did a javascript function in the dataimport, but the field  inicialautor
: indexing empty.
: 
: The function:
: 
: function InicialAutor(linha) {
: var aut = linha.get("autor");
: if (aut != null) {
:   if (aut.length > 0) {
:   var ch = aut.charAt(0);
:   linha.put("inicialautor", ch);
:   }
:   else {
:   linha.put("inicialautor", '');
:   }
: }
: else {
: linha.put("inicialautor", '');
: }
: return linha;
: }
: 
: What's wrong?
: 
: Thank's,
: 
: Renato Wesenauer
: 

-Hoss


Removing Common Web Page Header and Footer from All Content Fetched by Nutch

2010-10-18 Thread Israel Ekpo
Hi All,

I am indexing a web application with approximately 9500 distinct URL and
contents using Nutch and Solr.

I use Nutch to fetch the urls, links and the crawl the entire web
application to extract all the content for all pages.

Then I run the solrindex command to send the content to Solr.

The problem that I have now is that the first 1000 or so characters of some
pages and the last 400 characters of the pages are showing up in the search
results.

These are contents of the common header and footer used in the site
respectively.

The only work around that I have now is to index everything and then go
through each document one at a time to remove the first 1000 characters if
the levenshtein distance between the first 1000 characters of the page and
the common header is less than a certain value. Same applies to the footer
content common to all pages.

Is there a way to ignore certain "stop phrase" so to speak in the Nutch
configuration based on levenshtein distance or jaro winkler distance so that
certain parts of the fetched data that matches this stop phrases will not be
parsed?

Any useful pointers would be highly appreciated.

Thanks in advance.


-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: How can i get collect stemmed query?

2010-10-18 Thread Jerad

Thanks for your reply :)

1. I tested that "q=*:*&fl=body" , 1 doc returned as result as I expected.

2. I'm edit my scheme.xml as you instructed. 

 
//No filter description.
 

but no result returned.

3. I wonder that...

Tipically Tokenizer and filter flow was

1) Input stream provide text stream to tokenizer or filter.
2) tokenizer or filter get a token, and processed token and offset
attribute info has returned.
3) offset attributes has the infomation of token's.

This is a part of tipical filter src that I thought.
   

public class CustomStemFilter extends TokenFilter {

private MyCustomStemer stemmer;
private TermAttribute termAttr;
private OffsetAttribute offsetAttr;
private TypeAttribute typeAttr;
private Hashtable reserved = new
Hashtable();

public CustomStemFilter( TokenStream tokenStream, boolean 
isQuery,
MyCustomStemer stemmer ){
super( tokenStream );

this.stemmer = stemmer;
termAttr   = (TermAttribute) 
addAttribute(TermAttribute.class);   
offsetAttr = (OffsetAttribute)
addAttribute(OffsetAttribute.class);   
typeAttr   = (TypeAttribute)
addAttribute(TypeAttribute.class);   
addAttribute(PositionIncrementAttribute.class);

//Some of my custom logic here.
//do something.
}

private MyCustomStemmer stemmer = new MyCustomStemmer();

public boolean incrementToken() throws IOException {
clearAttributes();

if (!input.incrementToken())
return false;

StringBuffer queryBuffer = new StringBuffer();

//stemming logic here.
//generated query string has append to queryBuffer.

termAttr.setTermBuffer(queryBuffer.toString(), 0,
queryBuffer.length());
offsetAttr.setOffset(0, queryBuffer.length());
offSet += queryBuffer.length();
typeAttr.setType("word");

return true;
}
}
   


※ MyCustomStemmer analyze input string "flyaway" to query string :
fly +body:away
   and return it.

At index time, contents to be searched is normally analyzed and
indexed as below.

a) Contents to be indexed : fly away
b) Token "fly" and length of "fly" = 3(Has been setup by offset
attribute method) 
   has returned by filter or analyzer.
c) Next token "away" and length of "away" = 4 has returned.

I think it's a general index flow.

But, I customized MyCustomFilter that filter generate query string,
not a token.
In the process, offset value has changed : query's length, not a
single token's length.

I wonder that value to be set up by offsetAttr.setOffset() method 
has influence on search result on using solr? 
(I tested this on main page's query input box at
http://localhost:8983/solr/admin/ )


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-i-get-collect-search-result-from-custom-filtered-query-tp1723055p1729717.html
Sent from the Solr - User mailing list archive at Nabble.com.


Setting solr home directory in websphere

2010-10-18 Thread Kevin Cunningham
I've installed Solr a hundred times using Tomcat (on Windows) but now need to 
get it going with WebSphere (on Windows).  For whatever reason this seems to be 
black magic :)  I've installed the war file but have no idea how to set Solr 
home to let WebSphere know where the index and config files are.  Can someone 
enlighten me on how to do this please?


Re: Setting solr home directory in websphere

2010-10-18 Thread Israel Ekpo
You need to make sure that the following system variable is one of the
values specific in the JAVA_OPTS environment variable

-Dsolr.solr.home=path_to_solr_home



On Mon, Oct 18, 2010 at 10:20 PM, Kevin Cunningham <
kcunning...@telligent.com> wrote:

> I've installed Solr a hundred times using Tomcat (on Windows) but now need
> to get it going with WebSphere (on Windows).  For whatever reason this seems
> to be black magic :)  I've installed the war file but have no idea how to
> set Solr home to let WebSphere know where the index and config files are.
>  Can someone enlighten me on how to do this please?
>



-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


snapshot-4.0 and maven

2010-10-18 Thread Matt Mitchell
I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is
this possible to do? If so, could someone give me a tip or two on
getting started?

Thanks,
Matt


Re: snapshot-4.0 and maven

2010-10-18 Thread Tommy Chheng
 Once you built the solr 4.0 jar, you can use mvn's install command 
like this:


mvn install:install-file -DgroupId=org.apache -DartifactId=solr 
-Dpackaging=jar -Dversion=4.0-SNAPSHOT -Dfile=solr-4.0-SNAPSHOT.jar 
-DgeneratePom=true


@tommychheng


On 10/18/10 7:28 PM, Matt Mitchell wrote:

I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is
this possible to do? If so, could someone give me a tip or two on
getting started?

Thanks,
Matt


Re: Spell checking question from a Solr novice

2010-10-18 Thread Dennis Gearon
The first question to ask is will it work for you.

The SECOND question is do  you want google to know what's in your data?

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/18/10, Xin Li  wrote:

> From: Xin Li 
> Subject: Spell checking question from a Solr novice
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 12:43 PM
> Hi, 
> 
> I am looking for a quick solution to improve a search
> engine's spell checking performance. I was wondering if
> anyone tried to integrate Google SpellCheck API with Solr
> search engine (if possible). Google spellcheck came to my
> mind because of two reasons. First, it is costly to clean up
> the data to be used as spell check baseline. Secondly,
> google probably has the most complete set of misspelled
> search terms. That's why I would like to know if it is a
> feasible way to go.
> 
> Thanks,
> Xin
> This electronic mail message contains information that (a)
> is or 
> may be CONFIDENTIAL, PROPRIETARY IN NATURE, OR OTHERWISE 
> PROTECTED 
> BY LAW FROM DISCLOSURE, and (b) is intended only for the
> use of 
> the
> addressee(s) named herein.  If you are not an intended
> recipient, 
> please contact the sender immediately and take the steps 
> necessary 
> to delete the message completely from your computer
> system.
> 
> Not Intended as a Substitute for a Writing: Notwithstanding
> the 
> Uniform Electronic Transaction Act or any other law of
> similar 
> effect, absent an express statement to the contrary, this
> e-mail 
> message, its contents, and any attachments hereto are not 
> intended 
> to represent an offer or acceptance to enter into a
> contract and 
> are not otherwise intended to bind this sender, 
> barnesandnoble.com 
> llc, barnesandnoble.com inc. or any other person or
> entity.
>


Re: ApacheCon Atlanta Meetup

2010-10-18 Thread Dennis Gearon
I would love to go, but funds are low right now. NEXT year, I'd have something 
to demo though :-)


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/18/10, Grant Ingersoll  wrote:

> From: Grant Ingersoll 
> Subject: ApacheCon Atlanta Meetup
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 11:58 AM
> Is there interest in having a Meetup
> at ApacheCon?  Who's going?  Would anyone like to
> present?  We could do something less formal, too, and
> just have drinks and Q&A/networking.  Thoughts?
> 
> -Grant
> 
>


'Advertising' a site

2010-10-18 Thread Dennis Gearon
When I get my site which uses Solr/Lucene going, is is considered polite to 
post a small paragraph about it with a link?


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


Re: 'Advertising' a site

2010-10-18 Thread Otis Gospodnetic
Hi Dennis,

There is a PoweredBy page on the Wiki that's good for that.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Dennis Gearon 
> To: solr-user@lucene.apache.org
> Sent: Mon, October 18, 2010 11:35:09 PM
> Subject: 'Advertising' a site
> 
> When I get my site which uses Solr/Lucene going, is is considered polite to 
>post  a small paragraph about it with a link?
> 
> 
> Dennis  Gearon
> 
> Signature Warning
> 
> It is always a good idea  to learn from your own mistakes. It is usually a 
>better idea to learn from  others’ mistakes, so you do not have to make them 
>yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> EARTH  has a Right To Life,
>   otherwise we all die.
>


Re: Schema required?

2010-10-18 Thread Otis Gospodnetic
Solr requires a schema.
But Lucene does not! :)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Frank Calfo 
> To: "solr-user@lucene.apache.org" 
> Sent: Mon, October 18, 2010 5:25:27 PM
> Subject: Schema required?
> 
> We need to index documents where the fields in the document can change  
>frequently.
> 
> It appears that we would need to update our Solr schema  definition before we 
>can reindex using new fields.
> 
> Is there any way to  make the Solr schema optional?
> 
> 
> 
> --frank
> 
> 


Re: 'Advertising' a site

2010-10-18 Thread Dennis Gearon
Cool, thanks!

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/18/10, Otis Gospodnetic  wrote:

> From: Otis Gospodnetic 
> Subject: Re: 'Advertising' a site
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 9:28 PM
> Hi Dennis,
> 
> There is a PoweredBy page on the Wiki that's good for
> that.
> 
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
> > From: Dennis Gearon 
> > To: solr-user@lucene.apache.org
> > Sent: Mon, October 18, 2010 11:35:09 PM
> > Subject: 'Advertising' a site
> > 
> > When I get my site which uses Solr/Lucene going, is is
> considered polite to 
> >post  a small paragraph about it with a link?
> > 
> > 
> > Dennis  Gearon
> > 
> > Signature Warning
> > 
> > It is always a good idea  to learn from your own
> mistakes. It is usually a 
> >better idea to learn from  others’ mistakes, so
> you do not have to make them 
> >yourself. from 
> >'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> > 
> > EARTH  has a Right To Life,
> >   otherwise we all die.
> >
>


Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

2010-10-18 Thread Otis Gospodnetic
Hi Israel,

You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika
Not sure if it's built into Nutch, though...

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Israel Ekpo 
> To: solr-user@lucene.apache.org; u...@nutch.apache.org
> Sent: Mon, October 18, 2010 9:01:50 PM
> Subject: Removing Common Web Page Header and Footer from All Content Fetched 
> by 
>Nutch
> 
> Hi All,
> 
> I am indexing a web application with approximately 9500 distinct  URL and
> contents using Nutch and Solr.
> 
> I use Nutch to fetch the urls,  links and the crawl the entire web
> application to extract all the content for  all pages.
> 
> Then I run the solrindex command to send the content to  Solr.
> 
> The problem that I have now is that the first 1000 or so characters  of some
> pages and the last 400 characters of the pages are showing up in the  search
> results.
> 
> These are contents of the common header and footer  used in the site
> respectively.
> 
> The only work around that I have now is  to index everything and then go
> through each document one at a time to remove  the first 1000 characters if
> the levenshtein distance between the first 1000  characters of the page and
> the common header is less than a certain value.  Same applies to the footer
> content common to all pages.
> 
> Is there a way  to ignore certain "stop phrase" so to speak in the Nutch
> configuration based  on levenshtein distance or jaro winkler distance so that
> certain parts of the  fetched data that matches this stop phrases will not be
> parsed?
> 
> Any  useful pointers would be highly appreciated.
> 
> Thanks in  advance.
> 
> 
> -- 
> °O°
> "Good Enough" is not good enough.
> To give  anything less than your best is to sacrifice the gift.
> Quality First. Measure  Twice. Cut Once.
> http://www.israelekpo.com/
>


count(*) equivilent in Solr/Lucene

2010-10-18 Thread Dennis Gearon
Is there something in Solr/Lucene that could give me the equivalent to:

SELECT 
  COUNT(*) 
WHERE
  date_column1 > :start_date AND
  date_column2 > :end_date;

Providing I take into account deleted documents, of course (I.E., do some sort 
of averaging or some tracking function over time.)


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


Re: 'Advertising' a site

2010-10-18 Thread Chris Hostetter

: There is a PoweredBy page on the Wiki that's good for that.

Even better is a post to the list telling folks about your usee case, 
index size, hardware, etc  

A lot of new users find that information really helpful for comparison.


-Hoss


Re: count(*) equivilent in Solr/Lucene

2010-10-18 Thread Chris Hostetter
: 
: SELECT 
:   COUNT(*) 
: WHERE
:   date_column1 > :start_date AND
:   date_column2 > :end_date;

   q=*:*&fq=column1:[start TO *]&fq=column2:[end TO *]&rows=0

...every result includes a total count.

-Hoss


Re: 'Advertising' a site

2010-10-18 Thread Dennis Gearon
OK, no problem. It's about 4-8 months out. I'm just excited by the idea of 
finally going public.

I'm not a professional DB admin, web designer, Search Engine Analyst, Chief 
Technical Officer, or Backend programmer by education, only self study and 
about 1/2 of a Bachelors AND 1/2 of a Masters is CS. But I've studied and taken 
on about 1/2 of those.

It's all for something I WANT out there, and no one seems to have built it. So 
I will . . . and my team :-)

I'd like to get feedback when it's out there and learn from what people point 
out in their reaction to our implementation. I've already learned a lot here, 
and so has the main SE guy in our group.


I/we owe a LOOOT to:
  PHP community
  Symfony community
  Doctrine community
  Dezign for Databases
  Apache Community
  Eclipse community
  Postgres Community
  Ubuntu and it's community
  A couple of different library writers
.. let's see, who's left  Oh yeah!
  You guys here at Solr/Lucene

Thanks all of you guys :-)
  
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/18/10, Chris Hostetter  wrote:

> From: Chris Hostetter 
> Subject: Re: 'Advertising' a site
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 10:23 PM
> 
> : There is a PoweredBy page on the Wiki that's good for
> that.
> 
> Even better is a post to the list telling folks about your
> usee case, 
> index size, hardware, etc  
> 
> A lot of new users find that information really helpful for
> comparison.
> 
> 
> -Hoss
>


Re: count(*) equivilent in Solr/Lucene

2010-10-18 Thread Dennis Gearon
I/my team will have to look at that and decode it,LOL! I get some of it.

The database version returns 1 row, with the answer.

What does this return and how fast is it on BIG indexes?

PS, that should have been:
.
.
.
   date_column2 < :end_date;
.

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Mon, 10/18/10, Chris Hostetter  wrote:

> From: Chris Hostetter 
> Subject: Re: count(*) equivilent in Solr/Lucene
> To: solr-user@lucene.apache.org
> Date: Monday, October 18, 2010, 10:26 PM
> : 
> : SELECT 
> :   COUNT(*) 
> : WHERE
> :   date_column1 > :start_date AND
> :   date_column2 > :end_date;
> 
>    q=*:*&fq=column1:[start TO
> *]&fq=column2:[end TO *]&rows=0
> 
> ...every result includes a total count.
> 
> -Hoss
>