criteria for using the property stored="true" and indexed="true"

2007-12-12 Thread Dilip.TS

Hi,

I would be some clarifications on which fields should we assign the property
stored="true" and indexed="true"
What is the criteria for these property assignments?
What would be the impact if no field is assigned with this property?

Thanks in Advance,

Regards,
Dilip TS
Starmark Services Pvt. Ltd.



Solr and word frequencies?

2007-12-12 Thread Recono

Hi,

iam working on the following task.
I have a big Solr index "B"(round about 2 million forum-post entries) and 50
Sub-Indices "S1-50"(sub-forum entries) which are also included in "B".
Now I want Solr to compare the word frequency of each Word in "S1-50" to the
the word frequency of the whole big Index "B" to examine the words of
special interrest in "S1-50" compared to "B".

My questions are.
I guess Solr is using word frequency itself...is it possible to just access
this Solr functionality for my task (and if yes, how?) or do i have to write
something from scratch.
Do i need to put "S1-50" in standalone Solr-instances also or its enough to
set a field in "B" called 'S', values(1-50) ?

Thanks in advance!
-- 
View this message in context: 
http://www.nabble.com/Solr-and-word-frequencies--tp14292112p14292112.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOLR X FAST

2007-12-12 Thread Svein Parnas


On Dec 12, 2007, at 2:50 AM, Nuno Leitao wrote:



FAST uses two pipelines - an ingestion pipeline (for document  
feeding) and a query pipeline which are fully programmable (i.e.,  
you can customize it fully). At ingestion time you typically prepare  
documents for indexing (tokenize, character normalize, lemmatize,  
clean up text, perform entity extraction for facets, perform static  
boosting for certain documents, etc.), while at query time you can  
expand synonyms, and do other general query side tasks (not unlike  
Solr).


Horizontal scalability means the ability to cluster your search  
engine across a large number of servers, so you can scale up on the  
number of documents, queries, crawls, etc.


There are FAST deployments out there which run on dozens, in some  
cases hundreds of nodes serving multiple terabyte size indexes and  
achieving hundreds of queries per seconds.


Yet again, if your requirements are relatively simple then Lucene  
might do the job just fine.


Hope this helps.


With Fast, you will also get things like:
- categorization
- clustering
- more flexible collapsing / grouping
- more scalable facets (navigators) - at least for multivalued fields
- gigabytes of poorly documented software
- operations from hell
- huge amount of bugs
- high bills, both for software and hardware.

As for linguistic features (named entity extraction, dictionary based  
lemmatization and so on) and things like categorization / clustering  
etc, things should not be expected to work to well unless you put a  
huge amount of work into it, and some of the features are really  
primitive.


To sum up, if Solr meets your needs I would highly recommend Solr. If  
you need some additional features and have the knowledge, integrate  
other products with Solr. If you really need the scalability, go for  
Fast or some other commercial software.


As for document preprocessing and connectors for Solr, if you need it,  
you could have a look at OpenPipe, http://openpipe.berlios.de/ (not  
yet announced).


Svein



Leading WildCard in Query

2007-12-12 Thread Eswar K
Hi All,

I understand that a leading Wild card search is not allowed as it is a very
costly operation. There is an issues logged for it . (
http://issues.apache.org/jira/browse/SOLR-218). Is there any other way of
enabling leading wildcards apart from doing it in code by calling *
QueryParser.setAllowLeadingWildcard( true );*?

Regards,
Eswar


Re: criteria for using the property stored="true" and indexed="true"

2007-12-12 Thread Walter Ferrara
See:
http://wiki.apache.org/solr/SchemaXml#head-af67aefdc51d18cd8556de164606030446f56554

indexed means searchable (facet and sort also need this), stored instead
is needed only when you need the original text (i.e. not
tokenized/analyzed) to be returned.
When stored and indexed are not present, I think solr put them to a
default true (both of them)

Dilip.TS wrote:
> Hi,
>
> I would be some clarifications on which fields should we assign the property
> stored="true" and indexed="true"
> What is the criteria for these property assignments?
> What would be the impact if no field is assigned with this property?
>
> Thanks in Advance,
>
> Regards,
> Dilip TS
> Starmark Services Pvt. Ltd.
>
>
>   


Re: Leading WildCard in Query

2007-12-12 Thread Michael Kimsal
Please vote for SOLR-218.  I'm not aware of any other way to accomplish the
leading wildcard functionality that would be convenient.  SOLR-218 is not
asking that it be enabled by default, only that it be functionality that is
exposed to SOLR admins via config.xml.


On Dec 12, 2007 6:29 AM, Eswar K <[EMAIL PROTECTED]> wrote:

> Hi All,
>
> I understand that a leading Wild card search is not allowed as it is a
> very
> costly operation. There is an issues logged for it . (
> http://issues.apache.org/jira/browse/SOLR-218). Is there any other way of
> enabling leading wildcards apart from doing it in code by calling *
> QueryParser.setAllowLeadingWildcard( true );*?
>
> Regards,
> Eswar
>



-- 
Michael Kimsal
http://webdevradio.com


Re: display tokens

2007-12-12 Thread Ryan McKinley

Chris Hostetter wrote:

: Subject: display tokens
: 
: How can I retrieve the "analyzed tokens" (e.g. the stemmed values) of a

: specific field?

for a field by name independent of documents?  the LukeRequestHandler can 
give you the top N terms for a field ... but if you mean "i did a search, 
i found a document, show me the analyzed tokens for that document in this 
field" there is no easy way to get that information.


if you have a stored value for that field you can feed it into the 
analysis.jsp to see what the analyzed tokens are.




also check out faceting.  This returns the analyzed tokens, not the 
stored fields.


ryan


Re: Creating document schema at runtime

2007-12-12 Thread Ryan McKinley

Shalin Shekhar Mangar wrote:

Hi,

I'm looking on some tips on how to create a new document schema and
add it to solr core at runtime. The use case that I'm trying to solve
is:

1. Using a custom configuration tool, user creates a solr schema
2. The schema is added (uploaded) to a solr instance (on a remote machine).
3. Documents corresponding to the newly added schema are added to solr.

I understand that with SOLR-215, I can create a new core by specifying
the config and schema but still, there is no way for me to do this
from a remote machine using HTTP calls. 


Check SOLR-350 and: http://wiki.apache.org/solr/MultiCore

the 'LOAD' method isn't implemented yet, but that sounds like what you want.



If this capability does not
exist, I would be happy to open an issue in JIRA and contribute
patches.



patches are always welcome!

ryan


Re: Solr and word frequencies?

2007-12-12 Thread Otis Gospodnetic
Recono,

This would be easier to do with Lucene.  Solr uses Lucene under the hood, so 
just write an app that opens appropriate indices and makes use of various 
docFreq methods in the Lucene API.  Look at TermDocs, IndexReader, TermEnum, 
etc.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Recono <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, December 12, 2007 5:00:49 AM
Subject: Solr and word frequencies?


Hi,

iam working on the following task.
I have a big Solr index "B"(round about 2 million forum-post entries)
 and 50
Sub-Indices "S1-50"(sub-forum entries) which are also included in "B".
Now I want Solr to compare the word frequency of each Word in "S1-50"
 to the
the word frequency of the whole big Index "B" to examine the words of
special interrest in "S1-50" compared to "B".

My questions are.
I guess Solr is using word frequency itself...is it possible to just
 access
this Solr functionality for my task (and if yes, how?) or do i have to
 write
something from scratch.
Do i need to put "S1-50" in standalone Solr-instances also or its
 enough to
set a field in "B" called 'S', values(1-50) ?

Thanks in advance!
-- 
View this message in context:
 http://www.nabble.com/Solr-and-word-frequencies--tp14292112p14292112.html
Sent from the Solr - User mailing list archive at Nabble.com.






Solr 1.3 expected release date

2007-12-12 Thread Owens, Martin
What date or year do we believe Solr 1.3 will be released?

Regards, Martin Owens


Re: Solr 1.3 expected release date

2007-12-12 Thread Ryan McKinley

Owens, Martin wrote:

What date or year do we believe Solr 1.3 will be released?

Regards, Martin Owens


2008 for sure.  It will be after lucene 2.3 and that is a month(more?) 
away.  My honest guess is late Jan to mid Feb.


I think the last *major* change going into 1.3 is SOLR-303 (Distributed 
Search over HTTP) -- this will require some reworking of new features 
like SearchComponents and solrj.  After that, changes will mostly be for 
stability and clarity.


I don't really want to promote using nightly builds, but if you need 1.3 
features, the current ones are stable.  The interfaces may change, but 
it should not crash or anything like that.


ryan



RE: Solr, search result format

2007-12-12 Thread Owens, Martin

 
>> I think your biggest problem is requesting 70,000 records from Solr.
>> That is not going to be fast.

I know it, but the limits on the development don't lend themselves to putting 
all of the fields into lucene so a proper search can be conducted. We need to 
return them all because more work is done on the results webserver side (much 
to my chagrin) so paging is out of the question.

>> 2. Since you are running out of memory parsing XML, I'm guessing
>> that you're using a DOM-style parser. Don't do that. You do not
>> need to create elaborate structures, strip mine the data, then
>> throw those structures away. Instead, us a streaming parser, like Stax.

Oh I know there are better ways of doing it, I just can't do any of them. 
constraints and all that.

I was looking at the PythonResponseWriter, I'm trying to find a howto since a 
response writer would be responsible for writing the response after a search 
right?

Best regards, Martin Owens


Re: Solr, search result format

2007-12-12 Thread Walter Underwood
I think your biggest problem is requesting 70,000 records from Solr.
That is not going to be fast.

Two suggestions:

1. Use paging. Get the results in chunks, 10, 25, 100, whatever.

2. Since you are running out of memory parsing XML, I'm guessing
that you're using a DOM-style parser. Don't do that. You do not
need to create elaborate structures, strip mine the data, then
throw those structures away. Instead, us a streaming parser, like Stax.

The sounds like an XY problem. What are you trying to achieve by
fetching 10,000 records? There is probably a better way to do it.

wunder


On 12/12/07 11:58 AM, "Owens, Martin" <[EMAIL PROTECTED]> wrote:

> Hello everyone,
> 
> I'm looking for a better solution that the current xml output we're currently
> getting; if you return more than 70k records the webserver can no longer cope
> with parsing the xml and the machine falls over out of memory.
> 
> Ideally what we'd like is for the search results to go directly into a
> temporary mysql table so we can link against it in a further request from the
> web server. Does anyone know any plugs or people who have done anything along
> these lines?
> 
> We might be able to settle for receiving the single field column as a csv type
> file, that would at least let us cut down on the processing and parsing. I see
> there is a csv indexer but do we have a csv output plugin?
> 
> Once again thank you all for your help.
> 
> Best Regards, Martin Ownes



Re: Solr, search result format

2007-12-12 Thread Ryan McKinley

Owens, Martin wrote:

Hello everyone,

I'm looking for a better solution that the current xml output we're currently 
getting; if you return more than 70k records the webserver can no longer cope 
with parsing the xml and the machine falls over out of memory.

Ideally what we'd like is for the search results to go directly into a 
temporary mysql table so we can link against it in a further request from the 
web server. Does anyone know any plugs or people who have done anything along 
these lines?



"out of the box" solr does not do that...

maybe try a custom RequestHandler that extends StandarRequestHandler. 
let the base handler to everything, then in handleRequestBody, pull the 
results out of the response and use JDBC to fill your SQL tables.


otherwise try paging through the results... 70*1K results or something 
like that...


ryan


Re: Solr, search result format

2007-12-12 Thread Walter Underwood
Fetch your 70,000 results in 70 chunks of 1000 results. Parse each chunk
and add it to your internal list.

If you are allowed to parse Python results, why can't you use a diffetent
XML parser?

What sort of "more work" are you doing? I've implemented lots of stuff
on top of a paged model, including customizing the relevance formula
and re-ranking.

wunder

On 12/12/07 12:31 PM, "Owens, Martin" <[EMAIL PROTECTED]> wrote:
>  
>>> I think your biggest problem is requesting 70,000 records from Solr.
>>> That is not going to be fast.
> 
> I know it, but the limits on the development don't lend themselves to putting
> all of the fields into lucene so a proper search can be conducted. We need to
> return them all because more work is done on the results webserver side (much
> to my chagrin) so paging is out of the question.
> 
>>> 2. Since you are running out of memory parsing XML, I'm guessing
>>> that you're using a DOM-style parser. Don't do that. You do not
>>> need to create elaborate structures, strip mine the data, then
>>> throw those structures away. Instead, us a streaming parser, like Stax.
> 
> Oh I know there are better ways of doing it, I just can't do any of them.
> constraints and all that.
> 
> I was looking at the PythonResponseWriter, I'm trying to find a howto since a
> response writer would be responsible for writing the response after a search
> right?
> 
> Best regards, Martin Owens



Re: Solr, search result format

2007-12-12 Thread Mike Klaas


On 12-Dec-07, at 11:58 AM, Owens, Martin wrote:


Hello everyone,


Hi Martin,

It is usually preferrable to not reply to an existing message in the  
group when starting a new thread.  Some people (like me) use clients  
that properly track the Followup-To header that gets added, so  
multiple threads get all jumbled together.  (For instance, "Solr and  
word frequencies?", "Solr 1.3 expected release date", and "Solr,  
search result format" are all now mixed together in my client.)


Thanks!
-Mike


Autocommit

2007-12-12 Thread Michael Thessel
Hello UG

I already posted a while ago a problem that one of the solr threads
starts using 100% of one of the processor cores on a 4 core
system. This doesn't happen right after the start it slightly
increaes for about a week until the process runs constantly at
100%. I couldn't figure out a solution for this. I could live with this
problem but I think it has an side effect. While the processor load
increases the time between two autocommits increases as well. Currently
autocommit is set to 3 minutes. After 4 weeks the commits run only
every 40 minutes. 

I have the following solr version installed:

Solr Specification Version: 1.2.0
Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12
Lucene Specification Version: 2007-05-20_00-04-53
Lucene Implementation Version: build 2007-05-20

Does anyone have a hint what I could look for?

Thanks 

Michael

-- 
Michael Thessel <[EMAIL PROTECTED]>
Gossamer Threads Inc. http://www.gossamer-threads.com/
Tel: (604) 687-5804 Fax: (604) 687-5806


Re: Autocommit

2007-12-12 Thread Yonik Seeley
On Dec 12, 2007 6:15 PM, Michael Thessel <[EMAIL PROTECTED]> wrote:
> I already posted a while ago a problem that one of the solr threads
> starts using 100% of one of the processor cores on a 4 core
> system.

This sounds like warming / autowarming.
The other possibility is garbage collection.

> This doesn't happen right after the start it slightly
> increaes for about a week until the process runs constantly at
> 100%.

Is it still just one CPU at 100%, or is it ever 2 or more at 100%.
That would tell us if it were due to overlapping autowarming.

What happens to your index over time?  Does maxDoc() keep increasing?

> I couldn't figure out a solution for this. I could live with this
> problem but I think it has an side effect. While the processor load
> increases the time between two autocommits increases as well. Currently
> autocommit is set to 3 minutes. After 4 weeks the commits run only
> every 40 minutes.
>
> I have the following solr version installed:
>
> Solr Specification Version: 1.2.0
> Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12
> Lucene Specification Version: 2007-05-20_00-04-53
> Lucene Implementation Version: build 2007-05-20
>
> Does anyone have a hint what I could look for?

Perhaps post the XML you get from the statistics page so we might know more.
Try looking in the logs to see what part of autowarming is taking so long.

-Yonik


Re: Autocommit

2007-12-12 Thread Michael Thessel
> > I already posted a while ago a problem that one of the solr threads
> > starts using 100% of one of the processor cores on a 4 core
> > system.
> 
> This sounds like warming / autowarming.
> The other possibility is garbage collection.

What can I do here? Decrease the autowarmcount?

My current settings
filterCache autowarmCount="256"
queryResultCache autowarmCount="256"
documentCache autowarmCount="0"

> > This doesn't happen right after the start it slightly
> > increaes for about a week until the process runs constantly at
> > 100%.
> 
> Is it still just one CPU at 100%, or is it ever 2 or more at 100%.
> That would tell us if it were due to overlapping autowarming.

It is always just one core. All the other cores run once in a while at
100% but only one core is constantly at 100%

> What happens to your index over time?  Does maxDoc() keep increasing?
Yes maxDoc is always increasing it is pretty much the same as the total
number of indexed documents.

> Perhaps post the XML you get from the statistics page so we might
> know more. Try looking in the logs to see what part of autowarming is
> taking so long.
While checking the logs I run over some performance warnings. I run a
optimize every night. Minutes after I start the optimize I get a:

INFO: PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Could the problem be related to this? I disabled the optimize for now.
If the optimize is the problem what would be a better different
strategy to run the optimize?

Thanks for your help.

Cheers,

Michael

-- 
Michael Thessel <[EMAIL PROTECTED]>
Gossamer Threads Inc. http://www.gossamer-threads.com/
Tel: (604) 687-5804 Fax: (604) 687-5806


RE: Solr 1.3 expected release date

2007-12-12 Thread Norskog, Lance
 
... SOLR-303 (Distributed Search over HTTP)...

Woo-hoo!


-Original Message-
From: Ryan McKinley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 12, 2007 12:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 1.3 expected release date

Owens, Martin wrote:
> What date or year do we believe Solr 1.3 will be released?
> 
> Regards, Martin Owens

2008 for sure.  It will be after lucene 2.3 and that is a month(more?)
away.  My honest guess is late Jan to mid Feb.

I think the last *major* change going into 1.3 is SOLR-303 (Distributed
Search over HTTP) -- this will require some reworking of new features
like SearchComponents and solrj.  After that, changes will mostly be for
stability and clarity.

I don't really want to promote using nightly builds, but if you need 1.3
features, the current ones are stable.  The interfaces may change, but
it should not crash or anything like that.

ryan



Re: Solr 1.3 expected release date

2007-12-12 Thread Venkatraman S
On Dec 13, 2007 1:38 AM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

>
> I think the last *major* change going into 1.3 is SOLR-303 (Distributed
> Search over HTTP) -- this will require some reworking of new features
> like SearchComponents and solrj.  After that, changes will mostly be for
> stability and clarity.
>
>
interesting !! ...are you planning to use Hadoop??
Can you brief the DL on the architecture?

-- 
Venkat
Blog @ http://blizzardzblogs.blogspot.com


Re: Solr and Flex

2007-12-12 Thread Venkatraman S
I presume you understand the difference between Solr and Flex - and am not
sure what you need the code for?
do you want an AS3 script implementation/wrapper for Solr or are you
expecting an application which call uses Solr(to index the docs) and
retrieve the docs using some web services and present it to the users in a
Flex app?

either ways - you can code :)


On Dec 12, 2007 3:47 AM, jenix <[EMAIL PROTECTED]> wrote:

>
> Has anyone used Solr in a Flex application?
> Any code snipplets to share?
>
> Thank you.
> Jennifer
> --
> View this message in context:
> http://www.nabble.com/Solr-and-Flex-tp14284703p14284703.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


--


Re: SOLR X FAST

2007-12-12 Thread Chris Hostetter

: Why use FAST and not use SOLR ? For example.
: What will FAST offer that will justify the investment ?

Am I the only one that finds these questions incredibly hilarious?  
particularly on this list?

You should also email FAST customer service and ask them "Why use Solr and 
not use FAST ?"  :)



-Hoss



RE: Two Solr Webapps, one folder for the index data?

2007-12-12 Thread Chris Hostetter

: I asked a question similar to this back in 
: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200709.mbox/[EMAIL 
PROTECTED] 

: SolrDispatchFilter and stored in the global Config). This way, I can 
: have a multiple instances of Solr up and running with the exact same 
: configuration, and their indices contained wholly within their 
: deployment directories.

As i mentioned in that thread (and i don't think you ever replied) this 
seems like a really bad idea ... anytime you want to upgrade Solr, your 
configs and data all get completely blown away.

I think if people want to reuse the same configs multiple times with 
only small varaitions (for things like the dataDir) it makes a lot more 
sense to add support for variable substitution based on JNDI variables...

: : I actually have a patch for solr config parser which allows you to use
: : context environment variables in the solrconfig.xml
: : I generally use it for development when I'm working with multiple
: : instances and different data dirs.  I'll add it to jira today if you
: : want it.




-Hoss



Re: criteria for using the property stored="true" and indexed="true"

2007-12-12 Thread Chris Hostetter

: 
http://wiki.apache.org/solr/SchemaXml#head-af67aefdc51d18cd8556de164606030446f56554
: 
: indexed means searchable (facet and sort also need this), stored instead
: is needed only when you need the original text (i.e. not
: tokenized/analyzed) to be returned.
: When stored and indexed are not present, I think solr put them to a
: default true (both of them)

not exactly ... if you don't have them on the field, they are inherited 
from the  ... if you don't have them on the fieldType, then 
it's whatever the default behavior is for the FieldType class used by the 


: > What would be the impact if no field is assigned with this property?

if no fields are "stored" then you can't see your search results.  if no 
fields are "indexed" you can't search for anything.



-Hoss



Re: does solr handle hierarchical facets?

2007-12-12 Thread Chris Hostetter

: > such that if you search for category, you get all those documents that have
: > been tagged with the category AND any sub categories. If this is possible I
: > think I'll investigate using solr in place of some existing code we have
: > that deals with indexing and searching of such data.
: 
: sort of.   you can index a field literally as "category/subcategory/
: subsubcategory" and query for category/* to get all documents in that category
: and below.

I deal with this kind of stuff all the time ... if you can model your 
hierarchy using unique categoryIds (numbers are easiest) such that no 
categoryId appears more then one place in the hierarchy (something that 
frequently isn't possible with "category names" then it's really easy to 
just index the entire "breadcrumb" for a document and then you can search 
on any categoryId and get all of the documents in any "descendent" 
category.

ie, if this is your hierarchy...

Products/
Products/Computers/
Products/Computers/Laptops
Products/Computers/Desktops
Products/Cases
Products/Cases/Laptops
Products/Cases/CellPhones

Then this trick won't work (because Laptops appears twice) but if you have 
numeric IDs that corrispond with each of those categories (so that the two 
instances of Laptops are unique...

1/
1/2/
1/2/3
1/2/4
1/5/
1/5/6
1/5/7

...then you just index the full "path" (a pattern tokenizer can work fine) 
and then you can search on "5" and get all products which are "Cases")




-Hoss



Re: Solr, Multiple processes running

2007-12-12 Thread Chris Hostetter

: Subject: Solr, Multiple processes running
: References: <[EMAIL PROTECTED]>
:  <[EMAIL PROTECTED]>
: <[EMAIL PROTECTED]>
...

http://people.apache.org/~hossman/#threadhijack

Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss