olr using Solr admin
interface, the documents with last 20 RECORD_ID are missing.(example the
last id is 999,980 instead of 1,000,000)
- Sharmila
Feak, Todd wrote:
>
> A few questions to help the troubleshooting.
>
> Solr version #?
>
> Is there just 1 commit through Solrj fo
A few questions to help the troubleshooting.
Solr version #?
Is there just 1 commit through Solrj for the millions of documents?
Or do you do it on a regular interval (every 100k documents for example) and
then one at the end to be sure?
How are you observing that the last few didn't make it
Id=684936&literal.filingDate=1997-12-04T00:00:00Z&literal.formTypeId=95&literal.companyId=3567904&literal.sourceId=0&resource.name=684936.txt&commit=false
>>
>> Have you verified that all of your indexing jobs (you said you had 4
>> or 5) have commit=false?
&g
Any particular reason for the double quotes in the 2nd and 3rd query example,
but not the 1st, or is this just an artifact of your email?
-Todd
-Original Message-
From: Rakhi Khatwani [mailto:rkhatw...@gmail.com]
Sent: Tuesday, October 06, 2009 2:26 AM
To: solr-user@lucene.apache.org
Su
We use the snapcleaner script.
http://wiki.apache.org/solr/SolrCollectionDistributionScripts#snapcleaner
Will that do the job?
-Todd
-Original Message-
From: solr jay [mailto:solr...@gmail.com]
Sent: Monday, October 05, 2009 1:58 PM
To: solr-user@lucene.apache.org
Subject: cleanup old
It looks like you have some confusion about queries vs. facets. You may want to
look at the Solr wiki reqarding facets a bit. In the meanwhile, if you just
want to query for that field containing "21"...
I would suggest that you don't set the query type, don't set any facet fields,
and only set
@lucene.apache.org
Subject: RE: Solr Timeouts
I'm not committing at all actually - I'm waiting for all 6 million to be done.
-Original Message-----
From: Feak, Todd [mailto:todd.f...@smss.sony.com]
Sent: Monday, October 05, 2009 12:10 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts
ncade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Monday, October 05, 2009 9:30 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Timeouts
I'm not committing at all actually - I'm waiting for all 6 million to be done.
-Original Message-----
From: Feak, Todd [mailto:todd.f..
How often are you committing?
Every time you commit, Solr will close the old index and open the new one. If
you are doing this in parallel from multiple jobs (4-5 you mention) then
eventually the server gets behind and you start to pile up commit requests.
Once this starts to happen, it will ca
My understanding of a NGramTokenizing is to help with languages that don't
necessarily contain spaces as a word delimiter (Japanese et al). In that case
bi-gramming is used to find words contained within a stream of unbroken
characters. In that case, you want to find all of the bi-grams that you
Are the issues ran into due to non-standard code in Solr, or is there
some WebLogic inconsistency?
-Todd Feak
-Original Message-
From: news [mailto:n...@ger.gmane.org] On Behalf Of Ilan Rabinovitch
Sent: Friday, January 30, 2009 1:11 AM
To: solr-user@lucene.apache.org
Subject: Re: WebLogi
This usually represents anything less then 8ms if you are on a Windows
system. The granularity on timing on Windows systems is around 16ms.
-Todd feak
-Original Message-
From: sunnyfr [mailto:johanna...@gmail.com]
Sent: Thursday, January 29, 2009 9:13 AM
To: solr-user@lucene.apache.org
S
Although the idea that you will need to rebuild from scratch is
unlikely, you might want to fully understand the cost of recovery if you
*do* have to.
If it's incredibly expensive(time or money), you need to keep that in
mind.
-Todd
-Original Message-
From: Ian Connor [mailto:ian.con...
The easiest way is to run maybe 100,000 or more queries and take an
average. A single microsecond value for a query would be incredibly
inaccurate.
-ToddFeak
-Original Message-
From: AHMET ARSLAN [mailto:iori...@yahoo.com]
Sent: Friday, January 23, 2009 1:33 AM
To: solr-user@lucene.apa
Can you share your experience with the IBM JDK once you've evaluated it?
You are working with a heavy load, I think many would benefit from the
feedback.
-Todd Feak
-Original Message-
From: wojtekpia [mailto:wojte...@hotmail.com]
Sent: Thursday, January 22, 2009 3:46 PM
To: solr-user@luc
A ballpark calculation would be
Collected Amount (From GC logging)/ # of Requests.
The GC logging can tell you how much it collected each time, no need to
try and snapshot before and after heap sizes. However (big caveat here),
this is a ballpark figure. The garbage collector is not guaranteed t
>From a high level view, there is a certain amount of garbage collection
that must occur. That garbage is generated per request, through a
variety of means (buffers, request, response, cache expulsion). The only
thing that JVM parameters can address is *when* that collection occurs.
It can occur
The large drop in old generation from 27GB->6GB indicates that things
are getting into your old generation prematurely. They really don't need
to get there at all, and should be collected sooner (more frequently).
Look into increasing young generation sizes via JVM parameters. Also
look into concu
he schema.xml.
Are we on the same wave length here?
Thanks a lot for the suggestion,
Yogesh
- Original Message
From: "Feak, Todd"
To: solr-user@lucene.apache.org
Sent: Tuesday, January 20, 2009 4:49:56 PM
Subject: RE: New to Solr/Lucene design question
A third option -
A third option - Use dynamic fields.
Add a dynamic field call "*_stash". This will allow new fields for
documents to be added down the road without changing schema.xml, yet
still allow you to query on fields like "arresteeFirstName_stash"
without extra overhead.
-Todd Feak
-Original Message-
Anyone that can shed some insight?
-Todd
-Original Message-
From: Feak, Todd [mailto:todd.f...@smss.sony.com]
Sent: Friday, January 16, 2009 9:55 AM
To: solr-user@lucene.apache.org
Subject: How to select *actual* match from a multi-valued field
At a high level, I'm trying to do
At a high level, I'm trying to do some more intelligent searching using
an app that will send multiple queries to Solr. My current issue is
around multi-valued fields and determining which entry actually
generated the "hit" for a particular query.
For example, let's say that I have a multi-valu
I believe that when you commit, a new IndexReader is created, which is
warmed, etc. New incoming queries will be sent to this new IndexReader.
Once all previously existing queries have been answered, the old
IndexReader will shut down.
The commit doesn't wait for the query to finish, but it should
Kind of a side-note, but I think it may be worth your while.
If your queryResultCache hit rate is 65%, consider putting a reverse
proxy in front of Solr. It can give performance boosts over the query
cache in Solr, as it doesn't have to pay the cost of reformulating the
response. I've used Varnish
: Using query functions against a "type" field
On Tue, Jan 6, 2009 at 1:05 PM, Feak, Todd
wrote:
> I'm not sure I followed all that Yonik.
>
> Are you saying that I can achieve this affect now with a bq setting in
> my DisMax query instead of via a bf setting?
Yep, a "
-user@lucene.apache.org
Subject: Re: Using query functions against a "type" field
On Tue, Jan 6, 2009 at 10:41 AM, Feak, Todd
wrote:
> The boost queries are true queries, so the amount boost can be
affected
> by things like term frequency for the query.
Sounds like a constant sco
First suspect would be Filter Cache settings and Query Cache settings.
If they are auto-warming at all, then there is a definite difference
between the first start behavior and the post-commit behavior. This
affects what's in memory, caches, etc.
-Todd Feak
-Original Message-
From: wojte
:It should be fairly predictible, can you elaborate on what problems you
:have just adding boost queries for the specific types?
The boost queries are true queries, so the amount boost can be affected
by things like term frequency for the query. The functions aren't
affected by this and therefore
. The ngrams are extremely
fast and the recommended way to do this according to the user group. They
work wonderfully except this one issue. So do we basically have to do a
separate index for this or is there a dedup setting to only return unique
brand names.
On 12/24/08 7:51 AM, "Feak,
It sounds like you want to get a list of "brands" that start with a particular
string, out of your index. But your index is based on products, not brands. Is
that correct?
If so, that has nothing to do with NGrams (or even tokenizing for that matter)
I think you should be doing a Facet query in
Subject: Re: Using query functions against a "type" field
Try document boost at index time. --wunder
On 12/22/08 9:28 AM, "Feak, Todd" wrote:
> I would like to use a query function to boost documents of a certain
> "type". I realize that I can use a boost qu
I would like to use a query function to boost documents of a certain
"type". I realize that I can use a boost query for this, but in
analyzing the scoring it doesn't seem as predictable as the query
functions.
So, imagine I have a field called "foo". Foo contains a value that
indicates what typ
Don't forget to consider scaling concerns (if there are any). There are
strong differences in the number of searches we receive for each
language. We chose to create separate schema and config per language so
that we can throw servers at a particular language (or set of languages)
if we needed to.
utowarm is
done.
Feak, Todd wrote:
>
> It's spending 4-5 seconds warming up your query cache. If 4-5 seconds
is
> too much, you could reduce the number of queries to auto-warm with on
> that cache.
>
> Notice that the 4-5 seconds is spent only putting about 420 que
It's spending 4-5 seconds warming up your query cache. If 4-5 seconds is
too much, you could reduce the number of queries to auto-warm with on
that cache.
Notice that the 4-5 seconds is spent only putting about 420 queries into
the query cache. Your autowarm of 5 for the query cache seems a bi
You can set the home directory in your Tomcat context snippet/file.
http://wiki.apache.org/solr/SolrTomcat#head-7036378fa48b79c0797cc8230a8a
a0965412fb2e
This controls where Solr looks for solrconfig.xml and schema.xml. The
solrconfig.xml in turn specifies where to find the data directory.
-
I'm pretty sure "*" isn't supported by DisMax.
>From the Solr Wiki on DisMaxRequestHandler overview
http://wiki.apache.org/solr/DisMaxRequestHandler?highlight=(dismax)#head
-ce5517b6c702a55af5cc14a2c284dbd9f18a18c2
"This query handler supports an extremely simplified subset of the
Lucene QueryPa
One option is to add an additional field for sorting. Create a copy of the
field you want to sort on and modify the data you insert there so that it will
sort the way you want it to.
-ToddFeak
-Original Message-
From: Joel Karlsson [mailto:[EMAIL PROTECTED]
Sent: Monday, December 08, 2
Do you have a "dismaxrequest" request handler defined in your solr config xml?
Or is it "dismax"?
-Todd Feak
-Original Message-
From: tushar kapoor [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2008 10:07 AM
To: solr-user@lucene.apache.org
Subject: Encoded search string & qt=Dis
The commit after each one may be hurting you.
I believe that a new searcher is created after each commit. That searcher then
runs through its warm up, which can be costly depending on your warming
settings. Even if it's not overly costly, creating another one while the first
one is running make
Ok sounds reasonable. When you index/update those 4-10 documents, are
you doing a single commit? OR are you doing a commit after each one?
How big is your index? How big are your documents? Ballpark figures are
ok.
-ToddFeak
-Original Message-
From: dudes dudes [mailto:[EMAIL PROTECTED]
Probably going to need a bit more information.
Such as:
What version of Solr and a little info on doc count, index size, etc.
How often are you sending updates to your Master?
How often are you committing?
What are your QueryCache and FilterCache settings for autowarm?
Do you have queries set u
I've found that creating a custom filter and filter factory isn't too
burdensome when the filter doesn't "quite" do what I need. You could
grab the source and create your own version.
-Todd Feak
-Original Message-
From: Jerven Bolleman [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 2
Could you provide your schema and the exact query that you issued?
Things to consider... If you just searched for "the", it used the
default search field, which is declared in your schema. The filters
associated with that default field are what determine whether or not the
stopword list is invoked
Can Nutch crawl newsgroups? Anyone?
-Todd Feak
-Original Message-
From: John Martyniak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 19, 2008 3:06 PM
To: solr-user@lucene.apache.org
Subject: Searchable/indexable newsgroups
Does anybody know of a good way to index newsgroups using
I see value in this in the form of protecting the client from itself.
For example, our Solr isn't accessible from the Internet. It's all
behind firewalls. But, the client applications can make programming
mistakes. I would love the ability to lock them down to a certain number
of rows, just in cas
There's a patch in to do that as a separate filter. See
https://issues.apache.org/jira/browse/SOLR-813
You could just take the patch. It's the full filter and factory.
-Todd Feak
-Original Message-
From: Brian Whitman [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 13, 2008 12:31 PM
I believe (someone correct me if I'm wrong) that the only fields you
need to store are those fields which you wish returned from the query.
In other words, if you will never put the field on the list of fields
(fl) to return, there is no need to store it.
It would be advantageous not to store more
Is support for setting the FSDirectory this way built into 1.3.0
release? Or is it necessary to grab a trunk build.
-Todd Feak
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, November 12, 2008 11:59 AM
To: solr-user@lucene.ap
curious.
-Todd Feak
-Original Message-
From: wojtekpia [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 05, 2008 11:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Throughput Optimization
My documentCache hit rate is ~.7, and my queryCache is ~.03. I'm using
FastLRUCache on al
What are your other cache hit rates looking like?
Which caches are you using the FastLRUCache on?
-Todd Feak
-Original Message-
From: wojtekpia [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 05, 2008 8:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Throughput Optimization
Yes,
If you are seeing < 90% CPU usage and are not IO (File or Network)
bound, then you are most probably bound by lock contention. If your CPU
usage goes down as you throw more threads at the box, that's an even
bigger indication that that is the issue.
A good profiling tool should help you locate thi
as afraid of that. Was hoping not to need another big fat box
>>> like
>>> this one...
>>>
>>> ---
>>> Alok K. Dhir
>>> Symplicity Corporation
>>> www.symplicity.com
>>> (703) 351-0200 x 8080
>>> [EMAIL PROTECTED]
>
I believe this is one of the reasons that a master/slave configuration
comes in handy. Commits to the Master don't slow down queries on the
Slave.
-Todd
-Original Message-
From: Alok Dhir [mailto:[EMAIL PROTECTED]
Sent: Monday, November 03, 2008 1:47 PM
To: solr-user@lucene.apache.org
Su
Have you looked into the "bf" and "bq" arguments on the
DisMaxRequestHandler?
http://wiki.apache.org/solr/DisMaxRequestHandler?highlight=(dismax)#head
-6862070cf279d9a09bdab971309135c7aea22fb3
-Todd
-Original Message-
From: George [mailto:[EMAIL PROTECTED]
Sent: Monday, November 03, 200
I realize you said caching won't help because the searches are
different, but what about Document caching? Is every document returned
different? What's your hit rate on the Document cache? Can you throw
memory at the problem by increasing Document cache size?
I ask all this, as the Document cache
It strikes me that removing just the seconds could very well reduce
overhead to 1/60 of original. 30 second query turns into 500ms query.
Just a swag though.
-Todd
-Original Message-
From: Alok Dhir [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 29, 2008 1:48 PM
To: solr-user@lucene.
Have you looked at how long your warm up is taking?
If it's taking longer to warm up a searcher then it does for you to do
an update, you will be behind the curve and eventually run into this no
matter how big that number.
-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf
You may want to take a very close look at what the WordDelimiterFilter
is doing. I believe the underscore is dropped entirely during indexing
AND searching as it's not alphanumeric.
Wiki doco here
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
okenizer)#head-1c9b83870ca78
Unless "q=ALL" is a special query I don't know about, the only reason you would
get results is if "ALL" showed up in the default field of the single document
that was inserted/updated.
You could try a query of "*:*" instead. Don't forget to URL encode if you are
doing this via URL.
-Todd
---
My bad. I misunderstood what you wanted.
The example I gave was for the searching side of things. Not the data
representation in the document.
-Todd
-Original Message-
From: Aleksey Gogolev [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 22, 2008 11:14 AM
To: Feak, Todd
Subject: Re
Sent: Wednesday, October 22, 2008 9:24 AM
To: Feak, Todd
Subject: Re[2]: Question about copyField
Thanks for reply. I want to make your point more exact, cause I'm not
sure that I correctly understood you :)
As far as I know (correct me please, if I wrong) type defines the way
in which th
The filters and tokenizer that are applied to the copy field are
determined by it's type in the schema. Simply create a new field type in
your schema with the filters you would like, and use that type for your
copy field. So, the field description would have it's old type, but the
field suggestion
Any chance this is a MySql server configuration issue?
http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
-Todd
-Original Message-
From: sunnyfr [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 21, 2008 1:09 PM
To: solr-user@lucene.apache.org
Subject: Re: solr1.3 / tomcat55
ct: Re: Problem implementing a BinaryQueryResponseWriter
do you have handleSelect set to true in solrconfig?
...
if not, it would use a Servlet that is now deprecated
On Oct 20, 2008, at 4:52 PM, Feak, Todd wrote:
> I found out what's going on.
>
> My test queries
PrintWriter out = response.getWriter();
responseWriter.write(out, solrReq, solrRsp);
}
On Oct 20, 2008, at 3:59 PM, Feak, Todd wrote:
> Yes.
>
> I've gotten it to the point where my class is called, but the wrong
> method on it is c
ing a BinaryQueryResponseWriter
Hi Todd,
Did you add your response writer in solrconfig.xml?
On Mon, Oct 20, 2008 at 9:35 PM, Feak, Todd <[EMAIL PROTECTED]>
wrote:
> I switched from dev group for this specific question, in case other
> users have similar issue.
>
>
>
I would look real closely at the data between MySQL and Solr. I don't
know how it got from the database to the index, but I would try and get
a debugger running and look at the actual data as it's moving along.
Possible suspects include, JDBC driver, JDBC driver settings, HTTP
client (whatever sen
That looks like the data in the index is incorrectly encoded.
If the inserts into your index came in via HTTP GET and your Tomcat wasn't
configured for UTF-8 at the time, I could see it going into the index
corrupted. But I'm not sure if that's even possible (depends on Update)
Is it hard to r
I switched from dev group for this specific question, in case other
users have similar issue.
I'm implementing my own BinaryQueryResponseWriter. I've implemented the
interface and successfully plugged it into the Solr configuration.
However, the application always calls the Writer method on the
Two potential issues I see there.
1. Shouldn't your query string on the URL be encoded?
2. Are you using Tomcat, and did you set it up to use UTF-8 encoding? If not,
your connector node in Tomcat needs to have the URIEncoding set to UTF-8.
Documentation here
http://struts.apache.org/2.0.11.2/
The current Subversion trunk has the new Lucene 2.4.0 libraries
committed. So, it's definitely under way.
-Todd
-Original Message-
From: Julio Castillo [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 15, 2008 9:48 AM
To: solr-user@lucene.apache.org
Subject: Lucene 2.4 released
Any id
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Tuesday, October 14, 2008 1:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Practical number of Solr instances per machine
On Tue, Oct 14, 2008 at 4:29 PM, Feak, Todd <[EMAIL PROTECTED
In our load testing, the limit for utilizing all of the processor time
on a box was locking (synchronize, mutex, monitor, pick one). There were
a couple of locking points that we saw.
1. Lucene's locking on the index for simultaneous read/write protection.
2. Solr's locking on the LRUCaches for up
74 matches
Mail list logo