Is it possible you have two solr instances running off the same index
folder? This was a mistake I stumbled into early on - I was writing
with one, and reading with the other, so I didn't see updates.
-Mike
On 09/15/2011 12:37 AM, Pawan Darira wrote:
I am commiting but not doing replication
We have the identical problem in our system.
Our plan is to encode the most recent version of a document using an
explicit field/value;
ie
version=current
(or maybe current=true)
We also need to be able to allow users to search for the most current,
but only within versions they have access
Is there some reason you don't want to leverage Highlighter to do this
work? It has all the necessary code for using the analyzed version of
your query so it will only match tokens that really contribute to the
search match.
You might also be interested in LUCENE-2878 (which is still under
d
On 10/24/2011 02:35 PM, MBD wrote:
Is this really a stumper? This is my first experience with Solr and having spent
only an hour or so with it I hit this barrier (below). I'm sure *I* am doing
something completely wrong just hoping someone more familiar with the platform can
help me identify&
Since you are performing a complete reload of all of your data, I don't
understand why you can't create a new core, load your new data, swap
your application to look at the new core, and then erase the old one, if
you want.
Even so, you could track the timestamps on all your documents, which
I'm experiencing something really weird: I get different results
depending on whether I specify wt=javabin, and retrieve using SolrJ, or
wt=xml. I spent quite a while staring at query params to make sure
everything else is the same, and they do seem to be. At first I thought
the problem relat
quick follow-up: I also notice that the query from solrj gets version=1,
whereas the admin webapp puts version=2.2 on the query string, although
this param doesn't seem to change the xml results at all. Does this
indicate an older version of solrj perhaps?
-Mike
On 10/21/2010 04:47 PM,
s?
-Mike
On 10/21/2010 04:47 PM, Mike Sokolov wrote:
I'm experiencing something really weird: I get different results depending
on whether I specify wt=javabin, and retrieve using SolrJ, or wt=xml. I
spent quite a while staring at query params to make sure everything else is
the sam
ck.
I am looking into the virtual hosts config in tomcat; it seems as if
there must indeed be another solr instance running; in fact I'm now
concerned there might be two solr instances running against the same
data folder. yargh.
-Mike
On 10/22/2010 09:05 AM, Mike Sokolov wrote:
Yes - I r
Right - my point was to combine this with the previous approaches to
form a query like:
samsung AND android AND GPS AND word_count:3
in order to exclude documents containing additional words. This would
avoid the combinatoric explosion problem otehrs had alluded to earlier.
Of course this wou
seems like the only working idea. Maybe Varun
could comment on the maximum numbers of terms that his queries will
contain?
Regards,
Toke Eskildsen
On Wed, 2010-10-27 at 15:02 +0200, Mike Sokolov wrote:
Right - my point was to combine this with the previous approaches to
form a query like
Another alternative (prettier to my eye), would be:
(city:Chicago AND Romantic AND View)^10 OR (Romantic AND View)
-Mike
On 11/03/2010 09:28 AM, kenf_nc wrote:
Unfortunately the default operator is set to AND and I can't change that at
this time.
If I do (city:Chicago^10 OR Romantic OR Vi
If your ranges are always contiguous, you could index two fields:
range-start and range-end and then perform queries like:
range-start:[* TO 30] AND range-end:[5 TO *]
If you have multiple ranges which could have gaps in between then you
need something more complicated :)
On 02/27/2012 04:09
I think your example case would end up like this:
...
1 -- single-valued range field
15
...
On 02/27/2012 04:26 PM, federico.wachs wrote:
Michael thanks a lot for your quick answer, but i'm not exactly sure I
understand your solution.
How would the docuemnt you are proposin
No; contiguous means there are no gaps between them.
You need something like what you described initially.
Another approach is to de-normalize your data so that you have a single
document for every range. But this might or might not suit your
application. You haven't said anything about the
Yes, I see - I think your best bet is to index every day as a distinct
value. Don't worry about having 100's of values.
-Mike
On 02/27/2012 05:11 PM, federico.wachs wrote:
This is used on an apartment booking system, and what I store as solr
documents can be seen as apartments. These apartmen
I don't know if this would help with OOM conditions, but are you using a
tint type field for this? That should be more efficient to search than
a regular int or string.
-Mike
On 02/27/2012 05:27 PM, federico.wachs wrote:
Yeah that's what I'm doing right now.
But whenever I try to index an ap
On 3/27/2012 11:14 AM, Mark Miller wrote:
On Mar 27, 2012, at 10:51 AM, Shawn Heisey wrote:
On 3/26/2012 6:43 PM, Mark Miller wrote:
It doesn't get thrown because that logic needs to continue - you don't
necessarily want one bad document to stop all the following documents from
being added.
You can specify a solr field as "multi-valued", and then supply multiple
values for it. What that really does is concatenate all the values with
a positional gap between them to prevent phrases and other positional
queries from traversing the boundary between the distinct values.
-Mike
On 05
I'm creating a some Solr plugins that index and search documents in a
special way, and I'd like to make them as easy as possible to
configure. Ideally I'd like users to be able to just drop a jar in
place without having to copy any configuration into schema.xml, although
I suppose they will ha
ok, never mind all is well - I had a mismatch between the
schema-declared field and my programmatic field, where I was overzealous
in using OMIT_TF_POSITIONS.
-Mike
On 6/2/2012 5:02 PM, Mike Sokolov wrote:
I'm creating a some Solr plugins that index and search documents in a
special way
());
setQueryAnalyzer(new WhitespaceGapAnalyzer());
}
protected Field.Index getFieldIndex(SchemaField field, String
internalVal) {
return Field.Index.ANALYZED;
}
}
On 6/2/2012 5:48 PM, Mike Sokolov wrote:
ok, never mind all is well - I had a mismatch
I agree, that seems odd. We routinely index XML using either
HTMLStripCharFilter, or XmlCharFilter (see patch:
https://issues.apache.org/jira/browse/SOLR-2597), both of which parse
the XML, and we don't see such a huge speed difference from indexing
other field types. XmlCharFilter also allo
Does anybody know of a way to detect when the highlight snippet begins
at the beginning of the field or ends at the end of the field using one
of the standard highlighters shipped w/Solr? We'd like to display
ellipses only when there is additional text surrounding the snippet in
the original
Yes - I commented out the element in solrconfig.xml and then
got the expected behavior: the core used a data subdirectory in the core
subdirectory.
It seems like the problem arises from using the solrconfig.xml that's
distributed as example/solr/conf/solrconfig.xml
The solrconfig.xml's in
Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens. For example, "PET" might
be a synonym of "positron emission tomography", but "pet" wouldn't be.
-Mike
On 04/26/2011 09:51 AM, Robert Muir wrote:
On Tue, Apr 26, 2011 at 12:24
cased... its just
arbitrary that they are only being analyzed with the "tokenizer".
On Tue, Apr 26, 2011 at 4:13 PM, Mike Sokolov wrote:
Suppose your analysis stack includes lower-casing, but your synonyms are
only supposed to apply to upper-case tokens. For example, "
StandardTokenizer will have stripped punctuation I think. You might try
searching for all the entity names though:
(agrave | egrave | omacron | etc... )
The names are pretty distinctive. Although you might have problems with
greek letters.
-Mike
On 04/28/2011 12:10 PM, Paul wrote:
I'm tr
No clue. Try wireshark to gather more data?
On 04/28/2011 02:53 PM, Jed Glazner wrote:
Anybody?
On 04/27/2011 01:51 PM, Jed Glazner wrote:
Hello All,
I'm having a very strange problem that I just can't figure out. The
slave is not able to replicate from the master, even though the master
is r
This is in 1.4 - we push updates via SolrJ; our application sees the
updates, but when we use the solr admin screens to run test queries, or
use Luke to view the schema and field values, it sees the database in
its state prior to the commit. I think eventually this seems to
propagate, but I'm
Thanks - we are issuing a commit via SolrJ; I think that's the same
thing, right? Or are you saying really we need to do a separate commit
(via HTTP) to update the admin console's view?
-Mike
On 05/02/2011 11:49 AM, Ahmet Arslan wrote:
This is in 1.4 - we push updates via SolrJ; our applica
Ah - I didn't expect that. Thank you!
On 05/02/2011 12:07 PM, Ahmet Arslan wrote:
Thanks - we are issuing a commit via SolrJ; I think that's the same
thing, right? Or are you saying really we need to do a separate commit
(via HTTP) to update the admin console's view?
Yes separate commit is
I think the key question here is what's the best way to perform indexing
without affecting search performance, or without affecting it much. If
you have a batch of documents to index (say a daily batch that takes an
hour to index and merge), you'd like to do that on an offline system,
and then
Thanks - that sounds like what I was hoping for. So the I/O during
replication will have *some* impact on search performance, but
presumably much less than reindexing and merging/optimizing?
-Mike
Master/slave replication does this out of the box, easily. Just set the slave
to update on Opti
It preserves the location of the terms in the original HTML document so
that you can highlight terms in HTML. This makes it possible (for
instance) to display the entire document, with all the search terms
highlighted, or (with some careful surgery) to display formatted HTML
(bold, italic, etc
Would anyone care to comment on the merits of storing indexed full-text
documents in Solr versus storing them externally?
It seems there are three options for us:
1) store documents both in Solr and externally - this is what we are
doing now, and gives us all sorts of flexibility, but doesn't
On 05/15/2011 11:48 AM, Erick Erickson wrote:
Where are the documents coming from? Because storing them ONLY in
Solr risks losing them if your index is somehow hosed.
In our case, we generally have source documents and can reproduce the
index if need be, but that's a good point.
Storing the
On 05/16/2011 09:24 AM, Dmitry Kan wrote:
Dear list,
Might have missed it from the literature and the list, sorry if so, but:
SOLR 1.4.1
Consider the query:
term1 term2 OR "term1 term2" OR "term1 term3"
I think what's happening is that your query gets rewritten into
something like:
We use log4j explicitly and find it irritating to deal with the built-in
JDK logging default. We also have conflicts with other packages that
have their own ideas about how to bind slf4j, so the less of this the
better, IMO. The 1.6.1 no-op default behavior seems a bit unfortunate
as out-of-t
You might want to create a field that's analyzed using
HtmlStripCharFilter - this will index all the non-tag/non-attribute text
in the document, and if you store the value, will store the entire XML
document as well.
I've done some work on an XmlStripCharFilter, which does the same thing
(onl
Cool! suggestion: you might want to replace
externalVal.toLowerCase().split(" ");
with
externalVal.toLowerCase().split("\\s+");
also I bet folks might have different ideas about what to do with
hyphens, so maybe:
externalVal.toLowerCase().split("[-\\s]+");
In fact why not make it a config
A possible workaround is to re-fetch the documents in your result set
with a query that is:
+id=(id1 or id2 or ... id20) ()
where id1..20 are the doc ids in your result set
would require two round-trips though
-Mike
On 05/24/2011 08:19 AM, Koji Sekiguchi wrote:
(11/05/24 20:56), Lord Khan H
The "*" endpoint for range terms wasn't implemented yet in 1.4.1 As a
workaround, we use very large and very small values.
-Mike
On 05/27/2011 12:55 AM, alucard001 wrote:
Hi all
I am using SOLR 1.4.1 (according to solr info), but no matter what date
field I use (date or tdate) defined in def
I believe there is a query parser that accepts queries formatted in XML,
allowing you to provide a parse tree to Solr; perhaps that would get you
the control you're after.
-Mike
On 05/31/2011 02:24 PM, dar...@ontrenet.com wrote:
Hi,
I want to write my own query expander. It needs to obtain
Wildcard queries aren't analyzed, I think? I'm not completely sure what
the best workaround is here: perhaps simply lowercasing the query terms
yourself in the application. Also - I hope someone more knowledgeable
will say that the new HighlightQuery in trunk doesn't have this
restriction, bu
opps, please s/Highlight/Wildcard/
On 06/14/2011 05:31 PM, Mike Sokolov wrote:
Wildcard queries aren't analyzed, I think? I'm not completely sure
what the best workaround is here: perhaps simply lowercasing the query
terms yourself in the application. Also - I hope so
works but can get complex. The query that
I'm executing may have things like ranges which require some words to
be upper case (i.e. TO). I think this would be much better solved on
Solrs end, is there a JIRA about this?
On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov <mailto:soko...
I'd be very interested in this, as well, if you do it before me and are
willing to share...
A related question I have tried to ask on this list, and have never
really gotten a good answer to, is whether it makes sense to just chuck
the external storage and treat the lucene index as the primary
like that.
Thoughts/comments?
On Mon, Jun 20, 2011 at 9:05 AM, Mike Sokolov <mailto:soko...@ifactory.com>> wrote:
I'd be very interested in this, as well, if you do it before me
and are willing to share...
A related question I have tried to ask on this list, and ha
e should include
but that's all I would need at present.
On Mon, Jun 20, 2011 at 9:54 AM, Mike Sokolov
mailto:soko...@ifactory.com>> wrote:
Another option for determining whether to go to external
storage would be to examine the SchemaField, see if it
On 06/22/2011 04:01 AM, Dennis de Boer wrote:
Hi Bill,
as far as I understood now, with the help of my friend, you can't.
Multivalued fields don't work that way.
You can however always filter the facet results manually in the JSP. You
knwo what the user chose as a facet.
Yes - that is the m
We always remove the facet filter when faceting: in other words, for a
good user experience, you generally want to show facets based on the
query excluding any restriction based on the facets.
So in your example (facet B selected), we would continue to show *all*
facets. Only if you performed a
Actually - you are both wrong!
It is true that 0x is a valid UTF8 character, and not a valid UTF8
byte sequence.
But the parser is reporting (or trying to) that 0x is an invalid XML
character.
And Robert - if the wording offends you, you might want to send a note
to Tatu (http://ji
issues like this is to wrap the stream I'm handing to the parser
in some kind of cleanup stream that handles a few yucky issues. You
could, eg, just strip out invalid XML characters. Maybe Nutch should be
doing this, or at least handling the error better?
-Mike
On 06/27/2011 09:19 AM,
I don't think this is a BOM - that would be 0xfeff. Anyway the problem
we usually see w/processing XML with BOMs is in UTF8 (which really
doesn't need a BOM since it's a byte stream anyway), in which if you
transform the stream (bytes) into a reader (chars) before the xml parser
can see it, th
Markus - if you want to make sure not to offend XML parsers, you should
strip all characters not in this list:
http://en.wikipedia.org/wiki/XML#Valid_characters
You'll see that article talks about XML 1.1, which accepts a wider range
of characters than XML 1.0, and I believe the Woodstox parse
Does the phonetic analysis preserve the offsets of the original text field?
If so, you should probably be able to hack up FastVectorHighlighter to
do what you want.
-Mike
On 06/29/2011 02:22 PM, Jamie Johnson wrote:
I have a schema with a text field and a text_phonetic field and would like
t
t. Having no
familiarity with FastVectorHighlighter is there somewhere specific I
should be looking?
On Wed, Jun 29, 2011 at 3:20 PM, Mike Sokolov wrote:
Does the phonetic analysis preserve the offsets of the original text field?
If so, you should probably be able to hack up FastVectorHigh
;m not familiar with the CharFilters, I'll look into those now.
Is the solr.LowerCaseFilterFactory not handling wildcards the expected
result or is this a bug?
On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov wrote:
I wonder whether CharFilters are applied to wildcard terms? I suspect
lower casing the works but can get complex. The query that I'm
executing may have things like ranges which require some words to be upper
case (i.e. TO). I think this would be much better solved on Solrs end, is
there a JIRA about this?
On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov wrote:
Yes, that's right. But at the moment the HL code basically has to
reconstruct and re-run your query - it doesn't have any special
knowledge. There's some work going on to try and fix that, but it seems
like it's going to require some fairly major deep re-plumbing.
-Mike
On 07/01/2011 07:54
Did you ever commit?
On 07/07/2011 01:58 PM, Gabriele Kahlout wrote:
so, how about this:
Document doc = searcher.doc(i); // i get the doc
doc.removeField("wc"); // remove the field in case there's
addWc(doc, docLength); //add the new field
writer.updateDocumen
There is a syntax that allows you to specify different analyzers to use
for indexing and querying, in solr.xml. But if you don't do that, it
should use the same analyzer in both cases.
-Mike
On 07/11/2011 10:58 AM, Gabriele Kahlout wrote:
With a lucene QueryParser instance it's possible to
I think you need to list the charfilter earlier in the analysis chain;
before the tokenizer. Porbably Solr should tell you this...
-Mike
On 07/25/2011 09:03 AM, Merlin Morgenstern wrote:
sounds logical. I just changed it to the following, restarted and reindexed
with commit:
Hmm - I'm not sure about that; see
https://issues.apache.org/jira/browse/SOLR-2119
On 07/25/2011 12:01 PM, Markus Jelsma wrote:
charFilters are executed first regardless of their position in the analyzer.
On Monday 25 July 2011 17:53:59 Mike Sokolov wrote:
I think you need to lis
keyword false false
typewordword
startOffset 6 10
endOffset 9 13
On Monday 25 July 2011 18:07:29 Mike Sokolov wrote:
Hmm - I'm not sure about that; see
https://issues.apache.org/jira/browse/SOLR-2119
On 07/25/2011 12:01 PM, Markus Jelsma wrote:
I'm not sure I would identify stemming as the culprit here.
Do you have very large documents? If so, there is a patch for FVH
committed to limit the number of phrases it looks at; see
hl.phraseLimit, but this won't be available until 3.4 is released.
You can also limit the amount of each doc
A customer has an interesting problem: some documents will have multiple
versions. In search results, only the most recent version of a given
document should be shown. The trick is that each user has access to a
different set of document versions, and each user should see only the
most recent v
d).
Regards,
Tomás
On Mon, Aug 1, 2011 at 10:47 AM, Mike Sokolov wrote:
A customer has an interesting problem: some documents will have multiple
versions. In search results, only the most recent version of a given
document should be shown. The trick is that each user has access to a
diffe
local test dataset (~30M docs, ~8000 groups) and my
machine. You might encounter different search times when setting
group.ngroup=true.
Martijn
2011/8/1 Mike Sokolov
Thanks, Tomas. Yes we are planning to keep a "current" flag in the most
current document. But there are cases whe
If you want to avoid re-indexing, you could consider building a synonym
file that is generated using your rule set, and then using that to
expand your queries. You'd need to get a list of all terms in your
index and then process them to generate synyonyms. Actually, I don't
know how to get a
You have a few choices:
1) flatten your field structure - like your "undesirable" example, but
wouldn't you want to have the document identifier as a field value also?
2) use phrase queries to make sure the key/value pairs are adjacent
3) use a join query
That's all I can think of
-Mike
On
Although you weren't very clear about it, it sounds as if you want the
results to be sorted by a name that actually matched the query? In
general that is not going to be easy, since it is not something that can
be computed in advance and thus indexed.
-Mike
On 08/03/2011 10:39 AM, Olson, Ro
I don't have any experience with DIH: maybe XPathEntityProcessor doesn't
use a true XML parser?
You might want to try passing your documents through "xmllint -noent"
(basically parse and reserialize) - that should inline the characters as
UTF-8?
On 07/09/2012 03:18 PM, Michael Belenki wrote:
I think the issue here is that DIH uses Woodstox "BasicStreamReader"
(see
http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/wstx/sr/BasicStreamReader.html)
which has only minimal DTD support. It might be best to use
ValidatingStreamReader
(http://woodstox.codehaus.org/3.2.9/javadoc/com/ctc/w
75 matches
Mail list logo