Re: DIH deleting documents

2013-02-23 Thread cveres
Thanks Gora. I do not have optimise automatically enabled, but I am new to
Solr so I am not 100% familiar with all the steps that go on.

I will try your suggestion, but I was hoping first that I could get the data
straight from Solr. 

thanks, Csaba



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-deleting-documents-tp4041811p4042426.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Invalid Date String Exception

2013-02-23 Thread Shawn Heisey

On 2/23/2013 12:27 AM, Raja Kulasekaran wrote:

Hi,

I got the exception *"Invalid Date String" *as I run the crawl against
webpages .
*
*
Each one use their own date format and as a developer we don't have a
control on it. Instead of throwing exception, It should suppose to convert
into a Solr based format .

Can you suggest me how do I overcome that ?


Solr doesn't do any crawling, so you must be using another piece of 
software that talks to Solr, most likely Nutch.  Date conversion will 
have to be handled in the program that feeds the data to Solr. 
Questions for Nutch will find the best support on the Nutch user mailing 
list.


http://nutch.apache.org/mailing_lists.html

I did some searching and found something that says you'd have to write 
some code:


http://stackoverflow.com/questions/10445095/nutch-solr-formatting-date-from-web-page-metadata-into-correct-solr-format

Note that if you can get a date into a java Date object, I'm fairly sure 
that you can get the properly formatted string for Solr with this java 
code, where dateObject is the Date in question:


SimpleDateFormat formatUTC = new SimpleDateFormat
("-MMM-dd'T'HH:mm:ss'Z'");
formatUTC.setTimeZone(TimeZone.getTimeZone("UTC"));
return formatUTC.format(dateObject);

Further searching has turned up a possible plugin for handling dates, 
but it does say that it has no support for full timestamps.  This plugin 
has not been added to any released Nutch version, but the source code is 
attached to the issue:


https://issues.apache.org/jira/browse/NUTCH-1406

Thanks,
Shawn



Re: Solr 4.0 committing on an index built by another instance

2013-02-23 Thread Mark Miller
How are you doing the backup? You have to coordinate with Solr - files may be 
changing when you try and copy it, leaving to an inconsistent index. If you 
want to do a live backup, you have to use the backup feature of the replication 
handler.

- Mark

On Feb 23, 2013, at 3:54 AM, Prakhar Birla  wrote:

> Hi,
> 
> We use Solr 4.0 for our main searcher so it is a very vital part. We have
> set up a process called Index reassurance which assures that all documents
> are available in Solr by comparing to our database. In short this is
> achieved as: The production server (read/write) is a slave while another
> server (write only) is the master where the indexes are built and
> replicated to the slave.
> 
> We are adding a backup/restore feature to this process which means that the
> backup can originate from either server (while it is running) and will be
> applied to the master after which the indexes are built and replicated.
> 
> The backup is a tar.gz copy of the core.
> 
> The problem we are facing is when a commit is done on data loaded from the
> backup. Following is the stack trace:
> 
> SEVERE: null:java.io.FileNotFoundException: /var/www/locationapp7078/solr/
>> collection1/data/index.20130222154157971/_18.fnm (No such file or
>> directory)
>>at java.io.RandomAccessFile.open(Native Method)
>>at java.io.RandomAccessFile.(RandomAccessFile.java:233)
>>at org.apache.lucene.store.MMapDirectory.openInput(
>> MMapDirectory.java:222)
>>at org.apache.lucene.store.NRTCachingDirectory.openInput(
>> NRTCachingDirectory.java:232)
>>at org.apache.lucene.codecs.lucene40.Lucene40FieldInfosReader.read(
>> Lucene40FieldInfosReader.java:52)
>>at org.apache.lucene.index.SegmentCoreReaders.(
>> SegmentCoreReaders.java:101)
>>at org.apache.lucene.index.SegmentReader.(SegmentReader.java:57)
>>at org.apache.lucene.index.ReadersAndLiveDocs.getReader(
>> ReadersAndLiveDocs.java:120)
>>at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(
>> BufferedDeletesStream.java:267)
>>at org.apache.lucene.index.IndexWriter.applyAllDeletes(
>> IndexWriter.java:3010)
>>at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(
>> IndexWriter.java:3001)
>>at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2974)
>>at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2943)
>>at org.apache.lucene.index.IndexWriter.forceMerge(
>> IndexWriter.java:1606)
>>at org.apache.lucene.index.IndexWriter.forceMerge(
>> IndexWriter.java:1582)
>>at org.apache.solr.update.DirectUpdateHandler2.commit(
>> DirectUpdateHandler2.java:515)
>>at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(
>> RunUpdateProcessorFactory.java:87)
>>at org.apache.solr.update.processor.UpdateRequestProcessor.
>> processCommit(UpdateRequestProcessor.java:64)
>>at org.apache.solr.update.processor.DistributedUpdateProcessor.
>> processCommit(DistributedUpdateProcessor.java:1007)
>>at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(
>> LogUpdateProcessorFactory.java:157)
>>at org.apache.solr.handler.loader.XMLLoader.
>> processUpdate(XMLLoader.java:250)
>>at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:157)
>>at org.apache.solr.handler.UpdateRequestHandler$1.load(
>> UpdateRequestHandler.java:92)
>>at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>> ContentStreamHandlerBase.java:74)
>>at org.apache.solr.handler.RequestHandlerBase.handleRequest(
>> RequestHandlerBase.java:129)
>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1699)
>>at org.apache.solr.servlet.SolrDispatchFilter.execute(
>> SolrDispatchFilter.java:455)
>>at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
>> SolrDispatchFilter.java:276)
>>at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> doFilter(ServletHandler.java:1337)
>>at org.eclipse.jetty.servlet.ServletHandler.doHandle(
>> ServletHandler.java:484)
>>at org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:119)
>>at org.eclipse.jetty.security.SecurityHandler.handle(
>> SecurityHandler.java:524)
>>at org.eclipse.jetty.server.session.SessionHandler.
>> doHandle(SessionHandler.java:233)
>>at org.eclipse.jetty.server.handler.ContextHandler.
>> doHandle(ContextHandler.java:1065)
>>at org.eclipse.jetty.servlet.ServletHandler.doScope(
>> ServletHandler.java:413)
>>at org.eclipse.jetty.server.session.SessionHandler.
>> doScope(SessionHandler.java:192)
>>at org.eclipse.jetty.server.handler.ContextHandler.
>> doScope(ContextHandler.java:999)
>>at org.eclipse.jetty.server.handler.ScopedHandler.handle(
>> ScopedHandler.java:117)
>>at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
>> ContextHandlerCollection.java:250)
>>at org.eclipse.jetty.server.handler.HandlerCollection.
>> handle(HandlerCollection.java:149)
>>at org.eclipse.jetty.server.handler.H

Re: Subscription to mailing list

2013-02-23 Thread Steve Rowe
Hi Ankush,

Instructions for subscribing, and other info about the Solr mailing lists, are 
here:

http://lucene.apache.org/solr/discussion.html

Steve

On Feb 23, 2013, at 5:47 AM, Ankush Puri  wrote:

> Hi,
> I've started using solr from last day only.I must say it's really an
> awesome project.Have already explored lots of it's features...!!:)
> Will be great if I can also be part of this list.
> 
> -- 
> Best,
> Ankush



Re: Solr Grouping and empty fields

2013-02-23 Thread Jilal Oussama
Thank you Daniel, it is a nice idea.

I wish there was a better solution, but we will go with yours it seems.

(still open for any other idea)
On Feb 22, 2013 7:47 PM, "Daniel Collins"  wrote:

> We had something similar to be fair, a cluster information field which was
> unfortunately optional, so all the documents that didn't have this field
> set grouped together.
>
> It isn't Solr's fault, to be fair, we told it to group on the values of
> field Z, null is a valid value and lots of documents have that value so
> they all group together.  We got what we asked for :-)
>
> Our solution was to make that field mandatory, and in our indexing
> pipeline we will set that field to some unique value (same as the document
> key if necessary) if it isn't set already to ensure that every document has
> that field set appropriately.
>
> -Original Message- From: Oussama Jilal
> Sent: Friday, February 22, 2013 5:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Grouping and empty fields
>
> OK I'm sorry if I did not explained well my need. I'll try to give a
> better explanation.
>
> What I have : Millions of documents that have a field X , another field
> Y and another field Z which is not required (So it can be empty in some
> documents and not in others).
>
> What I want to do : Search for docs that have the field X equals
> something and group them by field Z (so that only 1 document is returned
> for every field Z value), BUT I want documents who have field Z as empty
> to be included in the results (all of them), and sort the results by
> field Y (so I can't separate the request into two requests).
>
> I hope that this is clearer.
>
>
> On 02/22/2013 03:59 PM, Jack Krupansky wrote:
>
>> What?!?! You want them grouped but not grouped together?? What on earth
>> does that mean?! I mean, either they are included or they are not. All
>> results will be in some group, so where exactly do you want these "not to
>> be grouped together" documents to be grouped? In any case, please clarify
>> what your expectations really are.
>>
>> -- Jack Krupansky
>> -Original Message- From: Oussama Jilal
>> Sent: Friday, February 22, 2013 7:17 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Grouping and empty fields
>>
>> Thank you Johannes, but I want the documents having the field empty to
>> be included in the results, just not to be grouped together, and if I
>> understood your solution correctly, it will simply remove those
>> documents from the results (Note : The field values are very variable
>> and unknown to me).
>>
>> On 02/22/2013 02:53 PM, Johannes Rodenwald wrote:
>>
>>> Hi Oussama,
>>>
>>> If you have only a few distinct, unchanging values in the field that you
>>> group upon, you could implement a FilterQuery (query parameter "fq") and
>>> add it to the query, allowing all valid values, but not an empty field. For
>>> example:
>>>
>>> fq=my_grouping_string_field:( value_a OR value_b OR value_c OR value_d )
>>>
>>> If you use SOLR 4.x, you should be able to group upon an integer field,
>>> allowing a range filter:
>>> (I still work with 3.6 which can only group on string fields, so i didnt
>>> test this one)
>>>
>>> fq=my_grouping_integer_field:[**1 TO *]
>>>
>>> --
>>> Johannes Rodenwald
>>>
>>>
>>> - Ursprüngliche Mail -
>>> Von: "Oussama Jilal" 
>>> An: solr-user@lucene.apache.org
>>> Gesendet: Freitag, 22. Februar 2013 12:32:13
>>> Betreff: Solr Grouping and empty fields
>>>
>>> Hi,
>>>
>>> I need to group some results in solr based on a field, but I don't want
>>> documents having that field empty to be grouped together, does anyone
>>> know how to achieve that ?
>>>
>>>
>>
> --
> Oussama Jilal
>
>


Re: DIH deleting documents

2013-02-23 Thread Arcadius Ahouansou
Hello Cveres.

I know you said the IDs are unique, but having a look at your config,
your id is defined as
CONCAT(CAST('${book_chapter.title}' AS CHAR),'-',CAST(chapter AS
CHAR)) as solr_id

There are many books out there with the same title and most of them
have a chapter titled "Introduction".
So, your solr_id may or may not be unique depending on your data.

It may help if you have a database  view that would hide the complex
queries  from solr.

You may also want to start with something simpler, a single entity and
see how it goes, then add the rest.

Thanks.

Arcadius.




On 23 February 2013 10:43, cveres  wrote:
> Thanks Gora. I do not have optimise automatically enabled, but I am new to
> Solr so I am not 100% familiar with all the steps that go on.
>
> I will try your suggestion, but I was hoping first that I could get the data
> straight from Solr.
>
> thanks, Csaba
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DIH-deleting-documents-tp4041811p4042426.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to overrride pre and post tags when usefastVectorHighlighter is set to true

2013-02-23 Thread Koji Sekiguchi

Hi Alex,

(13/02/23 10:53), alx...@aim.com wrote:

Hello,

I was unable to change pre and post tags for highlighting when 
usefastVectorHighlighter is set to true. Changing default tags in 
solrconfig.xml works for standard highlighter though. I searched mailing list 
and the net with no success.
I use solr-4.1.0.


According to Wiki:

hl.simple.pre/hl.simple.post
http://wiki.apache.org/solr/HighlightingParameters#hl.simple.pre.2BAC8-hl.simple.post

"... Use hl.tag.pre and hl.tag.post for FastVectorHighlighter (see example under 
hl.fragmentsBuilder)"

And solrconfig.xml in example:

  
  

  
  

  

If you don't use multi-colored tag, you can simply set:

  

  
  

  

koji
--
http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html


Re: Interesting issue with "special characters" in a string field value

2013-02-23 Thread Jack Park
Ok. I have revisited this issue as deeply as possible using simplistic
unit tests, tossing out indexes, and starting fresh.

A typical Solr document might have a label, e.g. the string inside the
quotes: "Node Type".  That would be queried, according to what I've
been able to read, as a Phrase Query, which means, include the quotes
around the text.

When I use the admin query panel with this query:
label:"Node Type"
A fragment of the full document is returned. it is this:

  
NodeType

  Node Type


In my code using SolrJ, I have printlines just as the "escaped" query
string comes in, and one which shows what the SolrQuery looks like
after setting it up to go online. I then show what came back:

Solr3Client.runQuery- label:"Node Type" 0 10
Solr3Client.runQuery-1 q=label%3A%22Node+Type%22&start=0&rows=10
 {numFound=1,start=0,docs=[SolrDocument{locator=NodeType,
smallIcon=cogwheel.png, subOf=ClassType, details=The TopicQuests
typology node type., isPrivate=false, creatorId=SystemUser, label=Node
Type, largeIcon=cogwheel.png, lastEditDate=Sat Feb 23 20:43:22 PST
2013, createdDate=Sat Feb 23 20:43:22 PST 2013,
_version_=1427826019119661056}]}

What that says is that SolrQuery inserted a + inside the query string,
and that it found 1 document, but did not return it.

In the largest picture, I have returned to using XMLResponseParser on
the theory that I will now be able to take advantage of partialUpdates
on multi-valued fields (List) but haven't tested that yet. I
am not yet escaping such things as "<" or ">" but just escaping those
things mentioned in the Solr documents which are reserved characters.

So, the current update is this: learning about phrase queries, and
judicious escaping of reserved characters seems to be helping. Next up
entails two issues: more robust testing of escaped characters, and
trying to discover what is the best approach to dealing with
characters that must be escaped to get past XML, e.g. '<', '>', and
others.

Many thanks
Jack


On Fri, Feb 22, 2013 at 2:44 PM, Jack Park  wrote:
> Michael,
> I don't think you misunderstood. I will soon give a full response here, but
> am on the road at the moment.
>
> Many thanks
> Jack
>
>
> On Friday, February 22, 2013, Michael Della Bitta
>  wrote:
>> My mistake, I misunderstood the problem.
>>
>> Michael Della Bitta
>>
>> 
>> Appinions
>> 18 East 41st Street, 2nd Floor
>> New York, NY 10017-6271
>>
>> www.appinions.com
>>
>> Where Influence Isn’t a Game
>>
>>
>> On Fri, Feb 22, 2013 at 3:55 PM, Chris Hostetter
>>  wrote:
>>>
>>> : If you're submitting documents as XML, you're always going to have to
>>> : escape meaningful XML characters going in. If you ask for them back as
>>> : XML, you should be prepared to unescape special XML characters as
>>>
>>> that still wouldn't explain the discrepency he's claiming to see between
>>> the json & xml resmonses (the json containing an empty string
>>>
>>> Jack: please elaborate with specifics about your solr version, field,
>>> field type, how you indexed your doc, and what the request urls & raw
>>> responses that you get are (ie: don't trust the XML you see in your
>>> browser, it may be unescaping escaped sequences in element text to be
>>> "helpful" .. use something like curl)
>>>
>>> For example...
>>>
>>> BEGIN GOOD EXAMPLE OF SPECIFICS---
>>>
>>> I'm using Solr 4.x with the 4.x example schema which has the following
>>> field...
>>>
>>>>> multiValued="true"/>
>>>>> />
>>>
>>> I indexed a doc like this...
>>>
>>> $ curl "http://localhost:8983/solr/update?commit=true"; -H
>>> 'Content-type:application/json' -d '[{"id":"hoss", "cat":">> as a source node>" } ]'
>>>
>>> And this is what i get from the following requests...
>>>
>>> $ curl
>>> "http://localhost:8983/solr/select?q=id:hoss&wt=xml&indent=true&omitHeader=true";
>>> 
>>> 
>>>
>>> 
>>>   
>>> hoss
>>> 
>>>   
>>> 
>>> 1427705631375097856
>>> 
>>> 
>>>
>>> $ curl
>>> "http://localhost:8983/solr/select?q=id:hoss&wt=json&indent=true&omitHeader=true";
>>> {
>>>   "response":{"numFound":1,"start":0,"docs":[
>>>   {
>>> "id":"hoss",
>>> "cat":[""],
>>> "_version_":1427705631375097856}]
>>>   }}
>>>
>>> $ curl
>>> "http://localhost:8983/solr/select?q=cat:%22%22&wt=json&indent=true&omitHeader=true"
>>> {
>>>   "response":{"numFound":1,"start":0,"docs":[
>>>   {
>>> "id":"hoss",
>>> "cat":[""],
>>> "_version_":1427705631375097856}]
>>>   }}
>>>
>>> END GOOD EXAMPLE OF SPECIFICS---
>>>
>>> : > Even more curious, if I use this query at the console:
>>> : >
>>> : > details:
>>> : >
>>> : > I get nothing back.
>>>
>>> note in my last example above the importance of using quotes (or the
>>> {!term} qparser) to query string fields that contain special characters
>>> like whitespace -- whitespace is syntacally meaningul to the lucene query
>>> parser, it seperates c

From a high level query call, tell Solr / Lucene to automatically apply a leaf operator?

2013-02-23 Thread Mark Bennett
Scenario:

You're submitting a block of text as a query.

You're content to let solr / lucene handing query parsing and tokenziation,
etc.

But you'd like to have ALL eventually produced leaf-nodes in the parse tree
to have:
* Boolean .MUST (effectively a + prefix)
* Fuzzy match of ~1 or ~2

In a simple application, and if there were no punctuation, you could
preprocess the query, effectively:
* split on whitespace
* for t in tokens: t = "+" + t + "~2"

But this is ugly, and even then I think things like stop words would be
messed up:
* OK in Solr:   the chair(it can properly remove "the")
* But if this:+the~2  +chair~2   (I'm not sure this would work)

Sure, at the application level you could also remove the stop words in the
"for t in tokens" loop, but then some other weird case would come up.
Maybe one of the field's analyzers has some other token filter you forgot
about, so you'd have to bring that logic forward as well.

(Long story of why I'd want to do all this... and I know people think
adding ~2 to all tokens will give bad results anyway, trying to fix
inherited code that can't be scrapped, etc)

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513