Re: offsets issues with multiword synonyms since LUCENE_33

2012-08-15 Thread Konrad Lötzsch

I don't know wether this was discussed previously,
but if you tell the synonmyfilter to not break your synonyms (which 
might be the default). In this case, the parts of the synonyms get new 
word positions. So you could use a Keywordtokenizer to avoid that behaviour:




with regards,
konrad.

Am 14.08.2012 18:51, schrieb Marc Sturlese:

Well an example would be:
synonyms.txt:
huge,big size

The I have the docs:
1- The huge fox attacks first
2- The big size fox attacks first

Then if I query for huge, the highlights for each document are:

1- The huge fox attacks first
2- The big size fox attacks first

The analyzer looks like this:
fieldType name="sy_text" class="solr.TextField" positionIncrementGap="100">
   
 
 
 
 
   
   
 
 
 
 
   
 

This was working with a previous version of Solr (couldn't make it work with
3.6, 4-alpha nor 4-beta).



--
View this message in context: 
http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195p4001213.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Query regarding dataimporthandler

2012-08-15 Thread Shalin Shekhar Mangar
There is no way to do it within DataImportHandler but you can configure
 in solrconfig.xml to automatically commit pending updates by
time or number of documents.

On Tue, Aug 14, 2012 at 4:11 PM, ravicv  wrote:

> Hi,
>
> Is there any way for intermediate commits while indexing data using
> dataimport handler?
> I am using 1.4 solr version.
>
> My problem is :
>
> Some times while indexing huge data about 4 GB , after indeixng it is while
> commit process is going on if any user searches the data sometimes solr is
> throwing heap space error.
>
> Since my data before commit operation is nearly 8 GB , but after both
> commit
> and optimize is node it reduces to 4 GB. I am usign full import option.
>
> Any ideas?
>
> Thanks,
> ravichandra
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Query-regarding-dataimporthandler-tp4001098.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
> When I send a scanned pdf to extraction request
> handler, below icon appears in my Dock.
> 
> http://tinypic.com/r/2mpmo7o/6
> http://tinypic.com/r/28ukxhj/6

I found that text-extractable pdf files triggers above weird icon too.

curl 
"http://localhost:8983/solr/update/extract?literal.id=solr-word&commit=true"; -F 
"myfile=@solr-word.pdf"

I wrote a standalone java program using tika. When text-extracting from all 
kinds of pdf files, that weird icon pops-up :)

I will ask tika-ML about this.

 AutoDetectParser _autoParser = new AutoDetectParser();
 File file = new File("solr-word.pdf");
 BodyContentHandler textHandler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 ParseContext context = new ParseContext();
 InputStream input = new FileInputStream(file);

 _autoParser.parse(input, textHandler, metadata, context);

 System.out.println("text : " + textHandler.toString());
 input.close();
 while (true) { }


Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht
Ahmet,

the dock icon appears when AWT starts, e.g. when a font is loaded.
You can prevent it using the headless mode but this is likely to trigger an 
exception.
Same if your user is not UI-logged-in.

hope it helps.

Paul

Le 15 août 2012 à 01:30, Ahmet Arslan a écrit :

> Hi All,
> 
> I have set of rich documents. Some of them are scanned pdf files. When I send 
> a scanned pdf to extraction request handler, below icon appears in my Dock.
> 
> http://tinypic.com/r/2mpmo7o/6
> http://tinypic.com/r/28ukxhj/6
> 
> Does anyone know what this is?
> 
> curl 
> "http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesi&literal.URL=ticaret_sicil_gazetesi&commit=true";
>  -F "myfile=@ticaret_sicil_gazetesi.pdf"
> 
> No exception is seen on solr logs. Doc is indexed, content field is: 
> 
> xmpTPg:NPages 4   Creation-Date 2011-08-24T13:03:16Z   stream_source_info 
> myfile   created Wed Aug 24 16:03:16 EEST 2011   stream_content_type 
> application/octet-stream   stream_size 2302337   producer Image Recognition 
> Integrated Systems, Autoformat5,0,0,229   stream_name 
> ticaret_sicil_gazetesi.pdf   Content-Type application/pdf   creator I.R.I.S.  
>  page page page page 
> 
> Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-Bit 
> Server VM (build 20.8-b03-424, mixed mode), jetty.
> 
> Same thing happens with Solr 4.0-beta and Tomcat too.
> 
> Thanks,



Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
> the dock icon appears when AWT starts, e.g. when a font is
> loaded.
> You can prevent it using the headless mode but this is
> likely to trigger an exception.
> Same if your user is not UI-logged-in.

Hi Paul, thanks for the explanation. So is it nothing to worry about?



Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Tirthankar Chatterjee
Hi Erick,
You are so right on the memory calculations. I am happy that I know now that I 
was doing something wrong. Yes I am getting confused with SQL.

I will back up and let you know the use case. I am tracking file versions. And 
I want to give an option to browse your system for the latest files. So in 
order to remove dups (same filename) I used grouping.

Also when you say Sharding is it okay if I do multi cores and does it mean that 
each core needs a separate tomcat. I meant to say can I use the same machine? 
150 mill docs have 120 mill unique paths too.

One more thing. If I need sharding and need a new box then it wont be great. 
Because this system still have horsepower left which I can use.

Thanks a ton for explaining the issue.

Erick Erickson  wrote:


You'r putting  a lot of data on a single box, then
asking to group on what I presume is a string
field. That's just going to eat up a _bunch_ of
memory.

let's say your average file name is 16 bytes long. Each
unique value will take up 58 + 32 bytes (58 bytes
of overhead, I'm presuming Solr 3.X and 16*2 bytes
for the chars). So, we're up to 90 bytes/string * number
of distinct file names) Say you have, for argument's
sake, 100M distinct file names. You're up to 9G
memory requirement for sorting alone. Solr's
sorting reads all the unique values into memory whether
or not they satisfy the query...

And Grouping can also be expensive. I don't think
you really want to group in this case, I'd simply use
a filter query something like:
fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307

Then you're also grouping on conv_sort which doesn't
make much sense, do you really want individual results returned
for _each_ file name?

What it looks like to me is you're confusing SQL with
solr search and getting into bad situations...

Also, 150M documents in a single shard is...really a lot.
You're probably at a point where you need to shard. Not
to mention that your 400G index is trying to be jammed
into 12G of memory.

This actually feels like an XY problem, can you back
up and let us know what the use-case you're
trying to solve is? Perhaps there are less memory-
consumptive solutions possible.

Best
Erick

On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
 wrote:
> Editing the query...remove  I don't know where it came from while 
> I did copy/paste
>
> Tirthankar Chatterjee  wrote:
>
>
> Hi,
> I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
> Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit
>
>
> Data Index Dir Size: 400GB
> Metadata of files is stored in it. I have around 15 schema fields.
> Total number of items:150million approx.
>
> I have a scenario which I will try to explain to the best of my knowledge 
> here:
>
> Let us consider the fields I am interested in
>
> Url: Entire path of a file in windows file system including the filename. 
> ex:C:\Documents\A.txt
> mtm: Modified Time of the file
> Jid:JOb ID
> conv_sort is string field type where the filename is stored.
>
> I run a job where the following gets inserted
>
> Total Items:2
> Url:C:\personal\A1.txt
> mtm:08/14/2012 12:00:00
> Jid:1
> Conv_sort:A1.txt
> ---
> Url:C:\personal\B1.txt
> mtm:08/14/2012 12:01:00
> Jid:1
> Conv_sort:B1.txt
> In the second run only one item changes:
>
> Url:C:\personal\A1.txt
> mtm:08/15/2012 1:00:00
> Jid:2
> Conv_sort=A1.txt
>
> When queried I would like to return the latest A1.txt and B1.txt back to the 
> end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
> can someone please help… as it is critical for my project
>
> The query I am trying is under a folder there are 1000 files and I putting a 
> filtered query param too asking it to group by filenames or url and none of 
> them work…what am I doing wrong here
>
>
> http://172.19.108.78:8080/solr/select/?q=*:*&version=2.2&start=0&rows=10&indent=on&group.query=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307"&group=true&group.limit=1&group.field=conv_sort&group.ngroup=true
>
>
> The stack trace:
>
>
> SEVERE: java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Unknown Source)
> at java.lang.String.(Unknown Source)
> at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> at 
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184
> )
> at 
> org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(
> FieldCacheImpl.java:882)
> at 
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java
> :233)
> at 
> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl
> .java:856)
> at 
> org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN
> extReader(TermFirstPassGroupingCollector.java:74)
> at 
> org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.
> java:113)
> at 
> org.

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht


Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :

> Hi Paul, thanks for the explanation. So is it nothing to worry about?

it is nothing to worry about except to remember that you can't run this step in 
a daemon-like process.
(on Linux, I had to set-up a VNC-server for similar tasks)

paul

Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Ahmet Arslan
> Because I have set a post in Stackoverflow, I wan't, that
> there is dublicate
> questions. Can you please read this post:
> 
> http://stackoverflow.com/questions/11956608/sphinx-user-is-switching-to-solr

Your questions require Sphinx knowledge. I suggest you to read these book(s) 
http://lucene.apache.org/solr/books.html
http://www.manning.com/hatcher3/

"I have in Sphinx: min_word_len ... How to use this in Solr?"

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/#solr.LengthFilterFactory


Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread nnikolay
HI iorixxx, thanks for the reply.

Well you don't need sphinx knowledge to answer my questions.

I have write you what I want:

1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
need to start 2 cores for example. How many cores can I run for solr? I have
for example over 100 different indexes, that they should seeing as separate
data. This indexes should be reindexed in different times and the data of
them should not mixed with each other.

You need to understand follow situation:

I have for example jobs form country A, jobs from country B and so on until
100 countries. I need to have for each country an separate index, because if
someone search for jobs in country A I need to query only the index for
country A. How to solve this problem?

How to do this? Is there are good tutorial? In the wiki of solr, it is very
bad explained.

2. When I become new data for example: Should I rotate the whole index
again, or can I include the new rows and delete the old rows. What is your
suggestion?

Thanks
Nik



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to design index for related versioned database records

2012-08-15 Thread Stefan Burkard
Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one "real" record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
(=offered services) <-> Service

That means I have data across 5 relations, each of them with versioned records.

***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan


Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Ahmet Arslan

> 1. I need to have 2 seprate indexes. In Stackoverlfow I
> became the answer I
> need to start 2 cores for example. How many cores can I run
> for solr? 

Please see : http://search-lucene.com/m/6rYti2ehFZ82


> I have for example jobs form country A, jobs from country B
> and so on until
> 100 countries. I need to have for each country an separate
> index, because if
> someone search for jobs in country A I need to query only
> the index for
> country A. How to solve this problem?
> How to do this? Is there are good tutorial? In the wiki of
> solr, it is very
> bad explained.

http://wiki.apache.org/solr/MultipleIndexes talks about different solutions. 
One big index with fq is an option too.

> 2. When I become new data for example: Should I rotate the
> whole index
> again, or can I include the new rows and delete the old
> rows. What is your
> suggestion?

I don't understand this. What do you mean by rotate the whole index?


Re: RAMDirectoryFactory bug

2012-08-15 Thread Michael Della Bitta
Hi, Lance,

Thanks for your reply!

It seems as if RAMDirectoryFactory is being passed the correct path to
the index, as it's being logged correctly. It just doesn't recognize
it as an index.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Tue, Aug 14, 2012 at 9:57 PM, Lance Norskog  wrote:
> I can't remember the property name, but there is a Solr Java property
> that tells where to hunt for the data/ directory. You might be able to
> work around this bug using that property.
>
> On Tue, Aug 14, 2012 at 1:34 PM, Michael Della Bitta
>  wrote:
>> Hi everyone,
>>
>> It looks like I found a bug with RAMDirectoryFactory (I know, I know...)
>>
>> It doesn't seem to be able to load files off the disk. Everytime it
>> starts up, it logs:
>>
>> WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
>> Creating new index...
>>
>> Even if that filesystem path exists and there's a valid index there
>> (verified by switching back to StandardDirectoryFactory).
>>
>> I experienced this first on our infrastructure on AWS, but I confirmed
>> this by downloading the Solr 3.6.1 distribution fresh, indexing the
>> exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
>> and restarting Jetty. The statement above gets logged, but otherwise
>> the core comes up OK, but empty.
>>
>> Should I file a bug?
>>
>> Michael Della Bitta
>>
>> 
>> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
>> www.appinions.com
>> Where Influence Isn’t a Game
>
>
>
> --
> Lance Norskog
> goks...@gmail.com


Re: How to design index for related versioned database records

2012-08-15 Thread Jack Krupansky
The date checking can be implemented using range query as a filter query, 
such as


&fq=startDate:[* TO NOW] AND endDate:[NOW TO *]

(You can also use an "frange" query.)

Then you will have to flatten the database tables. Your Solr schema would 
have a single "merged" record type. You will have to decide whether the 
different record types (tables) will have common fields versus static 
qualification by adding a prefix or suffix, e.g., "name" vs. "employee_name" 
and "employer_name". The latter has the advantage that you do not have to 
separately specify a table "type" field since the fields would be empty for 
records of other types.


-- Jack Krupansky

-Original Message- 
From: Stefan Burkard

Sent: Wednesday, August 15, 2012 8:12 AM
To: solr-user@lucene.apache.org
Subject: How to design index for related versioned database records

Hi solr-users

I have a case where I need to build an index from a database.

***Data structure***
The data is spread across multiple tables and in each table the
records are versioned - this means that one "real" record can exist
multiple times in a table, each with different validFrom/validUntil
dates. Therefore it is possible to query the valid version of a record
for a given point in time.

The relations of the data are something like this:
Employee <-> LinkTable (=Employment) <-> Employer <-> LinkTable
(=offered services) <-> Service

That means I have data across 5 relations, each of them with versioned 
records.


***Search needs***
Now I need to be able to search for employees and employers based on
the services they offer for a given point in time.

Therefore I have built an index of all employees and employers with
their services as subentity. So I have one index entry for every
version of every employee/employer and each version collects the
offered services for the given timeframe of the employee/employer
version.

Problem: The offered services of an employee/employer can change
during its validity period. That means I do not only need to take the
version timespan of the employee/employer into account but also the
version timespans of services and the link-tables.

***Question***
I think I could continue with my strategy to have an index entry of an
employee/employer with its services for any given point in time. But
there are much more entries than now since every involved
validfrom/validuntil period (if they overlap) produces more entries.
But I am not sure if this is a good strategy, or if it would be better
to try to index the whole datastructure in an other way.

Are there any recommendations how to handle such a case?

Thanks for any help
Stephan 



Re: scanned pdf with solr cell

2012-08-15 Thread Michael Della Bitta
You can try passing -Djava.awt.headless=true as one of the arguments
when you start Jetty to see if you can get this to go away with no ill
effects.

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 7:07 AM, Paul Libbrecht  wrote:
>
>
> Le 15 août 2012 à 13:03, Ahmet Arslan a écrit :
>
>> Hi Paul, thanks for the explanation. So is it nothing to worry about?
>
> it is nothing to worry about except to remember that you can't run this step 
> in a daemon-like process.
> (on Linux, I had to set-up a VNC-server for similar tasks)
>
> paul


Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
> You can try passing
> -Djava.awt.headless=true as one of the arguments
> when you start Jetty to see if you can get this to go away
> with no ill
> effects.

I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and 
successfully indexed two pdf files. That icon didn't appeared :) Thanks! 


Re: RAMDirectoryFactory bug

2012-08-15 Thread Mark Miller

On Aug 14, 2012, at 4:34 PM, Michael Della Bitta 
 wrote:

> Hi everyone,
> 
> It looks like I found a bug with RAMDirectoryFactory (I know, I know...)
> 

Fair warning - RAMDir use in Solr is like a third class citizen. You probably 
should be using the mmap dir anyway.
See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

> It doesn't seem to be able to load files off the disk. Everytime it
> starts up, it logs:
> 
> WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
> Creating new index...
> 
> Even if that filesystem path exists and there's a valid index there
> (verified by switching back to StandardDirectoryFactory).

I think it *should* work how you want, so does sound like a bug perhaps.

> 
> I experienced this first on our infrastructure on AWS, but I confirmed
> this by downloading the Solr 3.6.1 distribution fresh, indexing the
> exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
> and restarting Jetty. The statement above gets logged, but otherwise
> the core comes up OK, but empty.
> 
> Should I file a bug?

Sure.

> 
> Michael Della Bitta
> 
> 
> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
> www.appinions.com
> Where Influence Isn’t a Game

- Mark Miller
lucidimagination.com













Re: RAMDirectoryFactory bug

2012-08-15 Thread Michael Della Bitta
Yes, moving to mmap was on our roadmap. I'm in the middle of moving
our infrastructure from 1.4 to 3.6.1, and didn't want to make too many
changes at the same time. However, this bug might push us over the
edge to mmap and away from ram.

I'll file a bug regardless.

Thanks!

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 9:05 AM, Mark Miller  wrote:
>
> On Aug 14, 2012, at 4:34 PM, Michael Della Bitta 
>  wrote:
>
>> Hi everyone,
>>
>> It looks like I found a bug with RAMDirectoryFactory (I know, I know...)
>>
>
> Fair warning - RAMDir use in Solr is like a third class citizen. You probably 
> should be using the mmap dir anyway.
> See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
>> It doesn't seem to be able to load files off the disk. Everytime it
>> starts up, it logs:
>>
>> WARNING: [] Solr index directory 'solr/./data/index' doesn't exist.
>> Creating new index...
>>
>> Even if that filesystem path exists and there's a valid index there
>> (verified by switching back to StandardDirectoryFactory).
>
> I think it *should* work how you want, so does sound like a bug perhaps.
>
>>
>> I experienced this first on our infrastructure on AWS, but I confirmed
>> this by downloading the Solr 3.6.1 distribution fresh, indexing the
>> exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory,
>> and restarting Jetty. The statement above gets logged, but otherwise
>> the core comes up OK, but empty.
>>
>> Should I file a bug?
>
> Sure.
>
>>
>> Michael Della Bitta
>>
>> 
>> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
>> www.appinions.com
>> Where Influence Isn’t a Game
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


RE: Solr 4.0 - Join performance

2012-08-15 Thread David Smiley (@MITRE.org)
You would index rectangles of 0 height but that have a left edge 'x' of the
start time and a right edge 'x' of your end time.  You can index a variable
number of these per Solr document and then query by either a point or
another rectangle to find documents which intersect your query shape.  It
can't do a completely within based query, just intersection for now.  I
really look forward to seeing this wrapped up in some sort of RangeFieldType
so that users don't have to think in spatial terms.  



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-Join-performance-tp3998827p4001404.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index not loading

2012-08-15 Thread Jonatan Fournier
On Tue, Aug 14, 2012 at 5:37 PM, Jonatan Fournier
 wrote:
> On Tue, Aug 14, 2012 at 10:25 AM, Erick Erickson
>  wrote:
>> This is quite odd, it really sounds like you're not
>> actually committing. So, some questions.
>>
>> 1> What happens if you search before you shut
>> down your tomcat? Do you see docs then? If so,
>> somehow you're doing soft commits and never
>> doing a hard commit.

Yeah I just realized the behavior is the same as softCommit, is it the
default for commitWithin?

Cheers,

/jonathan

>>
>> 2> What happens if, as the last statement in your SolrJ
>> program you do a commit()?
>
> When using commitWithin, if I introduce server.commit() within the
> data load process the data gets commited ( I didn't reproduce with my
> 89G of data...), if I shutdown my EmbeddedServer and restart it and
> send a commit, like on Tomcat, all data gets wiped out too. So I guess
> that there's state loss somewhere.
>
> Cheers,
>
> /jonathan
>
>>
>> 3> While you're indexing, what do you see in your index
>> directory? You should see multiple segments being
>> created, and possibly merged so the number of
>> files should go up and down. If you only have a single
>> set of files, you're somehow not doing a commit.
>>
>> 4> Is there something really silly going on like your
>> restart scripts delete the index directory? Or you're
>> using a VM that restores a blank image?
>>
>> 5> When you do restart, are there any files at all
>> in your index directory?
>>
>> I really suspect you've got some configuration problem
>> here
>>
>> Best
>> Erick
>>
>>
>>
>> On Mon, Aug 13, 2012 at 9:11 AM, Jonatan Fournier
>>  wrote:
>>> Hi,
>>>
>>> I'm using Solr 4.0.0-ALPHA and the EmbeddedSolrServer.
>>>
>>> Within my SolrJ application, the documents are added to the server
>>> using the commitWithin parameter (in my case 60s). After 1 day my 125
>>> millions document are all added to the server and I can see 89G of
>>> index data files. I stop my SolrJ application and reload my Solr
>>> instance in Tomcat.
>>>
>>> From the Solr admin panel related to my Core (collection1) I see this info:
>>>
>>>
>>> Last Modified:
>>> Num Docs:0
>>> Max Doc:0
>>> Version:1
>>> Segment Count:0
>>> Optimized: (green check)
>>> Current:  (green check)
>>> Master:
>>> Version: 0
>>> Gen: 1
>>> Size: 88.14 GB
>>>
>>>
>>> From the general Core Admin panel I see:
>>>
>>> lastModified:
>>> version:1
>>> numDocs:0
>>> maxDoc:0
>>> optimized: (red circle)
>>> current: (green check)
>>> hasDeletions: (red circle)
>>>
>>> If I query my index for *:* I get 0 result. If I trigger optimize it
>>> wipes ALL my data inside the index and reset to empty. I've played
>>> around my EmbeddedServer initially using autoCommit/softCommit and it
>>> was working fine. Now that I've switched to commitWithin the document
>>> add query, it always do that! I'm never able to reload my index within
>>> Tomcat/Solr.
>>>
>>> Any idea?
>>>
>>> Cheers,
>>>
>>> /jonathan


Re: Switch from Sphinx to Solr - some basics please

2012-08-15 Thread Walter Underwood
These do require some Sphinx knowledge. I could answer them on StackOverflow 
because I converted Chegg from Sphinx to Solr this year.

As I said there, read about Solr cores. They are independent search 
configurations and indexes within one Solr server: 
http://wiki.apache.org/solr/CoreAdmin 

For your jobs example, I would use filter queries to limit the search to a 
single country. Filter them to country:us or country:de or country:fr and you 
will only get result from that country.

Solr does not use the term "rotate" for indexes. You can delete with a query, 
so you could delete all the jobs for one country, reindex those, then commit.

Separate cores are best when you have different kinds of data. At Chegg, we 
search books and college courses. Those are in different cores and have very 
different schemas.

wunder

On Aug 15, 2012, at 5:11 AM, nnikolay wrote:

> HI iorixxx, thanks for the reply.
> 
> Well you don't need sphinx knowledge to answer my questions.
> 
> I have write you what I want:
> 
> 1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I
> need to start 2 cores for example. How many cores can I run for solr? I have
> for example over 100 different indexes, that they should seeing as separate
> data. This indexes should be reindexed in different times and the data of
> them should not mixed with each other.
> 
> You need to understand follow situation:
> 
> I have for example jobs form country A, jobs from country B and so on until
> 100 countries. I need to have for each country an separate index, because if
> someone search for jobs in country A I need to query only the index for
> country A. How to solve this problem?
> 
> How to do this? Is there are good tutorial? In the wiki of solr, it is very
> bad explained.
> 
> 2. When I become new data for example: Should I rotate the whole index
> again, or can I include the new rows and delete the old rows. What is your
> suggestion?
> 
> Thanks
> Nik
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html
> Sent from the Solr - User mailing list archive at Nabble.com.

--
Walter Underwood
wun...@wunderwood.org





Re: Duplicated facet counts in solr 4 beta: user error

2012-08-15 Thread Erick Erickson
No problem, and thanks for posting the resolution

If you have the time and energy, anyone can edit the Wiki if you
create a logon, so any clarification you'd like to provide to keep
others from having this problem would be most welcome!

Best
Erick

On Tue, Aug 14, 2012 at 6:13 PM, Buttler, David  wrote:
> Here are my steps:
>
> 1)  Download apache-solr-4.0.0-BETA
>
> 2)  Untar into a directory
>
> 3)  cp -r example example2
>
> 4)  cp -r example exampleB
>
> 5)  cp -r example example2B
>
> 6)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
> -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
>
> 7)  cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
> start.jar
>
> 8)  cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
> start.jar
>
> 9)  cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
> start.jar
>
> 10)   cd example/exampledocs; java 
> -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
>
> http://localhost:8983/solr/collection1/select?q=*:*&wt=xml&fq=cat:%22electronics%22
> 14 results returned
>
> This is correct.  Let's try a slightly more circuitous route by running 
> through the solr tutorial first
>
>
> 1)  Download apache-solr-4.0.0-BETA
>
> 2)  Untar into a directory
>
> 3)  cd example; java  -jar start.jar
>
> 4)  cd example/exampledocs; java 
> -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
>
> 5)  kill jetty server
>
> 6)  cp -r example example2
>
> 7)  cp -r example exampleB
>
> 8)  cp -r example example2B
>
> 9)  cd example;  java -Dbootstrap_confdir=./solr/collection1/conf 
> -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
>
> 10)   cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar 
> start.jar
>
> 11)   cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar 
> start.jar
>
> 12)   cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar 
> start.jar
>
> 13)   cd example/exampledocs; java 
> -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
>
> With the same query as above, 22 results are returned.
>
> Looking at this, it is somewhat obvious that what is happening is that the 
> index was copied over from the tutorial and was not cleaned up before running 
> the cloud examples.
>
> Adding the debug=query parameter to the query URL produces the following:
> 
> *:*
> *:*
> MatchAllDocsQuery(*:*)
> *:*
> LuceneQParser
> 
> cat:"electronics"
> 
> 
> cat:electronics
> 
> 
>
> So, Erick's diagnoses is correct: pilot error.  However, the straightforward 
> path through the tutorial and on to solr cloud makes it easy to make this 
> mistake. Maybe a small warning in the solr cloud page would help?
>
> Now, running a delete operations fixes things:
> cd example/exampledocs;
> java -Dcommit=false -Ddata=args -jar post.jar 
> "*:*"
> causes the number of results to be zero.  So, let's reload the data:
> java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml
> now the number of results for our query
> http://localhost:8983/solr/collection1/select?q=*:*&wt=xml&fq=cat:"electronics"
> is back to the correct 14 results.
>
> Dave
>
> PS apologizes for hijacking the thread earlier.


Re: Facet sort numeric values

2012-08-15 Thread Erick Erickson
the problem you're running into is that lexical ordering of
numeric data != numeric ordering. If you have a mixed
alpha and numeric data, you man not care if the alpha
stuff is first, i.e.

asdb456
asdf490

sorts fine. Problems happen with
9jsdf
100ukel

the 100ukel comes first.

So if you have a mixed alpha and numeric situation,
you have to either live with it or normalize the numeric
data so it lexical ordering == numeric ordering, the most
common way is to left-pad numeric data to a fixed-width,
i.e. rather than index asb9fg, index asb009fg. Of
course you have to know what the upper limit of any digit
is for this to work...

Best
Erick

On Wed, Aug 15, 2012 at 12:33 AM, Aleksander Akerø
 wrote:
> Oh brilliant, didn't think of it being possible to configure that way.
>
> Had made my own "untokenized" type, so I guess it would be better for me to
> control datatype this way.
>
> Bonus question (hehe): What if these field values also contain alphanumeric
> values? E.g. "Alpha, Bravo, Omega, ... "
> How would this affect the sorting? I guess the TrieIntField is not
> applicable then.
>
> Aleksander Akerø
> @ Gurusoft AS
> Mobil: 944 89 054
> QR-Code (Kontaktinfo)
>
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: 14. august 2012 17:45
> To: solr-user@lucene.apache.org
> Subject: Re: Facet sort numeric values
>
>
> : I'm having a problem with sorting facets. I am using the facet.sort=index
> : parameter and it works fine for most of the values.
> ...
> : Eksample, when sorting "15, 6, 23, 7, 10, 90" it sorts like this: "10, 15,
> : 23, 6, 7, 90", but what I wanted was "6, 7, 10, 15, 23, 90".
>
> what field type are you using?
>
> If you use one of the Trie___Field types then the facet values should sort
> exactly as you describe.
>
>  positionIncrementGap="0"/>  class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
>  positionIncrementGap="0"/>  class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
>
>
>
> -Hoss
>


Re: Solr 3.5 result grouping is failing

2012-08-15 Thread Erick Erickson
Please attach the results of adding &debugQuery=on
to your query in both the success and failure case, there's
very little information to go on here. You might review:

http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Wed, Aug 15, 2012 at 12:57 AM, chethan  wrote:
> Hi,
>
> I'm trying to group (field collapse) my search results on a field called
> "site". The schema says that it has to be indexed: * type="string" stored="false" indexed="true"/>.*
> But when I try to query the results with *group.field=site&group.limit=100,
> *I see only 1 group of results being returned. And the group value is null.
> This seems to work on another solr instance which only has a few documents
> indexed. Seems to fail on bigger indexes. Help is appreciated.
>
> Thanks
> Chethan
>
>
> Sent this message again as it seemed to bounce the first time.


Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Erick Erickson
No, sharding into multiple cores on the same machine still
is limited by the physical memory available. It's still lots
of stuf on a limited box.

But try backing up and re-thinking the problem a bit.
Some possibilities off the top of my head:

1> have a new field "current". when you update a doc,
 reindex the old doc with current=0 and put current=
 1 in the new doc (boolean field). Getting one and
 only one is really simple.
2> Use external file fields (EFF) for the same purpose, that
 won't require you to re-index the doc. The trick
 here is you use the value in the EFF as a multiplier
 for the score (that's what function queries do). So older
 versions of the doc have scores of 0 and just don't
 show up.
3> Implement a custom collector that replaces older hits
 with newer hits. Actually I don't particularly like this
 because it would potentially replace a higher-scoring
document with a lower-scoring one in the results list...

Bottom line here is I don't think grouping is a good approach
for this problem

Best
Erick

On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee
 wrote:
> Hi Erick,
> You are so right on the memory calculations. I am happy that I know now that 
> I was doing something wrong. Yes I am getting confused with SQL.
>
> I will back up and let you know the use case. I am tracking file versions. 
> And I want to give an option to browse your system for the latest files. So 
> in order to remove dups (same filename) I used grouping.
>
> Also when you say Sharding is it okay if I do multi cores and does it mean 
> that each core needs a separate tomcat. I meant to say can I use the same 
> machine? 150 mill docs have 120 mill unique paths too.
>
> One more thing. If I need sharding and need a new box then it wont be great. 
> Because this system still have horsepower left which I can use.
>
> Thanks a ton for explaining the issue.
>
> Erick Erickson  wrote:
>
>
> You'r putting  a lot of data on a single box, then
> asking to group on what I presume is a string
> field. That's just going to eat up a _bunch_ of
> memory.
>
> let's say your average file name is 16 bytes long. Each
> unique value will take up 58 + 32 bytes (58 bytes
> of overhead, I'm presuming Solr 3.X and 16*2 bytes
> for the chars). So, we're up to 90 bytes/string * number
> of distinct file names) Say you have, for argument's
> sake, 100M distinct file names. You're up to 9G
> memory requirement for sorting alone. Solr's
> sorting reads all the unique values into memory whether
> or not they satisfy the query...
>
> And Grouping can also be expensive. I don't think
> you really want to group in this case, I'd simply use
> a filter query something like:
> fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307
>
> Then you're also grouping on conv_sort which doesn't
> make much sense, do you really want individual results returned
> for _each_ file name?
>
> What it looks like to me is you're confusing SQL with
> solr search and getting into bad situations...
>
> Also, 150M documents in a single shard is...really a lot.
> You're probably at a point where you need to shard. Not
> to mention that your 400G index is trying to be jammed
> into 12G of memory.
>
> This actually feels like an XY problem, can you back
> up and let us know what the use-case you're
> trying to solve is? Perhaps there are less memory-
> consumptive solutions possible.
>
> Best
> Erick
>
> On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
>  wrote:
>> Editing the query...remove  I don't know where it came from while 
>> I did copy/paste
>>
>> Tirthankar Chatterjee  wrote:
>>
>>
>> Hi,
>> I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
>> Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit
>>
>>
>> Data Index Dir Size: 400GB
>> Metadata of files is stored in it. I have around 15 schema fields.
>> Total number of items:150million approx.
>>
>> I have a scenario which I will try to explain to the best of my knowledge 
>> here:
>>
>> Let us consider the fields I am interested in
>>
>> Url: Entire path of a file in windows file system including the filename. 
>> ex:C:\Documents\A.txt
>> mtm: Modified Time of the file
>> Jid:JOb ID
>> conv_sort is string field type where the filename is stored.
>>
>> I run a job where the following gets inserted
>>
>> Total Items:2
>> Url:C:\personal\A1.txt
>> mtm:08/14/2012 12:00:00
>> Jid:1
>> Conv_sort:A1.txt
>> ---
>> Url:C:\personal\B1.txt
>> mtm:08/14/2012 12:01:00
>> Jid:1
>> Conv_sort:B1.txt
>> In the second run only one item changes:
>>
>> Url:C:\personal\A1.txt
>> mtm:08/15/2012 1:00:00
>> Jid:2
>> Conv_sort=A1.txt
>>
>> When queried I would like to return the latest A1.txt and B1.txt back to the 
>> end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
>> can someone please help… as it is critical for my project
>>
>> The query I am trying is un

Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j

2012-08-15 Thread David Smiley (@MITRE.org)
Hey solr-user, are you by chance indexing LineStrings?  That is something I
never tried with this spatial index.  Depending on which iteration of LSP
you are using, I figure you'd either end up indexing a vast number of points
along the line which would be slow to index and make the index quite big, or
you might end up with a geohash granularity that will look more like a very
blocky (i.e. pixelated) approximation of the line that is much courser and
will thus trigger searches "near" the line to match the line.  I don't have
this use-case in my work so I haven't put that much thought into handling
lines -- I just do points & polygons & circles & rects.
~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4001486.html
Sent from the Solr - User mailing list archive at Nabble.com.


Does DataImportHandler do any sanitizing?

2012-08-15 Thread Jon Drukman
I am pulling some fields from a mysql database using DataImportHandler and
some of them have invalid XML in them.  Does DataImportHandler do any kind
of filtering/sanitizing to ensure that it will go in OK or is it all on me?

Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
quotes ("we’re")

-jsd-


Re: Does DataImportHandler do any sanitizing?

2012-08-15 Thread Michael Della Bitta
Hi, Jon,

As far as I know, DataImportHandler doesn't transfer data to the rest
of Solr via XML so it shouldn't be a problem...

Michael Della Bitta


Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman  wrote:
> I am pulling some fields from a mysql database using DataImportHandler and
> some of them have invalid XML in them.  Does DataImportHandler do any kind
> of filtering/sanitizing to ensure that it will go in OK or is it all on me?
>
> Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
> quotes ("we’re")
>
> -jsd-


custom complex field - PolyField

2012-08-15 Thread Leonardo Souza
Hi,

I have to index a tuple like ('blah', 'more blah info') in a multivalued
field type.
I have read about the PolyField type and it seems the best solution so far
but i can't find documentation pointing how to use or implement a custom
field.
Any help is appreciated.


--
Leonardo S Souza


solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
Hello,

  I created an index => all the schema.xml & solrconfig.xml files are
created with content (I checked that they have contents in the xml files).
But, if I poweroff the system & restart again - the contents of the files
are gone. It's like 0 bytes files.

Even, the solr.xml file which got updated when I created a new index (with a
core) has 0 bytes & all the previous entries are lost too.

I'm using Solr 4.0

Does anyone has any idea about the scenarios where it might happen.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.xml entries got deleted when powered off

2012-08-15 Thread Leonardo Souza
Just guessing,.
disk full?

--
Abraços,
Leonardo S Souza




2012/8/15 vempap 

> Hello,
>
>   I created an index => all the schema.xml & solrconfig.xml files are
> created with content (I checked that they have contents in the xml files).
> But, if I poweroff the system & restart again - the contents of the files
> are gone. It's like 0 bytes files.
>
> Even, the solr.xml file which got updated when I created a new index (with
> a
> core) has 0 bytes & all the previous entries are lost too.
>
> I'm using Solr 4.0
>
> Does anyone has any idea about the scenarios where it might happen.
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
nopes .. there is good amount of space left on disk



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001502.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr.xml entries got deleted when powered off

2012-08-15 Thread Buttler, David
You are not putting these files in /tmp are you?  That is sometimes wiped by 
different OS's on shutdown


-Original Message-
From: vempap [mailto:phani.vemp...@emc.com] 
Sent: Wednesday, August 15, 2012 3:31 PM
To: solr-user@lucene.apache.org
Subject: Re: solr.xml entries got deleted when powered off

It's happening when I'm not doing a clean shutdown. Are there any more
scenarios it might happen ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: solr.xml entries got deleted when powered off

2012-08-15 Thread vempap
No, I'm not keeping them in /tmp



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001506.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Chris Hostetter

: 2> Use external file fields (EFF) for the same purpose, that
:  won't require you to re-index the doc. The trick
:  here is you use the value in the EFF as a multiplier
:  for the score (that's what function queries do). So older
:  versions of the doc have scores of 0 and just don't
:  show up.

or use it in an fq={!frange ...} to eliminate the older versions 
completley.

: > I will back up and let you know the use case. I am tracking file 
: versions. And I want to give an option to browse your system for the 
: latest files. So in order to remove dups (same filename) I used 
: grouping.

based on only knowing that sentence, my starting suggestion would be to 
have two indexes: one where the filename is hte unique key, thus only the 
most current versions of files are listed, and one where there is no 
unique key (or you use whatever key you use today) that lets you do the 
full historical archive search, and query whichever index makes sense for 
each user action.



-Hoss


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Nicholas Ball

Haven't managed to find a good way to do this yet. Does anyone have any
ideas on how I could implement this feature?
Really need to move docs across from one core to another atomically.

Many thanks,
Nicholas

On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
 wrote:
> That could work, but then how do you ensure commit is called on the two
> cores at the exact same time?
> 
> Cheers,
> Nicholas
> 
> On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog 
> wrote:
>> Index all documents to both cores, but do not call commit until both
>> report that indexing worked. If one of the cores throws an exception,
>> call roll back on both cores.
>> 
>> On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
>>  wrote:
>>>
>>> Hey all,
>>>
>>> Trying to figure out the best way to perform atomic operation across
>>> multiple cores on the same solr instance i.e. a multi-core
environment.
>>>
>>> An example would be to move a set of docs from one core onto another
> core
>>> and ensure that a softcommit is done as the exact same time. If one
> were
>>> to
>>> fail so would the other.
>>> Obviously this would probably require some customization but wanted to
>>> know what the best way to tackle this would be and where should I be
>>> looking in the source.
>>>
>>> Many thanks for the help in advance,
>>> Nicholas a.k.a. incunix


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Li Li
在 2012-7-2 傍晚6:37,"Nicholas Ball" 写道:
>
>
> That could work, but then how do you ensure commit is called on the two
> cores at the exact same time?
that may needs something like two phrase commit in relational dB. lucene
has prepareCommit, but to implement 2pc, many things need to do.
> Also, any way to commit a specific update rather then all the back-logged
> ones?
>
> Cheers,
> Nicholas
>
> On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog 
> wrote:
> > Index all documents to both cores, but do not call commit until both
> > report that indexing worked. If one of the cores throws an exception,
> > call roll back on both cores.
> >
> > On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
> >  wrote:
> >>
> >> Hey all,
> >>
> >> Trying to figure out the best way to perform atomic operation across
> >> multiple cores on the same solr instance i.e. a multi-core environment.
> >>
> >> An example would be to move a set of docs from one core onto another
> core
> >> and ensure that a softcommit is done as the exact same time. If one
> were
> >> to
> >> fail so would the other.
> >> Obviously this would probably require some customization but wanted to
> >> know what the best way to tackle this would be and where should I be
> >> looking in the source.
> >>
> >> Many thanks for the help in advance,
> >> Nicholas a.k.a. incunix


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Li Li
do you really need this?
distributed transaction is a difficult problem. in 2pc, every node could
fail, including coordinator. something like leader election needed to make
sure it works. you maybe try zookeeper.
but if the transaction is not very very important like transfer money in
bank, you can do like this.
coordinator:
在 2012-8-16 上午7:42,"Nicholas Ball" 写道:

>
> Haven't managed to find a good way to do this yet. Does anyone have any
> ideas on how I could implement this feature?
> Really need to move docs across from one core to another atomically.
>
> Many thanks,
> Nicholas
>
> On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
>  wrote:
> > That could work, but then how do you ensure commit is called on the two
> > cores at the exact same time?
> >
> > Cheers,
> > Nicholas
> >
> > On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog 
> > wrote:
> >> Index all documents to both cores, but do not call commit until both
> >> report that indexing worked. If one of the cores throws an exception,
> >> call roll back on both cores.
> >>
> >> On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
> >>  wrote:
> >>>
> >>> Hey all,
> >>>
> >>> Trying to figure out the best way to perform atomic operation across
> >>> multiple cores on the same solr instance i.e. a multi-core
> environment.
> >>>
> >>> An example would be to move a set of docs from one core onto another
> > core
> >>> and ensure that a softcommit is done as the exact same time. If one
> > were
> >>> to
> >>> fail so would the other.
> >>> Obviously this would probably require some customization but wanted to
> >>> know what the best way to tackle this would be and where should I be
> >>> looking in the source.
> >>>
> >>> Many thanks for the help in advance,
> >>> Nicholas a.k.a. incunix
>


Re: Atomic Multicore Operations - E.G. Move Docs

2012-08-15 Thread Li Li
http://zookeeper.apache.org/doc/r3.3.6/recipes.html#sc_recipes_twoPhasedCommit

On Thu, Aug 16, 2012 at 7:41 AM, Nicholas Ball
 wrote:
>
> Haven't managed to find a good way to do this yet. Does anyone have any
> ideas on how I could implement this feature?
> Really need to move docs across from one core to another atomically.
>
> Many thanks,
> Nicholas
>
> On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball
>  wrote:
>> That could work, but then how do you ensure commit is called on the two
>> cores at the exact same time?
>>
>> Cheers,
>> Nicholas
>>
>> On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog 
>> wrote:
>>> Index all documents to both cores, but do not call commit until both
>>> report that indexing worked. If one of the cores throws an exception,
>>> call roll back on both cores.
>>>
>>> On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball
>>>  wrote:

 Hey all,

 Trying to figure out the best way to perform atomic operation across
 multiple cores on the same solr instance i.e. a multi-core
> environment.

 An example would be to move a set of docs from one core onto another
>> core
 and ensure that a softcommit is done as the exact same time. If one
>> were
 to
 fail so would the other.
 Obviously this would probably require some customization but wanted to
 know what the best way to tackle this would be and where should I be
 looking in the source.

 Many thanks for the help in advance,
 Nicholas a.k.a. incunix


Re: SOLR3.6:Field Collapsing/Grouping throws OOM

2012-08-15 Thread Tirthankar Chatterjee
Awesome thanks a lot, I am already on it with option 1. We need to track delete 
to flip the previous one as the current.

Erick Erickson  wrote:


No, sharding into multiple cores on the same machine still
is limited by the physical memory available. It's still lots
of stuf on a limited box.

But try backing up and re-thinking the problem a bit.
Some possibilities off the top of my head:

1> have a new field "current". when you update a doc,
 reindex the old doc with current=0 and put current=
 1 in the new doc (boolean field). Getting one and
 only one is really simple.
2> Use external file fields (EFF) for the same purpose, that
 won't require you to re-index the doc. The trick
 here is you use the value in the EFF as a multiplier
 for the score (that's what function queries do). So older
 versions of the doc have scores of 0 and just don't
 show up.
3> Implement a custom collector that replaces older hits
 with newer hits. Actually I don't particularly like this
 because it would potentially replace a higher-scoring
document with a lower-scoring one in the results list...

Bottom line here is I don't think grouping is a good approach
for this problem

Best
Erick

On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee
 wrote:
> Hi Erick,
> You are so right on the memory calculations. I am happy that I know now that 
> I was doing something wrong. Yes I am getting confused with SQL.
>
> I will back up and let you know the use case. I am tracking file versions. 
> And I want to give an option to browse your system for the latest files. So 
> in order to remove dups (same filename) I used grouping.
>
> Also when you say Sharding is it okay if I do multi cores and does it mean 
> that each core needs a separate tomcat. I meant to say can I use the same 
> machine? 150 mill docs have 120 mill unique paths too.
>
> One more thing. If I need sharding and need a new box then it wont be great. 
> Because this system still have horsepower left which I can use.
>
> Thanks a ton for explaining the issue.
>
> Erick Erickson  wrote:
>
>
> You'r putting  a lot of data on a single box, then
> asking to group on what I presume is a string
> field. That's just going to eat up a _bunch_ of
> memory.
>
> let's say your average file name is 16 bytes long. Each
> unique value will take up 58 + 32 bytes (58 bytes
> of overhead, I'm presuming Solr 3.X and 16*2 bytes
> for the chars). So, we're up to 90 bytes/string * number
> of distinct file names) Say you have, for argument's
> sake, 100M distinct file names. You're up to 9G
> memory requirement for sorting alone. Solr's
> sorting reads all the unique values into memory whether
> or not they satisfy the query...
>
> And Grouping can also be expensive. I don't think
> you really want to group in this case, I'd simply use
> a filter query something like:
> fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307
>
> Then you're also grouping on conv_sort which doesn't
> make much sense, do you really want individual results returned
> for _each_ file name?
>
> What it looks like to me is you're confusing SQL with
> solr search and getting into bad situations...
>
> Also, 150M documents in a single shard is...really a lot.
> You're probably at a point where you need to shard. Not
> to mention that your 400G index is trying to be jammed
> into 12G of memory.
>
> This actually feels like an XY problem, can you back
> up and let us know what the use-case you're
> trying to solve is? Perhaps there are less memory-
> consumptive solutions possible.
>
> Best
> Erick
>
> On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
>  wrote:
>> Editing the query...remove  I don't know where it came from while 
>> I did copy/paste
>>
>> Tirthankar Chatterjee  wrote:
>>
>>
>> Hi,
>> I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
>> Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit
>>
>>
>> Data Index Dir Size: 400GB
>> Metadata of files is stored in it. I have around 15 schema fields.
>> Total number of items:150million approx.
>>
>> I have a scenario which I will try to explain to the best of my knowledge 
>> here:
>>
>> Let us consider the fields I am interested in
>>
>> Url: Entire path of a file in windows file system including the filename. 
>> ex:C:\Documents\A.txt
>> mtm: Modified Time of the file
>> Jid:JOb ID
>> conv_sort is string field type where the filename is stored.
>>
>> I run a job where the following gets inserted
>>
>> Total Items:2
>> Url:C:\personal\A1.txt
>> mtm:08/14/2012 12:00:00
>> Jid:1
>> Conv_sort:A1.txt
>> ---
>> Url:C:\personal\B1.txt
>> mtm:08/14/2012 12:01:00
>> Jid:1
>> Conv_sort:B1.txt
>> In the second run only one item changes:
>>
>> Url:C:\personal\A1.txt
>> mtm:08/15/2012 1:00:00
>> Jid:2
>> Conv_sort=A1.txt
>>
>> When queried I would like to return the latest A1.txt and B1.txt back to the 
>> end user. I am trying 

Re: Does DataImportHandler do any sanitizing?

2012-08-15 Thread Lance Norskog
If you want to sanitize them during indexing, the regular expression
tools can do this. You would create a regular expression that matches
bogus elements. There is a regular expression transformer in the DIH,
and a regular expression CharFilter inside the Lucene text analysis
stack.

On Wed, Aug 15, 2012 at 2:10 PM, Michael Della Bitta
 wrote:
> Hi, Jon,
>
> As far as I know, DataImportHandler doesn't transfer data to the rest
> of Solr via XML so it shouldn't be a problem...
>
> Michael Della Bitta
>
> 
> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
> www.appinions.com
> Where Influence Isn’t a Game
>
>
> On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman  wrote:
>> I am pulling some fields from a mysql database using DataImportHandler and
>> some of them have invalid XML in them.  Does DataImportHandler do any kind
>> of filtering/sanitizing to ensure that it will go in OK or is it all on me?
>>
>> Example bad data:  orphaned ampersands ("Peanut Butter & Jelly"), curly
>> quotes ("we’re")
>>
>> -jsd-



-- 
Lance Norskog
goks...@gmail.com


MySQL Exception: Communications link failure WITH DataImportHandler

2012-08-15 Thread Jienan Duan
Hi all:
I'm using DataImportHandler load data from MySQL.
It works fine on my develop machine and online environment.
But I got an exception on test environment:

> Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>> Communications link failure
>
>
>> The last packet sent successfully to the server was 0 milliseconds ago.
>> The driver has not received any packets from the server.
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> Method)
>
> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>
> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>
> at com.mysql.jdbc.Util.handleNewInstance(Util.java:406)
>
> at
>> com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1074)
>
> at com.mysql.jdbc.MysqlIO.(MysqlIO.java:343)
>
> at
>> com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2132)
>
> ... 26 more
>
> Caused by: java.net.ConnectException: Connection timed out
>
> at java.net.PlainSocketImpl.socketConnect(Native Method)
>
> at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>
> at
>> java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>
> at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>
> at java.net.Socket.connect(Socket.java:529)
>
> at java.net.Socket.connect(Socket.java:478)
>
> at java.net.Socket.(Socket.java:375)
>
> at java.net.Socket.(Socket.java:218)
>
> at
>> com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:253)
>
> at com.mysql.jdbc.MysqlIO.(MysqlIO.java:292)
>
> ... 27 more
>
> This make me confused,because the test env and online env almost
same:Tomcat runs on a Linux Server with JDK6,MySql5 runs on another.
Even I wrote a simple JDBC test class it works,a jsp file with JDBC code
also works.Only DataImportHandler failed.
I'm trying to read Solr source code and found that it seems Solr has it's
own ClassLoader.I'm not sure if it goes wrong with Tomcat on some specific
configuration.
Dose anyone know how to fix this problem? Thank you very much.

Best Regards.

Jienan Duan

-- 
--
不走弯路,就是捷径。
http://www.jnan.org/


RE: Facet sort numeric values

2012-08-15 Thread Aleksander Akerø
I see the problem, but there are no possibilities for normalization as the
upper limit could be anything in different cases (hard to explain).
I think it is better for me to just apply the correct type of sorting with
an array/list with some script. This is just for getting the facet values to
look pretty in a filter menu.

I knew this question was a shot in the dark, but thank you for a nice
explanation and possible solution!

Aleksander Akerø
@ Gurusoft AS
Mobil: 944 89 054 
QR-Code (Kontaktinfo)

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 15. august 2012 19:00
To: solr-user@lucene.apache.org
Subject: Re: Facet sort numeric values

the problem you're running into is that lexical ordering of numeric data !=
numeric ordering. If you have a mixed alpha and numeric data, you man not
care if the alpha stuff is first, i.e.

asdb456
asdf490

sorts fine. Problems happen with
9jsdf
100ukel

the 100ukel comes first.

So if you have a mixed alpha and numeric situation, you have to either live
with it or normalize the numeric data so it lexical ordering == numeric
ordering, the most common way is to left-pad numeric data to a fixed-width,
i.e. rather than index asb9fg, index asb009fg. Of course you have to
know what the upper limit of any digit is for this to work...

Best
Erick

On Wed, Aug 15, 2012 at 12:33 AM, Aleksander Akerø
 wrote:
> Oh brilliant, didn't think of it being possible to configure that way.
>
> Had made my own "untokenized" type, so I guess it would be better for 
> me to control datatype this way.
>
> Bonus question (hehe): What if these field values also contain 
> alphanumeric values? E.g. "Alpha, Bravo, Omega, ... "
> How would this affect the sorting? I guess the TrieIntField is not 
> applicable then.
>
> Aleksander Akerø
> @ Gurusoft AS
> Mobil: 944 89 054
> QR-Code (Kontaktinfo)
>
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: 14. august 2012 17:45
> To: solr-user@lucene.apache.org
> Subject: Re: Facet sort numeric values
>
>
> : I'm having a problem with sorting facets. I am using the 
> facet.sort=index
> : parameter and it works fine for most of the values.
> ...
> : Eksample, when sorting "15, 6, 23, 7, 10, 90" it sorts like this: 
> "10, 15,
> : 23, 6, 7, 90", but what I wanted was "6, 7, 10, 15, 23, 90".
>
> what field type are you using?
>
> If you use one of the Trie___Field types then the facet values should 
> sort exactly as you describe.
>
>  positionIncrementGap="0"/>  class="solr.TrieFloatField" precisionStep="0" 
> positionIncrementGap="0"/>  positionIncrementGap="0"/>  class="solr.TrieDoubleField" precisionStep="0" 
> positionIncrementGap="0"/>
>
>
>
> -Hoss
>