Hi Koji,
Semantic vectors is here: http://code.google.com/p/semanticvectors/
It is a project that has been around for a number of years and used by many
people (including me
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
).
If you could compare and contrast word2vec
$100 for anyone who gets me a working Long.MAX_VALUE branch! ;-)
I know that for many of the SOLR with faceting use cases, things will
not scale to Long documents, but there are a number of more
straightforward use cases, where SOLR/Lucene will scale to Long. Like
simple searches, small numbers o
der
>
> On Feb 12, 2013, at 12:26 PM, Glen Newton wrote:
>
>> Is there a page on the wiki that points out the use cases (or the
>> features) that are best suited for Lucene adoption, and those best
>> suited for SOLR adoption?
>>
>> -Glen
>>
>> On
Is there a page on the wiki that points out the use cases (or the
features) that are best suited for Lucene adoption, and those best
suited for SOLR adoption?
-Glen
On Tue, Feb 12, 2013 at 3:11 PM, Shawn Heisey wrote:
> On 2/12/2013 11:19 AM, JohnRodey wrote:
>>
>> So I have had a fair amount of
+10
On Mon, Oct 29, 2012 at 12:17 PM, Michael Della Bitta
wrote:
> As an external observer, I think the main problem is your branding.
> "Realtime Near Realtime" is definitely an oxymoron, and your ranking
> algorithm is called "Ranking Algorithm," which is generic enough to
> suggest that a. it'
rrow for a reason.
You should post to the nutch list.
If you think the nutch list is not a responsive space, post to
http://stackoverflow.com/ with the appropriate tags
nutch tagged questions: http://stackoverflow.com/questions/tagged/nutch
constructively,
Glen Newton
On Wed, May 9, 2012 at 4:55 A
"Re-Index your data" ~= Reload your data
On Wed, Apr 4, 2012 at 12:46 PM, Joseph Werner wrote:
> Hi,
>
> I'm evaluating Solr for use in a project. In the Solr FAQ under "How can I
> rebuild my index from scratch if I change my schema?" After restarting the
> server, step 5 is to "Re-Index your
millions of cores will not work...
...yet.
-glen
On Fri, Mar 9, 2012 at 1:46 PM, Lan wrote:
> Solr has no limitation on the number of cores. It's limited by your hardware,
> inodes and how many files you could keep open.
>
> I think even if you went the Lucene route you would run into same hardw
; _usually_)
- Network issues if non-local
- DB configuration (driver, etc)
If you can give more information about the above, people on this list
should be able to better indicate whether 18 hours sounds right for
your situation.
-Glen Newton
On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten
wrote
Please include information about your heap size, (and other Java
command line arguments) as well a platform OS (version, swap size,
etc), Java version, underlying hardware (RAM, etc) for us to better
help you.
>From the information you have given, increasing your heap size should help.
Thanks,
Gl
Please show how it "doesn't work", i.e. does the application throw an
exception and if yes, could you please post the stacktrace. If no,
please be more explicit.
Thanks,
Glen Newton
On Fri, Sep 2, 2011 at 10:35 AM, angel wrote:
> Hi below is my java program for indexing around
Please take this discussion off list.
Thanks,
Glen
On Thu, Aug 18, 2011 at 3:02 PM, Gora Mohanty wrote:
> On Fri, Aug 19, 2011 at 12:15 AM, Cupbearer wrote:
>> What are the Prerequite libraries required to get Solr to work in Php.
>> Php.net has libxml2 and libcurlx I think (off the top of my
+1
On 7/27/11, Twomey, David wrote:
>
> Does anyone have examples of indexing SP content using the Google Connectors
> API and using SolrJ.
>
> I know Lucid Imagination has a Sharepoint connector and I have used that
> successfully.
>
> However, I would like to create a thumbnail image of PDF's
On Fri, Mar 11, 2011 at 5:26 PM, Yonik Seeley
wrote:
> That's an apples to oranges comparison - lucene is a library and solr
> is a server.
I partially agree ;-)
Lucene is a library and Solr is an http server wrapper-plus around Lucene.
Solr also adds (all sorts of great) significant functional
I have seen little repeatable empirical evidence for the usual answer
"mostly no".
With respect: everyone in the Solr universe seems to answer this
question in the way Yonik has.
However, with a large number of requests the XML
serialization/deserialization must have some, likely significant,
impa
> This application will be built to serve many users
If this means that you have thousands of users, 1000s of VMs and/or
1000s of cores is not going to scale.
Have an ID in the index for each user, and filter using it.
Then they can see only their own documents.
Assuming that you are building an
Where do you get your Lucene/Solr downloads from?
[x] ASF Mirrors (linked in our release announcements or via the Lucene website)
[] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.)
[] I/we build them from source via an SVN/Git checkout.
-Glen Newton
--
-
e of this list...
On Tue, Dec 28, 2010 at 4:15 PM, Mark wrote:
> It was due to the way I was writing to the DB using our rails application.
> Everythin looked correct but when retrieving it using the JDBC driver it was
> all managled.
>
> On 12/27/10 4:38 PM, Glen Newton wrote:
&
ysql/share/mysql/charsets/ |
> +------++
> 8 rows in set (0.00 sec)
>
>
> Any other ideas? Thanks
>
>
> On 12/27/10 3:23 PM, Glen Newton wrote:
>>
>> [client]
>> > default-character-set = utf8
>> > [mysql]
>> > default-character-set=utf8
>> > [mysqld]
>> > character_set_server = utf8
>> > character_set_client = utf8
>
--
-
racter-set = utf8
> [mysql]
> default-character-set=utf8
> [mysqld]
> character_set_server = utf8
> character_set_client = utf8
-Glen
On Mon, Dec 27, 2010 at 6:15 PM, Mark wrote:
> I tried both of those with no such luck.
>
> On 12/27/10 2:49 PM, Glen Newton wrote:
>>
&
1 - Verify your mysql is set up using UTF-8
2 - Does your JDBC connect string contain:
useUnicode=true&characterEncoding=UTF-8
See: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
Glen
http://zzzoot.blogspot.com/
On Mon, Dec 27, 2010 at 5:15 PM, Mark wrote:
> Solr: 1.4
he Lucene list.
If you have any questions, please contact me.
Thanks,
Glen Newton
http://zzzoot.blogspot.com
--> Old LuSql benchmarks:
http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html
On Thu, Dec 16, 2010 at 12:04 PM, Dyer, James wrote:
> We have ~50 lon
Does anyone know what technology they are using: http://www.indextank.com/
Is it Lucene under the hood?
Thanks, and apologies for cross-posting.
-Glen
http://zzzoot.blogspot.com
--
-
In a recent blog entry ("The MySQL “swap insanity” problem and the
effects of the NUMA architecture"
http://jcole.us/blog/archives/2010/09/28/mysql-swap-insanity-and-the-numa-architecture/),
Jeremy Cole describes a particular but common problem with large memory
installations of MySql on multi-core
Apologies Chris: my mistake.
-Glen
On 31 August 2010 23:27, Chris Hostetter wrote:
>
> : ?
> : The second post was relevant to the original post.
> : And even dealt with some of the questions asked in the original:
>
> The first msg with subject "Memcache for Solr" was a thread-jack of
> an exist
?
The second post was relevant to the original post.
And even dealt with some of the questions asked in the original:
Q > are there any down sides to it and difficult to implement
A > We found it wasn't feasible to cache arbitrary result sets...
?
-glen
On 31 August 2010 15:11, Chris Hostette
Liz,
I've built terrabyte (1-2 TB) test Lucene indexes, but have not
reached to the petabyte level, so I am not sure. Certainly there is
overhead in using the http and xml marshaling/de-marshaling, which may
or may not be a critical factor for you.
Could you give more information with respect to
I was wondering if anyone has any experience using huge pages[1] to
improve SOLR (or Lucene) performance (esp on 64bit).
Some are reporting major performance gains in large, memory intense
applications (like EJBs)[2].
Also, ephemeral but significant performance reductions have also been
solved usin
d when the
permissions change.
Does SOLR expose this kind of functionality?
-Glen Newton
http://zzzoot.blogspot.com/
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
On 7 July 2010 00:38, RL wrote:
>
> I've a question about indexing/searching techniques in relatio
=true
- "Increase the netTimoutForStreamingResults value" from
http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/
See also:
http://lucene.472066.n3.nabble.com/Recommended-MySQL-JDBC-driver-td817458.html
-Glen Newton
http://zzzoot.blogspot.com/
On 09/06/
I have used up to 27GB of heap with no issues, both SOLR and (just) Lucene.
-Glen Newton
http://zzzoot.blogspot.com/
On 31 March 2010 11:34, Burton-West, Tom wrote:
> Hello all,
>
> We have been running a configuration in production with 3 solr instances
> under one tomcat with 16
That discussion cites a paper via a URL:
http://doc.rero.ch/lm.php?url#16;00,43,4,20091218142456-GY/Dolamic_Ljiljana__When_Stopword_Lists_Make_the_Difference_20091218.pdf
Unfortunately when I go to this URL I get:
"L'accès à ce document est limité."
But I tracked down the paper. Here is its refe
I've also index a concatenation of 50k journal articles (making a
single document of several hundred MB of text) and it did not give me
an OOM.
-glen
On 16 March 2010 15:57, Erick Erickson wrote:
> Why do you think you'd hit OOM errors? How big is "very large"? I've
> indexed, as a single docum
I've run Lucene with heap sizes as large as 28GB of RAM (on a 32GB
machine, 64bit, Linux) and a ramBufferSize of 3GB. While I haven't
noticed the GC issues mark mentioned in this configuration, I have
seen them in the ranges he discusses (on 1.6 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/i
(In Lucene) I break the document into smaller pieces, then add each
piece to the Document field in a loop. This seems to work better, but
will mess-around with analysis like term offsets.
This should work in your example.
In Lucene, you can also add the field using a Reader to the file in question
Have one thread recursing depth first down the directories & adding to
a queue (fixed size).
Have many threads reading off of the queue and doing the work.
-glen
http://zzzoot.blogspot.com/
2009/11/13 Peter Gabriel :
> Hello.
>
> I am on work with Tika 0.5 and want to scan a folder system about 1
I am using Semantic Vectors[1] implementation of LSA in a large scale
digital library project called Project Torngat[2]. I presented some of
the work at the European Conference on Digital Libraries (ECDL)[3], at
the 'Very Large Digital Libraries (VLDL) workshop[4] in September. A
pre-print of the p
2009/8/27 Fuad Efendi :
> stored="true" means that this piece of info will be stored in a filesystem.
> So that your index will contain 1Mb of pure log PLUS some info related to
> indexing itself: terms, etc.
>
> Search speed is more important than index size...
Not if you run out of space for the
tion using only the
full-text (no metadata).
For more info & howto:
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html
Glen Newton
--
-
doop, HBase, UIMA, NLP, NER, IR
>
>
>
> - Original Message
>> From: Glen Newton
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, July 23, 2009 5:52:43 AM
>> Subject: Re: DataImportHandler / Import from DB : one data set comes in
>> multiple rows
&g
://code4lib.org/files/glen_newton_LuSql.pdf
[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
Disclosure: I am the author of LuSql.
Glen Newton
http://zzzoot.blogspot.com/
2009/7/22 Chantal Ackermann :
> Hi all,
>
> this is my first post, as I am new to SOLR (some Lucene exp).
I am going to do some (large scale) indexing tests using Lucene & will
post to both this and the Lucene list.
More info on compressed pointers:
http://wikis.sun.com/display/HotSpotInternals/CompressedOops
-Glen Newton
http://zzzoot.blogspot.com/search?q=lucene
2009/7/16 Kevin Peterson :
/cistilabswiki/index.php/LuSql
Disclosure: I am the author of LuSql.
Glen Newton
http://zzzoot.blogspot.com/
2009/7/13 Gurjot Singh :
> Hi,
> We have a solr index of size 626 MB and number of douments indexed are
> 141810. We have configured index based spellchecker with buildOnCommit
>
Try putting all the PDF URLs into a file, download with something like
'wget' then index locally.
Glen Newton
http://zzzoot.blogspot.com/
2009/7/8 ahammad :
>
> Hello,
>
> I can index rich documents like pdf for instance that are on the filesystem.
> Can we use Extracti
.blogspot.com/search?q=lucene
>
>
> Thanks
>
> Francis
>
>
> -Original Message-
> From: Glen Newton [mailto:glen.new...@gmail.com]
> Sent: Thursday, July 02, 2009 8:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is there any other way to load the index be
cis
>
> -Original Message-
> From: Glen Newton [mailto:glen.new...@gmail.com]
> Sent: Wednesday, July 01, 2009 8:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is there any other way to load the index beside using "http"
> connection?
>
> You can
ster?
>
> You mentioned about LuSql, I am not familiar with that. Can you provide us
> the docs or something? Again I am not the database Guys, I am only the solr
> Guy. The database we have is a different box than Solr master and both are
> running linux(RedHat).
>
> Tha
You can directly load to the backend Lucene using LuSql[1]. It is
faster than Solr, sometimes as much as an order of magnitude faster.
Disclosure: I am the author of LuSql
-Glen
http://zzzoot.blogspot.com/
[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
2009/7/1 Francis Y
"Sunspot: A Solr-Powered Search Engine for Ruby"
http://www.linux-mag.com/id/7341
glen
http://zzzoot.blogspot.com/
--
-
han if the
document is accessed 99% of the time it is in the search result.
I think you could do this with 2 cores in Solr, if I understand Solr correctly.
I have also had good experience with BDB for (non-networked) document storage.
Glen Newton
http://zzzoot.blogspot.com/
2009/5/26 Peter Keane
The next version of LuSql[1] supports solutions for this kind of
issue: reading from JDBC (which may include a long and compex query)
and then writing the results to a single (flattened) JDBC table that
can subsequently be the source table for Solr. This might be helpful
for your particular issue.
Amit,
You might want to take a look at LuSql[1] and see if it may be
appropriate for the issues you have.
thanks,
Glen
[1]http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
2009/4/27 Amit Nithian :
> All,
> I have a few questions regarding the data import handler. We have some
You have not indicated how you wish to use the index (inside Solr or not).
It is possible that LuSql might be an preferable alternative to
Solr/DataImportHandler, depending on your requirements.
LuSql: http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
Disclaimer: I am the autho
s. It would need to be ported
> to a different implementation of SolrServer (the base class), one that uses
> java.net.URL. I suggest “JavaNetUrlHttpSolrServer”.
>
> ~ David Smiley
>
>
> On 4/14/09 1:13 PM, "Glen Newton" wrote:
>
> I was wondering if those m
I was wondering if those more up on SolrJ internals could take a look
if there were any serious gotchas with the AppEngine's Java urlfetch
with respect to SolrJ.
http://code.google.com/appengine/docs/java/urlfetch/overview.html
"The URL must use the standard ports for HTTP (80) and HTTPS (443).
Th
x27;t yet implemented work stealing]
-glen
2009/4/9 Glen Newton :
> For Solr / Lucene:
> - use -XX:+AggressiveOpts
> - If available, huge pages can help. See
> http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
> I haven't yet followed-up with my Luce
For Solr / Lucene:
- use -XX:+AggressiveOpts
- If available, huge pages can help. See
http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html
I haven't yet followed-up with my Lucene performance numbers using
huge pages: it is 10-15% for large indexing jobs.
For Lucene:
- mu
In MySql at least, you can do achieve what I think you want by
manipulating the SQL, like this:
mysql> select "foo" as Constant1, id from Article limit 10;
select "foo" as Constant1, id from Article limit 10;
+---++
| Constant1 | id |
+---++
| foo | 1 |
| foo |
Performance comparison link:
- "Jetty vs Tomcat: A Comparative Analysis". prepared by Greg Wilkins
- May, 2008.
http://www.webtide.com/choose/jetty.jsp
2009/3/5 Erik Hatcher :
> That being said... I don't think there is a strong reason to go out of your
> way to install Tomcat and do the addition
and your colleagues do not have infinite social
capital, and hopefully you will have no reason to be forced to spend
this capital in such an unfortunate manner in the future. :-)
sincerely,
Glen Newton
2009/3/5 Yonik Seeley :
> This morning, an apparently over-zealous marketing firm, on behalf
Also take a look at LuSql:
http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
2009/3/4 Shalin Shekhar Mangar :
> On Wed, Mar 4, 2009 at 7:32 PM, Radha C. wrote:
>
>> Hi,
>>
>> I am working in a software concern. We are having some R&D base work like
>> making use of solr search
Congrats & good-luck on this new endeavour!
-Glen :-)
2009/1/26 Grant Ingersoll :
> Hi Lucene and Solr users,
>
> As some of you may know, Yonik, Erik, Sami, Mark and I teamed up with
> Marc Krellenstein to create a company to provide commercial
> support (with SLAs), training, value-add compone
If we are talking short single term fields (like a file field that has
a single term like "foo.pdf") then do what the DBMS b-tree indexes did
a long time ago: for every field you want a leading wildcard, insert
it in reverse order. So field file:"foo.pdf" is also stored, indexed
as reverseField:"f
Depending on your requirements, using Lucene directly instead of Solr
might be appropriate.
Even in a web environment.
Not likely a popular statement on the Solr list, but one that you
should consider. :-)
-Glen
2008/12/23 Manupriya :
>
> Yes... At present I want SOLR to run within my standalone
Hello,
I amusing Solr 1.4 (solr-2008-11-19) with Lucene 2.4 dropped in instead of 2.9
I am indexing 500k records using the JDBC Data Import Request Handler.
Config:
Linux openSUSE 10.2 (X86-64)
Dual core dual core 64bit Xeon 3GHz Dell blade 8GB RAM
java version "1.6.0_07"
Java(TM) SE Runtim
When you are saying "application server" do you mean tomcat?
If yes, I have allocated >8GB of heap to tomcat and it uses it all no
problem (64 bit Intel/64 bit Java).
-glen
2008/12/5 Jeryl Cook <[EMAIL PROTECTED]>:
> your out of memory :).
>
> each instance of an application server you can techn
Hello,
I am putting together some performance comparisons of LuSql[1] and
Solr's Data Import Request Handler[2], JdbcDataSource[3]. I want to
make sure I am comparing apples with apples, so would appreciate the
community helping me to make sure I am doing so.
First, LuSql default uses Lucene's St
Hi Naomi,
Try fixing your data. :-)
No, really:
1 - Sort all of your call numbers using whatever sort makes sense to you.
2 - Assign them - in your sort order - sort keys that are floats, starting:
0.01
0.02
...
1.01
1.02
...
79,999.98
79,999.99
This should ap
I have some simple indexing benchmarks comparing Lucene 2.3.1 with 2.4:
http://zzzoot.blogspot.com/2008/11/lucene-231-vs-24-benchmarks-using-lusql.html
In the next couple of days I will be running benchmarks comparing
Solr's DataImportHandler/JdbcDataSource indexing performance with
LuSql and wil
ry.java
> ./src/java/org/apache/solr/analysis/StandardTokenizerFactory.java
>
>
> Does that do it?
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
>
> From: Glen Newton <[EMAIL PROTECTED]&g
Hello,
I am looking for the Solr schema equivalent to Lucene's StandardAnalyser.
Is it the Solr schema type:
s.apache.org/jira/browse/SOLR-853
>
> On Tue, Nov 18, 2008 at 8:26 PM, Glen Newton <[EMAIL PROTECTED]> wrote:
>
>> Erik,
>>
>> Right now there is no real abstraction like DIH in LuSql. But as
>> indicated in the TODO section of the documentation, I was plan
IH could borrow from? Or vice versa?
>
> Erik
>
>
> On Nov 17, 2008, at 11:03 PM, Glen Newton wrote:
>>
>> That said, I am very interested in making LuSql useful to the Solr
>> community as well as teh broader Lucene community, so if any of you
>> can off
Hello,
I'm Glen Newton, LuSql author.
Thanks for the kind words about LuSql! :-)
I have just joined the Solr list, and while knowing about Solr, I have
not used it and have only limited technical knowledge of Solr.
That said, I am very interested in making LuSql useful to the Solr
comm
74 matches
Mail list logo