mikemccand commented on pull request #101:
URL: https://github.com/apache/lucene/pull/101#issuecomment-836788286


   > > I also tried to run wikibigall as well, which seems to require 
enwiki-20100302-pages-articles-lines.txt but it's not downloaded by the util. 
It appears the archive should be coming from 
http://home.apache.org/~mikemccand/enwiki-20100302-pages-articles-lines.txt.bz2,
 but it's giving 404 now.
   > 
   > Hmm good question, I had downloaded this file years ago, I'm not sure 
where to find it nowadays. @mikemccand Do you know where to find it? Otherwise 
I'll upload mine somewhere.
   
   Egads, it is indeed missing!  I will re-upload it.  Hmm, I do not seem to 
have that exact file locally cached. Confusing ;)  I have the `-1kb` version 
(medium sized docs), but not the `big` docs.  @jpountz could you please post 
somewhere, maybe your `home.apache.org/~jpountz`, using `sftp`?  Then I'll 
download and copy it up to `/~mikemccand`.
   
   The nightly benchmarks uses the binary form of `wikibigall`, to reduce 
thread bottleneck when reading/parsing documents to index.  Hmm it is sampled 
from a different date (01/15/2011) ... OK I am uploading that one to 
https://home.apache.org/~mikemccand/enwiki-20110115-lines.bin (ETA ~20 minutes 
more).
   
   BTW there is a [`luceneutil` issue to re-sample the Wikipedia 
export](https://github.com/mikemccand/luceneutil/issues/91), but, alas, it is 
snagged up because the latest `enwiki` download, after converting XML -> text, 
is SMALLER than the sample we pulled 11 years ago!  Which I could not yet 
explain ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to