dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2702007572
I've downloaded and moved all those data sets that were present in gradle
build files (specifically, in external-datasets.gradle). If there is anything
else I should place there, let
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2701189251
@dweiss I can't tell from above -- are there other corpora that need a home
still?
--
This is an automated message from the Apache Git Service.
To respond to the message, pleas
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2701096339
> I guess thius 94GB comes from `33M x 768 x 4` bytes? Frankly I never test
with indexes > ~2M docs, but maybe there is a call for the 33M-doc index in
nightlies?
Yeah ...
msokolov commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2698449439
I guess thius 94GB comes from 33M*768*4 bytes? Frankly I never test with
indexes > ~2M docs, but maybe there is a call for the 33M-doc index in
nightlies?
--
This is an automate
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2698418094
Hmm... 100 gb may be stretching Apache Infra's patience... I don't even know
if this bucket has a limit of some sort.
--
This is an automated message from the Apache Git Service.
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697666568
Oooh we have an official S3 bucket to use now? I had already uploaded the
benchy corpus files to my own S3 bucket ... I think the URLs are in the
setup.py (just renamed to `init
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697675763
> [@mikemccand](https://github.com/mikemccand) would you be able to expose
the files [@dsmiley](https://github.com/dsmiley) rescued on your server?
oh, hmm, not I haven't y
rmuir commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2697599407
@dweiss we could fetch
https://whimsy.apache.org/public/public_ldap_people.json and retrieve
committer's GPG fingerprint that way?
--
This is an automated message from the Apache G
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2696786476
There are two or three references in test files. There is one reference
remaining in releaseWizard.py:
```
key_url = "https://home.apache.org/keys/committer/%s.asc"; % id.strip(
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2696384821
I can generate this file and make it available as a benchmark dataset. Or
would you rather give me one of your own, for consistency with your previous
results?
--
This is an a
msokolov commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695713269
Yes, I was referring to files that can be generated with
`infer_token_vectors_cohere.py`. Maybe we take the position that users should
regenerate, but it is kind of slow and demand
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695553260
> [...] but can we attach 3G files here?
I think we can, if it makes sense to do so.
We're not supposed to abuse this service - for example by downloading 3gb
data file
msokolov commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695517144
There are other vector data files - I think the key one that has become a
reference point is Cohere 768d trained on wikipedia-derived docs, but I'm not
sure where nightly benchmark
benwtrent commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695529583
@msokolov the python script in Lucene util downloads from hugging face. If
that is the data you are talking about?
`infer_token_vectors_cohere.py`
--
This is an a
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695433674
We now have an s3 bucket to place those benchmark/ reference files on. If
you have any of these files - please let me know and perhaps make it available
to me, somehow -
```
rmuir commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2695438823
@dweiss
https://issues.apache.org/jira/secure/attachment/12429835/top.100k.words.de.en.fr.uk.wikipedia.2009-11.tar.bz2
--
This is an automated message from the Apache Git Service.
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2684281563
Thanks to INFRA-26434 Lucene now has an s3 bucket we can publish those
data/test resources on. I'll try to collect these resources, upload them and
make the necessary build changes s
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2594748074
Fetched, thanks, David. I'm talking to infra about the possibilities of
storing those benchmark files somewhere on Apache services. I don't feel
comfortable uploading it to github/gi
dsmiley commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2593762704
[geonames_20130921_randomOrder_allCountries.txt.bz2](http://gofile.me/5MFBZ/edVjck97c)
297.2MB
If that works for you, I'll share the other. If it doesn't I'll share in
another
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2593500832
I filed https://issues.apache.org/jira/browse/INFRA-26434 and asked if
apache.org can be of any help here. Some of those files are too large to host
on github (even in a separate rep
dweiss commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2593450811
@mikemccand would you be able to expose the files @dsmiley rescued on your
server?
--
This is an automated message from the Apache Git Service.
To respond to the message, please l
iamsanjay commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2463893766
I was trying to set the
[luceneutil](https://github.com/mikemccand/luceneutil), ran the script.
```
python3 src/python/setup.py -download
```
It failed on one url where
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2312879477
I also aliased (CNAMEd)
[benchmarks.mikemccandless.com](https://benchmarks.mikemccandless.com/) --
GitHub pages makes this simple-ish, yay.
--
This is an automated message fro
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2312800530
> > FYI: I clicked on a few random links and found a 404
https://mikemccand.github.io/luceneutil/analyzers.html although this page does
seem to exist on the current site
>
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2312740634
Phew, OK, I think nightly benchy is now successfully publishing
automatically to https://mikemccand.github.io/lucenenightly (using GitHub
pages). Last night's run "just worked".
msokolov commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2305436221
Nice! glad it worked.
FYI: I clicked on a few random links and found a 404
https://mikemccand.github.io/luceneutil/analyzers.html although this page does
seem to exist on t
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2305157923
A nice side effect of this is that the long running (13+ years now!) nightly
reports will be backed up via git/GitHub and no longer single sourced on my
home box, yay. And if ev
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2305153561
> I'm leaning towards a [simple GitHub pages
site](https://docs.github.com/en/pages) (thank you @msokolov for the idea)
I enabled pages for the `luceneutil` repro and pushe
mikemccand commented on issue #13647:
URL: https://github.com/apache/lucene/issues/13647#issuecomment-2304800950
Thanks @rmuir and @ChrisHegarty.
I've downloaded all my content from `home.apache.org` (Lucene benchmark
source corpora, line file docs, large vector file, etc.), so we won
29 matches
Mail list logo