mikemccand commented on issue #61: URL: https://github.com/apache/lucene-jira-archive/issues/61#issuecomment-1193100921
OK I wrote a simple tool to aggregate all labels from my (nearly complete) jira dump: ``` import os import glob import json with_label_count = 0 label_count = {} for file_name in glob.glob('jira-dump/*.json'): d = json.load(open(file_name)) labels = d["fields"]["labels"] if len(labels) > 0: with_label_count += 1 #print(f'{file_name}: labels {labels}') for label in labels: label_count[label] = 1+label_count.get(label, 0) for label, count in sorted(label_count.items(), key=lambda a: -a[1]): print(f'{label} {count}') ``` Results: ``` patch 66 newdev 44 performance 39 newbie 30 vector-based-search 26 easyfix 25 gsoc2014 25 Java9 22 features 21 dead 19 build 17 gsoc2011 16 Java7 14 mentor 13 pull-request-available 13 documentation 13 Java8 12 maybe32blocker 11 random-chains 11 lucene-gsoc-11 11 lucene 11 github-pullrequest 11 analysis 10 IBM-J9 9 gsoc 8 search 8 facet 8 gsoc2012 7 lucene-gsoc-12 7 FastVectorHighlighter 7 patch-available 6 highlighter 6 query 6 suggester 6 docValues 6 test 6 Java11 6 similarity 5 stemmer 4 beginner 4 IndexWriter 4 incomplete_fix 4 missing_fixes 4 classification 4 index 4 sort 4 language 3 snowball 3 chinese 3 tokenization 3 compression 3 diffblue 3 queryparser 3 optimization 3 maven 3 solr 3 highlighting 3 Highlighter 3 stemming 3 fastvectorhighlighter 3 memory 3 perfomance 3 api-change 3 codec 3 bug 2 java8 2 pagination 2 sorting 2 parallelmultisearcher 2 jvm 2 rank 2 contrib 2 Documentation 2 Turkish 2 download 2 javadoc 2 hadoop 2 feature 2 blocker 2 locking 2 faceting 2 parser 2 Java10 2 booleanquery 2 regression 2 improvement 2 ICUFoldingFilterFactory 2 ready-to-commit 2 multi-word 2 synonyms 2 lock 2 release 2 filter 2 Arabic 2 highlight 2 faceted-search 2 EdgeNGramTokenFilter 2 analyzers 2 Java15 2 gsoc2013 2 searcher 2 tokenizer 2 morelikethis 2 jenkins 1 HTMLStripCharFilter 1 index, 1 iterators 1 Encoding 1 Front 1 normalize 1 null 1 codestyle 1 crush 1 multisearcher 1 span 1 synonym 1 score 1 Document 1 geo 1 join 1 DIH 1 Clarification 1 New_Users 1 Sort 1 docs 1 collator 1 ant 1 ivy 1 jar 1 javax 1 Analyzer 1 Ansj 1 plugin 1 Windows 1 antlr 1 hdfs 1 elasticsearch 1 refresh 1 static-analysis 1 scorer 1 clover 1 cache 1 explain 1 IndexReader 1 Highlighting 1 NPE 1 optimize 1 CountFacetRequest 1 LuceneFaq 1 Website 1 invalid 1 links 1 arguments/parameters 1 javadocs 1 indexing 1 soft-delete 1 ClassLoader 1 Thread 1 french 1 german 1 concurrency 1 starter 1 QueryParser 1 deprecated 1 missing 1 LZ4 1 BOM 1 Dependencies 1 IOE 1 update 1 policy 1 split 1 github-import 1 usability 1 EarlyTerminatingSortingCollector 1 paging 1 searchafter 1 sortingmergepolicy 1 spatial 1 spatialsearch 1 distance 1 geometric 1 length 1 short 1 suggest 1 lucene, 1 prefix 1 gradle-master 1 complexPhrase 1 cleanup 1 Impact 1 MultiLevelSkipList 1 SimpleTextCodec 1 discussion 1 gsoc2017 1 exception 1 interrupt 1 nio 1 classifier 1 batch 1 refactoring 1 time 1 error 1 checksum 1 double 1 float 1 int 1 long 1 numeric 1 Stemmer 1 SpanNearQuery 1 setMinimumNumberShouldMatch 1 CoreContainer 1 CoreReload 1 JMX 1 complexqueryparser 1 hang 1 NativeFSLockFactory 1 Java17 1 IDE 1 netbeans 1 applet 1 unsigned 1 grouping 1 neardup 1 CloseableThreadLocal 1 knn 1 android8.0 1 Suggestion 1 flex 1 merge 1 spatialrecursiveprefixtreefieldtype 1 fedora_12 1 tomcat 1 zstandard 1 Java13 1 Java14 1 java11 1 jdk11 1 jdk13 1 jdk14 1 jdk15 1 RegEx 1 bucket 1 security 1 sha1sum 1 curiosity 1 jdk16 1 opennlp 1 parallel 1 ShingleFilter 1 StopFilter 1 StopWords 1 writer 1 fieldcache 1 range 1 attribute 1 whitespace 1 Java16 1 SnapPull 1 failed 1 masterSlave 1 sorl 1 f5 1 test-failure 1 lookup 1 archive 1 dist 1 tests 1 query-parser 1 forbiddenapis 1 BTree 1 flamewar 1 logging 1 group 1 totalGroupCount 1 noob 1 patch-with-test 1 NPE, 1 Null-Safety 1 Scorer 1 ``` I think some of these are helpful? e.g. `vector-based-search`, `performance`, `newdev`, `newbie`. The highly unstructured nature is indeed a bit ... open-ended. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org