[GitHub] [lucene] JoeHF opened a new pull request, #1003: LUCENE-10616: optimizing decompress when only retrieving some fields
JoeHF opened a new pull request, #1003: URL: https://github.com/apache/lucene/pull/1003 ### Description (or a Jira issue link if you have one) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10616) Moving to dictionaries has made stored fields slower at skipping
[ https://issues.apache.org/jira/browse/LUCENE-10616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561823#comment-17561823 ] fang hou commented on LUCENE-10616: --- hi [~jpountz] i tried to resolve this by changing the decompress loop in LZ4WithPresetDictCompressionMode to return early when we got all fields we need. But i'm a little bit hesitated to do so because it may break the api design of decompress that it may not return enough length bytes than we expected. Besides, i'm not sure if it's the right direction to change logics in LZ4WithPresetDictCompressionMode. Should this decompression optimization happen in Lucene90CompressingStoredFieldsReader(but i haven't found an easy solution to do)? here is a WIP pr to demo my current thoughts [https://github.com/apache/lucene/pull/1003.] PLZ give me some insights thanks! > Moving to dictionaries has made stored fields slower at skipping > > > Key: LUCENE-10616 > URL: https://issues.apache.org/jira/browse/LUCENE-10616 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > [~ywelsch] has been digging into a regression of stored fields retrieval that > is caused by LUCENE-9486. > Say your documents have two stored fields, one that is 100B and is stored > first, and the other one that is 100kB, and you are only interested in the > first one. While the idea behind blocks of stored fields is to store multiple > documents in the same block to leverage redundancy across documents, > sometimes documents are larger than the block size. As soon as documents are > larger than 2x the block size, our stored fields format splits such large > documents into multiple blocks, so that you wouldn't need to decompress > everything only to retrieve a couple small fields. > Before LUCENE-9486, BEST_SPEED had a block size of 16kB, so only retrieving > the first field value would only need to decompress 16kB of data. With the > move to preset dictionaries in LUCENE-9486 and then LUCENE-9917, we now have > blocks of 80kB, so stored fields would now need to decompress 80kB of data, > 5x more than before. > With dictionaries, our blocks are now split into 10 sub blocks. We happen to > eagerly decompress all sub blocks that intersect with the stored document, > which is why we would decompress 80kB of data, but this is an implementation > detail. It should be possible to decompress these sub blocks lazily so that > we would only decompress those that intersect with one of the field values > that the user is interested in retrieving? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173102674 I found at least one test issue in the test repo https://github.com/mocobeta/sandbox-lucene-10557/issues appears in google's top search result. I didn't think it happened so quickly, but I might have to make the repo private - if there is anyone who is interested in debugging this issue, please let me know. I'll give you access to the repo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #974: LUCENE-10614: Properly support getTopChildren in RangeFacetCounts
Yuti-G commented on code in PR #974: URL: https://github.com/apache/lucene/pull/974#discussion_r912548774 ## lucene/demo/src/java/org/apache/lucene/demo/facet/DistanceFacetsExample.java: ## @@ -212,7 +212,26 @@ public static Query getBoundingBoxQuery( } /** User runs a query and counts facets. */ - public FacetResult search() throws IOException { + public FacetResult searchAllChildren() throws IOException { + +FacetsCollector fc = searcher.search(new MatchAllDocsQuery(), new FacetsCollectorManager()); + +Facets facets = +new DoubleRangeFacetCounts( +"field", +getDistanceValueSource(), +fc, +getBoundingBoxQuery(ORIGIN_LATITUDE, ORIGIN_LONGITUDE, 10.0), +ONE_KM, +TWO_KM, +FIVE_KM, +TEN_KM); + +return facets.getAllChildren("field"); + } + + /** User runs a query and counts facets. */ + public FacetResult searchTopChildren() throws IOException { Review Comment: Hmmm, I think the added example is intending to do what you described. It is faceting on one-hour bucket and tracing the error message in each one-hour slot from the past 168 hours (entire week). Only the first range (past `0-1` hour) has an end point as now. For better understanding, I changed the starting index in the range label to start from (0-1) hour in the latest commit. For example, if I call `getAllChildren("error log")`, the results are: ``` dim=error log path=[] value=2758 childCount=168 Past 0-1 hour (0) Past 1-2 hour (1) Past 2-3 hour (2) Past 3-4 hour (3) Past 4-5 hour (4) Past 5-6 hour (5) Past 6-7 hour (6) Past 7-8 hour (7) Past 8-9 hour (8) Past 9-10 hour (9) Past 10-11 hour (10) Past 11-12 hour (11) Past 12-13 hour (12) Past 13-14 hour (13) Past 14-15 hour (14) Past 15-16 hour (15) Past 16-17 hour (16) Past 17-18 hour (17) Past 18-19 hour (18) Past 19-20 hour (19) Past 20-21 hour (20) Past 21-22 hour (21) Past 22-23 hour (22) Past 23-24 hour (23) Past 24-25 hour (24) Past 25-26 hour (25) Past 26-27 hour (26) Past 27-28 hour (27) Past 28-29 hour (28) Past 29-30 hour (29) Past 30-31 hour (30) Past 31-32 hour (31) Past 32-33 hour (32) Past 33-34 hour (33) Past 34-35 hour (34) Past 35-36 hour (0) Past 36-37 hour (1) Past 37-38 hour (2) Past 38-39 hour (3) Past 39-40 hour (4) Past 40-41 hour (5) Past 41-42 hour (6) Past 42-43 hour (7) Past 43-44 hour (8) Past 44-45 hour (9) Past 45-46 hour (10) Past 46-47 hour (11) Past 47-48 hour (12) Past 48-49 hour (13) Past 49-50 hour (14) Past 50-51 hour (15) Past 51-52 hour (16) Past 52-53 hour (17) Past 53-54 hour (18) Past 54-55 hour (19) Past 55-56 hour (20) Past 56-57 hour (21) Past 57-58 hour (22) Past 58-59 hour (23) Past 59-60 hour (24) Past 60-61 hour (25) Past 61-62 hour (26) Past 62-63 hour (27) Past 63-64 hour (28) Past 64-65 hour (29) Past 65-66 hour (30) Past 66-67 hour (31) Past 67-68 hour (32) Past 68-69 hour (33) Past 69-70 hour (34) Past 70-71 hour (0) Past 71-72 hour (1) Past 72-73 hour (2) Past 73-74 hour (3) Past 74-75 hour (4) Past 75-76 hour (5) Past 76-77 hour (6) Past 77-78 hour (7) Past 78-79 hour (8) Past 79-80 hour (9) Past 80-81 hour (10) Past 81-82 hour (11) Past 82-83 hour (12) Past 83-84 hour (13) Past 84-85 hour (14) Past 85-86 hour (15) Past 86-87 hour (16) Past 87-88 hour (17) Past 88-89 hour (18) Past 89-90 hour (19) Past 90-91 hour (20) Past 91-92 hour (21) Past 92-93 hour (22) Past 93-94 hour (23) Past 94-95 hour (24) Past 95-96 hour (25) Past 96-97 hour (26) Past 97-98 hour (27) Past 98-99 hour (28) Past 99-100 hour (29) Past 100-101 hour (30) Past 101-102 hour (31) Past 102-103 hour (32) Past 103-104 hour (33) Past 104-105 hour (34) Past 105-106 hour (0) Past 106-107 hour (1) Past 107-108 hour (2) Past 108-109 hour (3) Past 109-110 hour (4) Past 110-111 hour (5) Past 111-112 hour (6) Past 112-113 hour (7) Past 113-114 hour (8) Past 114-115 hour (9) Past 115-116 hour (10) Past 116-117 hour (11) Past 117-118 hour (12) Past 118-119 hour (13) Past 119-120 hour (14) Past 120-121 hour (15) Past 121-122 hour (16) Past 122-123 hour (17) Past 123-124 hour (18) Past 124-125 hour (19) Past 125-126 hour (20) Past 126-127 hour (21) Past 127-128 hour (22) Past 128-129 hour (23) Past 129-130 hour (24) Past 130-131 hour (25) Past 131-132 hour (26) Past 132-1
[GitHub] [lucene-jira-archive] madrob commented on issue #4: Which GitHub accont we should/can use for migration?
madrob commented on issue #4: URL: https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1173298841 Regarding rate limits, would it be possible to reach out to GitHub directly or through ASF infra to get those temporarily raised? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #1: Fix markup conversion error
mocobeta commented on issue #1: URL: https://github.com/apache/lucene-jira-archive/issues/1#issuecomment-1173409973 Made a tiny list of error examples in #12 that hopefully cover typical markup conversion errors. Block elements (quote, bullet list, numbered list, and table) are sometimes correctly converted and sometimes broken depending on context. Haven't found the patterns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on issue #4: Which GitHub accont we should/can use for migration?
mocobeta commented on issue #4: URL: https://github.com/apache/lucene-jira-archive/issues/4#issuecomment-1173425738 I'm not familiar with the relationship between ASF and GitHub, but the ASF organization accounts could possibly be already counted as an Enterprise Account (with a higher limit of 15,000 requests per hour). I think the difficulty here is that we developers cannot test our migration script with a real ASF account and [infra would expect us to provide "tested" scripts](https://issues.apache.org/jira/browse/INFRA-20118). If we mistakenly estimate the throttling interval the script will fall into an unstable status in the middle of processing (maybe one or two hours later after starting), and there is no way to roll back. Maybe as a possible scenario, we could make the interval tunable and ask infra to set it to an appropriate value, if possible. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org