It’s broken again. This is the third breakage I am reporting in the past couple 
of years.

Is there some sort of alert or CI test we could setup to catch or prevent this 
going forward?


> On Dec 21, 2025, at 1:35 PM, Gengliang Wang <[email protected]> wrote:
> 
> Hi all,
> 
> 
> The crawler issue has been identified and fixed.
> 
> The root cause was that  by the crawler fails when the latest result contains 
> less than 90% of the previous result. Increasing the 
> `maxLostRecordsPercentage` threshold resolves the issue.
> 
> https://www.algolia.com/doc/tools/crawler/apis/configuration/safety-checks
> 
> 
> On Wed, Dec 17, 2025 at 10:03 PM Xiao Li <[email protected] 
> <mailto:[email protected]>> wrote:
>> Thanks for reporting it! Will take a look
>> 
>> Nicholas Chammas <[email protected] 
>> <mailto:[email protected]>> 于2025年12月5日周五 04:19写道:
>>> Bueller?
>>> 
>>> Is anyone on this list able to fix the crawler?
>>> 
>>> 
>>>> On Dec 1, 2025, at 12:19 PM, Nicholas Chammas <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> This seems to be happening again.
>>>> 
>>>> Perhaps we should add a new test (but where, I wonder?) to ensure that 
>>>> Algolia search doesn’t break without us knowing.
>>>> 
>>>> Nick
>>>> 
>>>> 
>>>>> On Dec 11, 2023, at 5:02 AM, Gengliang Wang <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hi Nick,
>>>>> 
>>>>> Thank you for reporting the issue with our web crawler.
>>>>> 
>>>>> I've found that the issue was due to a change(specifically, pull request 
>>>>> #40269 <https://github.com/apache/spark/pull/40269>) in the website's 
>>>>> HTML structure, where the JavaScript selector ".container-wrapper" is now 
>>>>> ".container". I've updated the crawler accordingly, and it's working 
>>>>> properly now.
>>>>> 
>>>>> Gengliang
>>>>> 
>>>>> On Sun, Dec 10, 2023 at 8:15 AM Nicholas Chammas 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>> Pinging Gengliang and Xiao about this, per these docs 
>>>>>> <https://github.com/apache/spark-website/blob/0ceaaaf528ec1d0201e1eab1288f37cce607268b/release-process.md#update-the-configuration-of-algolia-crawler>.
>>>>>> 
>>>>>> It looks like to fix this problem you need access to the Algolia Crawler 
>>>>>> Admin Console.
>>>>>> 
>>>>>> 
>>>>>>> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas 
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> Should I report this instead on Jira? Apologies if the dev list is not 
>>>>>>> the right place.
>>>>>>> 
>>>>>>> Search on the website appears to be broken. For example, here is a 
>>>>>>> search for “analyze”:
>>>>>>> 
>>>>>>> <Image 12-5-23 at 11.26 AM.jpeg>
>>>>>>> 
>>>>>>> And here is the same search using DDG 
>>>>>>> <https://duckduckgo.com/?q=site:https://spark.apache.org/docs/latest/+analyze&t=osx&ia=web>.
>>>>>>> 
>>>>>>> Nick
>>>>>>> 
>>>>>> 
>>>> 
>>> 

Reply via email to