Hi Nicholas,

The crawler configuration was not updated after the Spark 4.1.1 release, as
documented in the release process
<https://spark.apache.org/release-process.html>. I've fixed it.

A unit test isn't really feasible here since the doc search is powered by
Algolia, but we could set up an Algolia monitoring alert to catch this
proactively. I'll look into it when I have the bandwidth.

Gengliang

On Wed, Apr 1, 2026 at 3:09 PM Nicholas Chammas <[email protected]>
wrote:

> It’s broken again. This is the third breakage I am reporting in the past
> couple of years.
>
> Is there some sort of alert or CI test we could setup to catch or prevent
> this going forward?
>
>
> On Dec 21, 2025, at 1:35 PM, Gengliang Wang <[email protected]> wrote:
>
> Hi all,
>
> The crawler issue has been identified and fixed.
>
> The root cause was that  by the crawler fails when the latest result
> contains less than 90% of the previous result. Increasing the
> `maxLostRecordsPercentage` threshold resolves the issue.
>
> https://www.algolia.com/doc/tools/crawler/apis/configuration/safety-checks
>
> On Wed, Dec 17, 2025 at 10:03 PM Xiao Li <[email protected]> wrote:
>
>> Thanks for reporting it! Will take a look
>>
>> Nicholas Chammas <[email protected]> 于2025年12月5日周五 04:19写道:
>>
>>> Bueller?
>>>
>>> Is anyone on this list able to fix the crawler?
>>>
>>>
>>> On Dec 1, 2025, at 12:19 PM, Nicholas Chammas <
>>> [email protected]> wrote:
>>>
>>> Hello,
>>>
>>> This seems to be happening again.
>>>
>>> Perhaps we should add a new test (but where, I wonder?) to ensure that
>>> Algolia search doesn’t break without us knowing.
>>>
>>> Nick
>>>
>>>
>>> On Dec 11, 2023, at 5:02 AM, Gengliang Wang <[email protected]> wrote:
>>>
>>> Hi Nick,
>>>
>>> Thank you for reporting the issue with our web crawler.
>>>
>>> I've found that the issue was due to a change(specifically, pull
>>> request #40269 <https://github.com/apache/spark/pull/40269>) in the
>>> website's HTML structure, where the JavaScript selector
>>> ".container-wrapper" is now ".container". I've updated the crawler
>>> accordingly, and it's working properly now.
>>>
>>> Gengliang
>>>
>>> On Sun, Dec 10, 2023 at 8:15 AM Nicholas Chammas <
>>> [email protected]> wrote:
>>>
>>>> Pinging Gengliang and Xiao about this, per these docs
>>>> <https://github.com/apache/spark-website/blob/0ceaaaf528ec1d0201e1eab1288f37cce607268b/release-process.md#update-the-configuration-of-algolia-crawler>
>>>> .
>>>>
>>>> It looks like to fix this problem you need access to the Algolia
>>>> Crawler Admin Console.
>>>>
>>>>
>>>> On Dec 5, 2023, at 11:28 AM, Nicholas Chammas <
>>>> [email protected]> wrote:
>>>>
>>>> Should I report this instead on Jira? Apologies if the dev list is not
>>>> the right place.
>>>>
>>>> Search on the website appears to be broken. For example, here is a
>>>> search for “analyze”:
>>>>
>>>> <Image 12-5-23 at 11.26 AM.jpeg>
>>>>
>>>> And here is the same search using DDG
>>>> <https://duckduckgo.com/?q=site:https://spark.apache.org/docs/latest/+analyze&t=osx&ia=web>
>>>> .
>>>>
>>>> Nick
>>>>
>>>>
>>>>
>>>
>>>
>

Reply via email to