[Mirrors] Request to flag always up to date for my mirrors #76

Closed
opened 2024-08-18 15:17:46 +00:00 by cicku · 4 comments

Hello,

I operate a mirror behind CDN and can pin the hostnames to data centers of certain countries/regions, so it is very efficient for mirroring and has been adopted quite well by other projects.

Now when it comes to mirrormanager based system, the issue surfaces because mm tries to crawl every folder and this does not work for my mirror since most of the files being downloaded will not be the folder HTML itself, and this often results in being marked "not updated". Crawling also introduces extra workload to the server as well as unnecessary spike of traffic.

Therefore, I'd like to request this flag applied to all mirrors of mine in site 3979.

Hello, I operate a mirror behind CDN and can pin the hostnames to data centers of certain countries/regions, so it is very efficient for mirroring and has been adopted quite well by other projects. Now when it comes to mirrormanager based system, the issue surfaces because mm tries to crawl every folder and this does not work for my mirror since most of the files being downloaded will not be the folder HTML itself, and this often results in being marked "not updated". Crawling also introduces extra workload to the server as well as unnecessary spike of traffic. Therefore, I'd like to request this flag applied to all mirrors of mine in site 3979.
Owner

Thank you for opening an issue.

If I'm understanding your issue correctly, mirror manager crawling HTTP is not effective for your mirror and can result in the crawler stating that you are not updated. It doesn't necessarily crawl every directory; it only crawls the necessary directories that contain repo metadata, as that is the baseline mirror manager uses to mark a mirror up to date or not for a given repository+arch combination.

Setting "always up to date" is for mirrors that we (Rocky Linux project and the RESF) maintain and control. The main problem with setting "always up to date" for mirrors that are not under our direct control is that if your mirror is behind a CDN and say you're always up to date, but you're actually not, there's no real way for mirror manager to detect and remove you for being out of date.

If you believe the crawler is consistently getting it wrong about your mirror being out of date and/or you are wanting to prevent HEAD requests to your CDN when crawling, you have the option of turning on rsync crawling, which would help alleviate this issue for you. Doing so would require you to open up an rsync endpoint for the crawler to be able to access.

Tagging @neil for more information if required.

Thank you for opening an issue. If I'm understanding your issue correctly, mirror manager crawling HTTP is not effective for your mirror and can result in the crawler stating that you are not updated. It doesn't necessarily crawl every directory; it only crawls the necessary directories that contain repo metadata, as that is the baseline mirror manager uses to mark a mirror up to date or not for a given repository+arch combination. Setting "always up to date" is for mirrors that we (Rocky Linux project and the RESF) maintain and control. The main problem with setting "always up to date" for mirrors that are not under our direct control is that if your mirror is behind a CDN and say you're always up to date, but you're actually not, there's no real way for mirror manager to detect and remove you for being out of date. If you believe the crawler is consistently getting it wrong about your mirror being out of date and/or you are wanting to prevent HEAD requests to your CDN when crawling, you have the option of turning on rsync crawling, which would help alleviate this issue for you. Doing so would require you to open up an rsync endpoint for the crawler to be able to access. Tagging @neil for more information if required.
label added the
needinfo
component/mirrors
labels 2024-08-18 20:07:38 +00:00
Author
2024-08-18 22:01:08,353 - INFO - Worker 'b9c26c4' starting on host <Host(4218 - jp.mirrors.cicku.me)>
2024-08-18 22:01:08,606 - INFO - Crawling with URL http://jp.mirrors.cicku.me/rocky
2024-08-18 22:01:08,963 - INFO - scanning category Rocky Linux
2024-08-18 22:01:39,900 - INFO - Crawling with URL http://jp.mirrors.cicku.me/rocky

...

2024-08-19 00:00:04,171 - WARNING - Host 4218 marked not up2date: Crawler timed out before completing.  Host is likely overloaded.
2024-08-19 00:00:04,485 - INFO - Ending crawl of <Host(4218 - jp.mirrors.cicku.me)> with status 2

What I can tell is that mm has been a horrible system since it was born at the Fedora community. I have no clue why it spent 2 hours crawling my mirror and then determined that my "server was overloading", honestly it is really a joke when my mirror actually serves PB level of data every month and ~25 million requests daily, considering that there have been many bugfixes after mm's 1.0.0 release, I wonder if you can update the mm running on your end to see if it performs better (I can tell it is not the latest version because of the user agent as I was proposing an improvement to the code). Almalinux does not use mm and it works really well with my mirror, Archlinux works well, CTAN works well...Everything is fine until I start mirroring distros using mm and often find my mirror dropped from the metalink because of the "overloading".

- https://mirrors.rockylinux.org/mirrormanager/crawler/4218.log: ``` 2024-08-18 22:01:08,353 - INFO - Worker 'b9c26c4' starting on host <Host(4218 - jp.mirrors.cicku.me)> 2024-08-18 22:01:08,606 - INFO - Crawling with URL http://jp.mirrors.cicku.me/rocky 2024-08-18 22:01:08,963 - INFO - scanning category Rocky Linux 2024-08-18 22:01:39,900 - INFO - Crawling with URL http://jp.mirrors.cicku.me/rocky ... 2024-08-19 00:00:04,171 - WARNING - Host 4218 marked not up2date: Crawler timed out before completing. Host is likely overloaded. 2024-08-19 00:00:04,485 - INFO - Ending crawl of <Host(4218 - jp.mirrors.cicku.me)> with status 2 ``` --- What I can tell is that mm has been a horrible system since it was born at the Fedora community. I have no clue why it spent 2 hours crawling my mirror and then determined that my "server was overloading", honestly it is really a joke when my mirror actually serves PB level of data every month and ~25 million requests daily, [considering that there have been many bugfixes after mm's 1.0.0 release](https://github.com/fedora-infra/mirrormanager2), I wonder if you can update the mm running on your end to see if it performs better (I can tell it is not the latest version because of the user agent as I was proposing an improvement to the code). Almalinux does not use mm and it works really well with my mirror, Archlinux works well, CTAN works well...Everything is fine until I start mirroring distros using mm and often find my mirror dropped from the metalink because of the "overloading".
Owner

If you provide an rsync endpoint, the problem is moot.

The point stands that we need to ensure that mirrors are up to date, and cannot abdicate that trust elsewhere.

Yes, we could update mm--we're working on it. However that will not solve the issue that crawling several hundred thousand files via HTTP is inefficient and it takes more than 2 hours to crawl those files with a 100+ms RTT.

I can't comment on other people's mirroring solutions, but I can say that while MM2 may not be perfect, the decisions and architecture exist for an important reason.

Statements like "mm has been a horrible system..." are entirely unproductive, and frankly false. If you can increase the speed of light, I'm all ears.

If you provide an rsync endpoint, the problem is moot. The point stands that we need to ensure that mirrors are up to date, and cannot abdicate that trust elsewhere. Yes, we could update mm--we're working on it. However that will not solve the issue that crawling several hundred thousand files via HTTP is inefficient and it takes more than 2 hours to crawl those files with a 100+ms RTT. I can't comment on other people's mirroring solutions, but I can say that while MM2 may not be perfect, the decisions and architecture exist for an important reason. Statements like "mm has been a horrible system..." are entirely unproductive, and frankly false. If you can increase the speed of light, I'm all ears.
Owner

We have not heard back on this issue since August 20th.

As suggested: Please setup an rsync endpoint and configure mirror manager to crawl your mirror this way.

As we have not heard anything further, we will be closing this ticket. Please open a new ticket if you need other assistance.

We have not heard back on this issue since August 20th. As suggested: Please setup an rsync endpoint and configure mirror manager to crawl your mirror this way. As we have not heard anything further, we will be closing this ticket. Please open a new ticket if you need other assistance.
label closed this issue 2024-08-29 22:58:55 +00:00
label locked as Resolved and limited conversation to collaborators 2024-08-29 22:59:02 +00:00
Sign in to join this conversation.
No description provided.