Trendyol Data Centers Fabric Scaling

Deniz AYDIN
6 min readNov 15, 2021

As in all e-commerce companies, the last two years had a significant impact on the Trendyol business. As the business grows, infrastructure needs more resources for applications. Besides resources, the problem of redundancy becomes more critical.
During the last two years, our infrastructure made a considerable transformation. We have installed two new geographically separated data centers using new technologies. The project is called multi-dc transformation. It includes not only the infrastructure but also the applications. The project covers all of the related subjects, which are essentials to running the Trendyol business.
Our motivation is to install geographically redundant, independent data centers, which means there will be no dependency between them.
Our old data center is a CLOS-based legacy switch fabric, large bridge domain which is simply a massive switch with centralized routing, having tens of switches and hundreds of servers connected to them.

Legacy Data Center

This is a large logical switch that is stretched over dozens of physical ones. Basically, servers are isolated with VLANS, and operators define large prefixes for each VLAN. Inter-Vlan routing is handled by firewalls or routers. How simple it is, it has problems like scalability, failure isolation, inefficient use of links and e.t.c. These problems get more critical as the network grows.

For our project on the new data centers, we consider deploying layer 3 only design, which is the simplest solution for better scalability. From a network perspective, what is needed is to forward traffic that is destined towards the endpoint. On legacy data centers, this is achieved with Layer 2 protocols as they are large switch fabric. Servers are located on broadcast domains, and as fabric forward traffic at layer 2, it only needs to know where is the destination MAC. Dynamic learning of the mac addresses is a well-defined process for switches that is hard to achieve with only layer 3. Some may use static routing for connected hosts, but this is very hard for large-scale networks. Even impossible for virtual machines as they move inside the data center. If we can solve this problem and eliminate the requirement of layer 2 forwarding between endpoints, you are left with IP-only data centers. Thousands of switches, massively high number of servers can be installed on a single data center.
Moreover, as we will only route inside the data center and do not need any encapsulation, thus no special hardware is required. We can use white boxes. With segment routing, especially with SRv6, even source-based routing, network function virtualization is possible. But it needs more improvement on the server-side, overlay controllers like OpenStack and e.t.c. We still need an overlay, and our overlay controllers are not ready for this. Moreover, we need more time which we do not have.

We are already using VXLAN & EVPN in our legacy standalone data center but from a limited perspective. VXLAN is nothing else than data plane encapsulation. It simplifies the transport of Ethernet frames over the IP network and benefits like simplified redundancy and the use of equal-cost multipath. But, it needs a control plane for flood and learn, step discovery, and e.t.c. EVPN comes at this point. With EVPN and VXLAN, better layer 2 domains can be installed. Hundreds of switches in a single data center. But data center has still major problems like centralized routing; as a result, significant BUM traffic hits firewall interfaces and server interfaces.

We decided to use more new technologies to overcome these problems, which are;

  • Anycast gateways/Distributed routing: It’s simply configuring the default gateways of VLAN on each access switch. This is a significant improvement for scaling and failure isolation. It leverages the first hop redundancy to each access switch and inter-VLAN routing to stay on the access level. Better optimized east-west traffic can be achieved. Furthermore, it limits the BUM traffic, which hurts the switch line cards and the servers with the use of arp suppression.
  • Tenancy with the use of Layer 3 IP VRF. Also, inter-tenant routing can be over a firewall or can be on switches depending on the requirement.
  • Dynamic routing with the external services.
VXLAN & EVPN

For the numbers, we divided our physical infrastructure into pods that contain 48 racks, and each rack will have redundant switches. Later on, this number is increased to 96 racks.

Now we had better data centers. But what if we want to install more pods or need more resources;

The problem is related to the number of switches in a single VXLAN domain. All kinds of data center switches have VXLAN accelerated chipsets to speed up the encapsulation performance. But they have an upper limit of VXLAN neighbors which is around 500 or less for most vendors. So this is one of the main factors that limit the number of switches in a single VXLAN & EVPN domain. And also for you may not want to install all of your resources into a single VXLAN & EVPN domain. So you may also want to split up your failure domains. So this limits our single pod to around a maximum of 250 switches.

Options depend on Layer 2 requirement between pods, broadcast stretching between pods. The more critical problem is tenant network design. We do not plan to use the same VLANS or VXLAN VNI on the different pods. But layer 3 tenant networks will be the same. Thus we can not simply connect pods overusing layer 3 interconnections with routers; A dynamic mechanism is needed like EVPN to VPN stitching, e.t.c to support dynamic tenancy between EVPN and pods interconnect, which is actually data center interconnect.

But we are not sure about the layer 2 need. So a better approach is there; EVPN multi-site can support both layer 2 or layer 3 using special leaf switches that decapsulate and re-encapsulates VXLAN traffic. But it depends on the hardware again.

This phase is still in progress. We are at the beginning of adding more pods. But the solution will be based on EVPN multi-site.

For the connection between data centers, we built our own data center interconnect, and we dictated a motto, layer 2 stays in a single data center. We do not want to stretch layer 2 domains between data centers. This limits the ability of virtual machines to move between them, which is not needed from the beginning. DCI is used traffic between data center tenant networks. We decided to use L3 MPLS VPN. Again EVPN to VPN stitching is an option. Still, we decided to use different vendors on each data center and do not want to struggle with problems of vendor interoperability.

Data centers connected over DCI

We are operating VXLAN EVPN based data centers with distributed routing and will be using even multi-site inside a single data center for more scaling. Still, a long way to go for layer 2 free deployments

--

--