Twitter Microblogging System Design

8 min readJul 4, 2021

Today, we’re sharing this article that will shed light on how Twitter technology works and how it impacts people's experiences across the app. This article tries to answer questions like “What is Microblogging?”; ”How twitters get all the tweets for a particular user?”; ” How Twitter gets the data for the home timeline?”; ”How to calculate and display trends? ”; “How search timeline works?”

What is Microblogging ??

Microblogging is a combination of blogging and instant messaging that allows users to create short messages to be posted and shared with an audience online. Social platforms like Twitter have become extremely popular forms of this new type of blogging, especially on the mobile web , making it much more convenient to communicate with people than when desktop web browsing and interaction were the norms.

Twitter is one of the oldest and most well-known social platforms to be put under the “microblogging” category. While the 280-character limit still exists today, you can now also share videos, articles links, photos, GIFs, sound clips, and more through Twitter Cards in addition to regular text.

The Benefits of Microblogging Versus Traditional Blogging

Less time spent developing content.
Less time spent consuming individual pieces of content.
The opportunity for more frequent posts.
An easier way to share urgent or time-sensitive information.
An easier, more direct way to communicate with followers.
Mobile convenience.

Twitter is an American microblogging and social networking service on which users post and interact with messages known as “tweets”. Registered users can post, like, and retweet tweets, but unregistered users can only read them. Users access Twitter through its website interface or its mobile-device application software (“app”), though the service could also be accessed via SMS before April 2020. Tweets were originally restricted to 140 characters, but the limit was doubled to 280 for non-CJK languages in November 2017. Audio and video tweets remain limited to 140 seconds for most accounts.

Traffic: Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day, and around 200 billion tweets per year.

User Features?

The user should be able to tweet as fast as possible
The user should be able to see Tweet Timeline(s)
User timeline: Displaying user’s tweets and tweets user retweet
Home timeline: Displaying Tweets from people users follow
Search timeline: Display search results based on #tags or search keyword
The user should be able to follow another user
Users should be able to tweet millions of followers within a few seconds (5 seconds)
The user should see trends

Considerations before designing Twitter

If you see all of the features, it looks read-heavy, compared to write
It's ok to have Eventual consistency, It’s not much of a pain if the user sees the tweet of his follower a bit delayed
Space is not a problem as tweets are limited to 280 characters

How to get all the tweets for a particular user?

As we know Twitter is read-heavy, we need a system that loads the data faster. For this, we can use Redis to store the tweets which store data faster and scales horizontally to enhance the read rate, along with Redis we have to store data in the database since Redis is persistent.

The basic Architecture of the Twitter service consists of a User Table, Tweet Table, and Followers Table.

user_id -> user_tweets [1, 2, 3, …]

Tweet_id -> tweet “Hello world”

user_id -> followers [1, 2, 3, …]

Here first we search for the user_id with the help of username and get all the ID of the tweets.
Then we go to the tweet_id table to get all the actual tweets mapping to that user.
When a user follows another user, it gets stored in the follower table, and also cache it Redis.

Then with the help of additional metadata like “created_time” we can chronologically sort the data.

How to build a USER TIMELINE?

Fetch all the tweets from the Global Tweet Table/Redis for a particular user
Which also includes retweets, save retweets as tweets with original tweet reference
Display it on user timeline, order by date-time

user_id -> user_tweets [1, 2, 3, …]

Tweet_id -> tweet “Hello world”

How to get the data for the Home Timeline?

The home timeline will have all the tweets where the person is following.

First, get all the followers the user is following
For each follower get all the latest tweets
Merge all the tweets and show the tweets sorted by time

Is this scalable?

This huge search operation on relational DB is not Scalable. Though we can use sharding etc, this huge search operation will take time once the tweet table grows to millions. Therefore we have to come up with a new solution, this approach would be the Fanout approach.

Suppose if a user is followed by 10 people. The fanout approach works as below:

1. When a specific user tweets, then first store the tweet in the DB.

2. Then store the same tweet in the user timeline.

3. Then fan out the user tweet to all of his followers. Thus the follower home timeline will be having the person’s tweet.

This approach has a problem. Consider a user is a celebrity and has been followed by millions of users, we cannot possibly update all of the followers’ timeline, it will take time. So what we do is, we maintain another table associated with every user called “celebrity_table”. This table will update every time a user has followed a celebrity. So when the user login to twitter, he will get tweets as generated in the previous step, then Redis checks if he follows any celebrity, if he follows, Redis will get the tweet from the celebrity and display it in this home timeline.

What are the other optimizations we can make?

Don’t compute the timeline for inactive users who don’t log in to the system for more than 15 days.

How are trending topics calculated?

Twitter uses Apache Storm and Heron framework to compute trending topics

These tasks run on many containers, These applications create a real-time analysis of all tweets sent on the Twitter social network which can be used to determine the so-called trending topics.

Basically, method implies the counting of the most mentioned terms in the poster tweets in the Twitter social network.

The method is known in the domain of data analysis for the social network as the “Trending Hashtags” method. Suppose two subjects A and B, the fact that A is more popular than B is equivalent to the fact that the number of mentions of the subject A is greater than the number of mentions of the subject B.

The information required for this process are -:

Number of mentions in a subject (hashtag)
The total amount of time taken to generate the volume of tweets.

TweetSpout: Represents a component used for issuing the tweets in the topology
TweetFilterBolt: Reads the tweets issued by the TweetSpout and executes the filtering. Only tweets that contain coded messages using the standard Unicode. also, violation and CC checks are made.
ParseTweetBolt: Processes the filtered tweets issued as tuples by the component TweetFilterBolt. Taking into consideration that the tuple is filtered, at this level we have the guarantee that each tweet contains at least one hashtag
CountHashtagBolt: Takes the tweets that are parsed through the component ParseTweetBolt and counts each hashtag. This is to get the hashtag and number of references to it
TotalRankerBolt: Makes a total ranking of all the counted hashtags. It converts count to ranks in one or more pipelines.
GeoLocationBolt: It takes the hashtag issued by the ParseTweetBolt, with the location of the tweet.
CountLocationHashtagBolt: Presents a functionality similar to the component CountHashtagBolt uses one more dimension ie. Location
RedisBolt: inserts into Redis

Searching:

Early Bird uses an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact, I think you’ll find the big databases can do a simple string-equality query very quickly in that case.

Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.

However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.

It has to scatter-gather across the data center. It queries every Early Bird shard and asks do you have content that matches this query? If you ask for “New York Times” all shards are queried, the results are returned, sorted, merged, and reranked. Rerank is by social proof, which means looking at the number of retweets, favorites, and replies.

Databases:

Gizzard is Twitter’s distributed data storage framework built on top of MySQL (InnoDB). InnoDB was chosen because it doesn’t corrupt data. Gizzard us just a datastore. Data is fed in and you get it back out again. To get higher performance on individual nodes a lot of features like binary logs and replication are turned off. Gizzard handles sharding, replicating N copies of the data, and job scheduling.
Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs.

That’s it for this article and we believe this article helps you to learn about how Twitter works, if not completely at least a bit.

It feels good to operate with a team that is so self-reliant and motivated. I would like to thanks Apoorva and Shubham Kumar for their invaluable contribution to this article .

Happy Learning!!