Sign in

Think less, and do more.

“If I investigated my money to the real estate back then, I’m already a billionaire now.”

“There’s nothing special with Jimmy, we both saw the bright future of AI. It’s only because he graduated two years earlier than me. Otherwise, I must be the CEO of a biotech-AI startup now!”

I summarize them as two major types of complaints about “why I’m one step behind the true success.”

1) I was able to, but I didn’t see it.

2) I did see it, but I was not able to.

The basic assumption of these complaints is that the people who…


I had been in bad days until I slept on my dogs’ bed.

When I saw my two dogs sleeping in their bed with some chasing games in their dreams, I wished I were a dog like them.

I was not able to solve the problems in my life. I even thought it was better if I couldn’t wake up from my bed one day.

However, I never see my dogs getting up unhappy from their bed. Probably it is where the magic comes from.

So, I decided to switch beds with my dogs.

Thankfully, I found the magic of life there.

Got up early.

I got up at 7 am.

I used to get up…


DeepCOVID-XR, reported to have similar performance with experienced thoracic radiologists in detecting COVID-19 from chest radiographs, yet needs much less time to train. How much should we expect from these AI radiologists?

The fact that the COVID-19 specific features could be observed on chest imaging with X-ray/CT has been inspiring a lot of AI scientists to focus on algorithm development.

Recently, the researchers from Northwestern University published a novel classifier, DeepCOVID-XR, to diagnose a “COVID-19 positive case” based on the chest radiographs. The performance of this classifier is reported to be similar to experienced thoracic radiologists.

A lot of works have been published on similar topics, but the authors claimed that DeepCOVID-XR was trained on “the largest clinical dataset of chest radiographs from the COVID-19 era”.

As a data scientist, let me…


Researchers have been focused on developing artificial intelligence (AI)-driven analysis platforms for biomedical data. DrBioRight sets a good initial attempt.

Recently, an artificial intelligence-driven analytics platform, DrBioRight, has been developed for biomedical research, which is published at Cancer Cell, one of the top scientific journals.

The main function of this tool is to, by utilizing natural language processing (NLP) technology, enable biomedical researchers without expertise in bioinformatics or programming to perform computational analysis of large omics datasets.

Specifically, DrBioRight allows the user to “talk” to it in natural language instead of programming scripts. Then, it will “translate” the natural language to a specific bioinformatic task by applying the pre-trained neural network. …


Check the model assumptions and outliers of GLM in R.

Generalized Linear Model (GLM) is popular because it can deal with a wide range of data with different response variable types (such as binomial, Poisson, or multinomial).

Comparing to the non-linear models, such as the neural networks or tree-based models, the linear models may not be that powerful in terms of prediction. But the easiness in interpretation makes it still attractive, especially when we need to understand how each of the predictors is influencing the outcome.

The shortcomings of GLM are as obvious as its advantages. The linear relationship may not always hold and it is really sensitive to outliers…


Take a second look at your response variables before the multinomial modeling.

The popular multinomial logistic regression is known as an extension of the binomial logistic regression model, in order to deal with more than two possible discrete outcomes.

However, the multinomial logistic regression is not designed to be a general multi-class classifier but designed specifically for the nominal multinomial data.

To note, nominal data and ordinal data are two major categories of multinomial data. The difference is that there is no order to the categories in nominal multinomial data while there is an order to those in ordinal multinomial data.

For example, if our goal is to distinguish the three classes…


Eliminating the false positive detections of the significant coefficients in the Poisson model and the implementations in R.

The Poisson regression model naturally arises when we want to model the average number of occurrences per unit of time or space. For example, the incidence of rare cancer, the number of car crossing at the crossroad, or the number of earthquakes.

One feature of the Poisson distribution is that the mean equals the variance. However, over- or underdispersion happens in Poisson models, where the variance is larger or smaller than the mean value, respectively. In reality, overdispersion happens more frequently with a limited amount of data.

The overdispersion issue affects the interpretation of the model. It is necessary to…


A small discussion on the binomial regression model and its link functions

A lot of events in our daily life follow the binomial distribution that describes the number of successes in a sequence of independent Bernoulli experiments.

For example, assuming that the probability of James Harden making his shot is constant and each shot is independent, the number of field goals follows the binomial distribution.

If we want to find the relationship between the success probability (p) of a binomially distributed variable Y with a list of independent variables xs, the binomial regression model is among our top choices.

The link function is the major difference between a binomial regression and a…


Who is the NBA playoff league leader in the Orlando bubble?

I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but those with straightforward and meaningful graphs.

At a certain stage of my career as a data scientist, I realized that delivering the information is more important than showing the fancy models. Perhaps that’s why linear regression is still one of the most popular models in the finance world.

In this post, I am going to talk about a simple topic. It is how…


A case study of semi-supervised learning on NBA players’ position prediction with limited data labels.

Supervised learning and unsupervised learning are the two major tasks in machine learning. Supervised learning models are used when the output of all the instances is available, whereas unsupervised learning is applied when we don’t have the “true label”.

Even though the exploration of unsupervised learning has huge potential in future research, supervised learning is still dominating the field. However, it’s common that we need to build a supervised learning model when we don’t have sufficient labeled samples in our data.

In such a case, the semi-supervised learning can be taken into consideration. The idea is to build a supervised…

Yufeng

Ph.D., Data Scientist, and Bioinformatician. A true lover of data and basketball. Understanding is the path to eliminating discrimination.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store