0%

uugh8K.png

Sparkify is a digital music service similar to Netease Cloud Music or QQ Music. Many of the users stream their favorite songs in Sparkify service everyday, either using free tier that places advertisements in between the songs, or using the premium subscription model where they stream music as free, but pay a monthly flat rate. User can upgrade, downgrade or cancel their service at anytime.

This is a Customer Churn Prediction Problem , there are so many similar projects, such as WSDM - KKBox’s Churn Prediction Challenge competition from Kaggle, and a few helpful links are follows:

So, our job is deep mining the customers’ data and implement appropriate model to predict customer churn as follow steps:

  • Clean data: fill the nan values , correct the data types, drop the outliers.
  • EDA: exploratory data to look features’ distributions and correlation with key label (churn).
  • Feature engineering: extract and found customer-features and customer-behavior-features; Implement standscaler on numerical features.
  • Train and measure models: I choose logistic regression, linear svm classifier, decision tree and random forest classifier to train a baseline model and tuning a better model from best of them. It is worth mentioning that this data is unbalanced because of less churn customers, so we choose f1 score as a metrics to measure models’ performance.
阅读全文 »

概述

当处理一些较为灵活的数据时,团队内不同角色的同事会有自己对数据的关注点,所以,这就要求数据分析师不能只出一个“死”报告了事儿,而需要的是一个可以让同事们去探索,去解决自己关注问题的”活“报告——Dashboard。本文就一起来探讨下,利用FlaskPyecharts搭建局域网内Dashboard的方法,其中Flask用来提供Web应用框架,Pyecharts用来解决交互式可视化的需求。

阅读全文 »

概述

在处理字符串时,有时需要从字符串中提取出第n次出现某字符的位置,比如说想在字符串'abcdabdcsas'中找到第2次出现'ab'的索引,但Python String提供的find函数只能Return the lowest index in S where substring sub is found,所以,自己动手,丰衣足食:joy:

本文分为两部分:

  • 解决上述问题的两种方法及运行效率对比
  • 延伸:
    • 出现某字符的全部索引
    • 最后一次出现某字符的索引
阅读全文 »

概述

上一篇博客中,我们把图片中的水印去除掉,并且加深了字体的颜色,之后我对图片的大小进行了统一,甚至我还专门给他们都加上了参照字段,分别尝试了百度AIP的表格识别服务和Face++的自定义模板文字识别服务,可能是因为图片的分辨率较低,而且文字较密集的缘故,最终得到的结果都不尽如人意,错误率非常高,所以准备尝试下先将图片按照行列进行分割,之后再逐个去识别的方法,结果却出乎我的意料。

阅读全文 »

概述

今年阳光高考公示的自主招生名单变成了图片格式,还加了水印,(洪主编说“今年阳光高考变坏了”:joy:),确实是变坏了,无形中给我增添了不少工作:angry:,那么要转为Excel,就要进行文字识别,就需要先把烦人的水印去除掉。

本文分为两部分:

  • 基于OpenCV的图片水印去除
  • 将python脚本封装为可执行exe程序
阅读全文 »

Shortly after moving to San Francisco in October 2007, roommates and former schoolmates Brian Chesky and Joe Gebbia could not afford the rent for their loft apartment. Chesky and Gebbia came up with the idea of putting an air mattress in their living room and turning it into a bed and breakfast. The goal at first was just “to make a few bucks”.

​ —— Airbnb Wiki

A‌i‌r‌b‌n‌b‌,‌ ‌I‌n‌c‌.‌, is an online marketplace and hospitality service brokerage company, It has a cool brand story, with hosts sharing their extra living space, guests living for a fee, and allowing guests to experience real local life instead of booking a hotel. It sounds really cool, but what is the real situation in Beijing? Let’s check it out from two parts:

  • Listings in Beijing
  • Hosts in Beijing
阅读全文 »

概述

最近在处理自主招生的数据,对于某一个确定的高校来说,录取的人数远远小于未录取的人数,换言之,就是录取类的数据量远小于未录取类的数据量,这就是不平衡数据,虽然在机器学习中不平衡数据的处理不是难点,但这也是我们不得不去考虑的问题,那在本篇文章中,我们便来一起探讨下有哪些处理不平衡数据的技巧

阅读全文 »

概述

最近单位需要批量输出报告,好在这些报告的整体模板相同,只有一些跟用户相关的信息需要替换。几千份的重复工作,还是交给Python去处理吧。

阅读全文 »

A positive attitude causes a chain reaction of positive thoughts,events and outcomes.

Hi,同学们,上周我们主要对python的基础知识进行了学习,从完全不懂到写出第一段代码,从畏惧发怵到解决第一个代码问题,大家已经从小白迈出了python入门的第一步!你们都是最棒的!那请继续保持着这样的学习动力,趁热打铁,继续我们的课程吧!

阅读全文 »