Guangning Yu's Blog

Power BI Performance Tuning

2022-04-14 11:51:40

Best Practices

Import

  1. Turn on Query Caching for imported tables​
  2. Turn on Large Dataset Storage format​

DirectQuery

  1. Use star-schema for data model​and ensure that your dimension tables contain proper keys and that those keys relate to a fact table​
  2. Materialize all aggregations, transformations, and calculations in SQL Server​
  3. Pre-filter large fact tables in SQL Server​and remove filters in visual
    • Adding filters will cause Power BI to generate multiple SQL queries
  4. Use dynamic M query parameter to move measure logic from Power BI to SQL Server​
  5. Optimize measure definition to generate efficient DAX queries​
    • Power BI will translate complex DAX into multiple SQL Server queries
  6. Use dual mode for dim tables​
    • Otherwise Power BI will use import dim tables to generate long SQL Queries for filters
  7. Refine table structure and indexing strategy​
    • Clustered rowstore (like by TimeId) + non-clustered columnstore is recommended
  8. Only

Azure Solution Architect Notes

2021-08-09 17:03:39

Azure Certification Guide

title
title
title
title

Compute

Virtual Machine

title
title
title
title

Azure Functions

title

App Services

title

Azure Container Instances (ACI)

title

Azure Kubernetes Service (AKS)

title

Azure Container Registry (ACR)

title

Windows Virtual Desktop

title

Networking

Virtual Network (VNet)

title
title
title

Network Security Group (NSG)

title
title

Application Security Group

title

Load Balancer

title

VPN Gateway

title

Application Gateway

title
title

ExpressRoute

title

Paired Region

title

Storage

Storage Account

title
title
title

Blob

title
title

Disk

title
title

File

title

Archive

title

Database

title

Cosmos DB

title
title
title
title
title

Azure SQL

title
title

Azure Database for PostgreSQL

title

Database Migration Services

title

Identity and Access Management (IAM)

title

title

Azure Active Directory (AAD)

title
title
title
title
title
title

title
title
title
title

title

Group

title
title
title
title

Role

title
title
title
title

Scope

title

Role-Based Access Control (RBAC)

title
title
title
title
title
title
title
title

Azure AD Join

title
title
title
title

Azure AD Connect

title
title

Single Sign-On

title

title

Self-Service Password

AWS Solution Architect Professional Notes

2021-06-04 16:30:48

S3

  • Maximum object size is 5TB; largest object in a single PUT is 5GB.
  • Recommended to use multi-part uploads if larger than 100MB.
  • Security: IAM policies -> Bucket policy -> Object ACL
  • title
  • Versioning cannot be enabled at the object level. It's a bucket-level feature.

Amazon Neptune

  • title

Amazon Redshift

  • title
  • title
  • title
  • title
  • title
  • title

Amazon Athena

  • title
  • title

Amazon Quantum Ledger Database

  • title

Amazon Managed Blockchain

  • title

Amazon Timestream Database

  • title

Amazon DocumentDB

  • title

Amazon ElasticSearch

  • title

Databases

  • title

S3 Select vs Athena vs Redshift Spectrum

S3 Select is focused on retrieving data from S3 using SQL:

S3 Select, enables applications to retrieve only a subset of data from an object by using simple SQL expressions. By using S3 Select to retrieve only the data needed by your application, you can achieve drastic performance increases – in many cases you can get as much as a 400% improvement compared with classic S3 re

Test PySpark max()/min() function

2020-07-13 09:22:44  |  Spark

test.csv

  1. key,a,b,c
  2. a,1,,-1
  3. a,2,,
  4. a,3,,4

test.py

  1. from pyspark.sql import SparkSession
  2. from pyspark.sql import functions as F
  3. spark = SparkSession \
  4. .builder \
  5. .appName("spark-app") \
  6. .getOrCreate()
  7. spark.sparkContext.setLogLevel("WARN")
  8. df = spark.read.csv("test.csv", header=True)
  9. res = df.groupBy(["key"]).agg(*[
  10. F.max("a"),
  11. F.max("b"),
  12. F.max("c"),
  13. F.min("a"),
  14. F.min("b"),
  15. F.min("c"),
  16. ])
  17. print (res.toPandas())

spark-submit test.py

  1. key max(a) max(b) max(c) min(a) min(b) min(c)
  2. 0 a 3 None 4 1 None -1

AWS Certified Solutions Architect Associate Notes

2019-12-05 17:59:42  |  AWS

Compute

EC2

  • Billing for interrupted Spot Instance
    title
    title
  • When you launch an instance from AMI, it uses either paravirtual (PV) or hardware virtual machine (HVM) virtualization. HVM virtualization uses hardware-assist technology provided by the AWS platform.
  • The information about the instance can be retrieved from:
  • The underlying Hypervisor for EC2:
    • Xen
    • Nitro
  • Standard Reserved Instances cannot be moved between regions. You can choose if a Reserved Instance applies to either a specific AZ or an entire region, but you cannot change the region.
  • About EC2 Auto Scaling
    • Can span multi-AZ
  • About Placement Group
    • Three types of Placement Groups
      • Clustered Placement Group
        • Within a single AZ
        • Used for applications that need low network latency, high network throughput, or both
        • Only certain instances can be launched into a Clustered Placement Group
        • AWS r

Kubeless Basics

2019-05-14 10:09:51  |  Kubernetes

Deploy kubeless to a Kubernetes cluster

  1. $ export RELEASE=$(curl -s https://api.github.com/repos/kubeless/kubeless/releases/latest | grep tag_name | cut -d '"' -f 4)
  2. $ kubectl create ns kubeless
  3. $ kubectl create -f https://github.com/kubeless/kubeless/releases/download/$RELEASE/kubeless-$RELEASE.yaml
  1. $ kubectl get pods -n kubeless
  2. $ kubectl get deployment -n kubeless
  3. $ kubectl get customresourcedefinition

Deploy sample function

  1. def hello(event, context):
  2. print event
  3. return event['data']
  1. $ kubeless function deploy hello --runtime python2.7 \
  2. --from-file test.py \
  3. --handler test.hello
  1. $ kubectl get functions
  2. $ kubeless function ls
  1. $ kubeless function call hello --data 'Hello world!'

Windows cmd

2019-03-19 18:31:36  |  Windows
  • create a file

    1. echo This is a sample text file > sample.txt
  • delete a file

    1. del file_name
  • move a file

    1. move stats.doc c:\statistics
  • combine files

    1. copy /b file1 + file2 file3

Load Excel file into SQL Server

2019-03-07 15:57:56  |  Python
  1. import pandas as pd
  2. import pyodbc
  3. import sqlalchemy
  4. import urllib
  5. def get_sqlalchemy_engine(driver, server, uid, pwd, database):
  6. conn_str = 'DRIVER={};SERVER={};UID={};PWD={};DATABASE={}'.format(driver, server, uid, pwd, database)
  7. quoted = urllib.parse.quote_plus(conn_str)
  8. engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect={}'.format(quoted))
  9. return engine
  10. if __name__ == '__main__':
  11. # create engine
  12. driver = 'ODBC Driver 17 for SQL Server'
  13. server = 'xxx'
  14. uid = 'xxx'
  15. pwd = 'xxx'
  16. database = 'xxx'
  17. engine = get_sqlalchemy_engine(driver, server, uid, pwd, database)
  18. # read excel
  19. file_path = 'xxx'
  20. df = pd.read_excel(file_path)
  21. # load into SQL Server
  22. schema_name = 'xxx'
  23. table_name = 'xxx'
  24. df.to_sql(table_name, schema=schema_name, con=engine, index=False, if_exists='replace')

Access another container in Docker

2019-02-26 14:28:04  |  Docker

Steps

  1. Create a network named "test"

    1. docker network create test
  2. Create two containers using the network

    1. docker run --name c1 --network "test" --rm --entrypoint tail mongo -f
    2. docker run --name c2 --network "test" --rm --entrypoint tail mongo -f
  3. Enter one container to ping the other and it will work

    1. docker exec -it c1 bash
    1. apt-get install iputils-ping # install command ping
    1. root@79568c5ce391:/usr/src/app# ping c2
    2. PING c2 (172.18.0.3) 56(84) bytes of data.
    3. 64 bytes from c2.test (172.18.0.3): icmp_seq=1 ttl=64 time=0.137 ms
    4. 64 bytes from c2.test (172.18.0.3): icmp_seq=2 ttl=64 time=0.221 ms
    5. 64 bytes from c2.test (172.18.0.3): icmp_seq=3 ttl=64 time=0.232 ms
    6. ...

Notes

Using default network or "bridge" network does not work:

  1. docker run --name c1 --rm --entrypoint tail web_scraper:v1 -f
  2. docker run --name c2 --rm --entrypoint tail web_scraper:v1 -f
  1. docker run --name c1 --network "bridge" --rm --entrypoint tail web_scraper:v1

Publish or expose port (-p, --expose)

2019-02-26 13:42:45  |  Docker
  1. $ docker run -p 127.0.0.1:80:8080/tcp ubuntu bash

This binds port 8080 of the container to TCP port 80 on 127.0.0.1 of the host machine. You can also specify udp and sctp ports.

  1. $ docker run --expose 80 ubuntu bash

This exposes port 80 of the container without publishing the port to the host system’s interfaces.

Docker Basics

2019-02-20 22:04:31  |  Docker

Images

  1. docker images
  2. docker build -t image_name .
  3. docker rmi $(docker images | grep "^<none>" | awk "{print $3}") # remove all untagged images
  4. docker save image_name > image_name.tar # save image as a tar file
  5. docker load < busybox.tar.gz # load image

Containers

  1. docker run -p 27017:27017 -v mongodbdata:/data/db mongo
  2. docker ps -a
  3. docker exec -it ubuntu_bash bash
  4. docker rm container_name
  5. docker rm $(docker ps -a -q) # remove all stopped containers

Volumes

  1. docker volume create mongodbdata
  2. docker volume ls
  3. docker volume inspect mongodbdata

Networks

  1. docker network ls
  2. docker network create network_name
  3. docker network inspect network_name
  4. docker network rm network_name

Webpack4 Setup

2019-02-12 23:42:38  |  Javascript

Kubernetes Basics

2019-02-17 01:40:29  |  Kubernetes

微信小程序组件与接口

2017-12-18 23:15:24