Fast On-device LLM Inference with NPUs