pandas:功能强大的Python数据分析工具包

PDF版本

压缩的HTML

日期:2017年12月30日版本:0.22.0

二进制安装程序: http://pypi.python.org/pypi/pandas

源代码库: http://github.com/pandas-dev/pandas

问题&想法: https://github.com/pandas-dev/pandas/issues

问答支持: http://stackoverflow.com/questions/tagged/pandas

开发者邮件列表: http://groups.google.com/group/pydata

pandas是一个Python软件包,提供快速、灵活和富有表现力的数据结构,旨在使“relational”或“labeled”数据的工作既简单又直观。 It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. 此外,它有更远大的目标,即成为所有语言中最强大、最灵活的开源数据分析/操作工具。 It is already well on its way toward this goal.

pandas非常适合许多不同类型的数据:

  • 具有异构类型列的表格数据,如SQL表格或Excel电子表格中的数据
  • 有序和无序(即频率不一定固定)的时间序列数据。
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • 任何其他形式的观测/统计数据集。 The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

这只是pandas做得很好的几件事情:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

Some other notes

  • pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.
  • pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in Python.
  • pandas has been used extensively in production in financial applications.

Note

This documentation assumes general familiarity with NumPy. If you haven’t used NumPy much or at all, do invest some time in learning about NumPy first.

See the package overview for more detail about what’s in the library.

Scroll To Top