Below is the file 'sqlite/btree.c' from this revision. You can also download the file.
/* ** 2004 April 6 ** ** The author disclaims copyright to this source code. In place of ** a legal notice, here is a blessing: ** ** May you do good and not evil. ** May you find forgiveness for yourself and forgive others. ** May you share freely, never taking more than you give. ** ************************************************************************* ** $Id: btree.c,v 1.269 2005/09/17 15:20:27 drh Exp $ ** ** This file implements a external (disk-based) database using BTrees. ** For a detailed discussion of BTrees, refer to ** ** Donald E. Knuth, THE ART OF COMPUTER PROGRAMMING, Volume 3: ** "Sorting And Searching", pages 473-480. Addison-Wesley ** Publishing Company, Reading, Massachusetts. ** ** The basic idea is that each page of the file contains N database ** entries and N+1 pointers to subpages. ** ** ---------------------------------------------------------------- ** | Ptr(0) | Key(0) | Ptr(1) | Key(1) | ... | Key(N) | Ptr(N+1) | ** ---------------------------------------------------------------- ** ** All of the keys on the page that Ptr(0) points to have values less ** than Key(0). All of the keys on page Ptr(1) and its subpages have ** values greater than Key(0) and less than Key(1). All of the keys ** on Ptr(N+1) and its subpages have values greater than Key(N). And ** so forth. ** ** Finding a particular key requires reading O(log(M)) pages from the ** disk where M is the number of entries in the tree. ** ** In this implementation, a single file can hold one or more separate ** BTrees. Each BTree is identified by the index of its root page. The ** key and data for any entry are combined to form the "payload". A ** fixed amount of payload can be carried directly on the database ** page. If the payload is larger than the preset amount then surplus ** bytes are stored on overflow pages. The payload for an entry ** and the preceding pointer are combined to form a "Cell". Each ** page has a small header which contains the Ptr(N+1) pointer and other ** information such as the size of key and data. ** ** FORMAT DETAILS ** ** The file is divided into pages. The first page is called page 1, ** the second is page 2, and so forth. A page number of zero indicates ** "no such page". The page size can be anything between 512 and 65536. ** Each page can be either a btree page, a freelist page or an overflow ** page. ** ** The first page is always a btree page. The first 100 bytes of the first ** page contain a special header (the "file header") that describes the file. ** The format of the file header is as follows: ** ** OFFSET SIZE DESCRIPTION ** 0 16 Header string: "SQLite format 3\000" ** 16 2 Page size in bytes. ** 18 1 File format write version ** 19 1 File format read version ** 20 1 Bytes of unused space at the end of each page ** 21 1 Max embedded payload fraction ** 22 1 Min embedded payload fraction ** 23 1 Min leaf payload fraction ** 24 4 File change counter ** 28 4 Reserved for future use ** 32 4 First freelist page ** 36 4 Number of freelist pages in the file ** 40 60 15 4-byte meta values passed to higher layers ** ** All of the integer values are big-endian (most significant byte first). ** ** The file change counter is incremented when the database is changed more ** than once within the same second. This counter, together with the ** modification time of the file, allows other processes to know ** when the file has changed and thus when they need to flush their ** cache. ** ** The max embedded payload fraction is the amount of the total usable ** space in a page that can be consumed by a single cell for standard ** B-tree (non-LEAFDATA) tables. A value of 255 means 100%. The default ** is to limit the maximum cell size so that at least 4 cells will fit ** on one page. Thus the default max embedded payload fraction is 64. ** ** If the payload for a cell is larger than the max payload, then extra ** payload is spilled to overflow pages. Once an overflow page is allocated, ** as many bytes as possible are moved into the overflow pages without letting ** the cell size drop below the min embedded payload fraction. ** ** The min leaf payload fraction is like the min embedded payload fraction ** except that it applies to leaf nodes in a LEAFDATA tree. The maximum ** payload fraction for a LEAFDATA tree is always 100% (or 255) and it ** not specified in the header. ** ** Each btree pages is divided into three sections: The header, the ** cell pointer array, and the cell area area. Page 1 also has a 100-byte ** file header that occurs before the page header. ** ** |----------------| ** | file header | 100 bytes. Page 1 only. ** |----------------| ** | page header | 8 bytes for leaves. 12 bytes for interior nodes ** |----------------| ** | cell pointer | | 2 bytes per cell. Sorted order. ** | array | | Grows downward ** | | v ** |----------------| ** | unallocated | ** | space | ** |----------------| ^ Grows upwards ** | cell content | | Arbitrary order interspersed with freeblocks. ** | area | | and free space fragments. ** |----------------| ** ** The page headers looks like this: ** ** OFFSET SIZE DESCRIPTION ** 0 1 Flags. 1: intkey, 2: zerodata, 4: leafdata, 8: leaf ** 1 2 byte offset to the first freeblock ** 3 2 number of cells on this page ** 5 2 first byte of the cell content area ** 7 1 number of fragmented free bytes ** 8 4 Right child (the Ptr(N+1) value). Omitted on leaves. ** ** The flags define the format of this btree page. The leaf flag means that ** this page has no children. The zerodata flag means that this page carries ** only keys and no data. The intkey flag means that the key is a integer ** which is stored in the key size entry of the cell header rather than in ** the payload area. ** ** The cell pointer array begins on the first byte after the page header. ** The cell pointer array contains zero or more 2-byte numbers which are ** offsets from the beginning of the page to the cell content in the cell ** content area. The cell pointers occur in sorted order. The system strives ** to keep free space after the last cell pointer so that new cells can ** be easily added without having to defragment the page. ** ** Cell content is stored at the very end of the page and grows toward the ** beginning of the page. ** ** Unused space within the cell content area is collected into a linked list of ** freeblocks. Each freeblock is at least 4 bytes in size. The byte offset ** to the first freeblock is given in the header. Freeblocks occur in ** increasing order. Because a freeblock must be at least 4 bytes in size, ** any group of 3 or fewer unused bytes in the cell content area cannot ** exist on the freeblock chain. A group of 3 or fewer free bytes is called ** a fragment. The total number of bytes in all fragments is recorded. ** in the page header at offset 7. ** ** SIZE DESCRIPTION ** 2 Byte offset of the next freeblock ** 2 Bytes in this freeblock ** ** Cells are of variable length. Cells are stored in the cell content area at ** the end of the page. Pointers to the cells are in the cell pointer array ** that immediately follows the page header. Cells is not necessarily ** contiguous or in order, but cell pointers are contiguous and in order. ** ** Cell content makes use of variable length integers. A variable ** length integer is 1 to 9 bytes where the lower 7 bits of each ** byte are used. The integer consists of all bytes that have bit 8 set and ** the first byte with bit 8 clear. The most significant byte of the integer ** appears first. A variable-length integer may not be more than 9 bytes long. ** As a special case, all 8 bytes of the 9th byte are used as data. This ** allows a 64-bit integer to be encoded in 9 bytes. ** ** 0x00 becomes 0x00000000 ** 0x7f becomes 0x0000007f ** 0x81 0x00 becomes 0x00000080 ** 0x82 0x00 becomes 0x00000100 ** 0x80 0x7f becomes 0x0000007f ** 0x8a 0x91 0xd1 0xac 0x78 becomes 0x12345678 ** 0x81 0x81 0x81 0x81 0x01 becomes 0x10204081 ** ** Variable length integers are used for rowids and to hold the number of ** bytes of key and data in a btree cell. ** ** The content of a cell looks like this: ** ** SIZE DESCRIPTION ** 4 Page number of the left child. Omitted if leaf flag is set. ** var Number of bytes of data. Omitted if the zerodata flag is set. ** var Number of bytes of key. Or the key itself if intkey flag is set. ** * Payload ** 4 First page of the overflow chain. Omitted if no overflow ** ** Overflow pages form a linked list. Each page except the last is completely ** filled with data (pagesize - 4 bytes). The last page can have as little ** as 1 byte of data. ** ** SIZE DESCRIPTION ** 4 Page number of next overflow page ** * Data ** ** Freelist pages come in two subtypes: trunk pages and leaf pages. The ** file header points to first in a linked list of trunk page. Each trunk ** page points to multiple leaf pages. The content of a leaf page is ** unspecified. A trunk page looks like this: ** ** SIZE DESCRIPTION ** 4 Page number of next trunk page ** 4 Number of leaf pointers on this page ** * zero or more pages numbers of leaves */ #include "sqliteInt.h" #include "pager.h" #include "btree.h" #include "os.h" #include <assert.h> /* Round up a number to the next larger multiple of 8. This is used ** to force 8-byte alignment on 64-bit architectures. */ #define ROUND8(x) ((x+7)&~7) /* The following value is the maximum cell size assuming a maximum page ** size give above. */ #define MX_CELL_SIZE(pBt) (pBt->pageSize-8) /* The maximum number of cells on a single page of the database. This ** assumes a minimum cell size of 3 bytes. Such small cells will be ** exceedingly rare, but they are possible. */ #define MX_CELL(pBt) ((pBt->pageSize-8)/3) /* Forward declarations */ typedef struct MemPage MemPage; /* ** This is a magic string that appears at the beginning of every ** SQLite database in order to identify the file as a real database. ** ** You can change this value at compile-time by specifying a ** -DSQLITE_FILE_HEADER="..." on the compiler command-line. The ** header must be exactly 16 bytes including the zero-terminator so ** the string itself should be 15 characters long. If you change ** the header, then your custom library will not be able to read ** databases generated by the standard tools and the standard tools ** will not be able to read databases created by your custom library. */ #ifndef SQLITE_FILE_HEADER /* 123456789 123456 */ # define SQLITE_FILE_HEADER "SQLite format 3" #endif static const char zMagicHeader[] = SQLITE_FILE_HEADER; /* ** Page type flags. An ORed combination of these flags appear as the ** first byte of every BTree page. */ #define PTF_INTKEY 0x01 #define PTF_ZERODATA 0x02 #define PTF_LEAFDATA 0x04 #define PTF_LEAF 0x08 /* ** As each page of the file is loaded into memory, an instance of the following ** structure is appended and initialized to zero. This structure stores ** information about the page that is decoded from the raw file page. ** ** The pParent field points back to the parent page. This allows us to ** walk up the BTree from any leaf to the root. Care must be taken to ** unref() the parent page pointer when this page is no longer referenced. ** The pageDestructor() routine handles that chore. */ struct MemPage { u8 isInit; /* True if previously initialized. MUST BE FIRST! */ u8 idxShift; /* True if Cell indices have changed */ u8 nOverflow; /* Number of overflow cell bodies in aCell[] */ u8 intKey; /* True if intkey flag is set */ u8 leaf; /* True if leaf flag is set */ u8 zeroData; /* True if table stores keys only */ u8 leafData; /* True if tables stores data on leaves only */ u8 hasData; /* True if this page stores data */ u8 hdrOffset; /* 100 for page 1. 0 otherwise */ u8 childPtrSize; /* 0 if leaf==1. 4 if leaf==0 */ u16 maxLocal; /* Copy of Btree.maxLocal or Btree.maxLeaf */ u16 minLocal; /* Copy of Btree.minLocal or Btree.minLeaf */ u16 cellOffset; /* Index in aData of first cell pointer */ u16 idxParent; /* Index in parent of this node */ u16 nFree; /* Number of free bytes on the page */ u16 nCell; /* Number of cells on this page, local and ovfl */ struct _OvflCell { /* Cells that will not fit on aData[] */ u8 *pCell; /* Pointers to the body of the overflow cell */ u16 idx; /* Insert this cell before idx-th non-overflow cell */ } aOvfl[5]; struct Btree *pBt; /* Pointer back to BTree structure */ u8 *aData; /* Pointer back to the start of the page */ Pgno pgno; /* Page number for this page */ MemPage *pParent; /* The parent of this page. NULL for root */ }; /* ** The in-memory image of a disk page has the auxiliary information appended ** to the end. EXTRA_SIZE is the number of bytes of space needed to hold ** that extra information. */ #define EXTRA_SIZE sizeof(MemPage) /* ** Everything we need to know about an open database */ struct Btree { Pager *pPager; /* The page cache */ BtCursor *pCursor; /* A list of all open cursors */ MemPage *pPage1; /* First page of the database */ u8 inTrans; /* True if a transaction is in progress */ u8 inStmt; /* True if we are in a statement subtransaction */ u8 readOnly; /* True if the underlying file is readonly */ u8 maxEmbedFrac; /* Maximum payload as % of total page size */ u8 minEmbedFrac; /* Minimum payload as % of total page size */ u8 minLeafFrac; /* Minimum leaf payload as % of total page size */ u8 pageSizeFixed; /* True if the page size can no longer be changed */ #ifndef SQLITE_OMIT_AUTOVACUUM u8 autoVacuum; /* True if database supports auto-vacuum */ #endif u16 pageSize; /* Total number of bytes on a page */ u16 usableSize; /* Number of usable bytes on each page */ int maxLocal; /* Maximum local payload in non-LEAFDATA tables */ int minLocal; /* Minimum local payload in non-LEAFDATA tables */ int maxLeaf; /* Maximum local payload in a LEAFDATA table */ int minLeaf; /* Minimum local payload in a LEAFDATA table */ BusyHandler *pBusyHandler; /* Callback for when there is lock contention */ }; typedef Btree Bt; /* ** Btree.inTrans may take one of the following values. */ #define TRANS_NONE 0 #define TRANS_READ 1 #define TRANS_WRITE 2 /* ** An instance of the following structure is used to hold information ** about a cell. The parseCellPtr() function fills in this structure ** based on information extract from the raw disk page. */ typedef struct CellInfo CellInfo; struct CellInfo { u8 *pCell; /* Pointer to the start of cell content */ i64 nKey; /* The key for INTKEY tables, or number of bytes in key */ u32 nData; /* Number of bytes of data */ u16 nHeader; /* Size of the cell content header in bytes */ u16 nLocal; /* Amount of payload held locally */ u16 iOverflow; /* Offset to overflow page number. Zero if no overflow */ u16 nSize; /* Size of the cell content on the main b-tree page */ }; /* ** A cursor is a pointer to a particular entry in the BTree. ** The entry is identified by its MemPage and the index in ** MemPage.aCell[] of the entry. */ struct BtCursor { Btree *pBt; /* The Btree to which this cursor belongs */ BtCursor *pNext, *pPrev; /* Forms a linked list of all cursors */ int (*xCompare)(void*,int,const void*,int,const void*); /* Key comp func */ void *pArg; /* First arg to xCompare() */ Pgno pgnoRoot; /* The root page of this tree */ MemPage *pPage; /* Page that contains the entry */ int idx; /* Index of the entry in pPage->aCell[] */ CellInfo info; /* A parse of the cell we are pointing at */ u8 wrFlag; /* True if writable */ u8 isValid; /* TRUE if points to a valid entry */ }; /* ** The TRACE macro will print high-level status information about the ** btree operation when the global variable sqlite3_btree_trace is ** enabled. */ #if SQLITE_TEST # define TRACE(X) if( sqlite3_btree_trace )\ { sqlite3DebugPrintf X; fflush(stdout); } #else # define TRACE(X) #endif int sqlite3_btree_trace=0; /* True to enable tracing */ /* ** Forward declaration */ static int checkReadLocks(Btree*,Pgno,BtCursor*); /* ** Read or write a two- and four-byte big-endian integer values. */ static u32 get2byte(unsigned char *p){ return (p[0]<<8) | p[1]; } static u32 get4byte(unsigned char *p){ return (p[0]<<24) | (p[1]<<16) | (p[2]<<8) | p[3]; } static void put2byte(unsigned char *p, u32 v){ p[0] = v>>8; p[1] = v; } static void put4byte(unsigned char *p, u32 v){ p[0] = v>>24; p[1] = v>>16; p[2] = v>>8; p[3] = v; } /* ** Routines to read and write variable-length integers. These used to ** be defined locally, but now we use the varint routines in the util.c ** file. */ #define getVarint sqlite3GetVarint #define getVarint32 sqlite3GetVarint32 #define putVarint sqlite3PutVarint /* The database page the PENDING_BYTE occupies. This page is never used. ** TODO: This macro is very similary to PAGER_MJ_PGNO() in pager.c. They ** should possibly be consolidated (presumably in pager.h). */ #define PENDING_BYTE_PAGE(pBt) ((PENDING_BYTE/(pBt)->pageSize)+1) #ifndef SQLITE_OMIT_AUTOVACUUM /* ** These macros define the location of the pointer-map entry for a ** database page. The first argument to each is the number of usable ** bytes on each page of the database (often 1024). The second is the ** page number to look up in the pointer map. ** ** PTRMAP_PAGENO returns the database page number of the pointer-map ** page that stores the required pointer. PTRMAP_PTROFFSET returns ** the offset of the requested map entry. ** ** If the pgno argument passed to PTRMAP_PAGENO is a pointer-map page, ** then pgno is returned. So (pgno==PTRMAP_PAGENO(pgsz, pgno)) can be ** used to test if pgno is a pointer-map page. PTRMAP_ISPAGE implements ** this test. */ #define PTRMAP_PAGENO(pgsz, pgno) (((pgno-2)/(pgsz/5+1))*(pgsz/5+1)+2) #define PTRMAP_PTROFFSET(pgsz, pgno) (((pgno-2)%(pgsz/5+1)-1)*5) #define PTRMAP_ISPAGE(pgsz, pgno) (PTRMAP_PAGENO(pgsz,pgno)==pgno) /* ** The pointer map is a lookup table that identifies the parent page for ** each child page in the database file. The parent page is the page that ** contains a pointer to the child. Every page in the database contains ** 0 or 1 parent pages. (In this context 'database page' refers ** to any page that is not part of the pointer map itself.) Each pointer map ** entry consists of a single byte 'type' and a 4 byte parent page number. ** The PTRMAP_XXX identifiers below are the valid types. ** ** The purpose of the pointer map is to facility moving pages from one ** position in the file to another as part of autovacuum. When a page ** is moved, the pointer in its parent must be updated to point to the ** new location. The pointer map is used to locate the parent page quickly. ** ** PTRMAP_ROOTPAGE: The database page is a root-page. The page-number is not ** used in this case. ** ** PTRMAP_FREEPAGE: The database page is an unused (free) page. The page-number ** is not used in this case. ** ** PTRMAP_OVERFLOW1: The database page is the first page in a list of ** overflow pages. The page number identifies the page that ** contains the cell with a pointer to this overflow page. ** ** PTRMAP_OVERFLOW2: The database page is the second or later page in a list of ** overflow pages. The page-number identifies the previous ** page in the overflow page list. ** ** PTRMAP_BTREE: The database page is a non-root btree page. The page number ** identifies the parent page in the btree. */ #define PTRMAP_ROOTPAGE 1 #define PTRMAP_FREEPAGE 2 #define PTRMAP_OVERFLOW1 3 #define PTRMAP_OVERFLOW2 4 #define PTRMAP_BTREE 5 /* ** Write an entry into the pointer map. ** ** This routine updates the pointer map entry for page number 'key' ** so that it maps to type 'eType' and parent page number 'pgno'. ** An error code is returned if something goes wrong, otherwise SQLITE_OK. */ static int ptrmapPut(Btree *pBt, Pgno key, u8 eType, Pgno parent){ u8 *pPtrmap; /* The pointer map page */ Pgno iPtrmap; /* The pointer map page number */ int offset; /* Offset in pointer map page */ int rc; assert( pBt->autoVacuum ); if( key==0 ){ return SQLITE_CORRUPT_BKPT; } iPtrmap = PTRMAP_PAGENO(pBt->usableSize, key); rc = sqlite3pager_get(pBt->pPager, iPtrmap, (void **)&pPtrmap); if( rc!=SQLITE_OK ){ return rc; } offset = PTRMAP_PTROFFSET(pBt->usableSize, key); if( eType!=pPtrmap[offset] || get4byte(&pPtrmap[offset+1])!=parent ){ TRACE(("PTRMAP_UPDATE: %d->(%d,%d)\n", key, eType, parent)); rc = sqlite3pager_write(pPtrmap); if( rc==SQLITE_OK ){ pPtrmap[offset] = eType; put4byte(&pPtrmap[offset+1], parent); } } sqlite3pager_unref(pPtrmap); return rc; } /* ** Read an entry from the pointer map. ** ** This routine retrieves the pointer map entry for page 'key', writing ** the type and parent page number to *pEType and *pPgno respectively. ** An error code is returned if something goes wrong, otherwise SQLITE_OK. */ static int ptrmapGet(Btree *pBt, Pgno key, u8 *pEType, Pgno *pPgno){ int iPtrmap; /* Pointer map page index */ u8 *pPtrmap; /* Pointer map page data */ int offset; /* Offset of entry in pointer map */ int rc; iPtrmap = PTRMAP_PAGENO(pBt->usableSize, key); rc = sqlite3pager_get(pBt->pPager, iPtrmap, (void **)&pPtrmap); if( rc!=0 ){ return rc; } offset = PTRMAP_PTROFFSET(pBt->usableSize, key); if( pEType ) *pEType = pPtrmap[offset]; if( pPgno ) *pPgno = get4byte(&pPtrmap[offset+1]); sqlite3pager_unref(pPtrmap); if( *pEType<1 || *pEType>5 ) return SQLITE_CORRUPT_BKPT; return SQLITE_OK; } #endif /* SQLITE_OMIT_AUTOVACUUM */ /* ** Given a btree page and a cell index (0 means the first cell on ** the page, 1 means the second cell, and so forth) return a pointer ** to the cell content. ** ** This routine works only for pages that do not contain overflow cells. */ static u8 *findCell(MemPage *pPage, int iCell){ u8 *data = pPage->aData; assert( iCell>=0 ); assert( iCell<get2byte(&data[pPage->hdrOffset+3]) ); return data + get2byte(&data[pPage->cellOffset+2*iCell]); } /* ** This a more complex version of findCell() that works for ** pages that do contain overflow cells. See insert */ static u8 *findOverflowCell(MemPage *pPage, int iCell){ int i; for(i=pPage->nOverflow-1; i>=0; i--){ int k; struct _OvflCell *pOvfl; pOvfl = &pPage->aOvfl[i]; k = pOvfl->idx; if( k<=iCell ){ if( k==iCell ){ return pOvfl->pCell; } iCell--; } } return findCell(pPage, iCell); } /* ** Parse a cell content block and fill in the CellInfo structure. There ** are two versions of this function. parseCell() takes a cell index ** as the second argument and parseCellPtr() takes a pointer to the ** body of the cell as its second argument. */ static void parseCellPtr( MemPage *pPage, /* Page containing the cell */ u8 *pCell, /* Pointer to the cell text. */ CellInfo *pInfo /* Fill in this structure */ ){ int n; /* Number bytes in cell content header */ u32 nPayload; /* Number of bytes of cell payload */ pInfo->pCell = pCell; assert( pPage->leaf==0 || pPage->leaf==1 ); n = pPage->childPtrSize; assert( n==4-4*pPage->leaf ); if( pPage->hasData ){ n += getVarint32(&pCell[n], &nPayload); }else{ nPayload = 0; } n += getVarint(&pCell[n], (u64 *)&pInfo->nKey); pInfo->nHeader = n; pInfo->nData = nPayload; if( !pPage->intKey ){ nPayload += pInfo->nKey; } if( nPayload<=pPage->maxLocal ){ /* This is the (easy) common case where the entire payload fits ** on the local page. No overflow is required. */ int nSize; /* Total size of cell content in bytes */ pInfo->nLocal = nPayload; pInfo->iOverflow = 0; nSize = nPayload + n; if( nSize<4 ){ nSize = 4; /* Minimum cell size is 4 */ } pInfo->nSize = nSize; }else{ /* If the payload will not fit completely on the local page, we have ** to decide how much to store locally and how much to spill onto ** overflow pages. The strategy is to minimize the amount of unused ** space on overflow pages while keeping the amount of local storage ** in between minLocal and maxLocal. ** ** Warning: changing the way overflow payload is distributed in any ** way will result in an incompatible file format. */ int minLocal; /* Minimum amount of payload held locally */ int maxLocal; /* Maximum amount of payload held locally */ int surplus; /* Overflow payload available for local storage */ minLocal = pPage->minLocal; maxLocal = pPage->maxLocal; surplus = minLocal + (nPayload - minLocal)%(pPage->pBt->usableSize - 4); if( surplus <= maxLocal ){ pInfo->nLocal = surplus; }else{ pInfo->nLocal = minLocal; } pInfo->iOverflow = pInfo->nLocal + n; pInfo->nSize = pInfo->iOverflow + 4; } } static void parseCell( MemPage *pPage, /* Page containing the cell */ int iCell, /* The cell index. First cell is 0 */ CellInfo *pInfo /* Fill in this structure */ ){ parseCellPtr(pPage, findCell(pPage, iCell), pInfo); } /* ** Compute the total number of bytes that a Cell needs in the cell ** data area of the btree-page. The return number includes the cell ** data header and the local payload, but not any overflow page or ** the space used by the cell pointer. */ #ifndef NDEBUG static int cellSize(MemPage *pPage, int iCell){ CellInfo info; parseCell(pPage, iCell, &info); return info.nSize; } #endif static int cellSizePtr(MemPage *pPage, u8 *pCell){ CellInfo info; parseCellPtr(pPage, pCell, &info); return info.nSize; } #ifndef SQLITE_OMIT_AUTOVACUUM /* ** If the cell pCell, part of page pPage contains a pointer ** to an overflow page, insert an entry into the pointer-map ** for the overflow page. */ static int ptrmapPutOvflPtr(MemPage *pPage, u8 *pCell){ if( pCell ){ CellInfo info; parseCellPtr(pPage, pCell, &info); if( (info.nData+(pPage->intKey?0:info.nKey))>info.nLocal ){ Pgno ovfl = get4byte(&pCell[info.iOverflow]); return ptrmapPut(pPage->pBt, ovfl, PTRMAP_OVERFLOW1, pPage->pgno); } } return SQLITE_OK; } /* ** If the cell with index iCell on page pPage contains a pointer ** to an overflow page, insert an entry into the pointer-map ** for the overflow page. */ static int ptrmapPutOvfl(MemPage *pPage, int iCell){ u8 *pCell; pCell = findOverflowCell(pPage, iCell); return ptrmapPutOvflPtr(pPage, pCell); } #endif /* ** Do sanity checking on a page. Throw an exception if anything is ** not right. ** ** This routine is used for internal error checking only. It is omitted ** from most builds. */ #if defined(BTREE_DEBUG) && !defined(NDEBUG) && 0 static void _pageIntegrity(MemPage *pPage){ int usableSize; u8 *data; int i, j, idx, c, pc, hdr, nFree; int cellOffset; int nCell, cellLimit; u8 *used; used = sqliteMallocRaw( pPage->pBt->pageSize ); if( used==0 ) return; usableSize = pPage->pBt->usableSize; assert( pPage->aData==&((unsigned char*)pPage)[-pPage->pBt->pageSize] ); hdr = pPage->hdrOffset; assert( hdr==(pPage->pgno==1 ? 100 : 0) ); assert( pPage->pgno==sqlite3pager_pagenumber(pPage->aData) ); c = pPage->aData[hdr]; if( pPage->isInit ){ assert( pPage->leaf == ((c & PTF_LEAF)!=0) ); assert( pPage->zeroData == ((c & PTF_ZERODATA)!=0) ); assert( pPage->leafData == ((c & PTF_LEAFDATA)!=0) ); assert( pPage->intKey == ((c & (PTF_INTKEY|PTF_LEAFDATA))!=0) ); assert( pPage->hasData == !(pPage->zeroData || (!pPage->leaf && pPage->leafData)) ); assert( pPage->cellOffset==pPage->hdrOffset+12-4*pPage->leaf ); assert( pPage->nCell = get2byte(&pPage->aData[hdr+3]) ); } data = pPage->aData; memset(used, 0, usableSize); for(i=0; i<hdr+10-pPage->leaf*4; i++) used[i] = 1; nFree = 0; pc = get2byte(&data[hdr+1]); while( pc ){ int size; assert( pc>0 && pc<usableSize-4 ); size = get2byte(&data[pc+2]); assert( pc+size<=usableSize ); nFree += size; for(i=pc; i<pc+size; i++){ assert( used[i]==0 ); used[i] = 1; } pc = get2byte(&data[pc]); } idx = 0; nCell = get2byte(&data[hdr+3]); cellLimit = get2byte(&data[hdr+5]); assert( pPage->isInit==0 || pPage->nFree==nFree+data[hdr+7]+cellLimit-(cellOffset+2*nCell) ); cellOffset = pPage->cellOffset; for(i=0; i<nCell; i++){ int size; pc = get2byte(&data[cellOffset+2*i]); assert( pc>0 && pc<usableSize-4 ); size = cellSize(pPage, &data[pc]); assert( pc+size<=usableSize ); for(j=pc; j<pc+size; j++){ assert( used[j]==0 ); used[j] = 1; } } for(i=cellOffset+2*nCell; i<cellimit; i++){ assert( used[i]==0 ); used[i] = 1; } nFree = 0; for(i=0; i<usableSize; i++){ assert( used[i]<=1 ); if( used[i]==0 ) nFree++; } assert( nFree==data[hdr+7] ); sqliteFree(used); } #define pageIntegrity(X) _pageIntegrity(X) #else # define pageIntegrity(X) #endif /* ** Defragment the page given. All Cells are moved to the ** beginning of the page and all free space is collected ** into one big FreeBlk at the end of the page. */ static int defragmentPage(MemPage *pPage){ int i; /* Loop counter */ int pc; /* Address of a i-th cell */ int addr; /* Offset of first byte after cell pointer array */ int hdr; /* Offset to the page header */ int size; /* Size of a cell */ int usableSize; /* Number of usable bytes on a page */ int cellOffset; /* Offset to the cell pointer array */ int brk; /* Offset to the cell content area */ int nCell; /* Number of cells on the page */ unsigned char *data; /* The page data */ unsigned char *temp; /* Temp area for cell content */ assert( sqlite3pager_iswriteable(pPage->aData) ); assert( pPage->pBt!=0 ); assert( pPage->pBt->usableSize <= SQLITE_MAX_PAGE_SIZE ); assert( pPage->nOverflow==0 ); temp = sqliteMalloc( pPage->pBt->pageSize ); if( temp==0 ) return SQLITE_NOMEM; data = pPage->aData; hdr = pPage->hdrOffset; cellOffset = pPage->cellOffset; nCell = pPage->nCell; assert( nCell==get2byte(&data[hdr+3]) ); usableSize = pPage->pBt->usableSize; brk = get2byte(&data[hdr+5]); memcpy(&temp[brk], &data[brk], usableSize - brk); brk = usableSize; for(i=0; i<nCell; i++){ u8 *pAddr; /* The i-th cell pointer */ pAddr = &data[cellOffset + i*2]; pc = get2byte(pAddr); assert( pc<pPage->pBt->usableSize ); size = cellSizePtr(pPage, &temp[pc]); brk -= size; memcpy(&data[brk], &temp[pc], size); put2byte(pAddr, brk); } assert( brk>=cellOffset+2*nCell ); put2byte(&data[hdr+5], brk); data[hdr+1] = 0; data[hdr+2] = 0; data[hdr+7] = 0; addr = cellOffset+2*nCell; memset(&data[addr], 0, brk-addr); sqliteFree(temp); return SQLITE_OK; } /* ** Allocate nByte bytes of space on a page. ** ** Return the index into pPage->aData[] of the first byte of ** the new allocation. Or return 0 if there is not enough free ** space on the page to satisfy the allocation request. ** ** If the page contains nBytes of free space but does not contain ** nBytes of contiguous free space, then this routine automatically ** calls defragementPage() to consolidate all free space before ** allocating the new chunk. */ static int allocateSpace(MemPage *pPage, int nByte){ int addr, pc, hdr; int size; int nFrag; int top; int nCell; int cellOffset; unsigned char *data; data = pPage->aData; assert( sqlite3pager_iswriteable(data) ); assert( pPage->pBt ); if( nByte<4 ) nByte = 4; if( pPage->nFree<nByte || pPage->nOverflow>0 ) return 0; pPage->nFree -= nByte; hdr = pPage->hdrOffset; nFrag = data[hdr+7]; if( nFrag<60 ){ /* Search the freelist looking for a slot big enough to satisfy the ** space request. */ addr = hdr+1; while( (pc = get2byte(&data[addr]))>0 ){ size = get2byte(&data[pc+2]); if( size>=nByte ){ if( size<nByte+4 ){ memcpy(&data[addr], &data[pc], 2); data[hdr+7] = nFrag + size - nByte; return pc; }else{ put2byte(&data[pc+2], size-nByte); return pc + size - nByte; } } addr = pc; } } /* Allocate memory from the gap in between the cell pointer array ** and the cell content area. */ top = get2byte(&data[hdr+5]); nCell = get2byte(&data[hdr+3]); cellOffset = pPage->cellOffset; if( nFrag>=60 || cellOffset + 2*nCell > top - nByte ){ if( defragmentPage(pPage) ) return 0; top = get2byte(&data[hdr+5]); } top -= nByte; assert( cellOffset + 2*nCell <= top ); put2byte(&data[hdr+5], top); return top; } /* ** Return a section of the pPage->aData to the freelist. ** The first byte of the new free block is pPage->aDisk[start] ** and the size of the block is "size" bytes. ** ** Most of the effort here is involved in coalesing adjacent ** free blocks into a single big free block. */ static void freeSpace(MemPage *pPage, int start, int size){ int addr, pbegin, hdr; unsigned char *data = pPage->aData; assert( pPage->pBt!=0 ); assert( sqlite3pager_iswriteable(data) ); assert( start>=pPage->hdrOffset+6+(pPage->leaf?0:4) ); assert( (start + size)<=pPage->pBt->usableSize ); if( size<4 ) size = 4; /* Add the space back into the linked list of freeblocks */ hdr = pPage->hdrOffset; addr = hdr + 1; while( (pbegin = get2byte(&data[addr]))<start && pbegin>0 ){ assert( pbegin<=pPage->pBt->usableSize-4 ); assert( pbegin>addr ); addr = pbegin; } assert( pbegin<=pPage->pBt->usableSize-4 ); assert( pbegin>addr || pbegin==0 ); put2byte(&data[addr], start); put2byte(&data[start], pbegin); put2byte(&data[start+2], size); pPage->nFree += size; /* Coalesce adjacent free blocks */ addr = pPage->hdrOffset + 1; while( (pbegin = get2byte(&data[addr]))>0 ){ int pnext, psize; assert( pbegin>addr ); assert( pbegin<=pPage->pBt->usableSize-4 ); pnext = get2byte(&data[pbegin]); psize = get2byte(&data[pbegin+2]); if( pbegin + psize + 3 >= pnext && pnext>0 ){ int frag = pnext - (pbegin+psize); assert( frag<=data[pPage->hdrOffset+7] ); data[pPage->hdrOffset+7] -= frag; put2byte(&data[pbegin], get2byte(&data[pnext])); put2byte(&data[pbegin+2], pnext+get2byte(&data[pnext+2])-pbegin); }else{ addr = pbegin; } } /* If the cell content area begins with a freeblock, remove it. */ if( data[hdr+1]==data[hdr+5] && data[hdr+2]==data[hdr+6] ){ int top; pbegin = get2byte(&data[hdr+1]); memcpy(&data[hdr+1], &data[pbegin], 2); top = get2byte(&data[hdr+5]); put2byte(&data[hdr+5], top + get2byte(&data[pbegin+2])); } } /* ** Decode the flags byte (the first byte of the header) for a page ** and initialize fields of the MemPage structure accordingly. */ static void decodeFlags(MemPage *pPage, int flagByte){ Btree *pBt; /* A copy of pPage->pBt */ assert( pPage->hdrOffset==(pPage->pgno==1 ? 100 : 0) ); pPage->intKey = (flagByte & (PTF_INTKEY|PTF_LEAFDATA))!=0; pPage->zeroData = (flagByte & PTF_ZERODATA)!=0; pPage->leaf = (flagByte & PTF_LEAF)!=0; pPage->childPtrSize = 4*(pPage->leaf==0); pBt = pPage->pBt; if( flagByte & PTF_LEAFDATA ){ pPage->leafData = 1; pPage->maxLocal = pBt->maxLeaf; pPage->minLocal = pBt->minLeaf; }else{ pPage->leafData = 0; pPage->maxLocal = pBt->maxLocal; pPage->minLocal = pBt->minLocal; } pPage->hasData = !(pPage->zeroData || (!pPage->leaf && pPage->leafData)); } /* ** Initialize the auxiliary information for a disk block. ** ** The pParent parameter must be a pointer to the MemPage which ** is the parent of the page being initialized. The root of a ** BTree has no parent and so for that page, pParent==NULL. ** ** Return SQLITE_OK on success. If we see that the page does ** not contain a well-formed database page, then return ** SQLITE_CORRUPT. Note that a return of SQLITE_OK does not ** guarantee that the page is well-formed. It only shows that ** we failed to detect any corruption. */ static int initPage( MemPage *pPage, /* The page to be initialized */ MemPage *pParent /* The parent. Might be NULL */ ){ int pc; /* Address of a freeblock within pPage->aData[] */ int hdr; /* Offset to beginning of page header */ u8 *data; /* Equal to pPage->aData */ Btree *pBt; /* The main btree structure */ int usableSize; /* Amount of usable space on each page */ int cellOffset; /* Offset from start of page to first cell pointer */ int nFree; /* Number of unused bytes on the page */ int top; /* First byte of the cell content area */ pBt = pPage->pBt; assert( pBt!=0 ); assert( pParent==0 || pParent->pBt==pBt ); assert( pPage->pgno==sqlite3pager_pagenumber(pPage->aData) ); assert( pPage->aData == &((unsigned char*)pPage)[-pBt->pageSize] ); if( pPage->pParent!=pParent && (pPage->pParent!=0 || pPage->isInit) ){ /* The parent page should never change unless the file is corrupt */ return SQLITE_CORRUPT_BKPT; } if( pPage->isInit ) return SQLITE_OK; if( pPage->pParent==0 && pParent!=0 ){ pPage->pParent = pParent; sqlite3pager_ref(pParent->aData); } hdr = pPage->hdrOffset; data = pPage->aData; decodeFlags(pPage, data[hdr]); pPage->nOverflow = 0; pPage->idxShift = 0; usableSize = pBt->usableSize; pPage->cellOffset = cellOffset = hdr + 12 - 4*pPage->leaf; top = get2byte(&data[hdr+5]); pPage->nCell = get2byte(&data[hdr+3]); if( pPage->nCell>MX_CELL(pBt) ){ /* To many cells for a single page. The page must be corrupt */ return SQLITE_CORRUPT_BKPT; } if( pPage->nCell==0 && pParent!=0 && pParent->pgno!=1 ){ /* All pages must have at least one cell, except for root pages */ return SQLITE_CORRUPT_BKPT; } /* Compute the total free space on the page */ pc = get2byte(&data[hdr+1]); nFree = data[hdr+7] + top - (cellOffset + 2*pPage->nCell); while( pc>0 ){ int next, size; if( pc>usableSize-4 ){ /* Free block is off the page */ return SQLITE_CORRUPT_BKPT; } next = get2byte(&data[pc]); size = get2byte(&data[pc+2]); if( next>0 && next<=pc+size+3 ){ /* Free blocks must be in accending order */ return SQLITE_CORRUPT_BKPT; } nFree += size; pc = next; } pPage->nFree = nFree; if( nFree>=usableSize ){ /* Free space cannot exceed total page size */ return SQLITE_CORRUPT_BKPT; } pPage->isInit = 1; pageIntegrity(pPage); return SQLITE_OK; } /* ** Set up a raw page so that it looks like a database page holding ** no entries. */ static void zeroPage(MemPage *pPage, int flags){ unsigned char *data = pPage->aData; Btree *pBt = pPage->pBt; int hdr = pPage->hdrOffset; int first; assert( sqlite3pager_pagenumber(data)==pPage->pgno ); assert( &data[pBt->pageSize] == (unsigned char*)pPage ); assert( sqlite3pager_iswriteable(data) ); memset(&data[hdr], 0, pBt->usableSize - hdr); data[hdr] = flags; first = hdr + 8 + 4*((flags&PTF_LEAF)==0); memset(&data[hdr+1], 0, 4); data[hdr+7] = 0; put2byte(&data[hdr+5], pBt->usableSize); pPage->nFree = pBt->usableSize - first; decodeFlags(pPage, flags); pPage->hdrOffset = hdr; pPage->cellOffset = first; pPage->nOverflow = 0; pPage->idxShift = 0; pPage->nCell = 0; pPage->isInit = 1; pageIntegrity(pPage); } /* ** Get a page from the pager. Initialize the MemPage.pBt and ** MemPage.aData elements if needed. */ static int getPage(Btree *pBt, Pgno pgno, MemPage **ppPage){ int rc; unsigned char *aData; MemPage *pPage; rc = sqlite3pager_get(pBt->pPager, pgno, (void**)&aData); if( rc ) return rc; pPage = (MemPage*)&aData[pBt->pageSize]; pPage->aData = aData; pPage->pBt = pBt; pPage->pgno = pgno; pPage->hdrOffset = pPage->pgno==1 ? 100 : 0; *ppPage = pPage; return SQLITE_OK; } /* ** Get a page from the pager and initialize it. This routine ** is just a convenience wrapper around separate calls to ** getPage() and initPage(). */ static int getAndInitPage( Btree *pBt, /* The database file */ Pgno pgno, /* Number of the page to get */ MemPage **ppPage, /* Write the page pointer here */ MemPage *pParent /* Parent of the page */ ){ int rc; if( pgno==0 ){ return SQLITE_CORRUPT_BKPT; } rc = getPage(pBt, pgno, ppPage); if( rc==SQLITE_OK && (*ppPage)->isInit==0 ){ rc = initPage(*ppPage, pParent); } return rc; } /* ** Release a MemPage. This should be called once for each prior ** call to getPage. */ static void releasePage(MemPage *pPage){ if( pPage ){ assert( pPage->aData ); assert( pPage->pBt ); assert( &pPage->aData[pPage->pBt->pageSize]==(unsigned char*)pPage ); sqlite3pager_unref(pPage->aData); } } /* ** This routine is called when the reference count for a page ** reaches zero. We need to unref the pParent pointer when that ** happens. */ static void pageDestructor(void *pData, int pageSize){ MemPage *pPage; assert( (pageSize & 7)==0 ); pPage = (MemPage*)&((char*)pData)[pageSize]; if( pPage->pParent ){ MemPage *pParent = pPage->pParent; pPage->pParent = 0; releasePage(pParent); } pPage->isInit = 0; } /* ** During a rollback, when the pager reloads information into the cache ** so that the cache is restored to its original state at the start of ** the transaction, for each page restored this routine is called. ** ** This routine needs to reset the extra data section at the end of the ** page to agree with the restored data. */ static void pageReinit(void *pData, int pageSize){ MemPage *pPage; assert( (pageSize & 7)==0 ); pPage = (MemPage*)&((char*)pData)[pageSize]; if( pPage->isInit ){ pPage->isInit = 0; initPage(pPage, pPage->pParent); } } /* ** Open a database file. ** ** zFilename is the name of the database file. If zFilename is NULL ** a new database with a random name is created. This randomly named ** database file will be deleted when sqlite3BtreeClose() is called. */ int sqlite3BtreeOpen( const char *zFilename, /* Name of the file containing the BTree database */ Btree **ppBtree, /* Pointer to new Btree object written here */ int flags /* Options */ ){ Btree *pBt; int rc; int nReserve; unsigned char zDbHeader[100]; /* ** The following asserts make sure that structures used by the btree are ** the right size. This is to guard against size changes that result ** when compiling on a different architecture. */ assert( sizeof(i64)==8 ); assert( sizeof(u64)==8 ); assert( sizeof(u32)==4 ); assert( sizeof(u16)==2 ); assert( sizeof(Pgno)==4 ); pBt = sqliteMalloc( sizeof(*pBt) ); if( pBt==0 ){ *ppBtree = 0; return SQLITE_NOMEM; } rc = sqlite3pager_open(&pBt->pPager, zFilename, EXTRA_SIZE, flags); if( rc!=SQLITE_OK ){ if( pBt->pPager ) sqlite3pager_close(pBt->pPager); sqliteFree(pBt); *ppBtree = 0; return rc; } sqlite3pager_se